parsing html string into key value pairs table

Hello,

I am doing some html parsing, hoping to do some modification to it, before sending it to native.newWebView().

What are some of the efficient ways to parse the below html strings into key value pairs?

Example (1),

<img border=“0” style=“max-width:100%” src=“https://www.brookings.edu/wp-content/uploads/2018/07/trump_putin_helsinki006.jpg?w=248” width=“1200” height=“800” />

into,

pair = {

    border = “0”,

    style = “max-width:100%”,

    src = “https://www.brookings.edu/wp-content/uploads/2018/07/trump_putin_helsinki006.jpg?w=248”,

    width = “1200”,

    height = “800”,

}

Example (2),

<a href="#page-content" class=“screen-reader-shortcut”>Skip to main content</a>

into,

pair = {

    href = “#page-content”,

    class = “screen-reader-shortcut”,

    text = “Skip to main content”,

}

Many thanks for your help.

Regards, Luan

Loop through the attributes using an XML library and store them in a table.

I might have some old HTML parsing code (not written by me) in this AskEd answer from way back:

https://github.com/roaminggamer/RG_FreeStuff/raw/master/AskEd/2015/10/mltext_mod.zip

Original: 

https://github.com/luaforge/html/tree/master/html

You might want to look at our business app sample:  https://github.com/coronalabs-samples/business-app-sample

While not parsing HTML, it parses RSS XML which is quite similar. You would pull the xml.lua file to parse the HTML as XML and then you could look at the rss.lua or atom.lua files to see how to take the XML to Lua table setup to something more key-value pair friendly in the rss.lua fiie.

The challenge with XML like HTML is that it’s not a key-value setup.  Consider this:

<div id=“menu” class=“red”>This is a block of text</div>

You have two things to put into one key. You have the content of the div and you have the key-value attributes inside the tags. This doesn’t fit the whole key-value pairs that are Lua tables.  Most XML parsers will create nodes with sub-nodes for each attribute and assign the name of tag and the value into the subnodes
 

parsedXML.child[1].name = "div" parsedXML.child[1].id = "menu" parsedXML.child[1].child = { key = "class", value = "red" } parsedXML.child[1].value = "This is a block of text"

This is a made up structure, but you can see how a fairly simple HTML tag becomes a pretty complex Lua table. This one reason many developers prefer to work with JSON when possible.

Rob

I would disagree that Lua tables are not friendly with xml. In fact every element of xml is a key value pair. Except that the key may have attributes (which are directly key/value pairs).  

I think the parsed xml example from Rob Miracle would look more like this: 

parsedXML.child[1].name = "div" parsedXML.child[1].type = "element" parsedXML.child[1].text ="" parsedXML.child[1].attributes["id"]= "menu" parsedXML.child[1].attributes["class"]= "red" parsedXML.child[1].firstChild = { name="", type="text", text="This is a block of text"}

There are of course nuances with xml that can be exasperatingly detail oriented.   

I once spent hours trying to track down a bug where my xml was not parsing the way I thought it should. I finally realized that the parser was reading the cr/lf as an element.   

But for sure the op will want a pre-built xml library. Parsing html/xml is a huge job and you would be redesigning this wheel. Certainly this job is bigger than some simple Lua string pattern matching.   

PS:   

Many webView widgets in other systems will actually expose their dom’s. They all parse the HTML so why not. 

Hello all,

Many thanks for your prompt guidance. I will study them in detail.

Regards, Luan

Loop through the attributes using an XML library and store them in a table.

I might have some old HTML parsing code (not written by me) in this AskEd answer from way back:

https://github.com/roaminggamer/RG_FreeStuff/raw/master/AskEd/2015/10/mltext_mod.zip

Original: 

https://github.com/luaforge/html/tree/master/html

You might want to look at our business app sample:  https://github.com/coronalabs-samples/business-app-sample

While not parsing HTML, it parses RSS XML which is quite similar. You would pull the xml.lua file to parse the HTML as XML and then you could look at the rss.lua or atom.lua files to see how to take the XML to Lua table setup to something more key-value pair friendly in the rss.lua fiie.

The challenge with XML like HTML is that it’s not a key-value setup.  Consider this:

<div id=“menu” class=“red”>This is a block of text</div>

You have two things to put into one key. You have the content of the div and you have the key-value attributes inside the tags. This doesn’t fit the whole key-value pairs that are Lua tables.  Most XML parsers will create nodes with sub-nodes for each attribute and assign the name of tag and the value into the subnodes
 

parsedXML.child[1].name = "div" parsedXML.child[1].id = "menu" parsedXML.child[1].child = { key = "class", value = "red" } parsedXML.child[1].value = "This is a block of text"

This is a made up structure, but you can see how a fairly simple HTML tag becomes a pretty complex Lua table. This one reason many developers prefer to work with JSON when possible.

Rob

I would disagree that Lua tables are not friendly with xml. In fact every element of xml is a key value pair. Except that the key may have attributes (which are directly key/value pairs).  

I think the parsed xml example from Rob Miracle would look more like this: 

parsedXML.child[1].name = "div" parsedXML.child[1].type = "element" parsedXML.child[1].text ="" parsedXML.child[1].attributes["id"]= "menu" parsedXML.child[1].attributes["class"]= "red" parsedXML.child[1].firstChild = { name="", type="text", text="This is a block of text"}

There are of course nuances with xml that can be exasperatingly detail oriented.   

I once spent hours trying to track down a bug where my xml was not parsing the way I thought it should. I finally realized that the parser was reading the cr/lf as an element.   

But for sure the op will want a pre-built xml library. Parsing html/xml is a huge job and you would be redesigning this wheel. Certainly this job is bigger than some simple Lua string pattern matching.   

PS:   

Many webView widgets in other systems will actually expose their dom’s. They all parse the HTML so why not. 

Hello all,

Many thanks for your prompt guidance. I will study them in detail.

Regards, Luan