String matching

I would like to know how to structure a string search so that I can have all matching and unmatching strings returned in my search. For example, given this string:

Hello, \<b\>this is\</b\> me.

I would like a search to return these strings:

Hello, \<b\> this is \</b\> me.

I’ve tried this:

string.gsub( input, "\<(.-)\>", function(str) print( str ) end )

But that only prints:

b /b

Well, having toyed with the very regex-like Lua string pattern matching I’ve come up with this little gem:

local t = [[some long html stuff in here]] local cleaner = { { "&amp;", "&" }, { "&#151;", "-" }, { "&#146;", "'" }, { "&#160;", " " }, { "\<br.\*/\>", "\n" }, { "\</p\>", "\n" }, { "(%b\<\>)", "\n" }, { "\n\n\*", "\n" }, { "\n\*$", "" }, { "^\n\*", "" }, } for i=1, #cleaner do local cleans = cleaner[i] t = string.gsub( t, cleans[1], cleans[2] ) end print(t)

This converts erroneous characters to things like new lines and spaces and also removes extraneous new lines, while removing all html blocks - though it first converts </p> and <br /> elements to new lines (even if there are XML attributes included!)

It is a very good first pass, if I do say so myself, and cleans pretty much everything I’m interested in. I will now use this for padding the text out in pages using the display.newText function.

Pat yourself on the back.  Nicely done.

Rob

Thanks Rob :slight_smile:

Gist committed for improvements: http://code.coronalabs.com/code/clean-html

Well, having toyed with the very regex-like Lua string pattern matching I’ve come up with this little gem:

local t = [[some long html stuff in here]] local cleaner = { { "&amp;", "&" }, { "&#151;", "-" }, { "&#146;", "'" }, { "&#160;", " " }, { "\<br.\*/\>", "\n" }, { "\</p\>", "\n" }, { "(%b\<\>)", "\n" }, { "\n\n\*", "\n" }, { "\n\*$", "" }, { "^\n\*", "" }, } for i=1, #cleaner do local cleans = cleaner[i] t = string.gsub( t, cleans[1], cleans[2] ) end print(t)

This converts erroneous characters to things like new lines and spaces and also removes extraneous new lines, while removing all html blocks - though it first converts </p> and <br /> elements to new lines (even if there are XML attributes included!)

It is a very good first pass, if I do say so myself, and cleans pretty much everything I’m interested in. I will now use this for padding the text out in pages using the display.newText function.

Pat yourself on the back.  Nicely done.

Rob

Thanks Rob :slight_smile:

Gist committed for improvements: http://code.coronalabs.com/code/clean-html