HTML Cleaning

Hi folks,

I’m looking for a really good string cleaner for fairly dirty HTML cleaning. Currently, the best results I’ve got are by using this code:

local o = "" string.gsub("\>"..htmlContent.."\<","\>(.-)\<", function(a) o=o..a end ) local p = "" string.gsub(";"..o.."&",";(.-)&", function(a) p=p..a end ) print(p)

…which obviously isn’t beautiful. What I need is something really robust which can also convert HTML characters (and the weird ones like �) into normal, plain text.

Does anyone have anything like that, please?

Thanks,

Matt.

Have you look at this tutorial ? http://developer.coronalabs.com/code/strip-html-tags-text

Yes - that code reduced my fairly large string to only a few characters, so I don’t think it’s workable for me. I’m looking for something much more robust.

Do you have an example like before and after ? In our case we have come up with a solution for our tourism app as we transform thousand of html files to an sqlite database and the text is completely reformat during that phase. Maybe we could share something based on your need.

Ah, it’s ok - I seem to have solved the problem myself. I’ve asked a related question here, where I’ll post my answer: http://forums.coronalabs.com/topic/44268-string-matching/

Gist committed for improvements: http://code.coronalabs.com/code/clean-html

Have you look at this tutorial ? http://developer.coronalabs.com/code/strip-html-tags-text

Yes - that code reduced my fairly large string to only a few characters, so I don’t think it’s workable for me. I’m looking for something much more robust.

Do you have an example like before and after ? In our case we have come up with a solution for our tourism app as we transform thousand of html files to an sqlite database and the text is completely reformat during that phase. Maybe we could share something based on your need.

Ah, it’s ok - I seem to have solved the problem myself. I’ve asked a related question here, where I’ll post my answer: http://forums.coronalabs.com/topic/44268-string-matching/

Gist committed for improvements: http://code.coronalabs.com/code/clean-html