Well, having toyed with the very regex-like Lua string pattern matching I’ve come up with this little gem:
local t = [[some long html stuff in here]] local cleaner = { { "&", "&" }, { "—", "-" }, { "’", "'" }, { " ", " " }, { "\<br.\*/\>", "\n" }, { "\</p\>", "\n" }, { "(%b\<\>)", "\n" }, { "\n\n\*", "\n" }, { "\n\*$", "" }, { "^\n\*", "" }, } for i=1, #cleaner do local cleans = cleaner[i] t = string.gsub( t, cleans[1], cleans[2] ) end print(t)
This converts erroneous characters to things like new lines and spaces and also removes extraneous new lines, while removing all html blocks - though it first converts </p> and <br /> elements to new lines (even if there are XML attributes included!)
It is a very good first pass, if I do say so myself, and cleans pretty much everything I’m interested in. I will now use this for padding the text out in pages using the display.newText function.