Thank you! I found it useful. Be careful there are emojis with codepoint < 65535 like red heart, smiling face, frowning face, etc… I use codepoint < 8400 (just after currencys)
What string value (in Lua) does an Emoji actually output? I don’t think I’ve ever tested that…
Brent
Got this from stackoverflow but no idea how to replicate in Lua?
function removeEmojis (string) { var regex = /(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])/g; return string.replace(regex, ''); }
I’d still be curious to see what string you’re getting from an emoji inside Lua… like what you get as the user edits a text input field. I can test it myself I suppose…
lol… You get this in simulator
and this in console
This is super brute-force-y and would require way too much manual labor to implement fully, but in this limited test case, it works (at least in the simulator). Could help find a path to a more elegant solution:
function stripEmoji(str) local str = str or '' local emojis = {'',''} for i = 1, #emojis do str = str:gsub(emojis[i], '') end return str end print(stripEmoji("this is an emoji: (it should be blank)"))
Ugh, and it looks like the code blocking in the forum software stripped the emojis, so it doesn’t look right. Let’s try without the block quoting:
function stripEmoji(str)
local str = str or ‘’
local emojis = {’’,’’}
for i = 1, #emojis do
str = str:gsub(emojis[i], ‘’)
end
return str
end
print(stripEmoji(“this is an emoji: (it should be blank)”))
Grrr. The forum won’t let me post with the emojis copied/pasted in, but here it is in a gist, and it’s working for me in simulator:
https://gist.github.com/schroederapps/365d59fb32c21fcdad764010b3f68f7f
Thanks for the input but as you said “find and replace” is not really practical as there are thousands of emojis and new ones get added with each new Android, iOS release.
I’ve found a full list of them - http://getemoji.com/ and http://classic.getemoji.com/
I was hoping that the regex could somehow be coded in Lua?
You probably should be using the utf8 plugin if you’re going to dealing with strings that have utf8 characters in them. The Lua string library is not completely utf8 safe.
Rob
I was going to suggest the same thing. In particular, I wonder what “codepoint” those emojis spit out when you use the “utf8.codes()” function:
https://docs.coronalabs.com/plugin/utf8/codes.html
Brent
I use utf8 and my DB supports utf8… it is just certain emojis break this and I don’t understand why.
My problem is not with Corona per sae but more encrypting the values to storage seems to break.
This is how I persist my data
local hash = crypto.digest( crypto.md5, \<some random string here\>)
function save( filename, dataTable ) local myTable = json.encode( dataTable ) local dataToWrite = mime.b64(cipher:encrypt ( myTable, hash )) local path = system.pathForFile( filename, system.DocumentsDirectory ) local file = io.open( path, "w" ) if file then file:write( dataToWrite ) io.close( file ) end end
Yet the odd emoji breaks this and when I load and decrypt the json it is invalid.
I had this with an iPhone user today that had a green leaf emoji that broke my code.
I know this a rough code but should be good enough as a starting point to have an idea how to “correct” the problem.
i just made a “emoji killer” with this simple code with the limited tests i’ve done:
for i = 1, #name do local c = name:sub(i,i+3) if string.byte(c)==240 then print ("found an emoji") name=string.gsub(name,c,".") print ("passed it to a dot") end end
where name was a string result of a text field.
hope this helps.
that doesn’t work bit thanks for your input
Hi Adrian,
I would suggest using UTF8 (the plugin) and then detecting if a character’s codepoint is >= 128512. In my testing, on a real device, the most basic “smiley face” emoji outputs a codepoint of 128512, and other emojis just climb higher from there. This leads me to believe that emojis start their codepoint at 128512 in the Unicode standard, and every other non-language character has a codepoint well below that (for example, if I type in Japanese characters, I get codepoints in the 12000-12500 range.
The following chart covers every standard emoji. Perhaps you can find the exact starting codepoint for emojis on the site if you dig around further (unless shown otherwise, I’m sticking with 128512).
http://www.unicode.org/emoji/charts/emoji-list.html
Here’s the doc for the Corona API:
https://docs.coronalabs.com/plugin/utf8/codepoint.html
[lua]
for i = 1,utf8.len( myString ) do
local c = utf8.sub( myString, i )
print( utf8.codepoint© )
end
[/lua]
I’ll leave it to you to decide what to do from this. I’m not sure if you want to strip out emojis while they type in the text field, after they submit it, before it gets saved to JSON, or whatever… you’ll have to decide what works best for you, but this gives you a basis of detecting emojis using the UTF8 codepoint.
Brent
Thanks Brent that gave me a great starting point, According to https://en.wikipedia.org/wiki/Plane_(Unicode) the BMP range for all languages is 0 to 65535 so to strip emojis I wrote this function which may help others.
function stripNonBMPCharacters(s) local utf8 = require( "plugin.utf8" ) local res = "" for i = 1, utf8.len(s) do local c = utf8.sub(s, i, i) if utf8.codepoint(c) \<= 65535 then res = res .. c end end return res end
This also plays nice with MySQL utf8 code pages
@Sphere Game Studio, I really don’t know why my code didn’t work with you…I tried on mac, windows, ipad mini and android device…all worked fine.
My approach was to see what emojis return code was. the ones I tried all had the same size (4 chars) and they all returned code 240 why I used string.byte to check their internal numerical codes. So I just had to compare a substring of 4 chars to check if they had that internal code. if they return that code you could change for whatever ou want. I used the “.” (dot)
my only concern was if there were emojis with different sizes. didn’t find one, and didn’t investigate further if there are or will be.
I’m glad you figured it out by yourself.
the code i used on devices to test was:
local defaultField local text=display.newText({text="",x=display.contentWidth\*.5, y=display.contentHeight\*.5+50, fontSize=14}) local text2=display.newText({text="",x=display.contentWidth\*.5, y=display.contentHeight\*.5+100, fontSize=14}) local function clean\_emoji(textIn) local text=textIn local i=1 while i\<=#text do local c = text:sub(i,i+3) if string.byte(c)==240 then text=string.gsub(text,c,".") text2.text=c end i=i+1 end return text end local function textListener( event ) if ( event.phase == "began" ) then elseif ( event.phase == "ended" or event.phase == "submitted" ) then elseif ( event.phase == "editing" ) then text.text=clean\_emoji(event.text) end end defaultField = native.newTextField( 150, 150, 180, 30 ) defaultField:addEventListener( "userInput", textListener )
I changed the “for” cicle for a “while” since usually “while” cicles are faster than “for” cicles…nothing more.
It wasn’t that it didn’t work… it just wasn’t a 100% solution. Using utf8 codepoints was
Great to hear you solved it Adrian! Since I got you started on the path, I assume you don’t mind that I snag your function and roll it into one of my own apps which needs to strip out emojis before data gets saved out to a JSON file. 
Brent
P.S. - In your actual game code, I assume you’re not require()-ing the UTF8 plugin inside the function which handles this, right? Seems that wouldn’t be the optimal place for it…
Help yourself Brent… I posted it here for the entire community.
I only required it in the function so it was self-contained, self-documenting and simple to copy’n’paste.