After several weeks of trial and error, I “think” that I have this working for my particular situation. However, I still do not understand why I am seeing the different behavior across devices. Either way, here is what I think is happening and what I did to get around it.
Any characters outside of the Unicode BMP plane (greater than 65,535) are addressed as two, 16-bit units called high and low surrogates. It appears that Android 6 and higher, as well as any flavor of iOS either convert these UTF-16 surrogate pairs into their single Unicode codepoint equivalent or they have newer internal code that allows them to calculate the single Unicode codepoint and bypass the surrogates altogether (or something along those lines). Android 5 and older does not seem to be able to do either. They just leave them as UTF-16 surrogate pairs. The examples from my previous posts above can attest to that behavior.
This becomes a problem when displaying characters above the BMP, such as the many emojis in use today. Even though these newer devices seem to “automatically” convert surrogates into single codepoints during creation, they do not seem to know how to deal with a surrogate pair when they see one in the wild. And the older devices do not seem to know how to deal with the one Unicode codepoint either. This is also the behavior I was noting in my original post when I said, “Group 1 only plays nice with messages from Group 1 users and Group 2 only plays nice with messages from Group 2 users.”
As an example, the following string contains the codepoints for both a surrogate pair(%55357%56842)and a Unicode codepoint(%127877) for the two emojis shown which are both located above the BMP.
local testStr = utf8.escape( "%55357%56842%127877" )
If you display testStr via a native.newTextBox, you will get the following:
iOS and Android 6 and newer will ignore the surrogate pairs and just print out Father Christmas 🎅 while Android 5 and older will ignore the single Unicode codepoint and just print out the Smiling Face with Smiling Eyes emoji 😊. When I say ignore, some devices will leave them blank while others will insert little boxes and on occasion some gibberish characters.
So I had to find a way for each respective OS to convert what it didn’t understand into something that it did. Below is what I came up with based upon the Javascript code that I found here that addressed this same issue only with older browsers. You will need the utf8 and bit plugins. No matter if the device returns a UTF-16 surrogate pair or a single code point, I just save it as is. Nothing special when the user keys the emojis in. I just save them as codepoints(myCodePoint) using JSON onto my game servers.
local utf8 = require( "plugin.utf8" ) local chatMsg = "😊🎅" -- convert unicode into utf8 codepoints if chatMsg ~= nil then for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do myCodePoint = myCodePoint.."%"..utf8.codepoint(code) end else myCodePoint = "" end
But when it comes time to display these saved codepoints, I use the code below to determine if I am on an Android 5 or older device (stored in myGlobals file)and if I am and I run across a single Unicode codepoint then I convert it to its respective surrogate pair. And vice-versa, if I am on a newer flavor of Android or iOS and run across a surrogate pair then I convert that pair to a single codepoint. I know there are better ways to do this but after two weeks, I am just glad that I found something that works.
local bit = require("plugin.bit") local myCodePoint = "" local codePoint, higher, lowsur, chkCode, point, offset -- this group doesnt understand high/low surrogates so convert those into Unicode codepoints if \_myG["platName"] == "Android" and \_myG["apiLevel"] \> 21 or \_myG["platName"] ~= "Android" then for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \>= 55296 and codePoint \<= 56319 then -- high surrogate highsur = codePoint elseif codePoint \>= 56320 and codePoint \<= 57343 then -- low surrogate lowsur = codePoint -- I now have my surrogate pair so convert to single decimal chkCode = ((highsur - 0xD800) \* 0x400) + (lowsur - 0xDC00) + 0x10000 myCodePoint = myCodePoint.."%"..chkCode else myCodePoint = myCodePoint.."%"..codePoint -- no surrogates so just use the codepoint that was passed end end else -- this group doesnt understand Unicode codepoints so convert those into high/low surrogates for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \> 65535 then point = codePoint offset = point - 0x10000 if point \> 0xFFFF then -- I have a single decimal code point so convert to surrogates highsur = 0xD800 + (bit.rshift(offset, 10)) lowsur = 0xDC00 + (bit.band(offset, 0x3FF)) myCodePoint = myCodePoint.."%"..highsur.."%"..lowsur end else myCodePoint = myCodePoint.."%"..codePoint end end end decodedMsg = utf8.escape(myCodePoint)
So on newer devices, this string, “%55357%56842%127877” is converted to this string, “%128522%127877” and displays both emojis correctly.
While on older devices, this same string, “%55357%56842%127877” is converted to this string, “%55357%56842%55356%57221” and displays both emojis correctly.
So far I have tested on 4 different devices (2 Android, 2 iOS) with around 100 emojis and it seems to be working. Hopefully this will help someone in the future avoid the near Unicode mental breakdown that I had! :blink: