UTF-8 Decoding Differs on Older Versions of Android

Same crazy results Mr. Clarke. Below is the output I received by running two emojis from plane 0 (BMP) and two emojis from plane 1 (SMP) through both Android 6 and Android 4.4. As you can see, Android 6 is correct for all emojis in both planes but Android 4.4 is correct for only emojis in plane 0. I don’t understand why I get two of everything for emojis in plane 1 for Android 4.4. It is like it sees two sets of 3 bytes (6 bytes total) which makes no sense to this not-so-savvy-Unicode guy. Of course, therein lies my problem.

_ Android 6.0 _

plane 0 

:hourglass_flowing_sand:

Codepoint Hex: U+23F3

Codepoint Dec: 9203

Num of Bytes: 3

Bytes: \xE2\x8F\xB3

Name: Hourglass With Flowing Sand

:point_up:

Codepoint Hex: U+261D

Codepoint Dec: 9757

Num of Bytes: 3

Bytes: \xE2\x98\x9D

Name: White Up Pointing Index

plane 1

:blush:

Codepoint Hex: U+1F60A

Codepoint Dec: 128522

Num of Bytes: 4

Bytes: \xF0\x9F\x98\x8A

Name: Smiling Face With Smiling Eyes

:santa:

Codepoint Hex: U+1F385

Codepoint Dec: 127877

Num of Bytes: 4

Bytes: \xF0\x9F\x8E\x85

Name: Father Christmas

_ Android 4.4 _

plane 0 

:hourglass_flowing_sand:

Codepoint Hex: U+23F3

Codepoint Dec: 9203

Num of Bytes: 3

Bytes: \xE2\x8F\xB3

Name: Hourglass With Flowing Sand

:point_up:

Codepoint Hex: U+261D

Codepoint Dec: 9757

Num of Bytes: 3

Bytes: \xE2\x98\x9D

Name: White Up Pointing Index

plane 1

:blush:

Codepoint Hex: U+D83D

Codepoint Dec: 55357

Num of Bytes: 3

Bytes: \xED\xA0\xBD

Name: Smiling Face With Smiling Eyes

Codepoint Hex: U+DE0A

Codepoint Dec: 56842

Num of Bytes: 3

Bytes: \xED\xB8\x8A

:santa:

Codepoint Hex: U+D83C

Codepoint Dec: 55356

Num of Bytes: 3

Bytes: \xED\xA0\xBC

Name: Father Christmas

Codepoint Hex: U+DF85

Codepoint Dec: 57221

Num of Bytes: 3

Bytes: \xED\xBE\x85

Here is the code I used to print these results where chatMsg is entered via a native.newTextField.

for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do bytes = {} hexString = "" for i = 1,string.len(code) do bytes[i] = string.byte(code,i) end for k, v in pairs( bytes ) do hexString = hexString..(string.format("%X", v)) end print(" ") print("Codepoint Hex:", "U+"..string.format( "%X", utf8.byte(code))) print("Codepoint Dec:", utf8.codepoint(code)) print("No. of Bytes:", string.len(code)) print("Bytes:", hexString) end 

Thanks again for all the help!

I think I am dealing with a UTF-16 issue with any characters that are outside the BMP (plane 0). According to this FAQ, Java uses UTF-16 to store strings internally. The emojis above the BMP appear to be returning a pair of code units (called surrogates) to my app.

If I convert these code units (hex code points from the emojis above) into standard Javascript notation (\uD83C\uDF85) and plug them into the encoder/decoder located here then it displays the correct Father Christmas emoji.

I will experiment with storing these emojis in UTF-16 and report my results back here but in the meantime, if anyone is aware of a lua-based UTF-16 encoder/decoder, I would be very grateful!

After several weeks of trial and error, I “think” that I have this working for my particular situation. However, I still do not understand why I am seeing the different behavior across devices. Either way, here is what I think is happening and what I did to get around it. 

Any characters outside of the Unicode BMP plane (greater than 65,535) are addressed as two, 16-bit units called high and low surrogates. It appears that Android 6 and higher, as well as any flavor of iOS either convert these UTF-16 surrogate pairs into their single Unicode codepoint equivalent or they have newer internal code that allows them to calculate the single Unicode codepoint and bypass the surrogates altogether (or something along those lines). Android 5 and older does not seem to be able to do either. They just leave them as UTF-16 surrogate pairs. The examples from my previous posts above can attest to that behavior.

This becomes a problem when displaying characters above the BMP, such as the many emojis in use today. Even though these newer devices seem to “automatically” convert surrogates into single codepoints during creation, they do not seem to know how to deal with a surrogate pair when they see one in the wild. And the older devices do not seem to know how to deal with the one Unicode codepoint either. This is also the behavior I was noting in my original post when I said, “Group 1 only plays nice with messages from Group 1 users and Group 2 only plays nice with messages from Group 2 users.”

As an example, the following string contains the codepoints for both a surrogate pair(%55357%56842):blush:and a Unicode codepoint(%127877):santa: for the two emojis shown which are both located above the BMP. 

local testStr = utf8.escape( "%55357%56842%127877" )

If you display testStr via a native.newTextBox, you will get the following:

iOS and Android 6 and newer will ignore the surrogate pairs and just print out Father Christmas 🎅 while Android 5 and older will ignore the single Unicode codepoint and just print out the Smiling Face with Smiling Eyes emoji 😊. When I say ignore, some devices will leave them blank while others will insert little boxes and on occasion some gibberish characters.

So I had to find a way for each respective OS to convert what it didn’t understand into something that it did. Below is what I came up with based upon the Javascript code that I found here that addressed this same issue only with older browsers. You will need the utf8 and bit  plugins. No matter if the device returns a UTF-16 surrogate pair or a single code point, I just save it as is. Nothing special when the user keys the emojis in. I just save them as codepoints(myCodePoint) using JSON onto my game servers.

local utf8 = require( "plugin.utf8" ) local chatMsg = "😊🎅" -- convert unicode into utf8 codepoints if chatMsg ~= nil then for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do myCodePoint = myCodePoint.."%"..utf8.codepoint(code) end else myCodePoint = "" end 

But when it comes time to display these saved codepoints, I use the code below to determine if I am on an Android 5 or older device (stored in myGlobals file)and if I am and I run across a single Unicode codepoint then I convert it to its respective surrogate pair. And vice-versa, if I am on a newer flavor of Android or iOS and run across a surrogate pair then I convert that pair to a single codepoint. I know there are better ways to do this but after two weeks, I am just glad that I found something that works.

local bit = require("plugin.bit") local myCodePoint = "" local codePoint, higher, lowsur, chkCode, point, offset -- this group doesnt understand high/low surrogates so convert those into Unicode codepoints if \_myG["platName"] == "Android" and \_myG["apiLevel"] \> 21 or \_myG["platName"] ~= "Android" then for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \>= 55296 and codePoint \<= 56319 then -- high surrogate highsur = codePoint elseif codePoint \>= 56320 and codePoint \<= 57343 then -- low surrogate lowsur = codePoint -- I now have my surrogate pair so convert to single decimal chkCode = ((highsur - 0xD800) \* 0x400) + (lowsur - 0xDC00) + 0x10000 myCodePoint = myCodePoint.."%"..chkCode else myCodePoint = myCodePoint.."%"..codePoint -- no surrogates so just use the codepoint that was passed end end else -- this group doesnt understand Unicode codepoints so convert those into high/low surrogates for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \> 65535 then point = codePoint offset = point - 0x10000 if point \> 0xFFFF then -- I have a single decimal code point so convert to surrogates highsur = 0xD800 + (bit.rshift(offset, 10)) lowsur = 0xDC00 + (bit.band(offset, 0x3FF)) myCodePoint = myCodePoint.."%"..highsur.."%"..lowsur end else myCodePoint = myCodePoint.."%"..codePoint end end end decodedMsg = utf8.escape(myCodePoint) &nbsp;

So on newer devices, this string, “%55357%56842%127877” is converted to this string, “%128522%127877” and displays both emojis correctly.

While on older devices, this same string, “%55357%56842%127877” is converted to this string, “%55357%56842%55356%57221” and displays both emojis correctly.

So far I have tested on 4 different devices (2 Android, 2 iOS) with around 100 emojis and it seems to be working. Hopefully this will help someone in the future avoid the near Unicode mental breakdown that I had!  :blink: