UTF-8 Decoding Differs on Older Versions of Android

Hi,

I have a small chat feature in my app where the players can send each other messages. Since a player may key in a variety of characters, including multibyte characters such as emojis, I am decoding these messages to utf-8 then storing them on our game servers. I then encode those messages back to unicode when they are displayed. I am using a native.newTextBox field to display these messages and I am using the code located here to perform the actual utf-8 decode/encode functions.

My problem is with older versions of Android (5.0 and older). If I put Android 5 and older in say, Group 1 and Android 6 and newer as well as any flavor of iOS that supports emojis in Group 2 then I find that Group 1 only plays nice with messages from Group 1 users and Group 2 only plays nice with messages from Group 2 users. 

To test this, I used the emoji keyboard on the devices listed below to input the standard cherries emoji. :cherries: I then utf-8 decoded that emoji and get the following results:

Group 1

Android 4.4.2 (Samsung Galaxy S4)
í ¼í½\u0092

Android 5.0 (RCA Viking Pro tablet)
í ¼í½\u0092

_ Group2 _
Android 6.0.1 (Samsung Tab A tablet)
ð\u009f\u008d\u0092

iOS 7.1.2 (iPhone 4)
ð\u009f\u008d\u0092

iOS 10.1.1 (iPad 4th gen)
ð\u009f\u008d\u0092 

As you can see, Group 1 and Group 2 differ as far as the utf-8 decodings go, Does anyone know why they would differ like this? This, of course, is the root of my problem. I would imagine if they would decode the same, then they would encode and display correctly across all of these platforms.

I also ran these utf-8 results through the online decoder located here. Group 1 results comes back as “invalid input” while Group 2 encodes and displays the cherries correctly.

This is also my first experience with utf-8 so it could also be my lack of understanding!

Any help is greatly appreciated!

Scott

To test this issue further, I ran 5 common emojis through the code below and had the following results which are consistent with my earlier findings.

Does anyone know why Android 5 and older would encode the unicode characters into utf-8 differently than Android 6 and iOS? Do you see anything wrong with the code itself that would cause this issue?

If the code is correct, could it be possible that these older versions of Android use different codepoints than newer versions of Android for the same characters? I would have thought codepoints would be the same if/when the unicode source characters are the same.

Emoji     Android 4.4         Android 5            Android 6                      iOS 7                            iOS 10

:blush:           í ½í¸\u008a          í ½í¸\u008a          ð\u009f\u0098\u008a    ð\u009f\u0098\u008a    ð\u009f\u0098\u008a

:santa:           í ¼í¾\u0085        í ¼í¾\u0085        ð\u009f\u008e\u0085    ð\u009f\u008e\u0085    ð\u009f\u008e\u0085

:rainbow:           í ¼í¼\u0088        í ¼í¼\u0088        ð\u009f\u008c\u0088    ð\u009f\u008c\u0088    ð\u009f\u008c\u0088

:+1:           í ½í±\u008d         í ½í±\u008d         ð\u009f\u0091\u008d    ð\u009f\u0091\u008d    ð\u009f\u0091\u008d

:octopus:           í ½í°\u0099          í ½í°\u0099          ð\u009f\u0090\u0099    ð\u009f\u0090\u0099    ð\u009f\u0090\u0099

local function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf8 end myChatMsg = "😊" encodedChatMsg = utf8\_encode(myChatMsg)

My post above got me to thinking about the Unicode code points, so I printed out the code points behind these emojis…see below. Keying these code points into any online encoder/decoder shows the older Android code points below to be invalid while the Android 6/IOS are valid…yet the invalid ones display correctly on these older devices.

             Android 5&<             Android 6 and IOS    

Emoji   code points              code points

:blush:       55357 56842            128522  

:santa:       55356 57221            127877

:rainbow:       55356 57096            127752

:+1:       55357 56397            128077

:octopus:       55357 56345            128025

It doesn’t seem to matter if I input these emojis directly into my native.textfield using the native keyboard or cut-and-paste them from a third-party source, they have the same issue. Just for kicks, I also tried downloading a new keyboard app (GO keyboard) on the Galaxy S4(Android 4.4) but the issue remains even with these new keyboard apps.

Code to print the code points:

local utf8 = require("plugin.utf8") local chatMsg = "😊🎅🌈👍🐙" for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do print(utf8.codepoint(code)) end

Thoughts, ideas and suggestions are greatly appreciated!

I’ve asked an Engineer to drop in and take a look.

Rob

Thanks Rob

As another followup, this behavior can also be replicated in the NativeKeyboard app in the Corona sample code. I added the utf8 plugin and the code below to the fieldHandler function for the defaultField and it printed the “invalid” code points on Android 5 and older for the emojis I entered.

defaultField:addEventListener( "userInput", fieldHandler( function() for code in defaultField.text:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do print(utf8.codepoint(code)) end return defaultField end ) )

Not sure if this is the case or not, but so far with my limited testing, it seems that any emojis located in Unicode plane 0 (BMP) are translated correctly on Android 5 and older while emojis above plane 0 are not. Unfortunately, the vast majority of emojis are in plane 1 (SMP). 

I’m unclear why you sometimes write the bytes as numbers (e.g. \u009f) and sometimes as whatever single-byte character they happen to encode (e.g. ¼). I think things would be 100% clearer if you always used numbers for the bytes since you’re getting 4 bytes in both cases but it’s not possible to usefully compare them because of the way they are presented.

So taking the cherries emoji as an example:

Codepoint: U+1F352

Bytes: \xF0\x9F\x8D\x92

Name: cherries

The question is: what is the 4-byte sequence for this character in each of your differing cases?  Is there an obvious pattern to the differences?

The other thing that seems to matter after some light googling is the exact device (or, more specifically, device manufacturer) you are testing with as, prior to Android 6, it seems different forks of Android handled emojis differently though I’m not 100% sure the differences matter at this level.

Thanks for taking a look Perry. The output of the UTF-encoder code was copied and pasted directly into this post so they are printed as they were outputted from that code (crazy characters and all), i.e., no manipulations from me.

This UTF-8 inconsistency, along with the crazy characters (like ¼), led me to take a look at the underlying code points for these characters. My third post above shows that the older versions of Android “see” the code points (decimal) differently than the newer versions which might explain the crazy characters in the UTF-8 fields. Garbage in, garbage out.

I will print the bytes and see what we get. Also, the devices I used are listed in the original post.

Thanks for the help

Same crazy results Mr. Clarke. Below is the output I received by running two emojis from plane 0 (BMP) and two emojis from plane 1 (SMP) through both Android 6 and Android 4.4. As you can see, Android 6 is correct for all emojis in both planes but Android 4.4 is correct for only emojis in plane 0. I don’t understand why I get two of everything for emojis in plane 1 for Android 4.4. It is like it sees two sets of 3 bytes (6 bytes total) which makes no sense to this not-so-savvy-Unicode guy. Of course, therein lies my problem.

_ Android 6.0 _

plane 0 

:hourglass_flowing_sand:

Codepoint Hex: U+23F3

Codepoint Dec: 9203

Num of Bytes: 3

Bytes: \xE2\x8F\xB3

Name: Hourglass With Flowing Sand

:point_up:

Codepoint Hex: U+261D

Codepoint Dec: 9757

Num of Bytes: 3

Bytes: \xE2\x98\x9D

Name: White Up Pointing Index

plane 1

:blush:

Codepoint Hex: U+1F60A

Codepoint Dec: 128522

Num of Bytes: 4

Bytes: \xF0\x9F\x98\x8A

Name: Smiling Face With Smiling Eyes

:santa:

Codepoint Hex: U+1F385

Codepoint Dec: 127877

Num of Bytes: 4

Bytes: \xF0\x9F\x8E\x85

Name: Father Christmas

_ Android 4.4 _

plane 0 

:hourglass_flowing_sand:

Codepoint Hex: U+23F3

Codepoint Dec: 9203

Num of Bytes: 3

Bytes: \xE2\x8F\xB3

Name: Hourglass With Flowing Sand

:point_up:

Codepoint Hex: U+261D

Codepoint Dec: 9757

Num of Bytes: 3

Bytes: \xE2\x98\x9D

Name: White Up Pointing Index

plane 1

:blush:

Codepoint Hex: U+D83D

Codepoint Dec: 55357

Num of Bytes: 3

Bytes: \xED\xA0\xBD

Name: Smiling Face With Smiling Eyes

Codepoint Hex: U+DE0A

Codepoint Dec: 56842

Num of Bytes: 3

Bytes: \xED\xB8\x8A

:santa:

Codepoint Hex: U+D83C

Codepoint Dec: 55356

Num of Bytes: 3

Bytes: \xED\xA0\xBC

Name: Father Christmas

Codepoint Hex: U+DF85

Codepoint Dec: 57221

Num of Bytes: 3

Bytes: \xED\xBE\x85

Here is the code I used to print these results where chatMsg is entered via a native.newTextField.

for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do bytes = {} hexString = "" for i = 1,string.len(code) do bytes[i] = string.byte(code,i) end for k, v in pairs( bytes ) do hexString = hexString..(string.format("%X", v)) end print(" ") print("Codepoint Hex:", "U+"..string.format( "%X", utf8.byte(code))) print("Codepoint Dec:", utf8.codepoint(code)) print("No. of Bytes:", string.len(code)) print("Bytes:", hexString) end 

Thanks again for all the help!

I think I am dealing with a UTF-16 issue with any characters that are outside the BMP (plane 0). According to this FAQ, Java uses UTF-16 to store strings internally. The emojis above the BMP appear to be returning a pair of code units (called surrogates) to my app.

If I convert these code units (hex code points from the emojis above) into standard Javascript notation (\uD83C\uDF85) and plug them into the encoder/decoder located here then it displays the correct Father Christmas emoji.

I will experiment with storing these emojis in UTF-16 and report my results back here but in the meantime, if anyone is aware of a lua-based UTF-16 encoder/decoder, I would be very grateful!

After several weeks of trial and error, I “think” that I have this working for my particular situation. However, I still do not understand why I am seeing the different behavior across devices. Either way, here is what I think is happening and what I did to get around it. 

Any characters outside of the Unicode BMP plane (greater than 65,535) are addressed as two, 16-bit units called high and low surrogates. It appears that Android 6 and higher, as well as any flavor of iOS either convert these UTF-16 surrogate pairs into their single Unicode codepoint equivalent or they have newer internal code that allows them to calculate the single Unicode codepoint and bypass the surrogates altogether (or something along those lines). Android 5 and older does not seem to be able to do either. They just leave them as UTF-16 surrogate pairs. The examples from my previous posts above can attest to that behavior.

This becomes a problem when displaying characters above the BMP, such as the many emojis in use today. Even though these newer devices seem to “automatically” convert surrogates into single codepoints during creation, they do not seem to know how to deal with a surrogate pair when they see one in the wild. And the older devices do not seem to know how to deal with the one Unicode codepoint either. This is also the behavior I was noting in my original post when I said, “Group 1 only plays nice with messages from Group 1 users and Group 2 only plays nice with messages from Group 2 users.”

As an example, the following string contains the codepoints for both a surrogate pair(%55357%56842):blush:and a Unicode codepoint(%127877):santa: for the two emojis shown which are both located above the BMP. 

local testStr = utf8.escape( "%55357%56842%127877" )

If you display testStr via a native.newTextBox, you will get the following:

iOS and Android 6 and newer will ignore the surrogate pairs and just print out Father Christmas 🎅 while Android 5 and older will ignore the single Unicode codepoint and just print out the Smiling Face with Smiling Eyes emoji 😊. When I say ignore, some devices will leave them blank while others will insert little boxes and on occasion some gibberish characters.

So I had to find a way for each respective OS to convert what it didn’t understand into something that it did. Below is what I came up with based upon the Javascript code that I found here that addressed this same issue only with older browsers. You will need the utf8 and bit  plugins. No matter if the device returns a UTF-16 surrogate pair or a single code point, I just save it as is. Nothing special when the user keys the emojis in. I just save them as codepoints(myCodePoint) using JSON onto my game servers.

local utf8 = require( "plugin.utf8" ) local chatMsg = "😊🎅" -- convert unicode into utf8 codepoints if chatMsg ~= nil then for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do myCodePoint = myCodePoint.."%"..utf8.codepoint(code) end else myCodePoint = "" end 

But when it comes time to display these saved codepoints, I use the code below to determine if I am on an Android 5 or older device (stored in myGlobals file)and if I am and I run across a single Unicode codepoint then I convert it to its respective surrogate pair. And vice-versa, if I am on a newer flavor of Android or iOS and run across a surrogate pair then I convert that pair to a single codepoint. I know there are better ways to do this but after two weeks, I am just glad that I found something that works.

local bit = require("plugin.bit") local myCodePoint = "" local codePoint, higher, lowsur, chkCode, point, offset -- this group doesnt understand high/low surrogates so convert those into Unicode codepoints if \_myG["platName"] == "Android" and \_myG["apiLevel"] \> 21 or \_myG["platName"] ~= "Android" then for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \>= 55296 and codePoint \<= 56319 then -- high surrogate highsur = codePoint elseif codePoint \>= 56320 and codePoint \<= 57343 then -- low surrogate lowsur = codePoint -- I now have my surrogate pair so convert to single decimal chkCode = ((highsur - 0xD800) \* 0x400) + (lowsur - 0xDC00) + 0x10000 myCodePoint = myCodePoint.."%"..chkCode else myCodePoint = myCodePoint.."%"..codePoint -- no surrogates so just use the codepoint that was passed end end else -- this group doesnt understand Unicode codepoints so convert those into high/low surrogates for k in string.gmatch( myChatMsg, "%d+" ) do codePoint = tonumber(k) if codePoint \> 65535 then point = codePoint offset = point - 0x10000 if point \> 0xFFFF then -- I have a single decimal code point so convert to surrogates highsur = 0xD800 + (bit.rshift(offset, 10)) lowsur = 0xDC00 + (bit.band(offset, 0x3FF)) myCodePoint = myCodePoint.."%"..highsur.."%"..lowsur end else myCodePoint = myCodePoint.."%"..codePoint end end end decodedMsg = utf8.escape(myCodePoint) &nbsp;

So on newer devices, this string, “%55357%56842%127877” is converted to this string, “%128522%127877” and displays both emojis correctly.

While on older devices, this same string, “%55357%56842%127877” is converted to this string, “%55357%56842%55356%57221” and displays both emojis correctly.

So far I have tested on 4 different devices (2 Android, 2 iOS) with around 100 emojis and it seems to be working. Hopefully this will help someone in the future avoid the near Unicode mental breakdown that I had!  :blink:

To test this issue further, I ran 5 common emojis through the code below and had the following results which are consistent with my earlier findings.

Does anyone know why Android 5 and older would encode the unicode characters into utf-8 differently than Android 6 and iOS? Do you see anything wrong with the code itself that would cause this issue?

If the code is correct, could it be possible that these older versions of Android use different codepoints than newer versions of Android for the same characters? I would have thought codepoints would be the same if/when the unicode source characters are the same.

Emoji     Android 4.4         Android 5            Android 6                      iOS 7                            iOS 10

:blush:           í ½í¸\u008a          í ½í¸\u008a          ð\u009f\u0098\u008a    ð\u009f\u0098\u008a    ð\u009f\u0098\u008a

:santa:           í ¼í¾\u0085        í ¼í¾\u0085        ð\u009f\u008e\u0085    ð\u009f\u008e\u0085    ð\u009f\u008e\u0085

:rainbow:           í ¼í¼\u0088        í ¼í¼\u0088        ð\u009f\u008c\u0088    ð\u009f\u008c\u0088    ð\u009f\u008c\u0088

:+1:           í ½í±\u008d         í ½í±\u008d         ð\u009f\u0091\u008d    ð\u009f\u0091\u008d    ð\u009f\u0091\u008d

:octopus:           í ½í°\u0099          í ½í°\u0099          ð\u009f\u0090\u0099    ð\u009f\u0090\u0099    ð\u009f\u0090\u0099

local function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf8 end myChatMsg = "😊" encodedChatMsg = utf8\_encode(myChatMsg)

My post above got me to thinking about the Unicode code points, so I printed out the code points behind these emojis…see below. Keying these code points into any online encoder/decoder shows the older Android code points below to be invalid while the Android 6/IOS are valid…yet the invalid ones display correctly on these older devices.

             Android 5&<             Android 6 and IOS    

Emoji   code points              code points

:blush:       55357 56842            128522  

:santa:       55356 57221            127877

:rainbow:       55356 57096            127752

:+1:       55357 56397            128077

:octopus:       55357 56345            128025

It doesn’t seem to matter if I input these emojis directly into my native.textfield using the native keyboard or cut-and-paste them from a third-party source, they have the same issue. Just for kicks, I also tried downloading a new keyboard app (GO keyboard) on the Galaxy S4(Android 4.4) but the issue remains even with these new keyboard apps.

Code to print the code points:

local utf8 = require("plugin.utf8") local chatMsg = "😊🎅🌈👍🐙" for code in chatMsg:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do print(utf8.codepoint(code)) end

Thoughts, ideas and suggestions are greatly appreciated!

I’ve asked an Engineer to drop in and take a look.

Rob

Thanks Rob

As another followup, this behavior can also be replicated in the NativeKeyboard app in the Corona sample code. I added the utf8 plugin and the code below to the fieldHandler function for the defaultField and it printed the “invalid” code points on Android 5 and older for the emojis I entered.

defaultField:addEventListener( "userInput", fieldHandler( function() for code in defaultField.text:gmatch("[%z\1-\127\194-\244][\128-\191]\*") do print(utf8.codepoint(code)) end return defaultField end ) )

Not sure if this is the case or not, but so far with my limited testing, it seems that any emojis located in Unicode plane 0 (BMP) are translated correctly on Android 5 and older while emojis above plane 0 are not. Unfortunately, the vast majority of emojis are in plane 1 (SMP). 

I’m unclear why you sometimes write the bytes as numbers (e.g. \u009f) and sometimes as whatever single-byte character they happen to encode (e.g. ¼). I think things would be 100% clearer if you always used numbers for the bytes since you’re getting 4 bytes in both cases but it’s not possible to usefully compare them because of the way they are presented.

So taking the cherries emoji as an example:

Codepoint: U+1F352

Bytes: \xF0\x9F\x8D\x92

Name: cherries

The question is: what is the 4-byte sequence for this character in each of your differing cases?  Is there an obvious pattern to the differences?

The other thing that seems to matter after some light googling is the exact device (or, more specifically, device manufacturer) you are testing with as, prior to Android 6, it seems different forks of Android handled emojis differently though I’m not 100% sure the differences matter at this level.

Thanks for taking a look Perry. The output of the UTF-encoder code was copied and pasted directly into this post so they are printed as they were outputted from that code (crazy characters and all), i.e., no manipulations from me.

This UTF-8 inconsistency, along with the crazy characters (like ¼), led me to take a look at the underlying code points for these characters. My third post above shows that the older versions of Android “see” the code points (decimal) differently than the newer versions which might explain the crazy characters in the UTF-8 fields. Garbage in, garbage out.

I will print the bytes and see what we get. Also, the devices I used are listed in the original post.

Thanks for the help