string.toupper broken?

carloscosta · June 15, 2016, 8:59am

i’ve made this simple code:

local a="mão maça ícone" a=string.upper(a) local options={ text=a, font=native.systemFont, fontSize=14 } local t=display.newText(options) t.x=100 t.y=100

this supose to put all words on CAPS but letters with accents don’t work. in lua manual it says it works with accents and i remember seeing this work since i deleted my own function to do this and started to use lua function. i’m using the lateste version 2901.

please confirm this is a bug so i can submit it.

rob · June 15, 2016, 3:51pm

Before you do please look at the UTF8 plugin. It’s a UTF8 friendly version of the string library.

https://store.coronalabs.com/plugin/utf-8

Rob

carloscosta · June 15, 2016, 4:47pm

hi Rob, thanks for the tip.

my first thinking was exactly utf8 problem, so i used this old code, not made by me, to try if that was the problem:

-- -- Converts all unicode characters (\>127) into UTF-8 character sets --@param unicode ASCII or unicoded string --@return a UTF-8 representation function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf8 end&nbsp;

with no success. i saw the corona plugin but i need to confess i didn’t try it. I thought was the same approach and i remember seeing this working without any plugin.

Saying this, i installed the plugin like you suggested and worked like a charm. So for now my problem is solved, thanks

but my question still remains, string.upper() should work. I read this from the lua manual:

Both string.upper and string.lower follow the current locale. Therefore, if you work with the European Latin-1 locale, the expression string.upper(“ação”)
results in “AÇÃO”.

i used system.getPreference to check my locale country/language and was correct (pt). so the problem seems that corona is not using locale correctly with string.upper.

rob · June 15, 2016, 6:14pm

The “Latin-1” character set is based on one byte values to represent each string. It’s an extension of the original ASCII. In ASCII, only 7 of 8 bits are used to identify the character. The 8th bit had a different use. As computers became more internationalized, they extended the ASCII character set to use all 256 possible values. There are variants of this one byte - one glyph system out there. Microsoft had their own, etc. UTF-8 however is a multi-byte character system that allows way more glyphs in the character set.

In Latin-1, a ñ is represented by one byte. In UTF-8 it could be two or more bytes. The string you are working with is likely a UTF-8 string. Lua’s core string functions just work on bytes. If you’re editor was set to Latin-1 it probably would work as expected. This is why we provided the UTF-8 plugin since today most things are UTF-8 based.

Rob

Joshua_Quick · June 15, 2016, 7:07pm

Lua’s string.toupper() and string.tolower() functions can only convert *ASCII* characters. So, English characters only.

Those functions do not support any other text encoding.

The system locale is not a factor. Trust me. I’ve seen the internals of the official Lua C code (it’s open source).

To convert UTF encoded strings, you need to use the UTF-8 plugin as Rob suggested. You must also ensure that your Lua file is using a UTF-8 encoding without a BOM signature. If you’re coding on Windows, be aware that most code editors on that platform default to ANSI encoding, not UTF-8.

carloscosta · June 16, 2016, 8:53am

@rob, @Joshua, i changed my editors setting to latin-1 and it did not work, on simulator (windows) and on device (tested on android). i tried other settings but with no success, only managed to work with the plugin and in my editor setting to UTF-8 without BOM. it should work on latin-1 since it only use 1 byte. and lua supports it.

on latin-1 (ISO 8859-1) the original text changed to local a=“joÃ£o maÃ§a Ãcone” but it printed “joão maça ícone” on screen like it should, tried to convert both strings to upper with no success. On mac i only got black screen in other encoded. only utf-8 without BOM worked.

my original editor setting was UTF-8 without BOM.

Joshua_Quick · June 16, 2016, 6:51pm

>> it should work on latin-1 since it only use 1 byte. and lua supports it.

Corona ***ONLY*** supports UTF-8 encoded Lua files without a BOM signature.

This is an intentionally imposed Corona restriction because UTF-8 is the most portable text encoding on all platforms.

It’s also an optimization on most platforms. Particularly on Apple and Android based platforms since their native C/C++ APIs support UTF-8 by default.

Also remember that most text files such as *.lua, *.cpp, *.c, etc. don’t provide any indication of what text encoding they use. This forces a lot of code editor and other software to *guess* at what encoding they use by scanning the file and they can guess wrong. The Firefox developers have written a lengthy article about this before. And that was something that the XML format designers solved by forcing users to declare the encoding at the top of an XML file. But the bottom line here is that you never want the script interpreter to *guess* at the format at runtime and incur a performance hit. So, it’s better to impose a standard encoding format such as UTF-8, which is what Corona does.

So, bottom line, switch to UTF-8 without BOM.

This is the only encoding Corona will ever support for Lua.

carloscosta · June 17, 2016, 10:09am

understood. thanks for the clarification.

i guess this should go strait to documentation so people don’t ask this question all over again. if corona imposes something that should be pointed in docs.

regards,

Carlos.

Joshua_Quick · June 17, 2016, 5:45pm

Sure thing. Happy to help. And sorry about the confusion.

I’ll talk to our team and make sure this is better documented.

rob · June 15, 2016, 3:51pm

Before you do please look at the UTF8 plugin. It’s a UTF8 friendly version of the string library.

https://store.coronalabs.com/plugin/utf-8

Rob

carloscosta · June 15, 2016, 4:47pm

hi Rob, thanks for the tip.

my first thinking was exactly utf8 problem, so i used this old code, not made by me, to try if that was the problem:

-- -- Converts all unicode characters (\>127) into UTF-8 character sets --@param unicode ASCII or unicoded string --@return a UTF-8 representation function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf8 end&nbsp;

with no success. i saw the corona plugin but i need to confess i didn’t try it. I thought was the same approach and i remember seeing this working without any plugin.

Saying this, i installed the plugin like you suggested and worked like a charm. So for now my problem is solved, thanks

but my question still remains, string.upper() should work. I read this from the lua manual:

Both string.upper and string.lower follow the current locale. Therefore, if you work with the European Latin-1 locale, the expression string.upper(“ação”)
results in “AÇÃO”.

i used system.getPreference to check my locale country/language and was correct (pt). so the problem seems that corona is not using locale correctly with string.upper.

rob · June 15, 2016, 6:14pm

The “Latin-1” character set is based on one byte values to represent each string. It’s an extension of the original ASCII. In ASCII, only 7 of 8 bits are used to identify the character. The 8th bit had a different use. As computers became more internationalized, they extended the ASCII character set to use all 256 possible values. There are variants of this one byte - one glyph system out there. Microsoft had their own, etc. UTF-8 however is a multi-byte character system that allows way more glyphs in the character set.

In Latin-1, a ñ is represented by one byte. In UTF-8 it could be two or more bytes. The string you are working with is likely a UTF-8 string. Lua’s core string functions just work on bytes. If you’re editor was set to Latin-1 it probably would work as expected. This is why we provided the UTF-8 plugin since today most things are UTF-8 based.

Rob

Joshua_Quick · June 15, 2016, 7:07pm

Lua’s string.toupper() and string.tolower() functions can only convert *ASCII* characters. So, English characters only.

Those functions do not support any other text encoding.

The system locale is not a factor. Trust me. I’ve seen the internals of the official Lua C code (it’s open source).

To convert UTF encoded strings, you need to use the UTF-8 plugin as Rob suggested. You must also ensure that your Lua file is using a UTF-8 encoding without a BOM signature. If you’re coding on Windows, be aware that most code editors on that platform default to ANSI encoding, not UTF-8.

carloscosta · June 16, 2016, 8:53am

@rob, @Joshua, i changed my editors setting to latin-1 and it did not work, on simulator (windows) and on device (tested on android). i tried other settings but with no success, only managed to work with the plugin and in my editor setting to UTF-8 without BOM. it should work on latin-1 since it only use 1 byte. and lua supports it.

on latin-1 (ISO 8859-1) the original text changed to local a=“joÃ£o maÃ§a Ãcone” but it printed “joão maça ícone” on screen like it should, tried to convert both strings to upper with no success. On mac i only got black screen in other encoded. only utf-8 without BOM worked.

my original editor setting was UTF-8 without BOM.

Joshua_Quick · June 16, 2016, 6:51pm

>> it should work on latin-1 since it only use 1 byte. and lua supports it.

Corona ***ONLY*** supports UTF-8 encoded Lua files without a BOM signature.

This is an intentionally imposed Corona restriction because UTF-8 is the most portable text encoding on all platforms.

It’s also an optimization on most platforms. Particularly on Apple and Android based platforms since their native C/C++ APIs support UTF-8 by default.

Also remember that most text files such as *.lua, *.cpp, *.c, etc. don’t provide any indication of what text encoding they use. This forces a lot of code editor and other software to *guess* at what encoding they use by scanning the file and they can guess wrong. The Firefox developers have written a lengthy article about this before. And that was something that the XML format designers solved by forcing users to declare the encoding at the top of an XML file. But the bottom line here is that you never want the script interpreter to *guess* at the format at runtime and incur a performance hit. So, it’s better to impose a standard encoding format such as UTF-8, which is what Corona does.

So, bottom line, switch to UTF-8 without BOM.

This is the only encoding Corona will ever support for Lua.

carloscosta · June 17, 2016, 10:09am

understood. thanks for the clarification.

i guess this should go strait to documentation so people don’t ask this question all over again. if corona imposes something that should be pointed in docs.

regards,

Carlos.

Joshua_Quick · June 17, 2016, 5:45pm

Sure thing. Happy to help. And sorry about the confusion.

I’ll talk to our team and make sure this is better documented.