Network request response is CP-1250 (html), how to convert to UTF-8?

joedavinci · October 3, 2015, 11:51pm

Hi all,

I’m trying build a pretty simple app that will read actual HTML code of a specific website (sort of a very unusual bulletin board) and present it in a mobile-friendly format…

I can parse the HTML to get the data I need, however I’ve ran into problems with character encodings.

The site presents all HTML files with CP-1250 encodings, which is even noted in the <META> tags of the HTML response:

<META http-equiv=“content-type” content=“text/html; charset=windows-1250”>

just to be clear - i have no control over that server, and can’t change its output encoding format - i have to take the HTML as it is provided and deal with it on the client.

There’s a lot of non-english (non ANSI) characters (the site is in Czech language) and I can’t seem to find a way to convert this string into UTF-8 string, that could be shown correctly on the mobile app.

Examples of such characters:

ěščřžýáíéúůüóľďťň

ĚŠČŘŽÝÁÍÉÚŮÜÓĽĎŤŇ

Here on web they obviously show alright, but in the simulator’s console output they are replaced by unknown characters (squares, i’m using the windows simulator). In the mobile app if I put the string in display.newText, some are shown, some are simply missing completely.

I’m attaching a TXT file () with these characters encoded in CP-1250, if you read it in Corona and load as the string, you can replicate what happens to me - just try to print it into the console output, and to display using display.newText…

Can anyone suggest a way to convert the string so that it can be shown correctly on-screen ?

Thanks!

joedavinci · October 4, 2015, 1:13pm

Here’s an extra of the example… I load the file “badchars.txt” which is CP-1250 encoded with the special characters.

Then i print it out to the console directly as Corona has read it.

Under that I print it out again after running it through the utf8_encode function that I found online…

local function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf

it was more for converting straight UNICODE to UTF-8 I think, so it doesn’t fix it… but it does get a few of the characters right… If someone knew how to modify this so it would map between CP-1250 and UTF-8, it would solve the problem.

Anyway, I also put the texts (both straight as read, and with utf8_encode) to the display in the simulator.

you can see it all at once on the attached screenshot.

primoz.cerar · October 7, 2015, 3:26pm

This function basicaly converts unicode characters to their utf-8 representation which can be single byte or multibyte. It will not convert to a readable string. First thing you would have to check is what kind of values do you get from string.byte() for the invalid characters and then map them to the appropriate utf-8 code. I think you would have to do this on char per char basis for all the bytes above 127 in WIN1250 code page. I do not believe there is any method of mathematical conversion (to my knowledge) since the order in WIN1250 and UTF-8 is not the same. You could make a simple conversion table win1250_table[win1250_code] = string.char(utf8_code).

You can lookup the char values for win1250 and utf-8 on the web.

joedavinci · October 9, 2015, 12:14pm

yeah, that’s what i was afraid of and trying to avoid

thanks for the reply anyway, I guess there’s no workaround, and i’ll just have to go with it

primoz.cerar · October 9, 2015, 12:54pm

This should help you:

http://stackoverflow.com/a/16627763

I did not test it though.

The table that goes on top of that you can get here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT

joedavinci · October 11, 2015, 1:18pm

heh, didn’t even check your links, and wrote the whole function and conversion table myself… seems to work so far… if it help anyone, please feel free to use it or parts of it (can be altered to any other codepage i suppose)

local function convertCP1250toUTF8(inputString) local outputString = "" local cp1250table = { [1] = {byteval = 128, character = "€"}, [2] = {byteval = 129, character = " "}, [3] = {byteval = 130, character = "‚"}, [4] = {byteval = 131, character = " "}, [5] = {byteval = 132, character = "„"}, [6] = {byteval = 133, character = "…"}, [7] = {byteval = 134, character = "†"}, [8] = {byteval = 135, character = "‡"}, [9] = {byteval = 136, character = " "}, [10] = {byteval = 137, character = "‰"}, [11] = {byteval = 138, character = "Š"}, [12] = {byteval = 139, character = "‹"}, [13] = {byteval = 140, character = "Ś"}, [14] = {byteval = 141, character = "Ť"}, [15] = {byteval = 142, character = "Ž"}, [16] = {byteval = 143, character = "Ź"}, [17] = {byteval = 144, character = " "}, [18] = {byteval = 145, character = "‘"}, [19] = {byteval = 146, character = "’"}, [20] = {byteval = 147, character = "“"}, [21] = {byteval = 148, character = "”"}, [22] = {byteval = 149, character = "•"}, [23] = {byteval = 150, character = "–"}, [24] = {byteval = 151, character = "—"}, [25] = {byteval = 152, character = " "}, [26] = {byteval = 153, character = "™"}, [27] = {byteval = 154, character = "š"}, [28] = {byteval = 155, character = "›"}, [29] = {byteval = 156, character = "ś"}, [30] = {byteval = 157, character = "ť"}, [31] = {byteval = 158, character = "ž"}, [32] = {byteval = 159, character = "ź"}, [33] = {byteval = 160, character = " "}, [34] = {byteval = 161, character = "ˇ"}, [35] = {byteval = 162, character = "˘"}, [36] = {byteval = 163, character = "Ł"}, [37] = {byteval = 164, character = "¤"}, [38] = {byteval = 165, character = "Ą"}, [39] = {byteval = 166, character = "¦"}, [40] = {byteval = 167, character = "§"}, [41] = {byteval = 168, character = "¨"}, [42] = {byteval = 169, character = "©"}, [43] = {byteval = 170, character = "Ş"}, [44] = {byteval = 171, character = "«"}, [45] = {byteval = 172, character = "¬"}, [46] = {byteval = 173, character = "SHY"}, [47] = {byteval = 174, character = "®"}, [48] = {byteval = 175, character = "Ż"}, [49] = {byteval = 176, character = "°"}, [50] = {byteval = 177, character = "±"}, [51] = {byteval = 178, character = "˛"}, [52] = {byteval = 179, character = "ł"}, [53] = {byteval = 180, character = "´"}, [54] = {byteval = 181, character = "µ"}, [55] = {byteval = 182, character = "¶"}, [56] = {byteval = 183, character = "·"}, [57] = {byteval = 184, character = "¸"}, [58] = {byteval = 185, character = "ą"}, [59] = {byteval = 186, character = "ş"}, [60] = {byteval = 187, character = "»"}, [61] = {byteval = 188, character = "Ľ"}, [62] = {byteval = 189, character = "˝"}, [63] = {byteval = 190, character = "ľ"}, [64] = {byteval = 191, character = "ż"}, [65] = {byteval = 192, character = "Ŕ"}, [66] = {byteval = 193, character = "Á"}, [67] = {byteval = 194, character = "Â"}, [68] = {byteval = 195, character = "Ă"}, [69] = {byteval = 196, character = "Ä"}, [70] = {byteval = 197, character = "Ĺ"}, [71] = {byteval = 198, character = "Ć"}, [72] = {byteval = 199, character = "Ç"}, [73] = {byteval = 200, character = "Č"}, [74] = {byteval = 201, character = "É"}, [75] = {byteval = 202, character = "Ę"}, [76] = {byteval = 203, character = "Ë"}, [77] = {byteval = 204, character = "Ě"}, [78] = {byteval = 205, character = "Í"}, [79] = {byteval = 206, character = "Î"}, [80] = {byteval = 207, character = "Ď"}, [81] = {byteval = 208, character = "Đ"}, [82] = {byteval = 209, character = "Ń"}, [83] = {byteval = 210, character = "Ň"}, [84] = {byteval = 211, character = "Ó"}, [85] = {byteval = 212, character = "Ô"}, [86] = {byteval = 213, character = "Ő"}, [87] = {byteval = 214, character = "Ö"}, [88] = {byteval = 215, character = "×"}, [89] = {byteval = 216, character = "Ř"}, [90] = {byteval = 217, character = "Ů"}, [91] = {byteval = 218, character = "Ú"}, [92] = {byteval = 219, character = "Ű"}, [93] = {byteval = 220, character = "Ü"}, [94] = {byteval = 221, character = "Ý"}, [95] = {byteval = 222, character = "Ţ"}, [96] = {byteval = 223, character = "ß"}, [97] = {byteval = 224, character = "ŕ"}, [98] = {byteval = 225, character = "á"}, [99] = {byteval = 226, character = "â"}, [100] = {byteval = 227, character = "ă"}, [101] = {byteval = 228, character = "ä"}, [102] = {byteval = 229, character = "ĺ"}, [103] = {byteval = 230, character = "ć"}, [104] = {byteval = 231, character = "ç"}, [105] = {byteval = 232, character = "č"}, [106] = {byteval = 233, character = "é"}, [107] = {byteval = 234, character = "ę"}, [108] = {byteval = 235, character = "ë"}, [109] = {byteval = 236, character = "ě"}, [110] = {byteval = 237, character = "í"}, [111] = {byteval = 238, character = "î"}, [112] = {byteval = 239, character = "ď"}, [113] = {byteval = 240, character = "đ"}, [114] = {byteval = 241, character = "ń"}, [115] = {byteval = 242, character = "ň"}, [116] = {byteval = 243, character = "ó"}, [117] = {byteval = 244, character = "ô"}, [118] = {byteval = 245, character = "ő"}, [119] = {byteval = 246, character = "ö"}, [120] = {byteval = 247, character = "÷"}, [121] = {byteval = 248, character = "ř"}, [122] = {byteval = 249, character = "ů"}, [123] = {byteval = 250, character = "ú"}, [124] = {byteval = 251, character = "ű"}, [125] = {byteval = 252, character = "ü"}, [126] = {byteval = 253, character = "ý"}, [127] = {byteval = 254, character = "ţ"}, [128] = {byteval = 255, character = "˙"} } if inputString == nil then outputString = "ERROR - NULL INPUT STRING!" else local stringLength = string.len(inputString) for i=1,stringLength do local character = string.sub( inputString, i, i ) local byteValue = string.byte(character) if byteValue\>=128 and byteValue\<=255 then local replacedCharacter = "" --do the loop and find the corresponding in table for i=1,#cp1250table do if byteValue == cp1250table[i].byteval then replacedCharacter = cp1250table[i].character else --do nothing end end outputString = outputString .. replacedCharacter else outputString = outputString .. character end end end return outputString end

joedavinci · October 4, 2015, 1:13pm

Here’s an extra of the example… I load the file “badchars.txt” which is CP-1250 encoded with the special characters.

Then i print it out to the console directly as Corona has read it.

Under that I print it out again after running it through the utf8_encode function that I found online…

local function utf8\_encode(unicode) local math = math local utf8 = "" for i=1,string.len(unicode) do local v = string.byte(unicode,i) local n, s, b = 1, "", 0 if v \>= 67108864 then n = 6; b = 252 elseif v \>= 2097152 then n = 5; b = 248 elseif v \>= 65536 then n = 4; b = 240 elseif v \>= 2048 then n = 3; b = 224 elseif v \>= 128 then n = 2; b = 192 end for i = 2, n do local c = math.mod(v, 64); v = math.floor(v / 64) s = string.char(c + 128)..s end s = string.char(v + b)..s utf8 = utf8..s end return utf

it was more for converting straight UNICODE to UTF-8 I think, so it doesn’t fix it… but it does get a few of the characters right… If someone knew how to modify this so it would map between CP-1250 and UTF-8, it would solve the problem.

Anyway, I also put the texts (both straight as read, and with utf8_encode) to the display in the simulator.

you can see it all at once on the attached screenshot.

primoz.cerar · October 7, 2015, 3:26pm

This function basicaly converts unicode characters to their utf-8 representation which can be single byte or multibyte. It will not convert to a readable string. First thing you would have to check is what kind of values do you get from string.byte() for the invalid characters and then map them to the appropriate utf-8 code. I think you would have to do this on char per char basis for all the bytes above 127 in WIN1250 code page. I do not believe there is any method of mathematical conversion (to my knowledge) since the order in WIN1250 and UTF-8 is not the same. You could make a simple conversion table win1250_table[win1250_code] = string.char(utf8_code).

You can lookup the char values for win1250 and utf-8 on the web.

joedavinci · October 9, 2015, 12:14pm

yeah, that’s what i was afraid of and trying to avoid

thanks for the reply anyway, I guess there’s no workaround, and i’ll just have to go with it

primoz.cerar · October 9, 2015, 12:54pm

This should help you:

http://stackoverflow.com/a/16627763

I did not test it though.

The table that goes on top of that you can get here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT

joedavinci · October 11, 2015, 1:18pm

heh, didn’t even check your links, and wrote the whole function and conversion table myself… seems to work so far… if it help anyone, please feel free to use it or parts of it (can be altered to any other codepage i suppose)

local function convertCP1250toUTF8(inputString) local outputString = "" local cp1250table = { [1] = {byteval = 128, character = "€"}, [2] = {byteval = 129, character = " "}, [3] = {byteval = 130, character = "‚"}, [4] = {byteval = 131, character = " "}, [5] = {byteval = 132, character = "„"}, [6] = {byteval = 133, character = "…"}, [7] = {byteval = 134, character = "†"}, [8] = {byteval = 135, character = "‡"}, [9] = {byteval = 136, character = " "}, [10] = {byteval = 137, character = "‰"}, [11] = {byteval = 138, character = "Š"}, [12] = {byteval = 139, character = "‹"}, [13] = {byteval = 140, character = "Ś"}, [14] = {byteval = 141, character = "Ť"}, [15] = {byteval = 142, character = "Ž"}, [16] = {byteval = 143, character = "Ź"}, [17] = {byteval = 144, character = " "}, [18] = {byteval = 145, character = "‘"}, [19] = {byteval = 146, character = "’"}, [20] = {byteval = 147, character = "“"}, [21] = {byteval = 148, character = "”"}, [22] = {byteval = 149, character = "•"}, [23] = {byteval = 150, character = "–"}, [24] = {byteval = 151, character = "—"}, [25] = {byteval = 152, character = " "}, [26] = {byteval = 153, character = "™"}, [27] = {byteval = 154, character = "š"}, [28] = {byteval = 155, character = "›"}, [29] = {byteval = 156, character = "ś"}, [30] = {byteval = 157, character = "ť"}, [31] = {byteval = 158, character = "ž"}, [32] = {byteval = 159, character = "ź"}, [33] = {byteval = 160, character = " "}, [34] = {byteval = 161, character = "ˇ"}, [35] = {byteval = 162, character = "˘"}, [36] = {byteval = 163, character = "Ł"}, [37] = {byteval = 164, character = "¤"}, [38] = {byteval = 165, character = "Ą"}, [39] = {byteval = 166, character = "¦"}, [40] = {byteval = 167, character = "§"}, [41] = {byteval = 168, character = "¨"}, [42] = {byteval = 169, character = "©"}, [43] = {byteval = 170, character = "Ş"}, [44] = {byteval = 171, character = "«"}, [45] = {byteval = 172, character = "¬"}, [46] = {byteval = 173, character = "SHY"}, [47] = {byteval = 174, character = "®"}, [48] = {byteval = 175, character = "Ż"}, [49] = {byteval = 176, character = "°"}, [50] = {byteval = 177, character = "±"}, [51] = {byteval = 178, character = "˛"}, [52] = {byteval = 179, character = "ł"}, [53] = {byteval = 180, character = "´"}, [54] = {byteval = 181, character = "µ"}, [55] = {byteval = 182, character = "¶"}, [56] = {byteval = 183, character = "·"}, [57] = {byteval = 184, character = "¸"}, [58] = {byteval = 185, character = "ą"}, [59] = {byteval = 186, character = "ş"}, [60] = {byteval = 187, character = "»"}, [61] = {byteval = 188, character = "Ľ"}, [62] = {byteval = 189, character = "˝"}, [63] = {byteval = 190, character = "ľ"}, [64] = {byteval = 191, character = "ż"}, [65] = {byteval = 192, character = "Ŕ"}, [66] = {byteval = 193, character = "Á"}, [67] = {byteval = 194, character = "Â"}, [68] = {byteval = 195, character = "Ă"}, [69] = {byteval = 196, character = "Ä"}, [70] = {byteval = 197, character = "Ĺ"}, [71] = {byteval = 198, character = "Ć"}, [72] = {byteval = 199, character = "Ç"}, [73] = {byteval = 200, character = "Č"}, [74] = {byteval = 201, character = "É"}, [75] = {byteval = 202, character = "Ę"}, [76] = {byteval = 203, character = "Ë"}, [77] = {byteval = 204, character = "Ě"}, [78] = {byteval = 205, character = "Í"}, [79] = {byteval = 206, character = "Î"}, [80] = {byteval = 207, character = "Ď"}, [81] = {byteval = 208, character = "Đ"}, [82] = {byteval = 209, character = "Ń"}, [83] = {byteval = 210, character = "Ň"}, [84] = {byteval = 211, character = "Ó"}, [85] = {byteval = 212, character = "Ô"}, [86] = {byteval = 213, character = "Ő"}, [87] = {byteval = 214, character = "Ö"}, [88] = {byteval = 215, character = "×"}, [89] = {byteval = 216, character = "Ř"}, [90] = {byteval = 217, character = "Ů"}, [91] = {byteval = 218, character = "Ú"}, [92] = {byteval = 219, character = "Ű"}, [93] = {byteval = 220, character = "Ü"}, [94] = {byteval = 221, character = "Ý"}, [95] = {byteval = 222, character = "Ţ"}, [96] = {byteval = 223, character = "ß"}, [97] = {byteval = 224, character = "ŕ"}, [98] = {byteval = 225, character = "á"}, [99] = {byteval = 226, character = "â"}, [100] = {byteval = 227, character = "ă"}, [101] = {byteval = 228, character = "ä"}, [102] = {byteval = 229, character = "ĺ"}, [103] = {byteval = 230, character = "ć"}, [104] = {byteval = 231, character = "ç"}, [105] = {byteval = 232, character = "č"}, [106] = {byteval = 233, character = "é"}, [107] = {byteval = 234, character = "ę"}, [108] = {byteval = 235, character = "ë"}, [109] = {byteval = 236, character = "ě"}, [110] = {byteval = 237, character = "í"}, [111] = {byteval = 238, character = "î"}, [112] = {byteval = 239, character = "ď"}, [113] = {byteval = 240, character = "đ"}, [114] = {byteval = 241, character = "ń"}, [115] = {byteval = 242, character = "ň"}, [116] = {byteval = 243, character = "ó"}, [117] = {byteval = 244, character = "ô"}, [118] = {byteval = 245, character = "ő"}, [119] = {byteval = 246, character = "ö"}, [120] = {byteval = 247, character = "÷"}, [121] = {byteval = 248, character = "ř"}, [122] = {byteval = 249, character = "ů"}, [123] = {byteval = 250, character = "ú"}, [124] = {byteval = 251, character = "ű"}, [125] = {byteval = 252, character = "ü"}, [126] = {byteval = 253, character = "ý"}, [127] = {byteval = 254, character = "ţ"}, [128] = {byteval = 255, character = "˙"} } if inputString == nil then outputString = "ERROR - NULL INPUT STRING!" else local stringLength = string.len(inputString) for i=1,stringLength do local character = string.sub( inputString, i, i ) local byteValue = string.byte(character) if byteValue\>=128 and byteValue\<=255 then local replacedCharacter = "" --do the loop and find the corresponding in table for i=1,#cp1250table do if byteValue == cp1250table[i].byteval then replacedCharacter = cp1250table[i].character else --do nothing end end outputString = outputString .. replacedCharacter else outputString = outputString .. character end end end return outputString end