Split utf-8 string word (with foreign characters) to letters

benny5 · December 3, 2013, 10:02pm

Hi,

We’re making a simple word guessing game where the letters in a word is scrambled. To scramble we split the word into its letter components and change their position randomly.

This works fine for english words:

local scramblewordtable = {} for i = 1, #randomword, 1 do scramblewordtable[i] = randomword:sub(i,i) end

“word” becomes a table { w,o,r,d }

however, when doing this with swedish characters

“äpple” becomes {?,?,p,p,l,e}

is there any good way to handle this?

ingemar · December 4, 2013, 9:27am

That was fun

I’ve been thinking about how to handle this before but had no project that needed it, however this topic sparked my interest and I’ve created a small function that should work.

local UTF8ToCharArray = function(str) local charArray = {}; local iStart = 0; local strLen = str:len(); local function bit(b) return 2 ^ (b - 1); end local function hasbit(w, b) return w % (b + b) \>= b; end local checkMultiByte = function(i) if (iStart ~= 0) then charArray[#charArray + 1] = str:sub(iStart, i - 1); iStart = 0; end end for i = 1, strLen do local b = str:byte(i); local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8)); local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8)); if (multiStart) then checkMultiByte(i); iStart = i; elseif (not multiTrail) then checkMultiByte(i); charArray[#charArray + 1] = str:sub(i, i); end end -- process if last character is multi-byte checkMultiByte(strLen + 1); return charArray; end local arr = UTF8ToCharArray("Äpplet är i trädet ÅÄÖåäö"); for k,v in pairs(arr) do print(k, v); end

Multi byte characters start with a byte with bits 7 and 8 set, trailing bytes have bit 7 not set and bit 8 set.

My function checks for these bits and acts accordingly.

Give this function a whirl and see if it works for you.

I’ve done some basic testing, and it works well even for Chinese, Japanese and Korean text :wub:

benny5 · December 4, 2013, 10:22am

Wow, kudos for going above and beyond! I’m sure this will help a lot of people. I guess the answer to my question of is there a good way is big NO then That was really advanced.

I’ll give it a whirl tonight after work! Thanks!

richard9 · December 4, 2013, 3:38pm

Wow. I had no idea how to detect multibyte characters and this specifically solves a problem I didn’t even know I was going to have! Thanks ingemar!

ingemar · December 4, 2013, 5:03pm

No problem guys It was fun to get away from my daily coding routine for a while…

benny5 · December 4, 2013, 6:18pm

Worked perfectly!

ingemar · December 5, 2013, 2:09am

Great! Use the code as you wish.

ingemar · December 4, 2013, 9:27am

That was fun

I’ve been thinking about how to handle this before but had no project that needed it, however this topic sparked my interest and I’ve created a small function that should work.

local UTF8ToCharArray = function(str) local charArray = {}; local iStart = 0; local strLen = str:len(); local function bit(b) return 2 ^ (b - 1); end local function hasbit(w, b) return w % (b + b) \>= b; end local checkMultiByte = function(i) if (iStart ~= 0) then charArray[#charArray + 1] = str:sub(iStart, i - 1); iStart = 0; end end for i = 1, strLen do local b = str:byte(i); local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8)); local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8)); if (multiStart) then checkMultiByte(i); iStart = i; elseif (not multiTrail) then checkMultiByte(i); charArray[#charArray + 1] = str:sub(i, i); end end -- process if last character is multi-byte checkMultiByte(strLen + 1); return charArray; end local arr = UTF8ToCharArray("Äpplet är i trädet ÅÄÖåäö"); for k,v in pairs(arr) do print(k, v); end

Multi byte characters start with a byte with bits 7 and 8 set, trailing bytes have bit 7 not set and bit 8 set.

My function checks for these bits and acts accordingly.

Give this function a whirl and see if it works for you.

I’ve done some basic testing, and it works well even for Chinese, Japanese and Korean text :wub:

benny5 · December 4, 2013, 10:22am

Wow, kudos for going above and beyond! I’m sure this will help a lot of people. I guess the answer to my question of is there a good way is big NO then That was really advanced.

I’ll give it a whirl tonight after work! Thanks!

richard9 · December 4, 2013, 3:38pm

Wow. I had no idea how to detect multibyte characters and this specifically solves a problem I didn’t even know I was going to have! Thanks ingemar!

ingemar · December 4, 2013, 5:03pm

No problem guys It was fun to get away from my daily coding routine for a while…

benny5 · December 4, 2013, 6:18pm

Worked perfectly!

ingemar · December 5, 2013, 2:09am

Great! Use the code as you wish.

ali4 · March 16, 2014, 1:22pm

@ingemar

thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…

it works fine in Arabic too :))

THANK YOU VERY MUCH

if you can make a briefe explination for the code

ali4 · March 16, 2014, 1:22pm

@ingemar

thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…

it works fine in Arabic too :))

THANK YOU VERY MUCH

if you can make a briefe explination for the code

jeff15 · March 24, 2015, 3:42pm

Thanks @ingemar, you just made my life easier too!

Cheers,

Jeff

Nob_Studio · June 6, 2015, 10:07pm

You solved my problem! Thank you!

jeff15 · March 24, 2015, 3:42pm

Thanks @ingemar, you just made my life easier too!

Cheers,

Jeff

Nob_Studio · June 6, 2015, 10:07pm

You solved my problem! Thank you!

keystagefun · September 11, 2015, 10:27am

Massive thank you for writing this. Solved my issue in seconds. Brilliant - cheers!