Split utf-8 string word (with foreign characters) to letters

Hi,

We’re making a simple word guessing game where the letters in a word is scrambled. To scramble we split the word into its letter components and change their position randomly.

This works fine for english words:

local scramblewordtable = {} for i = 1, #randomword, 1 do scramblewordtable[i] = randomword:sub(i,i) end

“word” becomes a table { w,o,r,d }

however, when doing this with swedish characters

“äpple” becomes {?,?,p,p,l,e}

is there any good way to handle this?

That was fun  :slight_smile:

I’ve been thinking about how to handle this before but had no project that needed it, however this topic sparked my interest and I’ve created a small function that should work.

local UTF8ToCharArray = function(str) local charArray = {}; local iStart = 0; local strLen = str:len(); local function bit(b) return 2 ^ (b - 1); end local function hasbit(w, b) return w % (b + b) \>= b; end local checkMultiByte = function(i) if (iStart ~= 0) then charArray[#charArray + 1] = str:sub(iStart, i - 1); iStart = 0; end end for i = 1, strLen do local b = str:byte(i); local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8)); local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8)); if (multiStart) then checkMultiByte(i); iStart = i; elseif (not multiTrail) then checkMultiByte(i); charArray[#charArray + 1] = str:sub(i, i); end end -- process if last character is multi-byte checkMultiByte(strLen + 1); return charArray; end local arr = UTF8ToCharArray("Äpplet är i trädet ÅÄÖåäö"); for k,v in pairs(arr) do print(k, v); end

Multi byte characters start with a byte with bits 7 and 8 set, trailing bytes have bit 7 not set and bit 8 set.

My function checks for these bits and acts accordingly.

Give this function a whirl and see if it works for you. 

I’ve done some basic testing, and it works well even for Chinese, Japanese and Korean text  :wub:

Wow, kudos for going above and beyond! I’m sure this will help a lot of people. I guess the answer to my question of is there a good way is big NO then :slight_smile: That was really advanced.

I’ll give it a whirl tonight after work! Thanks!

Wow. I had no idea how to detect multibyte characters and this specifically solves a problem I didn’t even know I was going to have! Thanks ingemar!

No problem guys :slight_smile: It was fun to get away from my daily coding routine for a while…

Worked perfectly!

Great! Use the code as you wish.

That was fun  :slight_smile:

I’ve been thinking about how to handle this before but had no project that needed it, however this topic sparked my interest and I’ve created a small function that should work.

local UTF8ToCharArray = function(str) local charArray = {}; local iStart = 0; local strLen = str:len(); local function bit(b) return 2 ^ (b - 1); end local function hasbit(w, b) return w % (b + b) \>= b; end local checkMultiByte = function(i) if (iStart ~= 0) then charArray[#charArray + 1] = str:sub(iStart, i - 1); iStart = 0; end end for i = 1, strLen do local b = str:byte(i); local multiStart = hasbit(b, bit(7)) and hasbit(b, bit(8)); local multiTrail = not hasbit(b, bit(7)) and hasbit(b, bit(8)); if (multiStart) then checkMultiByte(i); iStart = i; elseif (not multiTrail) then checkMultiByte(i); charArray[#charArray + 1] = str:sub(i, i); end end -- process if last character is multi-byte checkMultiByte(strLen + 1); return charArray; end local arr = UTF8ToCharArray("Äpplet är i trädet ÅÄÖåäö"); for k,v in pairs(arr) do print(k, v); end

Multi byte characters start with a byte with bits 7 and 8 set, trailing bytes have bit 7 not set and bit 8 set.

My function checks for these bits and acts accordingly.

Give this function a whirl and see if it works for you. 

I’ve done some basic testing, and it works well even for Chinese, Japanese and Korean text  :wub:

Wow, kudos for going above and beyond! I’m sure this will help a lot of people. I guess the answer to my question of is there a good way is big NO then :slight_smile: That was really advanced.

I’ll give it a whirl tonight after work! Thanks!

Wow. I had no idea how to detect multibyte characters and this specifically solves a problem I didn’t even know I was going to have! Thanks ingemar!

No problem guys :slight_smile: It was fun to get away from my daily coding routine for a while…

Worked perfectly!

Great! Use the code as you wish.

@ingemar

thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…

it works fine in Arabic too :))

THANK YOU VERY MUCH

if you can make a briefe explination for the code :slight_smile:

@ingemar

thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…thanks…

it works fine in Arabic too :))

THANK YOU VERY MUCH

if you can make a briefe explination for the code :slight_smile:

Thanks @ingemar, you just made my life easier too!

Cheers,

Jeff

You solved my problem! Thank you!

Thanks @ingemar, you just made my life easier too!

Cheers,

Jeff

You solved my problem! Thank you!

Massive thank you for writing this. Solved my issue in seconds. Brilliant - cheers!