Comparing methods for storing/searching through large list of words

Hi All,

I’m looking for an outside opinion. I’m developing a word game that uses a list of 200,000+ words. I’m trying to decide the best way to store and retrieve these words on the device. I’ve seen different examples but I’d like to better understand the trade-offs.

With my current implementation, I imported the words into a SQLite db. (As an aside – the text file I imported from is only 2.6MB, and my sqlite db was only 13KB. After the import my db is 14.8MB, almost 6 times the size of the file…is that normal??) When I go to validate the words, I put the list of all words played in a lua table and loop through it, querying the database with each word to see if it exists. I’ve noticed that on my phone this operation is seamless without any lag, but on the simulator (for Mac) there seems to be a slight bit of lag at that step. Could querying the db in rapid succession cause noticable lag on older phones?

The other method I’ve seen people use is just reading the file directly into a lua table and keeping all the words in memory. Of course, with this method I’m concerned about memory constraints on the device, but I think it would offer lightning fast lookups if I inserted every word as a key. Then I could just do something like below without having to worry about opening a connection to the db, performing multiple queries, then closing the db.

if(wordlist['word'] ~= nil) then --do stuff end

Also, I feel like doing it this way would help keep the size of the app bundle low. I don’t know if I messed up the SQLite import somehow, but a 6x increase seems crazy!

What are other people’s thoughts?

@Vince_

Hi.  I saw this yesterday, but deferred till today to answer.  I ended up writing a short article on this topic here, but let me summarize and answer at least a few of your questions now.

SQLite vs Table

  • Size - SQLite wins by a lot.
  • Speed - Tables win by a huge margin.
  • Versatility - SQLite wins hands-down.

Memory Question

I’m a little unclear on what you meant when you said import.  Did you mean your random access memory usage increased when you loaded the DB file into memory?  This seems a little odd.  I could be wrong, but as I have tested and recall, one of the benefits of SQLite is that you don’t actually load the whole DB into memory, just indexing data.  However, this may vary from device to device.

Now, if you create a table in memory indexed by words, it will be HUGE.  Check out my article for some meaningful numbers and comparisons.

PS - The article has a link to my test code so you can use it for comparison to your own efforts.

@RoamingGamer

Wow this a fantastic write up! This was the sort of thing I was looking for. If I understand your analysis correctly, SQLite may not be the fastest but it is preferable to tables because of the added benefits. I guess speed won’t be such a big problem after all since I’m not validating anywhere near 1000 words at any given time - only around 16 at most.

Just to clarify, when I said memory I meant RAM, and when I mentioned the size of the DB I was talking persistent storage. It seems that storing that huge word list in a lua table isn’t worth it from a RAM perspective.

I do have a question about how you encoded your word list into the SQLite db though. I took the YAWL word list found here and used the sqlite shell for Mac to import the list into my table like so:

sqlite3\> .open GameSetup.sqlite (the database) sqlite3\> .import word.list WordList (the word table)

After doing that the persistent size of my db swelled to the 15MB that I mentioned earlier. It might be possible that I screwed up the import as I ran that command multiple times (by accident). However, last night I decided to delete all entries in that table, then try the import again using the ENABLE2Kword list (only once this time) and then performed the VACUUM statement on that table. The size shrank from 15MB to ~9MB, but it’s still no where near the size of the text file which is ~2MB.

What method did you use to import your word list into the db, and can you verfiy that all 170,000+ words got included? I’m going to take a gander at your code later tonight when I get home from work.

Thanks again for taking the time to research this.

It really depends if the tag matters. There are other approaches which may or may not work depending.

For example, you have a set of words, encoded in 8 bit ASCII ; but that itself is wasteful, of the 256 bit patterns you are only using 26 of them. You could encode all your words so that common groups of words, like say ‘ful’ or so on are represented by one of the 230 odd unused 8 bit patterns. (May be slightly different if you need foreign character codes) - you could write a program to analyse your 200k words looking for the most common groups, or this might actually be known anyway.

@Vince,

Hi.  It looks like I did use two different lists (sorry about that).  The DB has a list of ~81K words in it.  I’m afraid I pulled it from an old example of mine and didn’t verify that both the DB and table used the same list source.  I’ll correct the article shortly.

Note: There is a link at the end of the article to the tool I used for encoding.

Hello all!

I’ve updated the article and the example so that both the SQL DB and the Table use the same exact word list.  This has improved the TABLE version significantly, whereas previously it has 2.12 times as many words in it as the SQL version.

Sorry about the confusion!

PS - I also added code to show you how I encoded the table version.

@roaminggamer, great article & analysis. Thanks for taking the time. 

@RoamingGamer

Thanks for taking another look! This is good stuff. I decided to stick with the SQLite DB for my purposes.

@paulscottrobson

You bring up a good point about the encoding of the text itself. I hadn’t even considered that.

Cool article!

@Vince_

Hi.  I saw this yesterday, but deferred till today to answer.  I ended up writing a short article on this topic here, but let me summarize and answer at least a few of your questions now.

SQLite vs Table

  • Size - SQLite wins by a lot.
  • Speed - Tables win by a huge margin.
  • Versatility - SQLite wins hands-down.

Memory Question

I’m a little unclear on what you meant when you said import.  Did you mean your random access memory usage increased when you loaded the DB file into memory?  This seems a little odd.  I could be wrong, but as I have tested and recall, one of the benefits of SQLite is that you don’t actually load the whole DB into memory, just indexing data.  However, this may vary from device to device.

Now, if you create a table in memory indexed by words, it will be HUGE.  Check out my article for some meaningful numbers and comparisons.

PS - The article has a link to my test code so you can use it for comparison to your own efforts.

@RoamingGamer

Wow this a fantastic write up! This was the sort of thing I was looking for. If I understand your analysis correctly, SQLite may not be the fastest but it is preferable to tables because of the added benefits. I guess speed won’t be such a big problem after all since I’m not validating anywhere near 1000 words at any given time - only around 16 at most.

Just to clarify, when I said memory I meant RAM, and when I mentioned the size of the DB I was talking persistent storage. It seems that storing that huge word list in a lua table isn’t worth it from a RAM perspective.

I do have a question about how you encoded your word list into the SQLite db though. I took the YAWL word list found here and used the sqlite shell for Mac to import the list into my table like so:

sqlite3\> .open GameSetup.sqlite (the database) sqlite3\> .import word.list WordList (the word table)

After doing that the persistent size of my db swelled to the 15MB that I mentioned earlier. It might be possible that I screwed up the import as I ran that command multiple times (by accident). However, last night I decided to delete all entries in that table, then try the import again using the ENABLE2Kword list (only once this time) and then performed the VACUUM statement on that table. The size shrank from 15MB to ~9MB, but it’s still no where near the size of the text file which is ~2MB.

What method did you use to import your word list into the db, and can you verfiy that all 170,000+ words got included? I’m going to take a gander at your code later tonight when I get home from work.

Thanks again for taking the time to research this.

It really depends if the tag matters. There are other approaches which may or may not work depending.

For example, you have a set of words, encoded in 8 bit ASCII ; but that itself is wasteful, of the 256 bit patterns you are only using 26 of them. You could encode all your words so that common groups of words, like say ‘ful’ or so on are represented by one of the 230 odd unused 8 bit patterns. (May be slightly different if you need foreign character codes) - you could write a program to analyse your 200k words looking for the most common groups, or this might actually be known anyway.

@Vince,

Hi.  It looks like I did use two different lists (sorry about that).  The DB has a list of ~81K words in it.  I’m afraid I pulled it from an old example of mine and didn’t verify that both the DB and table used the same list source.  I’ll correct the article shortly.

Note: There is a link at the end of the article to the tool I used for encoding.

Hello all!

I’ve updated the article and the example so that both the SQL DB and the Table use the same exact word list.  This has improved the TABLE version significantly, whereas previously it has 2.12 times as many words in it as the SQL version.

Sorry about the confusion!

PS - I also added code to show you how I encoded the table version.

@roaminggamer, great article & analysis. Thanks for taking the time. 

@RoamingGamer

Thanks for taking another look! This is good stuff. I decided to stick with the SQLite DB for my purposes.

@paulscottrobson

You bring up a good point about the encoding of the text itself. I hadn’t even considered that.

Cool article!