Problem Searching for Elements in a Very Large Table

crawfdc47 · May 9, 2015, 6:14pm

I am creating a word app and need a way to check that a word found by the user is a real word that exists in the English language. In order to do this, I first read words from the file wordlist.txt (see attached file), and load them into a very large table. The specific problem I am having comes when I try to search for a word in this table to determine whether the word I am searching for exists in the table or not. Below is my code, which I have tested using Corona builds 2014.2511 and 2015.2612.

local ) if wordInTable == true then print( "word found" ) else print( "word NOT found" ) end end

Even though all of the words in wordsToSearch are words that exist in the wordlist.txt file (I have checked), the printed output indicates that only half of the words in wordsToSearch actually exist in the wordlist.txt. I am open to any feedback about what may be causing this issue, as well alternative approaches for word searching.

rob · May 9, 2015, 6:41pm

How about a second table (or different way)

local = true &nbsp;&nbsp;&nbsp; end &nbsp;&nbsp;&nbsp; io.close( file ) &nbsp;&nbsp;&nbsp; file = nil &nbsp;&nbsp;&nbsp; return wordTable end

This way you have a table where your word is the key, not the value. Then you can simply do:

if wordTable[wordToCheck] then

– word exists…

end

Rob

crawfdc47 · May 9, 2015, 8:44pm

Hi Rob,

Thanks for the tip. I tried implementing your approach and the code I used is below. I am still getting the same result in the output that shows that only half of the words I searched for were found.

local readWordsFromFile = function( wordList ) local wordTable = {} local path = system.pathForFile( wordList, system.ResourceDirectory ) local file = io.open( path, "r" ) for line in file:lines() do line = string.sub( line, 1, #line - 1 ) -- table.insert( wordTable, line ) wordTable[line] = true end io.close( file ) file = nil return wordTable end local wordTable = readWordsFromFile( "wordlist.txt" ) local wordsToSearch = { "and", "bond", "punk", "actor", "from", "ton", "what", "bun", "son", "quit" } for i = 1, #wordsToSearch do if wordTable[wordsToSearch[i]] then --if word exists in word list print( "word found" ) else print( "word NOT found" ) end end --Output: --word NOT found --word found --word NOT found --word NOT found --word NOT found --word found --word NOT found --word found --word found --word found

crawfdc47 · May 9, 2015, 10:34pm

Hi Rob,

Thanks again for your suggestion. I identified the problem on line no. 7 of my code. Each of the words that were being extracted from the word list and stored in a table were missing the very last character. For instance, the word “and” was being stored as “an”, which was the reason why it was not being found in the search. Very silly mistake. I borrowed the code for reading files from a different source, and did not proofread it very carefully before implementing it. I fixed the problem by changing line no. 7 to:

line = string.sub( line, 1, #line )

However, I am still curious if your method of storing words as a key rather than a value is more memory efficient than the method I had originally used.

Thanks,

-crawfdc47

StarCrunch · May 9, 2015, 11:23pm

That line is actually now a no-op, and could just be removed altogether. (I’m guessing the “- 1” was written assuming that a newline or NUL byte gets tacked onto the end of the line string.)

If memory is actually at a premium, then the original array version will probably be cheaper, since Lua does optimize for contiguous array portions of tables. If you do go that route, you should ensure that your words are in order (run table.sort on the array, if you don’t have them sorted already), then do something like binary search to look for your word (which turns your O(N) problem into an O(lg N) one).

On the other hand, you could try something like a BK-tree and get some abilities along with it.

rob · May 10, 2015, 6:06am

The method I suggested is about the same memory footprint and is way faster. You’re doing a lookup search which is searching the whole table every lookup. You could add a “break” statement in your successful check to avoid looking any further, but even then your average look up is n / 2 (it’s n now, where n is the number of members). I’m not sure what Lua uses for key looks up, if it’s some ISAM table or a B-Tree look up, but a B-Tree search is really-really fast. Regardless the keyword look up is super efficient in Lua.

Rob

Alan_PlantPot · May 11, 2015, 12:20pm

I’m not sure how well this would work for you, but could you not split the main file with the list of valid words into 26 tables, one for each letter of the alphabet?

Then when the user enters a word, you check the first letter of that word, and only lookup the relevant table.

davebollinger · May 11, 2015, 9:44pm

No - Rob’s method will use a bit more memory, but will run orders of magnitude faster as a hash search in native code

The most memory-efficient, retaining a bit of native-code performance, would be to avoid the table, something along these lines:

-- setup file loading: wholeDictionary = file.read("\*a") -- later word searching: exists = string.find(wholeDictionary, specificWord) -- (tho you'd want to build in 'delimiters' to eliminate potential false positives within/across words)

thomas6 · May 12, 2015, 10:19am

FYI, you could split up your large table into 5 (or more) smaller ones based on the first letter of the word - a simple trick to speed up things if needed.

StarCrunch · May 12, 2015, 6:31pm

Just indexing with a letter will run into a couple things:

Some runs of words will be relatively small, e.g. those starting with “q” and “x”, possibly giving false positives for speedup, whereas “e”, “s”, “t”, and so on won’t. (I’ve only done a cursory scan of the word list, but going off of general English knowledge.)

It also doesn’t account for common prefixes, such as “re” and “su” (“sub”, “super”, etc.).

That said, the idea isn’t necessarily unsound. There probably is a legitimate way to shrink memory with data structures along the lines of ropes or suffix arrays.

@ Rob Miracle

To quote from Lua 5.1.5’s ltable.c:

/\* \*\* Implementation of tables (aka arrays, objects, or hash tables). \*\* Tables keep its elements in two parts: an array part and a hash part. \*\* Non-negative integer keys are all candidates to be kept in the array \*\* part. The actual size of the array is the largest `n' such that at **least half the slots between 0 and n are in use.** Hash uses a mix of chained scatter table with Brent's variation. **A main invariant of these tables is that, if an element is not** in its main position (i.e. the `original' position that its hash gives \*\* to it), then the colliding element is in its own main position. \*\* Hence even when the load factor reaches 100%, performance remains good. \*/

For anybody interested, the file itself is the implementation of Lua tables. This topic, and in particular the links in this post, bring out some interesting analysis on the technique.

@ crawfdc47 Sorry if this has gone far afield of your original question.

davebollinger · May 13, 2015, 12:48pm

far afield of OQ, yes, but i for one at least enjoy reading your replies.

as for practical forum advice… given that the OP was doing a full table scan (and initially not even breaking early), my hunch would be that implementing anything considerably more “esoteric” might be a stretch.

if that 400k table is already considered “very large”, maybe we can assume it won’t be growing much to “truly large” (to the point where some form of encoded storage required). if true, and storage really shouldn’t be an issue, then values-as-keys will likely give most bang-for-buck per loc to implement.

rob · May 9, 2015, 6:41pm

How about a second table (or different way)

local = true &nbsp;&nbsp;&nbsp; end &nbsp;&nbsp;&nbsp; io.close( file ) &nbsp;&nbsp;&nbsp; file = nil &nbsp;&nbsp;&nbsp; return wordTable end

This way you have a table where your word is the key, not the value. Then you can simply do:

if wordTable[wordToCheck] then

– word exists…

end

Rob

crawfdc47 · May 9, 2015, 8:44pm

Hi Rob,

Thanks for the tip. I tried implementing your approach and the code I used is below. I am still getting the same result in the output that shows that only half of the words I searched for were found.

local readWordsFromFile = function( wordList ) local wordTable = {} local path = system.pathForFile( wordList, system.ResourceDirectory ) local file = io.open( path, "r" ) for line in file:lines() do line = string.sub( line, 1, #line - 1 ) -- table.insert( wordTable, line ) wordTable[line] = true end io.close( file ) file = nil return wordTable end local wordTable = readWordsFromFile( "wordlist.txt" ) local wordsToSearch = { "and", "bond", "punk", "actor", "from", "ton", "what", "bun", "son", "quit" } for i = 1, #wordsToSearch do if wordTable[wordsToSearch[i]] then --if word exists in word list print( "word found" ) else print( "word NOT found" ) end end --Output: --word NOT found --word found --word NOT found --word NOT found --word NOT found --word found --word NOT found --word found --word found --word found

crawfdc47 · May 9, 2015, 10:34pm

Hi Rob,

Thanks again for your suggestion. I identified the problem on line no. 7 of my code. Each of the words that were being extracted from the word list and stored in a table were missing the very last character. For instance, the word “and” was being stored as “an”, which was the reason why it was not being found in the search. Very silly mistake. I borrowed the code for reading files from a different source, and did not proofread it very carefully before implementing it. I fixed the problem by changing line no. 7 to:

line = string.sub( line, 1, #line )

However, I am still curious if your method of storing words as a key rather than a value is more memory efficient than the method I had originally used.

Thanks,

-crawfdc47

StarCrunch · May 9, 2015, 11:23pm

That line is actually now a no-op, and could just be removed altogether. (I’m guessing the “- 1” was written assuming that a newline or NUL byte gets tacked onto the end of the line string.)

If memory is actually at a premium, then the original array version will probably be cheaper, since Lua does optimize for contiguous array portions of tables. If you do go that route, you should ensure that your words are in order (run table.sort on the array, if you don’t have them sorted already), then do something like binary search to look for your word (which turns your O(N) problem into an O(lg N) one).

On the other hand, you could try something like a BK-tree and get some abilities along with it.

rob · May 10, 2015, 6:06am

The method I suggested is about the same memory footprint and is way faster. You’re doing a lookup search which is searching the whole table every lookup. You could add a “break” statement in your successful check to avoid looking any further, but even then your average look up is n / 2 (it’s n now, where n is the number of members). I’m not sure what Lua uses for key looks up, if it’s some ISAM table or a B-Tree look up, but a B-Tree search is really-really fast. Regardless the keyword look up is super efficient in Lua.

Rob

Alan_PlantPot · May 11, 2015, 12:20pm

I’m not sure how well this would work for you, but could you not split the main file with the list of valid words into 26 tables, one for each letter of the alphabet?

Then when the user enters a word, you check the first letter of that word, and only lookup the relevant table.

davebollinger · May 11, 2015, 9:44pm

No - Rob’s method will use a bit more memory, but will run orders of magnitude faster as a hash search in native code

The most memory-efficient, retaining a bit of native-code performance, would be to avoid the table, something along these lines:

-- setup file loading: wholeDictionary = file.read("\*a") -- later word searching: exists = string.find(wholeDictionary, specificWord) -- (tho you'd want to build in 'delimiters' to eliminate potential false positives within/across words)

thomas6 · May 12, 2015, 10:19am

FYI, you could split up your large table into 5 (or more) smaller ones based on the first letter of the word - a simple trick to speed up things if needed.

StarCrunch · May 12, 2015, 6:31pm

Just indexing with a letter will run into a couple things:

Some runs of words will be relatively small, e.g. those starting with “q” and “x”, possibly giving false positives for speedup, whereas “e”, “s”, “t”, and so on won’t. (I’ve only done a cursory scan of the word list, but going off of general English knowledge.)

It also doesn’t account for common prefixes, such as “re” and “su” (“sub”, “super”, etc.).

That said, the idea isn’t necessarily unsound. There probably is a legitimate way to shrink memory with data structures along the lines of ropes or suffix arrays.

@ Rob Miracle

To quote from Lua 5.1.5’s ltable.c:

/\* \*\* Implementation of tables (aka arrays, objects, or hash tables). \*\* Tables keep its elements in two parts: an array part and a hash part. \*\* Non-negative integer keys are all candidates to be kept in the array \*\* part. The actual size of the array is the largest `n' such that at **least half the slots between 0 and n are in use.** Hash uses a mix of chained scatter table with Brent's variation. **A main invariant of these tables is that, if an element is not** in its main position (i.e. the `original' position that its hash gives \*\* to it), then the colliding element is in its own main position. \*\* Hence even when the load factor reaches 100%, performance remains good. \*/

For anybody interested, the file itself is the implementation of Lua tables. This topic, and in particular the links in this post, bring out some interesting analysis on the technique.

@ crawfdc47 Sorry if this has gone far afield of your original question.