C/C++

Yet Another Word Puzzle

By Mark R. Nelson, November 14, 2007

The Inner Loop

For this program to succeed, it must terminate its processing with a container that holds the longest good words in the dictionary. For this particular problem, my choice of container is the C++ hash_set, which is non-standard but universally implemented.

My bottom-up approach means that I will fill in the hash_set container for words of length 1 first, then words of length 2, and so on. For reasons of efficiency, it works out better if I keep a separate hash_set for each word length, so the hash_set objects are actually stored in a map that is indexed on the word size:

std::map<size_t,hash_set<std::string> > good_words

To fill in the hash for size i, I need a loop that iterates over all words of size i, removing one character at a time and then testing to see if the result is a good word of size i-1. If it is, I add it to good_words[ i ]. When I'm done iterating over all words of that size, I have a hash_set that contains all good words of that size, and I can move up to the next larger size.

So if we're testing words of size i the innermost part of the loop will look like this:

for ( size_t j = 0 ; j <i ; j++ ) {
     string test_word = word;
     test_word.erase( j, 1 );
     if ( good_words[ i-1 ].find( test_word ) != fail ) {
          good_words[ i ].insert( word );
          break;
     }
}

In this inner part of the loop, we repeatedly remove a character from the word, testing to see if the resulting word appears in the list of words saved in the previous size hash set. If a match occurs, the word is inserted into the hash set for the current word size, and the loop breaks.

This bottom-up approach seems like the ideal way to build my lists of good words for a couple of reasons. First, I don't go to the expense of adding words to the hash sets unless they are good words. Second, determining that a given word is a good word only requires a test against the words at level i-1; I don't have to test for a complete path down to a or i.

Building The Input Data

In the previous section I mentioned that the innermost loop was going to be called as I iterated over all the dictionary words of a given size. So how do I get all the words of a given size?

The first thing that might occur to you is that you could read in all the words from the dictionary, then sort them by size. This approach would work, but given that the standard sort routines available to you in the C++ library are all going to work in O(n·lgn) time, it might get kind of expensive as the dictionary grows to hundreds of thousands of words.

The good news is that with this data set we're in position to take advantage of a linear sort. Yes, we can sort data and do substantially better than O(n·lgn) when we know that the data to be sorted is constrained to a small set of values.

In this case, I just create one linked list for each word size, and as I read the input file, I add each input word to the front of the appropriate linked list. This would be a true linear algorithm if I constrained the input size to a fixed number, say 25, but for convenience I actually store the lists in a map that looks like this:

std::map<size_t,std::list<std::string> > words_by_length

As a result the input code runs in close to linear time. The actual loop that reads in the data is nice and simple:

while ( input ) {
     string word;
     input>> word;
     words_by_length[ word.size() ].push_back( word );
     count++;
     if ( ( count % 100 ) == 0 )
          cout <<count <<"\r";

Once all the words are read in, I can access the list of words of a given size with a simple map lookup: words_by_length[ i ].

Word Lists

One final detail before I can compile my library -- I need some lists of words! Lists of words are not hard to come by, although coming up with a suitable one for this exercise might take some work.

One of the first places to look is is on your local Linux system. My system has a list of words in /usr/share/dict/words, which is used by the spell checker application, and possibly by the password app. One one of my systems, this dictionary has a whopping 483,524 words, which means it is packed with obscure words. Just as an example, a typical 12-character good word and its derivation found using /usr/share/dict/words yields this head-scratching sequence:

abranchiate
branchiate
branchiae
branchia
branchi
branch
ranch
rach
ach
ch
h

Probably the first thing you want to do with that file is go through and remove all the single letter words except a and i, but even so, you're going to be boggled by some of what you see.

Another good alternative are the collection of word lists distributed on Project Gutenberg as the Moby Word List. This includes a wide variety of lists of various sizes.

Beth Katz had several good dictionaries listed along with her homework assignment, including a short one called kids.dict that is nice and short, making it good for debugging runs.

Finally, Google searches for "word lists" will turn up many other good choices.

Wrapping It Up

Once I had the bottom-up good word builder working with the file-based word list, I was ready to put it all together. The core of main() now looks like this:

map<size_t,list<string>> words_by_length;
read_words( argv[ 1 ], words_by_length );
map<size_t,hash_set<string>> good_words;
size_t longest = build_up_words( words_by_length, good_words );
print_good_words( longest, words_by_length, good_words );

Procedure read_words() simply reads all the words from the file into the map of hash_set containers called word_by_length, as described earlier.

Then build_up_words collects all the good words and stores them, organized by size, in the map of hash_set containers called good_words. The core of that routine was described earlier.

I show the results of the top two levels in print_good_words(), which is simple enough to read up on in the source code.

Previous 1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Yet Another Word Puzzle

The Inner Loop

Building The Input Data

Word Lists

Wrapping It Up

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Yet Another Word Puzzle

The Inner Loop

Building The Input Data

Word Lists

Wrapping It Up

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content