Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Yet Another Word Puzzle


The Inner Loop

For this program to succeed, it must terminate its processing with a container that holds the longest good words in the dictionary. For this particular problem, my choice of container is the C++ hash_set, which is non-standard but universally implemented.

My bottom-up approach means that I will fill in the hash_set container for words of length 1 first, then words of length 2, and so on. For reasons of efficiency, it works out better if I keep a separate hash_set for each word length, so the hash_set objects are actually stored in a map that is indexed on the word size:

std::map<size_t,hash_set<std::string> > good_words

To fill in the hash for size i, I need a loop that iterates over all words of size i, removing one character at a time and then testing to see if the result is a good word of size i-1. If it is, I add it to good_words[ i ]. When I'm done iterating over all words of that size, I have a hash_set that contains all good words of that size, and I can move up to the next larger size.

So if we're testing words of size i the innermost part of the loop will look like this:

for ( size_t j = 0 ; j <i ; j++ ) {
     string test_word = word;
     test_word.erase( j, 1 );
     if ( good_words[ i-1 ].find( test_word ) != fail ) {
          good_words[ i ].insert( word );
          break;
     }
}

In this inner part of the loop, we repeatedly remove a character from the word, testing to see if the resulting word appears in the list of words saved in the previous size hash set. If a match occurs, the word is inserted into the hash set for the current word size, and the loop breaks.

This bottom-up approach seems like the ideal way to build my lists of good words for a couple of reasons. First, I don't go to the expense of adding words to the hash sets unless they are good words. Second, determining that a given word is a good word only requires a test against the words at level i-1; I don't have to test for a complete path down to a or i.

Building The Input Data

In the previous section I mentioned that the innermost loop was going to be called as I iterated over all the dictionary words of a given size. So how do I get all the words of a given size?

The first thing that might occur to you is that you could read in all the words from the dictionary, then sort them by size. This approach would work, but given that the standard sort routines available to you in the C++ library are all going to work in O(n·lgn) time, it might get kind of expensive as the dictionary grows to hundreds of thousands of words.

The good news is that with this data set we're in position to take advantage of a linear sort. Yes, we can sort data and do substantially better than O(n·lgn) when we know that the data to be sorted is constrained to a small set of values.

In this case, I just create one linked list for each word size, and as I read the input file, I add each input word to the front of the appropriate linked list. This would be a true linear algorithm if I constrained the input size to a fixed number, say 25, but for convenience I actually store the lists in a map that looks like this:

std::map<size_t,std::list<std::string> > words_by_length

As a result the input code runs in close to linear time. The actual loop that reads in the data is nice and simple:

while ( input ) {
     string word;
     input>> word;
     words_by_length[ word.size() ].push_back( word );
     count++;
     if ( ( count % 100 ) == 0 )
          cout <<count <<"\r";

Once all the words are read in, I can access the list of words of a given size with a simple map lookup: words_by_length[ i ].

Word Lists

One final detail before I can compile my library -- I need some lists of words! Lists of words are not hard to come by, although coming up with a suitable one for this exercise might take some work.

One of the first places to look is is on your local Linux system. My system has a list of words in /usr/share/dict/words, which is used by the spell checker application, and possibly by the password app. One one of my systems, this dictionary has a whopping 483,524 words, which means it is packed with obscure words. Just as an example, a typical 12-character good word and its derivation found using /usr/share/dict/words yields this head-scratching sequence:

abranchiate
branchiate
branchiae
branchia
branchi
branch
ranch
rach
ach
ch
h

Probably the first thing you want to do with that file is go through and remove all the single letter words except a and i, but even so, you're going to be boggled by some of what you see.

Another good alternative are the collection of word lists distributed on Project Gutenberg as the Moby Word List. This includes a wide variety of lists of various sizes.

Beth Katz had several good dictionaries listed along with her homework assignment, including a short one called kids.dict that is nice and short, making it good for debugging runs.

Finally, Google searches for "word lists" will turn up many other good choices.

Wrapping It Up

Once I had the bottom-up good word builder working with the file-based word list, I was ready to put it all together. The core of main() now looks like this:

map<size_t,list<string>> words_by_length;
read_words( argv[ 1 ], words_by_length );
map<size_t,hash_set<string>> good_words;
size_t longest = build_up_words( words_by_length, good_words );
print_good_words( longest, words_by_length, good_words );

Procedure read_words() simply reads all the words from the file into the map of hash_set containers called word_by_length, as described earlier.

Then build_up_words collects all the good words and stores them, organized by size, in the map of hash_set containers called good_words. The core of that routine was described earlier.

I show the results of the top two levels in print_good_words(), which is simple enough to read up on in the source code.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.