Wordsnakes

By Dennis E. Shasha, July 01, 2000

Ecco and Liane take on the challenge of wordsnakes -- a list of words without repeats such that some suffix of each word is a nonempty prefix of the next word in the list.

Jul00: Dr. Ecco's Omniheurist Corner

Dennis, a professor of computer science at New York University, is the author of The Puzzling Adventures of Dr. Ecco (Dover, 1998); Codes, Puzzles, and Conspiracy (W.H. Freeman & Co., 1992); Database Tuning: A Principled Approach (Prentice Hall, 1992); (coauthored with Jason Wang and Bruce Shapiro) Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications Oxford University Press, 1999); and (coauthored with Cathy Lazere) Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientists (Springer Verlag, 1998). He can be contacted at [email protected].

The graphic artists say they are ready," our visitor said with delight. Mike Johnson was the games editor of a big city newspaper that many coffee and latte shop denizens have been seen to read in cities and towns everywhere.

"They have a design of a long sequence of letters that twists along a page and even bridges over itself. You see, our solutions will look like that and it is important that they, well, fit.

"Let me explain the game. Given a collection of words, a wordsnake is a list of those words without repeats such that some suffix of each word is a nonempty prefix of the next word in the list. (Suffixes of the last word are unconstrained.) The score of such a consecutive match is the square of the number of letters in the overlap.

"For example, the words house and sea have an overlap of two letters (hence a score of 2 squared or 4) in the given order because the suffix se of house is the prefix of sea. On the other hand, beret and timber have an overlap of 1 (and score of 1) in the given order because of the letter t but have an overlap of 3 (and score of 9) in the reverse order because of the letters ber. So, some wordsnakes have higher scores than others.

"I spoke about the graphic artists because we will present the wordsnake as a long word having a nonredundant representation of the overlap. We call this 'the wordsnake in long form.' This will give us housea for house and sea and will give us beretimber for beret and timber in that order but timberet in the opposite order. If the initial collection is long, then the wordsnake in long form can be quite lengthy, too."

"Cool game!" 12-year old Liane said. "Give us a try."

"I had hoped you would have that reaction," Johnson said. "Here is a list of words in alphabetical order. Your job is to rearrange the words into a wordsnake that gives the highest score possible. No need for you to lay it out in long form. Our artists will do that."

Reader: Liane found a list having a score of 357 and a length in long form of 145. Can you do at least as well score wise?

"I knew my niece would find a solution, Mr. Johnson," Ecco said after Liane had presented her solution including a design for the wordsnake in long form (she couldn't resist). "We have been working on Hamiltonian paths together lately. In the meantime, I have some variants to ask you about: One can imagine trying to find the longest possible wordsnake in long form (in which case, beretimber would beat timberet) or the shortest possible wordsnake in long form (this need not always have the highest score, though often it will) for this list. I think there could be a fascinating competition in which one tries to design a wordsnake whose representation in long form would be 50 letters or fewer and whose score would be as high as possible. Such a competition could be conducted in any one of several languages."

Johnson promised to think about these challenges, but said he wanted to present the game as it was to his editor-in-chief. Ecco and Liane worked on the 50-letter variant for a while, but remained unsatisfied with their solutions. They did, however, find a list with an 18-letter long form that yields a score of 338.

Reader: Can you do at least as well as Ecco and Liane for 18 letters and see how well you can do for 50 letters? The words must appear in some standard dictionary of your language. Please send your solution as a list of separate words in plain ASCII text, one word per line in the form:

antiauthoritarianism

authoritarianism

I'll create the long form.

Last Month's Solution

Generate a set of sizes and their sequences for each kind of restriction and each combination of restrictions. From this, figure out the entire sequence. Here is an intermediate result:

0: "T"

1-6: "TCCATC"

7-8: "TC"

9-10: "TA"

11-12: "GA"

13-19: "GTCCGTC"

20-28 "TCACACGGC"

29-30 "TC"

31-41 "TCGCACACGGA"

42-45 "GATA"

46: "GC"

48: "TC"

Here is the whole sequence:

"TTCCATCTCT AGAGTCCGTC TCACACGGCT CTCGCACACG GAGATAGCTC"

The longer sequence is:

"GGAGTTACGG CCCCGAGTTG GTGTTACAGT TTACATGTCT TTTATCGCGA

"CTGGAAGGCG GAGGTTATGC GAAGGACGTC TTTAGCTAGC CGCGATCCAT TCGTATA"

Reader Notes

The Blood puzzle (DDJ, April 2000) distracted several intrepid readers from their university research projects. Approaches from statistics, algorithmics, and combinatorial design have been brought to bear on ensuring the purity of the blood supply by Eric Wiseblatt, Michael Goodrich, Andrew D. Todd, Jimmy Hu, Bruce Oddson, Suzanne M. Lea, Magne Ostlyngen, Bruce Oddson, Paul Stockmeyer, Tomas Rokicki, Robert Morrison, Ng Weng Leong, Chris Alonzo, Patrick R. Schonfeld, Dennis Yelle, and Greg Smith.

The basic setting of the puzzle was that there were 100,000 pints of blood, some of which might have hepatitis. A PCR test that takes a day is available and it can detect even tiny amounts of hepatitis, so drops from many pints may be combined and the hepatitis may still be detected.

For the first part of the puzzle in which 1000 pints are bad, 2000 good pints may be thrown away, and two days are available.

Magne Ostlyngen suggested the best strategy when there are exactly 1000 bad pints: Partition the 100,000 into 5882 groups of 17 and 1 of 6. Do 5883 tests. The worst case is that 999 tests are bad. (1000 bad batches are good, since that means there is exactly one bad in each group, meaning only 15 out of each 17 need be tested on the second day, requiring just 5000 tests on the second day.) In the case that there are 999 bad batches, the set of potentially bad pints is 999×17=16983 pints. We then partition these into 5661 groups of three and test each group. We throw away all the pints in each bad group. The total number of tests is 5883+5661=11,544.

Patrick R. Schonfeld, Suzanne M. Lea, Magne Ostlyngen, Paul Stockmeyer, and Michael Goodrich found solutions for two tainted pints that require 34 tests. Most of the suggested solutions involved a binary decomposition in which the pints were partitioned into nearly equal groups and tested. Either both groups test positive, in which case a simple binary search finds each bad pint, or the search space is reduced by half. Paul Stockmeyer, however, suggested a four-way decomposition for the first test, leading to a nine-day solution. Tomas Rokicki improved this independently to an eight-day solution and further proposed a 41-test solution for two days using a matrix approach.

Now consider the puzzle: How can you identify 10 bad pints out of 100,000 in two days? Dennis Yelle suggested the following: "Arrange the 100,000 samples in a large, nearly square rectangle matrix containing 316 rows by 317 columns. (The 172 extra spaces can be filled with empty containers.) On the first day do 633 tests as follows: Do 316 tests where each test uses every sample in one of the 316 rows. Do 317 tests where each test uses every sample in one of the 317 columns (note that 316×317>100,000). The worst case is that 10 of the row tests are positive, and 10 of the column tests are positive. On the second day, do individual tests on any sample that was in a positive row and a positive column. There will be no more than 100 of these tests, for a total of 733 tests; (732 if there are 10 bad pints exactly)."

Surprisingly, combining a binary search and matrix approach yields an even better solution as Ng Weng Leong and Tomas Rokicki showed. Rokicki has the current record at 608 tests.

If any number of days are allowed and there are 10 bad pints, then the first optimal solution was suggested by Mike Goodrich: "The idea is actually quite simple. We form a complete binary tree on top of the ID numbers of the pints. Starting at the root, we proceed round-by-round down the tree in a level-by-level fashion to find all the bad pints at the leaves of this tree. At the beginning of each round i we inductively know all the nodes of the tree at level i that have bad descendents; call these nodes 'bad.' We then perform a test for the two children of each such bad node, which gives us the induction hypothesis for the next level. We can then continue down the tree until we have found all the bad pints. There is also a slight optimization we can apply in some cases: Namely, if at any point during this traversal we have b bad nodes on the current level, then we need only to test left children from then on. The reason for this optimization is that once we have b bad nodes, each one has exactly one bad leaf descendant, so if a left-child badness test is negative, we know that the right child must be bad (without actually testing it). In the worst case, the binary tree method (together with this optimization) will use no more than 264 tests for the 10 bad node case and works in 17 days." That would be (by day) (2,4,8,16, 18×13) or 264 tests.

Rokicki suggested an improvement: "This can be improved, though, by simply starting with a larger number of groups at the very top, say, 25, which would give us (25,18×12) for a max of 241 tests in 13 days max, and this can probably be yet further improved..."

Suzanne M. Lea suggested a generalization: "Assume we have b bad pints in N samples. We test by dividing the pints into k groups during each iteration and testing all the pints belonging to each group. If on any iteration we have b bad groups, then each group can have at most one bad pint (which can then be found using binary search). The worst case is when b-1 groups are bad each time."

Michael Goodrich combined the binary tree and matrix approach to solve the problem of 10 bad pints in three days using only 400 tests. For example, partition the pints into sets of 1000 in the first day. At most, 10 of these sets will be bad, leaving 10,000 pints to be tested. A square matrix of 100×100 can then be used to analyze these as in Dennis Yelle's approach.

DDJ

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.