Channels ▼
RSS

Global Developer

Approximate String Matches and Dynamic Problems



Dennis Shasha is a professor of Computer Science at New York University. His latest puzzle book Puzzles for Programmers and Pros outlined some general approaches to programmatic puzzle solving. This column challenges you to find the solutions to the puzzles that lie at the core of some cool tough applications. He solicits your help in identifying such cool apps. If you have written an application that required extensive use of heuristics (because no algorithm would solve it) or that is otherwise algorithmically interesting, please send mail to Dennis at shasha@cs.nyu.edu. Put "Tough Apps" in the subject line.


Many problems have the property that they can be broken down into two or more subproblems whose solutions can be combined in some way.

For example, finding the shortest path from point x to point y through z can be expressed as finding the shortest path from x to z and then finding the shortest path from z to y. That realization leads to Dijkstra's classic algorithm:

  
i. Mark the initial node <b>x</b> as having cost 0 and all other nodes as having cost infinity; also mark <b>x</b> as current but all other nodes as not current.
  
ii. Until node y is current: If the cost from <b>x</b> to some current node q plus the cost from q to q' (where q' is not yet current) is the minimum of any cost to a non-current node, then set the cost to q' to the sum cost(x,q) + cost(q,q') and mark q' current.

This remarkably simple algorithm is used millions of times a day, even though Dijkstra had trouble publishing it. (As Cathy Lazere and I found out when we interviewed Edsger Dijkstra for our book Out of their Minds, Dijkstra had trouble publishing the algorithm after he came up with it. "At the time, algorithms were hardly considered a scientific topic. I wouldn't have known where to publish it...The mathematical culture of the day was very much identified with the continuum and infinity. Could a finite discrete problem be of any interest? The number of paths from here to there on a finite graph is finite; each path is a finite length; you must search for the minimum of a finite set. Any finite set has a minimum--next problem,please. It was not considered mathematically respectable..."

Dynamic programming generalizes this approach. Let's look at the classic problem of string comparison, used whenever you run the diff command or biologists compare gene sequences. In string comparison, the goal is to find the smallest number of changes to transform one string into another. For example, starting with the string "some," three changes (also called edits) will yield "sammy" (e.g., replace "o" by "a," replace "e" by "m," and insert "y"). To see how to arrive at such a conclusion, consider a matrix with "some" written on the top and "sammy" written down the side, as in Figure 1.

[Click image to view at full size]
Figure 1.

Next note that in Figure 2, the top row corresponds to deleting first 0 letters (at cost 0), then one letter ("s"), then two letters ("so"), and so on.

[Click image to view at full size]
Figure 2.

So the top row corresponds to transforming some portion of "some" to the empty string. The left column corresponds to transforming the empty string (above the 0) to some portion of "sammy" through insertions. With the setup done, Figure 3 shows the matrix in mid-expansion.

[Click image to view at full size]
Figure 3.

The 0 at the intersection of the two S's corresponds to transforming "s" to "s" (at cost 0,) and the 1 at the intersection of the A row and the O column corresponds to transforming "so" to "sa." The arrows suggest which sub-solution certain numbers come from. So, the 0 at the intersection of the two S's comes from the 0 corresponding to the cost of transforming an empty string to an empty string. Similarly, the 1 at the A row and O column corresponds to mapping "s" to "s" and then mapping "o" to "a" at cost 1. The method to compute the cost of location (i, j) (row i and column j, where (0,0) corresponds to the upper left hand corner of the array) is as follows:

  
cost(i,j) = min( cost(i-1,j) + 1, cost(i,j-1) + 1, cost(i-1,j-1) + delta) where delta is 0 if letter i of the top word ("some" in our case) is the same as letter j of the column word ("sammy" in our case) and delta is 1 otherwise.

So the 0 at the intersection of the S's comes from the top left 0 plus a delta of 0. The 1 of the A row and the O column comes from the 0 plus a delta of 1 because "o" and "a" are unequal.

Continuing the expansion of the matrix, Figure 4, we see that the minimum cost of the 2 in the S row and M column comes from deleting an M after transforming "so" to "s."

[Click image to view at full size]
Figure 4.

Finally, the full matrix Figure 5 shows a path from the bottom right corner to the top left corner.

[Click image to view at full size]
Figure 5.

Each leg of that path corresponds either to the identity transformation (e.g. "m" to "m") at cost 0 or to either a replacement, insertion, or deletion. Thus we can read off the insertion of the "y," then the tranforming of the "e" to an "m," and so on. The cool thing about this algorithm is that it is very simple even though the problem to start with might seem very complex.

Quiz: (For the mathematically inclined) Do you see a way to make this more efficient than quadratic (i.e, proportional to the product of the two strings) when the strings are relatively close? Formally, can you find a method that can find the best edit sequence in time proportional to k*n if k is the number of edits that need to be made?


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video