The Right Tool for Forensics
The right tool for this kind of problem is Cogito (www.cogitoinc.com), a graph-based relationship analytics tool that can do pattern matching and relationship identification, even if patterns are unknown or the relationships are highly separated.
Consider an actual example that is much harder than a simple path. The police collect surveillance data in the form of notes and police reports. There is no fixed structure in which to fit this data. For example, U-Haul reports that a truck has not been returned and they file a police report. That same week, a farm supply company reports someone purchased a large amount of fertilizer. If the same person did both actions, and used his own name (or a known alias) in both cases, then you could join them into a relationship based on the "bad guys" table. This would be fairly easy; you would have this kind of query in a VIEW for weekly reports. This is basically a shortest-path problem and it means that we are trying to find the dumbest terrorist in the United States.
In the real world, conspirator A rents the truck and conspirator B buys the fertilizer. Or one guy rents a truck and cannot return it on time, while another totally unrelated person buys fertilizer. Who knows? To find out if we have a coincidence or a conspiracy, you need a relationship between the people involved. That relationship can be weak (both people live in New York state) or strong (they were cellmates in prison).
Figure 1 shows the actual subgraph that ties the stolen U-Haul truck and fertilizer to previously unknown people. The subgraph is pulled from a graph with hundreds of nodes and edges. It is unlikely that you would see that subgraph by eye or from the raw data. The pieces are simple: Buying fertilizer is a suspicious activity. A security guard reports a public nuisance at Cooley Dam. Jake Campbell fails to return a U-Haul truck. And finally, all the connections are put together.
Notice that the subgraph is not a simple path, but a network of relationships. It is easy to build a path, but not to think in terms of networks. One property we would like in a graph is that it is planar, meaning you can draw it on a flat piece of paper, without any edges crossing over each other. It is simply easier for a human to process a planar graph.
However, just because a graph is drawn with crossing lines, it doesn't mean that it can't be redrawn as a planar graph. If you want to play with this idea, John Tantalo (www.planarity.net) presents an interactive game that displays a graph of (n) nodes, which can be moved to "untangle" the edges.
But back in the real world, we tend to discover networks rather than start with a network as a hypothesis. The model we are building in Cogito's graph tool is dynamic, while the RDBMS model is static. This is why Cogito works for investigation and intelligence problems, and SQL works for production problems. Cogito, in other words, is an electronic version of that "CSI" whiteboard.
In fact, major police departments have hundreds of cases a week, and the super-genius Sherlock Holmes characters from television shows are few and far between. But even if you could find such geniuses, you simply do not have enough whiteboards to do this kind of analysis one case at a time in the real world. To be effective, intelligence work must be computerized.
Repeat offenders who tend to follow patterns commit most crimes. What a police department wants to do is to describe a case, then look through all the open cases to see if there are three or more cases that have the same pattern. The choice of three is not by chanceone occurrence is an event, two occurrences is a coincidence, and three occurrences is a pattern.
Another use for Cogito is the creation of a hypothesis and cognitive map from raw data. The visual display can be manipulated to arrange the graph in 2D and 3D visuals. Suddenly, a relationship appears as a clustering of nodes on the screen. You can then check it out and see if it is worth further investigation.
The traditional way of hypothesis discovery was to make a few guesses, then apply statistical tests to see what correlations you could find. But since you have to set up the testing yourself, you tend to make assumptions about causality. For example, a medical study on the skin disease psoriasis collected hundreds of measurements on the patients. Even with a huge computer and a good statistics package, trying to find correlations for creating a testable hypothesis can be a combinatorial explosion. Instead of trying all possible combinations, you tend to look for expected factors and overlook the unexpected factors.
Unexpectedly, one of the strongest factors in the psoriasis study was obesity. A traditional statistical tool might not have found this relationship because nobody would think to look for it.
Investigation and knowledge discovery is not the same as production applications or traditional statistical analysis. We are not looking for a measurement (statistics) or a well-described subset (SQL and RDBMS). The problem is that we don't know what we are looking for, much less how to measure it.