Channels ▼
RSS

Database

Discovering Relationships in Context

Source Code Accompanies This Article. Download It Now.


The Right Tool for Forensics

The right tool for this kind of problem is Cogito (www.cogitoinc.com), a graph-based relationship analytics tool that can do pattern matching and relationship identification, even if patterns are unknown or the relationships are highly separated.

Consider an actual example that is much harder than a simple path. The police collect surveillance data in the form of notes and police reports. There is no fixed structure in which to fit this data. For example, U-Haul reports that a truck has not been returned and they file a police report. That same week, a farm supply company reports someone purchased a large amount of fertilizer. If the same person did both actions, and used his own name (or a known alias) in both cases, then you could join them into a relationship based on the "bad guys" table. This would be fairly easy; you would have this kind of query in a VIEW for weekly reports. This is basically a shortest-path problem and it means that we are trying to find the dumbest terrorist in the United States.

In the real world, conspirator A rents the truck and conspirator B buys the fertilizer. Or one guy rents a truck and cannot return it on time, while another totally unrelated person buys fertilizer. Who knows? To find out if we have a coincidence or a conspiracy, you need a relationship between the people involved. That relationship can be weak (both people live in New York state) or strong (they were cellmates in prison).

Figure 1 shows the actual subgraph that ties the stolen U-Haul truck and fertilizer to previously unknown people. The subgraph is pulled from a graph with hundreds of nodes and edges. It is unlikely that you would see that subgraph by eye or from the raw data. The pieces are simple: Buying fertilizer is a suspicious activity. A security guard reports a public nuisance at Cooley Dam. Jake Campbell fails to return a U-Haul truck. And finally, all the connections are put together.

[Click image to view at full size]

Figure 1: Cogito subgraph.

Notice that the subgraph is not a simple path, but a network of relationships. It is easy to build a path, but not to think in terms of networks. One property we would like in a graph is that it is planar, meaning you can draw it on a flat piece of paper, without any edges crossing over each other. It is simply easier for a human to process a planar graph.

However, just because a graph is drawn with crossing lines, it doesn't mean that it can't be redrawn as a planar graph. If you want to play with this idea, John Tantalo (www.planarity.net) presents an interactive game that displays a graph of (n) nodes, which can be moved to "untangle" the edges.

But back in the real world, we tend to discover networks rather than start with a network as a hypothesis. The model we are building in Cogito's graph tool is dynamic, while the RDBMS model is static. This is why Cogito works for investigation and intelligence problems, and SQL works for production problems. Cogito, in other words, is an electronic version of that "CSI" whiteboard.

In fact, major police departments have hundreds of cases a week, and the super-genius Sherlock Holmes characters from television shows are few and far between. But even if you could find such geniuses, you simply do not have enough whiteboards to do this kind of analysis one case at a time in the real world. To be effective, intelligence work must be computerized.

Repeat offenders who tend to follow patterns commit most crimes. What a police department wants to do is to describe a case, then look through all the open cases to see if there are three or more cases that have the same pattern. The choice of three is not by chance—one occurrence is an event, two occurrences is a coincidence, and three occurrences is a pattern.

Another use for Cogito is the creation of a hypothesis and cognitive map from raw data. The visual display can be manipulated to arrange the graph in 2D and 3D visuals. Suddenly, a relationship appears as a clustering of nodes on the screen. You can then check it out and see if it is worth further investigation.

The traditional way of hypothesis discovery was to make a few guesses, then apply statistical tests to see what correlations you could find. But since you have to set up the testing yourself, you tend to make assumptions about causality. For example, a medical study on the skin disease psoriasis collected hundreds of measurements on the patients. Even with a huge computer and a good statistics package, trying to find correlations for creating a testable hypothesis can be a combinatorial explosion. Instead of trying all possible combinations, you tend to look for expected factors and overlook the unexpected factors.

Unexpectedly, one of the strongest factors in the psoriasis study was obesity. A traditional statistical tool might not have found this relationship because nobody would think to look for it.

Conclusion

Investigation and knowledge discovery is not the same as production applications or traditional statistical analysis. We are not looking for a measurement (statistics) or a well-described subset (SQL and RDBMS). The problem is that we don't know what we are looking for, much less how to measure it.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video