Dr. Dobb's | Discovering Relationships in Context

Cogito is a graph-based relationship analytics tool for pattern matching and relationship identification, making it an ideal tool for computer forensics.

Relational databases hold models of a world that assumes you know the basic relationships among the entities in your problem space. Their purpose is to maintain business rules and access data in a known format. This is a good assumption in a production environment. Actually, it is a necessary assumption. You want to produce known and well-defined transactions and reports in online processing (OLTP) environments. A constantly changing schema would be a model of a world where elephants can drop out of the sky. Even online analytical processing (OLAP) deals with known relationships, usually summaries taken from transactional systems and outside data sources.

The relational model (RDBMS) is a deductive system that can find sets of entities from the data in this model. If you find a new relationship, you have to add tables and new constraints to your schema to model it. Likewise, if you drop a table from your schema, you have to see how that changes the whole model. This is the nature of a deductive approach to data.

RDBMS is very useful, but there are problems in the world that cannot be done with deductive methods. We need inductive tools that let us add relationships, rather than data, to a model. What kind of data requires inductive reasoning? Imagine you are a cop on the "CSI" television show. All you have is a collection of odd facts that do not fall into nice neat relational tables. These facts tie various data elements together in various ways. You have 60 minutes to find a network of associations to connect the bad guys to the crime in some as-of-yet unknown manner. And a new fact introduced after the last commercial break can change the outcome in the last five minutes.

In any of the shows, the detectives go to a marker board and start making "fishbone" diagrams (also known as "Ishikawa" diagrams) and other general-directed graphs. Each event is tied together until we have a path from the perpetrators to the crime. The whole point of the show is finding that path by hard work, some shooting, and clever thinking. It's good drama, but in the real world, it can be slow and we don't have scriptwriters to guarantee outcomes.

What we need is a tool to manipulate a general-directed graph with various relationships between simple entities. Ideally, we would like this tool to be declarative and self-optimizing—like SQL engines.

For instance, graph theory, a branch of mathematics that deals with nodes and edges, is one of the most powerful mathematical tools we have because it is so general. A road map's nodes are cities and the edges are roads. A family tree's nodes are people and the edges are blood relations. A circuit diagram's nodes are electronic components and the edges are the wires between them. Graphs are so general—you've used them all your life, but not thought of them in a formal sense.

Because I'm a big fan of SQL, my first approach to any problem is to write a schema and some queries. I have tried to do general graphs in SQL and my conclusion is that it's possible—but not practical—for any data set of a realistic size.

Six Degrees of Kevin Bacon

For example, "Six Degrees of Kevin Bacon" is a popular game that started at Albright College (www.albright.edu). The idea is that Kevin Bacon can be linked with any other actor in the world via movie appearances. Jack Nicholson was in A Few Good Men with Kevin Bacon. Michelle Pfeiffer was in Wolf with Jack Nicholson. Tab Hunter was in Grease 2 with Michelle Pfeiffer. The number of links is your "Bacon Number," which cannot be greater than six—Bacon (Nicholson) = 1, Bacon (Pfeiffer) = 2, and Bacon (Hunter) = 3.

In fact, there is a web site at the University of Virginia (www.cs.virginia.edu/oracle/) that does nothing but maintain an in-memory graph database for just this single problem. It downloads current data from the Internet Movie Database (www.imdb.com), which contains more than 800,000 actors, 375,000 movies, and 70,000 aliases. It's a cute piece of work for a single—and silly—purpose, but it is not a general tool.

The most common—and best—way to model a graph in SQL is an adjacency-list model. Example 1 is a typical adjacency-list model of a general graph with one kind of edge that is understood from context. Structure goes in one table and the nodes in a separate table, because they are separate kinds of things (entities and relationships).

CREATE TABLE Actors -- nodes
(actor_id INTEGER NOT NULL PRIMARY KEY
 actor_name CHAR(25) NOT NULL);

CREATE TABLE MovieCasts -- edges
(relationship_name  VARCHAR(50) NOT NULL, 
 begin_actor_id INTEGER NOT NULL
         REFERENCES Actors (actor_id)
         ON UPDATE CASCADE
         ON DELETE CASCADE, 
 end_actor_id INTEGER NOT NULL 
          REFERENCES Actors (actor_id)
         ON UPDATE CASCADE
         ON DELETE CASCADE, 
 PRIMARY KEY (relationship_name, begin_actor_id, end_actor_id), 
 CHECK (begin_actor_id <> end_actor_id));

Example 1: Typical adjacency list.

Finding the Shortest Path

A path through a graph is a traversal of consecutive nodes along a sequence of edges. Clearly, the node at the end of one edge in the sequence must also be the node at the beginning of the next edge in the sequence. The length of the path is the number of edges that are traversed along the path. A path does not have a cycle in it.

In this particular problem, the nodes are actors and the edges are the "in a movie with," "directed by," "costarred," and other such relationships.

Here, I'm looking for a path from "Kevin Bacon" to some other actor that has a length less than six. Actually, what I would really like is the shortest path (most direct relationship) within the set of all paths between those two actors.

The advantage of SQL as a database language is that it is a declarative, set-oriented language. I tell the SQL engine what I want to get back and it figures out how to do it. An optimizer looks at statistics, indexes, table sizes, and dozens of other things to get an execution plan. The optimizer is usually smarter than human beings.

But suddenly, this is not the approach we want with a graph. When I specify a rule for a path, I have to know the length (n) of the path I am looking for before I can write the query. Then when I run this query, I get all the paths in the set. Looking for this set can be a combinatorial explosion.

What I really wanted was the shortest path between my two actors. This means I have to start with paths of length=1, length=2, and so forth, until I either find a path or have searched all the edges in the graph. Example 2 is a template for a simple, two-edge path starting at Kevin Bacon.

SELECT M1.end_actor_id, 
        M1.relationship_name  
       M1.end_actor_id, 
        M2.relationship_name,  
       M2.end_actor_name 
  FROM MovieCasts AS M1, MovieCasts AS M2
 WHERE M1.begin_actor_name = 'Kevin Bacon'
   AND M1.end_actor_id = M2.begin_actor_id 
   AND M2.end_actor_id = 'Joe Celko';

Example 2: Two-edge path.

To extend this pattern, we alter the table by adding a column for the next relationship and one for the next actor. The important part is to also add a predicate that prevents cycles. That is, the next step in the path cannot already be in the unaltered table.

Notice in Example 3 that you have to redo the query for every path length for every pair of actors. If there is no relation between the two actors, you waste a lot of time and storage space, assuming that the problem doesn't overflow the limits of your database engine. Luckily, there's a tool that handles this kind of problem.

   AND M1.end_actor_id = M2.begin_actor_id;  
   AND M2.end_actor_id = M3.begin_actor_id
   ... 
   AND Mn.end_actor_id = 'Joe Celko' 
  AND Mn.begin_actor_id 
          NOT IN ('Kevin Bacon', M2.end_actor_id, M3.end_actor_id, ..);

Example 3: Altering the table.

The Right Tool for Forensics

The right tool for this kind of problem is Cogito (www.cogitoinc.com), a graph-based relationship analytics tool that can do pattern matching and relationship identification, even if patterns are unknown or the relationships are highly separated.

Consider an actual example that is much harder than a simple path. The police collect surveillance data in the form of notes and police reports. There is no fixed structure in which to fit this data. For example, U-Haul reports that a truck has not been returned and they file a police report. That same week, a farm supply company reports someone purchased a large amount of fertilizer. If the same person did both actions, and used his own name (or a known alias) in both cases, then you could join them into a relationship based on the "bad guys" table. This would be fairly easy; you would have this kind of query in a VIEW for weekly reports. This is basically a shortest-path problem and it means that we are trying to find the dumbest terrorist in the United States.

In the real world, conspirator A rents the truck and conspirator B buys the fertilizer. Or one guy rents a truck and cannot return it on time, while another totally unrelated person buys fertilizer. Who knows? To find out if we have a coincidence or a conspiracy, you need a relationship between the people involved. That relationship can be weak (both people live in New York state) or strong (they were cellmates in prison).

Figure 1 shows the actual subgraph that ties the stolen U-Haul truck and fertilizer to previously unknown people. The subgraph is pulled from a graph with hundreds of nodes and edges. It is unlikely that you would see that subgraph by eye or from the raw data. The pieces are simple: Buying fertilizer is a suspicious activity. A security guard reports a public nuisance at Cooley Dam. Jake Campbell fails to return a U-Haul truck. And finally, all the connections are put together.

Notice that the subgraph is not a simple path, but a network of relationships. It is easy to build a path, but not to think in terms of networks. One property we would like in a graph is that it is planar, meaning you can draw it on a flat piece of paper, without any edges crossing over each other. It is simply easier for a human to process a planar graph.

However, just because a graph is drawn with crossing lines, it doesn't mean that it can't be redrawn as a planar graph. If you want to play with this idea, John Tantalo (www.planarity.net) presents an interactive game that displays a graph of (n) nodes, which can be moved to "untangle" the edges.

But back in the real world, we tend to discover networks rather than start with a network as a hypothesis. The model we are building in Cogito's graph tool is dynamic, while the RDBMS model is static. This is why Cogito works for investigation and intelligence problems, and SQL works for production problems. Cogito, in other words, is an electronic version of that "CSI" whiteboard.

In fact, major police departments have hundreds of cases a week, and the super-genius Sherlock Holmes characters from television shows are few and far between. But even if you could find such geniuses, you simply do not have enough whiteboards to do this kind of analysis one case at a time in the real world. To be effective, intelligence work must be computerized.

Repeat offenders who tend to follow patterns commit most crimes. What a police department wants to do is to describe a case, then look through all the open cases to see if there are three or more cases that have the same pattern. The choice of three is not by chance—one occurrence is an event, two occurrences is a coincidence, and three occurrences is a pattern.

Another use for Cogito is the creation of a hypothesis and cognitive map from raw data. The visual display can be manipulated to arrange the graph in 2D and 3D visuals. Suddenly, a relationship appears as a clustering of nodes on the screen. You can then check it out and see if it is worth further investigation.

The traditional way of hypothesis discovery was to make a few guesses, then apply statistical tests to see what correlations you could find. But since you have to set up the testing yourself, you tend to make assumptions about causality. For example, a medical study on the skin disease psoriasis collected hundreds of measurements on the patients. Even with a huge computer and a good statistics package, trying to find correlations for creating a testable hypothesis can be a combinatorial explosion. Instead of trying all possible combinations, you tend to look for expected factors and overlook the unexpected factors.

Unexpectedly, one of the strongest factors in the psoriasis study was obesity. A traditional statistical tool might not have found this relationship because nobody would think to look for it.

Conclusion

Investigation and knowledge discovery is not the same as production applications or traditional statistical analysis. We are not looking for a measurement (statistics) or a well-described subset (SQL and RDBMS). The problem is that we don't know what we are looking for, much less how to measure it.