Graphing Perl Data with GraphViz::Data::Structure

Joe is a test automation architect at Yahoo! who has contributed to core Perl test support, supplied the debugger's internal documentation, and supports a number of CPAN modules. He can be contacted at mcmahon@ibiblio.org.

It used to be that if you wanted to draw a diagram of a data structure, or any other picture that consisted of nodes and edges, you had to do it by hand (or with a computer drawing program), carefully working out the layout of what node went where and how the edges were routed to get the clearest possible picture. This could be a big problem if you completed a diagram and then found out there were a couple more nodes to add; sometimes, you'd have to completely discard the old diagram and start over from scratch. This was a big waste of time when you had diagrams that were constantly changing.

The folks at AT&T Bell Labs created the graphviz package to deal with this problem. It parses a language, called "dot," for the description of graphs—the nodes and their names, and the node connections, like this:

```digraph G {
a -> b;    a -> c;
b -> d;    b -> e;
}

```

The "digraph" keyword denotes to dot that this is a directed graph (edges have a start and an end, with an arrowhead showing the direction in which the edge "flows"). The "arrow" specifications show that a is connected to b and c, and that b is connected to d and e. When this is rendered by graphviz, you get the picture in Figure 1. (dotty is part of the graphviz package—among other things, it can display dot graphs.) In a traditional drawing program, if you wanted to add more nodes, you'd have to figure out how to lay out the graph all over again. With dot, you simply add the new definitions:

```digraph G {
a -> b;   a -> c;
b -> d;   b -> e;
e -> c;   f -> a;   b-> f;
}

```

And it takes care of the layout; see Figure 2.

The possibility of simply describing a drawing rather than painstakingly rendering it by hand opens up many more possibilities for programmers. Instead of having to output textual descriptions of data and relationships, it's now possible to describe those relationships and see a picture of them. The aforementioned examples illustrate this well; even with the description of the connections in the dot file, it's still very difficult to envision exactly how the nodes are connected without the picture.

When graphviz was first released, it was still necessary to put together the dot input files by hand. This could be time consuming—less so than hand-diagramming, admittedly—but very often, a graph creator would go through many iterations of edit/graph/view to try to get just the right graph with all the proper connections.

Enter Perl, and Data Structures

Leon Brocard created the GraphViz Perl module to address this problem, making it far easier to create graphs. Now, instead of having to figure out connections and write the necessary dot code to display a graph, one could simply write Perl code to generate the graph, and have GraphViz render the graph for you via dot. For instance, GraphViz has several interesting demo classes included with the package, including one that visualizes XML and one that renders Perl data structures as graphs (GraphViz::Data::Grapher)

This is where GraphViz::Data::Structure got its start. I wanted to use the XML visualizer, but it has a minor problem in that it doesn't preserve strict order of sub trees in the XML. I was going to render these graphs for others who weren't particularly XML-savvy at the time, so accurate order was important. Also, since this was partially a propaganda tool as well (XML was perceived as "too complicated" at the time), the graphs also had to look attractive.

Dot Records

Readers who are familiar with texts such as Donald Knuth's Art of Computer Programming will no doubt remember the beautiful graphs that were used to describe data structure algorithms: nice rectangular nodes with pointers linking them together. I wanted to achieve a similar effect in my XML visualizations. It so happens that the dot language supports what it calls a record: a rectangular node that can be arbitrarily subdivided into rectangular subnodes. These subnodes can then be linked together via ports that describe the source and target sub nodes. This seemed an ideal way to handle the data structure visualizations: It was supported by dot and it would be familiar to my readers.

Record specs look something like this:

```digraph G {
node [shape=record];
a [label = "<f0> foo | x | <f1> bar"];
b [label = "a | { <f0> foo | x | <f1> bar } | b"];
a:f0 -> b:f1
}

```

The resulting graph is shown in Figure 3.

As you can see, the syntax is a bit arcane, but we can do what we want to do: Create a familiar-looking structure and link the subparts to other things. This allows us to use one of the XML parsing classes to create a tree of objects and then display that tree, all properly linked together and in order. I chose to use XML::Parser's style=>objects to do this, mostly because I understood it best and secondarily because it creates relatively uncomplicated objects.

I was able to cobble together a very basic module that could handle the XML::Parser object tree pretty quickly; it only required a very basic tree-walk and simple array-like layouts. I showed it around, and Leon commented that the output was really neat, but he wondered: Could I generalize it?

It turned out that the basic code I had was enough to do the simple job of dumping a very specific set of hashes and arrays that pointed to each other, but was woefully incomplete when it came to doing arbitrary structures. I decided that the module should be able to dump any Perl data structure, no matter what it was, including self-referential ones. This ruled out using Data::Dumper to do the structural analysis because Data::Dumper can't handle these structures easily (it can dump them fine, but I'd have had to parse the resulting Perl to patch up the self-references, which looked like even more work). This meant I needed a tree-crawler that could handle any of Perl's data structures: scalars, arrays, hashes, globs, and all of the different kinds of references, plus undef.

The basic algorithm is rather simple. At each level, recursively call the tree-walk subroutine to render all of the nodes under this one, and then handle the current node. Arrays and hashes (and globs) have to be iterated over to handle each of their elements.

This was all pretty simple—perhaps the most difficult problem in the tree walk was accurately determining the type of thing being pointed to by a reference. Perl 5.6 and Perl 5.8 differ significantly in what ref() returns for scalar references, which led to the code being broken for a considerable amount of time on 5.8 until I could compare 5.6 and 5.8 runs head-to-head to spot the problem.

Brace Yourself

This was no problem at all compared to the research and trial-and-error needed to figure out how records really work. It's possible, using curly braces, to tell GraphViz to switch cell layouts within a record from vertical to horizontal. This was necessary for objects (and globs) because I wanted them have the name and type on top, and the contents on the bottom. This feature was not precisely documented, and it took a lot of work to figure out exactly how the braces should be nested to get the desired results. Once this was worked out via test-driven development (we'll get to that in a minute), it was pretty clear sailing. Globs were handled as a special case of blessed hashes, for instance.

A Good Set of References

Code references were a bit more of a problem. I could see that the debugger was able to track down where code references came from, but it wasn't obvious how this was done. A lot of inspection of the debugger's dumpvar.pl code was needed to figure out how to look up the actual location where a code reference was defined, and I stole a small amount of code from the dumpvar.pl library routine to handle this.

Testing

A particularly interesting question is, how can you automate the testing of something that is fundamentally visual? The Perl code couldn't know if the graphs looked right; it could only judge whether or not the output was as expected. On the other hand, not automating the tests at all seemed really wrong. I was able to solve this problem in the classic computer science manner: When presented with two exclusive alternatives, do both—selectively.

I needed to look at the actual graphical output once to make sure it was right. After that, I could tell the GraphViz object embedded inside my object to render the graph as_canon, which is canonical dot source code. This source code could be cleaned up a bit to make sure that it was easily comparable, and then automated testing could be used to make sure that the canonical graph was still the same each time after I succeeded in getting the graph right visually. This was extremely helpful when trying to work out the compatibility problem between 5.6 and 5.8.

For test-driven development, once I had figured out how the records worked, I could hand-compose one drawing and save it as "expected" output, then develop code that output that dot code. Once the basic optical test and the comparison tests both passed, I had a basis from which I could develop more tests.

It became evident early on that developing the tests by hand, even with Test::More helping out, was extremely painful. So I developed a helper program that let me specify the tests I wanted to run, run them to do a visual check, and then collect the output for later automated testing. This sped development up considerably because I no longer had to hand-code dot output for every test case once the output started looking sensible.

Current Events

At the moment, GraphViz::Data::Structure has the following features:

• It can dump arbitrary complex data structures.
• It can terminate the dump at a given level (nice if you only want to see the top N layers of a data structure).
• It can trim strings output to dot to a specified minimum length (it was discovered during testing of the XML dump code that dot will just segfault if text strings in nodes are "too long").
• It uses plain text wherever possible to minimize graph size (constants, empty hashes, and arrays).

It's quite useful as an adjunct when trying to get a handle on particularly messy data structures. You can dynamically spin off a graph at any point in your program by piping the output of GraphViz::Data::Structure to a compatible program. This is particularly useful in the debugger.

GraphViz::Data::Structure In Action

Here's a quick example that shows just about everything GraphViz::Data::Structure can do. We create a data structure, graph it, and then launch dotty to display it (shown in Figure 4):

```#!/usr/bin/perl -w
use strict;
use GraphViz::Data::Structure;

my (\$a,\$b,@c,%d);

sub over_there { 'sample sub' }

\$a=\*Foo::Bar;
*Foo::Bar=\&over_there;

*Foo::Bar=\\$b;
\$b="this is a really long string which might cause dot to segfault, so it will be truncated";

*Foo::Bar = \@c;
@c=qw(foo bar baz);

*Foo::Bar = bless \%d, "An::Object";
%d = (This=>'That', The=>'Other');

my \$dot_source =
GraphViz::Data::Structure->new(
\\$a, title=>"a little\nof everything"
)->graph->as_canon;
open DOTTY_IN, ">dotty_input"
or die "Can't create dotty input file: \$!\n";
print DOTTY_IN \$z;
close DOTTY_IN;
system "dotty dotty_input &";

```

You can see that we have the *Foo::Bar glob pointing to each of the things we assigned to it; in particular, GraphViz::Data::Structure has trimmed the long string and has labeled the hash we blessed with its class name. Note also that the array's elements are in order, and that the hash's key-value pairs (and the glob's entries) are all lined up nicely, and that we know the actual name of the subroutine associated with the code reference.

Future Plans and Ideas

GraphViz now supports HTML records, which use table-like markup to define the internal contents of a record node. It's possible that the data layout code could be simplified considerably using this method of defining nodes. The tree-walking code, which does all the hard work, should be refactored to separate the graph generation from the visit code; that way, other programs could use the visit code to safely explore arbitrary data structures. Last, some self-referential structures generate dot code that exposes bugs in dot—the graphs will not render. It's currently unclear as to whether this is caused by bugs in the record layout code or if it's a deeper problem; trying HTML layout may possibly solve this problem.

Conclusion

So now there's a tool to draw pretty pictures of your complex Perl data structures, and it's even invokable on the fly. Use it to debug those weird data structure bugs. Draw pretty pictures of your data relationships. And as Larry says, "have the appropriate amount of fun."

TPJ

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>

Dr. Dobb's TV