Channels ▼

Web Development

Molecular Biology With Perl

August, 2004: Molecular Biology With Perl

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@

A week or so ago, I got an e-mail from one of my housemates—yes, it's that sort of house—showing me some Perl code she'd written. After June's The Perl Journal article by Ivan Tubert-Brohman on Perl and chemistry, this code made me sit up and take notice. Now don't worry if you don't know any chemistry. Before this morning, neither did I. We're going to look at a bit of chemistry, a bit of text processing, a little bit of optimizing basic Perl programs, and quite a lot of graphics programming.

You see, my housemate is an immunology researcher and spends her days (and most of her nights) playing with big glass columns full of proteins. Every so often, her proteins do something exciting and we get a step closer to understanding why the human body can suffer from autoimmune diseases like lupus and multiple sclerosis. Meanwhile, I'm writing web-based search engines or something similarly meaningless. It makes you think.

The Perl code that she'd been writing measures the distance between two atoms in one particular protein, a bit of a human with some Epstein-Barr virus mixed in. She'd got a file that told her where all the atoms were, and applied some trigonometry to work out the distance.

PDB Format

Protein structure data is usually stored in a file format called PDB, after the Protein Data Bank, which collects a huge number of these protein files. A PDB file looks like this:

ATOM      1  N   ALA  0001      -1.053   1.300   0.614
ATOM      2  CA  ALA  0001      -0.304   0.032   0.746
ATOM      3  C   ALA  0001       0.770  -0.014  -0.311
ATOM      4  O   ALA  0001       1.952  -0.167  -0.047
ATOM      5  H   ALA  0001      -1.805   1.385   1.386
ATOM      6  O   ALA  0001       0.354   0.125  -1.567
ATOM      7  H   ALA  0001      -1.522   1.368  -0.358
ATOM      8  H   ALA  0001       0.176   0.013   1.740
ATOM      9  C   ALA  0001      -1.237  -1.200   0.610
ATOM     10  H   ALA  0001      -2.007  -1.183   1.397
ATOM     11  H   ALA  0001      -0.655  -2.129   0.709
ATOM     12  H   ALA  0001      -1.737  -1.199  -0.371
ATOM     13  H   ALA  0001       1.100   0.082  -2.154

This is a pretty simple one, for the amino acid alanine, one of the compounds that makes up DNA. There are a bunch of other lines that might be in this file, such as an AUTHOR line, some REMARK lines, and CONECT lines, which tell you how the atoms bond together. As we'll see later, that information isn't as important as it might appear.

Each of these ATOM lines contain a number of space-separated fields; again, not all of which are in use in this sample. Here's a fuller one, which is actually part of an alanine acid in a bigger protein:

ATOM     65  CB  ALA A  10      94.137   2.952   8.447  1.00 16.71

To make things a little clearer, I'll add a ruler on top of it, like so:

ATOM     65  CB  ALA A  10      94.137   2.952   8.447  1.00 16.71

Here we have an atom with a unique ID of 65; the next field tells us that this is C, a carbon atom, and then things skip about a little.

As well as identifying the atom with a unique ID, we want to give it a structural description that helps us understand what it's doing in the molecule. So the fourth field, A, is a top-level structural marker—we're in the first "bit" of the molecule. Then the third and fifth fields (ALA 10) tell us that this atom belongs to an ALAnine acid, (the "residue"), which is the 10th residue in this bit of the molecule. To further qualify the atom, we give it a unique suffix, which tells us which particular carbon atom this is in that acid; the suffixes tend to go in a particular sequence: blank for the first such carbon atom, then "A" for alpha, "B" for beta, and so on.

Then we have the three numbers that are what it's all about—the three-dimensional coordinates of that atom in the protein, as revealed by X-ray crystallography. The other two fields are more properties of the atom that we don't need to deal with for now.

Heather's Code

What my housemate needed to do was to find the distance between alpha carbon atom in residue number 65 of the first part of the molecule, and alpha carbon atom in residue number 67 of the second part. Her code, slightly reformatted for brevity, looked like this:

opendir(DIR, ".");
@files = grep(/\.pdb$/,readdir(DIR));

foreach $file (@files) {
   print "$file\n";
   open (FILE,$file) || die "Cannae dae it\n";
   while ($line = <FILE>) {
      if ($line =~ /CA  ... A  65/) {
         $ax = substr($line, 31, 8);
         $ay = substr($line, 39, 8);
         $az = substr($line, 47, 8);
      if ($line =~ /CA  ... B  67/) {
         $bx = substr($line, 31, 8);
         $by = substr($line, 39, 8);
         $bz = substr($line, 47, 8);

   $distance = sqrt(($ax - $bx)*($ax-$bx) + ($ay - $by)*($ay-$by)
                    + ($az - $bz)*($az-$bz));
   print "$distance \n";

When run in a directory containing Heather's protein, it produces the following output:


And that's the answer: 18.2 Angstroms between the two atoms.

Basic Refinements

Now I hate myself for this, but it seems like my natural reaction when I look at someone else's Perl code is to see if I can do it better. So before I start mucking around with the program, let me say it has two major advantages over any other implementation: First, my housemate was able to get it written quickly and easily, and second, it gives the right answer. Anything else is window-dressing.

But permit me a bit of window-dressing.

When I sat down to write this article, I had gotten a directory with quite a few PDB files in it, and was quite surprised to see this output from the program:


So the first problem is that we neither stop when we've found the answer nor reset our coordinates every time we open a new file. We'll do both of these things as our first refinement:

foreach $file (@files) {
   print "$file\n";
   open (FILE,$file) || die "Cannae dae it\n";
   my ($ax, $ay, $az, $bx, $by, $bz);
   while ($line = <FILE>) {
      # ...

   $distance = sqrt(($ax - $bx)*($ax-$bx) + ($ay - $by)*($ay-$by)
                    + ($az - $bz)*($az-$bz));
   print "$distance \n";
   last if $distance;

(# ... represents parts we haven't changed.)

Additionally, since we have all the values we need once we've found the second atom, we can put a last in there, too, to make things more efficient:

   while ($line = <FILE>) {
      if ($line =~ /CA  ... A  65/) {
         $ax = substr($line, 31, 8);
         $ay = substr($line, 39, 8);
         $az = substr($line, 47, 8);
      if ($line =~ /CA  ... B  67/) {
         $bx = substr($line, 31, 8);
         $by = substr($line, 39, 8);
         $bz = substr($line, 47, 8);

Now there's a lot of repetition in those 10 lines, and repetition is something that a computer programmer should try to avoid at all costs. I'll say that again: Repetition is something that a computer programmer should try to avoid at all costs.

Why? Suppose we got one of those offsets wrong—suppose the coordinates start at position 30, not 31. We have to change 31 to 30, and then 39 to 38, and then 47 to 46—and, to make things worse, we have to go down a couple of lines and change it all again. If something appears more than once, there's a reasonable chance that we'll end up forgetting to change it one of those times. In short: More repetitions means more maintenance.

Thankfully, sometime in the '60s, computer scientists came up with the concept of a subroutine to help us avoid repetition:

use constant COORDS_OFFSET => 31;

sub co_ords { 
    my $line = shift;
    (substr($line, COORDS_OFFSET,       8),
     substr($line, COORDS_OFFSET + 8,   8),
     substr($line, COORDS_OFFSET + 16,  8))

while ($line = <FILE>) {
   if ($line =~ /CA  ... A  65/) {
      ($ax, $ay, $az) = co_ords($line);
   if ($line =~ /CA  ... B  67/) {
      ($bx, $by, $bz) = co_ords($line);

Similarly, we can use the exponentiation operator (**) to avoid repeating terms when calculating the distance:

$distance = sqrt(($ax - $bx) ** 2 + ($ay - $by) ** 2
                + ($az - $bz) ** 2);

This isn't just trying to shave characters off a program for the sake of it—this is reducing the amount of maintenance we have to do in the future. And now we've got the kind of program as I'd write it—although, to be honest, I'd still be casting an eye over those repeated substrs.

Handling PDB Files

The substrs would be bothering me because Perl has a bunch of tools for extracting data from lines of text and, while substr is a perfectly good one, it's not great at doing several things at once. Since we're using regular expressions to match the atoms we want, a better way might be to also use them to extract the coordinates.

while ($line = <FILE>) {
  if ($line =~ /CA  ... A  65    (.{8})(.{8})(.{8})/) {
      ($ax, $ay, $az) = ($1, $2, $3);
  if ($line =~ /CA  ... B  67    (.{8})(.{8})(.{8})/) {
      ($ax, $ay, $az) = ($1, $2, $3);

This is a lot better for the simple reason that it keeps the information about the structure of a given PDB file line in one place, where previously we had the structure encoded in a regexp and also in offsets in a subroutine.

Perhaps better still, though, might be to use the unpack function—this is used primarily to unpack fixed-length records from a binary file, but it works just as well to unpack fixed-length records from a text file, such a PDB file. Let's look at the PDB line with its ruler again:

ATOM     65  CB  ALA A  10      94.137   2.952   8.447  1.00 16.71

Now, to create an unpack format from this, we simply turn our field length specifiers (123456) into the letter A (for ASCII) followed by the length: A6. As for the x's, which represent unused columns—well, they just stay as x's, to give us:

my $pdb_format = "A6A5xA2A2xA3A2A4xxxxA8A8A8A6A6";
while ($line = <FILE>) {
    my @fields = unpack($pdb_format, $line);

This is great if we're going to use all the information in the line, but because we're only using a little bit of it, we actually end up doing a bit more work:

if ($fields[0] eq "ATOM" and $fields[2] eq " C" 
    and $fields[3] eq "A " and $fields[5] eq " A" and
    $fields[6] == 65) {
    ($ax, $ay, $z) = @fields[7,8,9];

Yuck. So in this case, it looks like using regular expressions is the right way to do it. Of course, being a lazy Perl programmer, the first thing I really did when looking at this program was to look on for PDB, and I turned up Chemistry::File::PDB and Chemistry::Mol, the subject of Tubert-Brohman's June TPJ article.

Now in this instance, Chemistry::File::PDB turns out to not be that useful—the columns in the PDB file that deal with the residue sequence number are biology specific and aren't picked up by the more generic Chemistry module.

But it did give me another idea.

Visualization with OpenGL

While Heather was busy working on her molecules, another friend was playing with writing some new OpenGL bindings for Perl. OpenGL is a graphics library that makes writing 3D visualization applications really easy. I had been thinking about visualizing molecules into those lovely ball-and-stick models that you used to get in high-school chemistry, but I thought it would be way too difficult to work out how all the atoms fit together and where to place them.

But here was my housemate with her PDB files, which told me the precise coordinates of where each atom was supposed to go. The coordinates were being handed to me on a plate, and OpenGL provided a simple way of drawing spheres at a particular set of coordinates, so shouldn't it be trivial to write my own molecular visualization software to rival rasmol or pymol?

Well, it was that simple. The first thing I needed to do was set up the OpenGL framework—get the window up and running, set light and material parameters, and tell OpenGL what do when the mouse moved or the window resized. Thankfully, most of the hard work is hidden in the OpenGL::Simple::Viewer module, which provides default handlers for OpenGL events. We begin by using the relevant modules and setting up GLUT, the OpenGL toolkit:

glutInitDisplayMode( GLUT_RGBA | GLUT_DOUBLE | GLUT_DEPTH );


my $window = glutCreateWindow("PerlMOL");
glutDisplayFunc( basic_displayfunc(\&visualize) );
glutReshapeFunc( \&basic_reshapefunc );
glutMouseFunc( \&basic_mousefunc );
glutMotionFunc( \&basic_motionfunc );

The basic_... functions are provided by OpenGL::Simple::Viewer. Apart from basic_displayfunc, they provide handlers that allow you to resize, rotate, translate, zoom, and image using the mouse. This is pretty much all we need for a basic molecular visualizer. basic_displayfunc provides a wrapper around a subroutine we provide, which will do all the actual drawing.

We also need to set up some shading and lighting:

my @LightAmbient  = ( 0.4, 0.4, 0.4, 1.0 );
my @LightDiffuse  = ( 0.9, 0.9, 0.9, 1.0 );
my @LightSpecular = ( 0.1, 0.1, 0.1, 0.1 );
my @LightPos      = ( 0, 0, 2, 1.0 );

glLight( GL_LIGHT1, GL_AMBIENT,  @LightAmbient );
glLight( GL_LIGHT1, GL_DIFFUSE,  @LightDiffuse );
glLight( GL_LIGHT1, GL_SPECULAR, @LightSpecular );
glLight( GL_LIGHT1, GL_POSITION, @LightPos );


And the final thing we need to do before diving into writing the drawing code is to call the main loop:


So far, we've brought up the window, set the parameters, and told it to go, calling out visualize subroutine to work out what to do.

Since the molecule is not going to change while we're rotating or zooming around it, we build up and cache a model outside of doing the drawing. First, we want to open the molecule file using Chemistry::File::PDB, and then find all the atoms:

use Chemistry::File::PDB;
my $mol = Chemistry::MacroMol->read(shift);
my @atoms = $mol->atoms;
for my $atom (@atoms) {

We're going to need to know three things about each atom: the coordinates (where to put it), the mass (which relates to how big we're going to make it), and the element type (which relates to what color we're going to make it). Thankfully, Chemistry::Mol already handles each of these.

First the mass. The problem with the mass is that hydrogen atoms are very, very light; so much lighter than anything else that if we draw the atoms to a linear scale, they look tiny. So we scale the atomic mass by taking a logarithm to smooth out the differences in atoms, and then scale again by a linear factor to make it look reasonable:

my $mass   = log( 1 + $atom->mass ) / $mass_scale;

Next, the color. We set up two color conversion tables, for convenience. The first maps names of colors to OpenGL vectors:

my %color = (
    red     => [ 1,   0,   0,   1 ],
    yellow  => [ 1,   1,   0,   1 ],
    orange  => [ 1,   0.5, 0,   1 ],
    green   => [ 0,   1,   0,   1 ],
    cyan    => [ 0,   1,   1,   1 ],
    blue    => [ 0,   0,   1,   1 ],
    magenta => [ 1,   0,   1,   1 ],
    grey    => [ 0.5, 0.5, 0.5, 1 ],
    white   => [ 1,   1,   1,   1 ],

And the second maps elements to colors; I've used a relatively standard color map here:

my %element_colors = (
    C => $color{grey},
    O => $color{red},
    N => $color{blue},
    H => $color{white},
    S => $color{yellow}, 
    P => $color{orange}, 

Now for each atom, we can determine the color:

my $color  = $element_colors{ $atom->symbol } || $color{cyan};

We use cyan as a default, in case there are any unexpected atoms like potassium cropping up in our molecule. Finally, we can extract the coordinates from the atom object:

my @coords = $atom->coords->array;

So for each atom, we put all these things into an array reference and place it in an array:

    push @ballpoints, [ $color, $mass, @coords ];

This is the model of our molecule. Now we need to draw it! Thankfully, OpenGL makes things very simple:

sub visualize {
    if ( !@ballpoints ) { make_model() }
    for (@ballpoints) {
        my ( $color, $mass, @coords ) = @$_;
        glutSolidSphere( $mass, $sphericity, $sphericity);

We set the color, translate the drawing frame to the appropriate coordinates, and draw a circle with the given mass and sphericity (sphericity is a constant that tells OpenGL how many segments to make into a circle). In the OpenGL world, everything is made up of polygons; the more polygons in a sphere, the more spherical it looks, but the more complicated it is to render. I've found that a sphericity of about eight provides attractive but not overly resource-heavy atoms.

And, well, that's it. We can load up our molecules, view the atoms on the screen, and rotate and zoom them with the mouse. Great. Except for one thing: Where are the bonds?


We've got the balls in our ball-and-stick diagram, but no sticks—we don't know how the atoms fit together. The PDB file has some CONECT records that purport to tell us this, but Chemistry::File::PDB does not read them—in part because they are not always complete. Instead, we can work out the bonds between atoms by applying science.

Just like two large objects have a gravitational attraction to each other, atoms that are close to each other stay in position by means of something called the "Van der Waals forces." Practically, this means that we can tell if two atoms are bonded by seeing if the distance between them is less than a particular boundary value, which is a function of the mass of the two atoms in question.

Of course, to check the distance and mass of every pair of atoms to see if they're bonded is a horrendous task, and so we have two ways to avoid this: We can either use a clever algorithm or we can let someone else do the work. I prefer the second solution, and a couple of moments searching on CPAN brought me to Chemistry::Bond::Find:

use Chemistry::Bond::Find qw(:all);

And now we can query $mol->bonds to link our atoms together. Inside our build_model routine, we put:

for my $bond ( $mol->bonds ) {
    my ( $from, $to ) = $bond->atoms;
    my @from = $from->coords->array;
    my @to   = $to->coords->array;
    push @ballsticks, [ \@from, \@to ];

Each bond consists of a pair of atoms, so we just take note of the coordinates of each atom in the pair so we can draw a line between them. Once we have this data, we can add the following to the visualize routine:

glColor( @{ $color{'grey'} } );
for (@ballsticks) { glBegin(GL_LINES); glVertex(@$_) for @$_; glEnd; }

And that's all—for each stick we start a GL_LINES object, pass in the coordinates of our two vertices, and then end it. Now we get to look at molecule diagrams like Figure 1.

Perlmol's Progress

There's a lot more than can be done with this project, which I've called "perlmol" (as a nod to Python's "pymol," a similar visualization tool). These future extensions come in two categories: things to do with biology and things to do with OpenGL.

For instance, a simple biology change could be to alter the way atoms are colored; another way molecular biologists like to visualize their molecules is to color each set of atoms according to the residue to which they belong. We can do this by altering our build_model routine:

my $color = gen_color($atom->{attr}{"pdb/residue_name"});

gen_color is a handy little routine that dishes out consistent colors to each textual "tag" (in this case, the residue name) that comes its way: The first one will be red, the second yellow, the next orange, and so on. It works by maintaining a mapping between tags and colors, and using a sequence to assign a color to a tag that doesn't already have one:

my %ccache;
my @colors = values %color;
my $iter = 0;
sub gen_color {
    my $key = shift;
    $ccache{$key} ||= $colors[$iter++ % $#colors];

An OpenGL change, on the other hand, would be something like allowing you to coat your molecules in different textures, as shown in Figure 2.

One planned change that covers both areas is that biologists don't usually look at the whole swarm of balls-on-sticks for a molecule of many thousands of atoms; instead, they track what's called the "backbone," a particular series of atoms that runs from residue to residue. Normally, this is displayed as a "ribbon" swirling around the molecule. Offering this as an option would make visualizing large molecules much more comfortable.

But you don't have to be a biologist to play with OpenGL—any kind of structure or surface can be modeled, displayed, and then viewed from multiple angles in a variety of textures, materials, or lighting conditions, and the OpenGL::Simple modules make it incredibly easy to do that. Happy visualizing!


Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.