Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Perl and Chemistry


June, 2004: Perl and Chemistry

Ivan is a graduate student at the Yale University Chemistry department and can be contacted at [email protected].


A typical day in the life of an aspiring computational chemist: I had to find a few thousand structures in the Cambridge Structural Database and figure out the statistical distribution of certain bond lengths. The software for accessing the database was available in the library and it had a feature to do exactly what I wanted. Unfortunately, it did not work as advertised. In the end, I concluded that I could export the structures as text files, which could then be opened with any molecular visualization program with which I could measure the distances that I wanted interactively. Yeah, right. Like I'm going to open a thousand files one by one and click on a bond to find out the bond length and then copy the number to a spreadsheet. I'm a Perl programmer and Perl programmers are lazy!

Luckily, I had some modules I had started writing for dealing with these types of problems. I initially wrote a 20-line script that solved it, but I still thought that that was too long. So, I continued developing the modules (which collectively I dubbed the PerlMol project) with the goal of being able to solve this kind of problem with one line of code.

This article is about how Perl can be useful for dealing with chemical information. I hope that even nonchemists may find it interesting, as it illustrates some problems of general interest in computer science and in the practice of designing and implementing Perl modules.

Chemistry 101

The idea of structure is one of the most fundamental concepts in chemistry: This is the idea that the properties of matter are determined by the arrangement of atoms in space. Throughout the history of modern chemistry, people have developed various ways of representing chemical structures, from the familiar structure diagram to mechanical models and computer-generated images. It goes without saying that computers are used every day in the chemical sciences. Therefore, one of the central issues in computational chemistry and chemical informatics has to be the representation of chemical structures in the computer. Depending on the application, one or both of the following approaches is typically used for representing molecules:

  • Molecular graphs. A graph is a set of nodes and arcs, where the arcs connect the nodes. In a chemical graph, the nodes are the atoms and the arcs are the bonds. Molecular graphs are a subset of the more general mathematical concept of a graph; for example, the arcs do not have direction and both the arcs and the nodes are "colored." The mental model that organic chemists have of bonds, valence, rings, and functional groups is mostly based on this idea of a molecule as a graph.
  • Tridimensional vectors. The position of each atom in space can be represented by its (x, y, z) coordinates. This approach has the advantage that it allows for a more realistic representation of the molecule and it has all the information required, from a rigorous physical standpoint, for the calculation of the properties of the molecule. However, it is less intuitive because it does not account for bonds and functional groups explicitly.

Common Problems in Computational Chemistry

There are already countless programs that are used for doing chemistry in the computer. These programs can:

  • Draw structural diagrams.
  • Calculate the energy and other properties of molecules with a basis in some physical theory, such as quantum mechanics or molecular mechanics.
  • Manage databases of chemical information.
  • Visualize big molecules such as proteins and DNA.
  • Simulate the movement of atoms and molecules in solution.

(See http://dmoz.org/Science/Chemistry/Software/ for a more detailed categorization.)

One problem with many of these programs is that almost every one has its own file format, which makes interoperation hard. Many programs know how to read the output of some other programs, but never all. There have been open-source projects that deal specifically with the problem of file conversion, most notably OpenBabel (http://openbabel.sourceforge.net/).

Another problem is that, while there are many nice programs with glossy GUIs that allow you to visualize or modify molecules using the mouse, when you have to deal with a large number of molecules, the options are not as diverse. What is needed is a very flexible batch tool or a programming toolkit focused on dealing with molecules. Of course, there are already a few of these toolkits around. Some of these are proprietary, such as Daylight (http://www.daylight.com/), OEChem (http://www.eyesopen.com/ products/toolkits/oechem.html), and ChemAxon (http://www .chemaxon.com/). Some are open source, such as the Chemistry Development Kit (http://cdk.sourceforge.net/), OpenBabel, and JOELib (http://www-ra.informatik.uni-tuebingen.de/software/joelib/ index.html). The most popular languages for these toolkits have been C++ and Java (OEChem also has a Python interface to the underlying C++ library).

I was surprised that CPAN was sorely lacking in terms of modules for chemistry. The only available modules were Chemistry::Element, which allows you to convert between atomic number, element symbol, and element name and store other elemental information; and Chemistry::MolecularMass, which calculates the mass from the molecular formula. There were no modules that actually dealt with the structure of molecules. While some of the options in other languages are not bad, I was looking for something with the simplicity and conciseness of Perl that could allow me to write "chemical one-liners" to solve small problems very quickly, without having to compile anything. Hence, PerlMol was born.

OK, enough theoretical stuff. On to the Perl.

Designing PerlMol

Molecules seem to scream that they want to be represented as objects. It is very natural to represent the molecule as an object that contains a group of atom objects and a group of bond objects. Atoms have properties such as coordinates, which are conveniently represented by vector objects. (For the vector objects, I used the Math::VectorReal module by Anthony Thyssen.) This leads naturally to the three most important PerlMol classes: Chemistry::Mol, Chemistry::Atom, and Chemistry::Bond. I like to think of them as the trinity of chemical classes. Some other toolkits (usually written in Java) tend to have a proliferation of classes in a complex hierarchy: An Atom is an AtomType, which is an Isotope, which is an Element, which is a ChemObject. Atoms are held by AtomContainers and so on. I decided to keep the flattest possible hierarchy because I do not share the philosophy that everything has to be an object. For example, I think that Perl arrays are more than powerful enough, so that container objects are not really needed (but I did keep a root class, Chemistry::Obj, to keep some common methods needed by the other classes).

One of the design choices was that PerlMol should "know" as little chemistry as possible, at least for the core classes. That is because there are many different—and often incompatible—ways of thinking about molecules and other chemical entities. Different programs and different file formats often assume different things. For example, it is almost always assumed that a bond connects two atoms. However, that is not entirely true because bonds that connect three atoms are known to exist, although they are not common. To be on the flexible side, atom objects have an array of atoms of unspecified length. In a way, the idea of "molecule" that is used in PerlMol should aim to be the "lowest common denominator," so that it can effectively represent different conventions. It is still possible, and indeed a good idea, to have some classes or methods that are more specific (and restrictive).

The most basic operation in PerlMol is constructing a molecule object. As you would expect, this can be done as follows:

my $mol = Chemistry::Mol->new;

That gives you a "blank molecule," to which you can add atoms and bonds with methods such as $mol->add_atom and $mol->add_bond. However, most of the time you do not want to create a molecule from scratch, but to read it from a file:

my $mol = Chemistry::Mol->read("myfile.mol");

For this to work, you need to have a module that can read that file format. The format can be specified as an option to the read method, or it can try to guess it automagically from the contents of the file or from the filename extension (the .mol extension in the example is used for files with the MDL molfile format). All the file I/O modules inherit a common interface from the Chemistry::File class and they can be used explicitly or they can all be loaded automatically (in which case, you can think of them as plug-ins).

Now you probably want to coerce some information out of the molecule, which you can easily do by calling other Chemistry::Mol methods, most notably $mol->atoms, and then the methods on the atom objects themselves. Some examples:

print $mol->name;
my $a1 = $mol->atoms(1);  # get the first atom
print $a1->coords;        # outputs something like [0.000, 1.679, -1.373]
if ($a1->symbol eq "Pb") { 
    $a1->symbol("Au")     # Transmute lead into gold
}

# select all the carbon atoms in the molecule
my @carbons = grep { $_->symbol eq "C" } $mol->atoms;

# find carbon atoms within 2.5 Angstroms
my @close_carbons = grep { $a1->distance($_) < 2.5 } @carbons;
$mol->delete_atom(@close_carbons);   # get rid of those atoms!

These samples should give you an idea about how Perl lists and functions, such as grep, give you a lot of flexibility for selecting atoms and then playing with them. Some visualization programs give you ways of specifying atoms from a command line by using a specialized minilanguage, but here you can use the full power of Perl. There are many other methods available in the PerlMol toolkit; if you are interested, please refer to PerlMol's accompanying POD.

One of the practical problems that I found when writing PerlMol was the use of circular references. An atom object has a reference to a bond object and vice versa. Since perl uses a simple reference count for deciding which objects are no longer used and should be deleted, circular references cause memory leaks. There are several ways of dealing with this problem; I chose to use weak references by means of the Scalar::Util module by Graham Barr (there is an article on this topic by Randal Schwartz in the July 2003 issue of TPJ).

Pattern Matching

Everybody knows that one of the most powerful features of Perl is its use of regular expressions. You can describe the pattern that you want to match and then find its occurrences in a string. The same idea can be applied to molecules; for example, you may need to find a sulfur atom that is bound to a carbon that is bound to another carbon that is bound to an oxygen. You could use Perl's grep to find the first atom, then look at its neighbors, and then "walk" the molecule one atom at a time until you find a match. That is painful. What's needed is a general pattern-matching engine; for that, I wrote Chemistry::Pattern. A pattern is a subclass of molecule, so you can create it in a similar way:

my $mol  = Chemistry::Mol->read("bigmol.mol");
my $patt = Chemistry::Pattern->read("smallpatt.mol");

# find matches of the pattern in the molecule
while ($patt->match($mol)) {
    print "Atom 1 in the pattern matches atom ", $patt->atoms(1)->map_to,
          "in the molecule!\n";
}

What I show above is a substructure match, which is akin to a substring match. A regular expression is more powerful than that because you can define character classes and use special operators. The Chemistry::Pattern engine actually has some of that flexibility, but you need to use a special language (or file type) to specify the pattern in a concise manner. The most popular language for that purpose is called SMARTS. A simple SMARTS example would be "[F,Cl]*[!C]", which means "either chlorine or fluorine next to any atom (*) next to anything but carbon." It is possible to specify very complex patterns using this language. (For a detailed description of the SMARTS language, see http://www.daylight .com/dayhtml/doc/theory/theory.smarts.html.) At the time of this writing, the Chemistry::File::SMARTS module is still unfinished (it only implements a subset of the language), but you can use the example above as follows:

my $pattern = Chemistry::Pattern->parse("[F,Cl]*[!C]", format => "smarts");

A Molecular grep

One of the most common tasks in chemical informatics is to find molecules matching a pattern. A program to do this has to loop over all the files and match them against the pattern. You can write a simple "molecular grep" as follows:

1      use Chemistry::Mol;
2      use Chemistry::Pattern;
3      use Chemistry::File ':auto'; # use all available file formats
4      use strict;
5      use warnings;

6      if (@ARGV < 2) {
7          die "Usage: molgrep <smarts> <fname> ...\n";
8      }

9      my $smarts = shift;
10      my $patt = Chemistry::Pattern->parse($smarts, format => "smarts");

11      for my $fname (@ARGV) {
12          # note: files may contain more than one molecule
13          my @mols = Chemistry::Mol->read($fname);
14          for my $mol (@mols) {
15              while ($patt->match($mol)) {
16                  print "File $fname matches!\n";
17              }
18          }
19      }

To find molecules matching the pattern just discussed, you could type:

molgrep '[F,Cl]*[!C]' *.mol

Even More Laziness: Mok

This molgrep is all well and good, but what if you need to extract more information from the molecule? That was the problem that I mentioned at the beginning of this article. You can just edit molgrep and add whatever you need in the inner loop (line 16). That sounds like needless repetition and lots of cut-and-paste whenever you want something different. Wouldn't it be nice if the loops, file reading, and the like were implicit and we only had to write line 16? That would be similar to the behavior of awk, or perl -n. What is needed, then, is a "molecular AWK." I decided to shorten it to Mok, which also happens to remind me of Ookla the Mok, a character from the animated series Thundarr the Barbarian. (If you're curious about that, see http://dmoz.org/Arts/Animation/ Cartoons/Titles/T/Thundarr_the_Barbarian/.)

I will not include all the Mok code here, but you can find it on CPAN. The essence of it, as you can imagine, is an eval. Here, we can harness the ability of Perl to compile user-supplied code dynamically.

A traditional AWK program consists of a series of patterns with associated "action blocks." For example, if you want to say "hello" whenever you see someone, you would code the following:

/someone/ { print "hello" }

For the Mok language, I borrowed AWK's pattern-action format, but the code inside the block is Perl code, which is evaled. The patterns are expressed as SMARTS strings. You can rewrite the molecular grep example from the previous section as a Mok one-liner:

mok '/[F,Cl]*[!C]/ { print "$file matches!\n" }' *.mol

All of the required Chemistry::* modules are used implicitly. The block is executed for each molecule matching the pattern and it has several predefined variables available: $MOL is the current molecule, $FILE the current filename, @A are the atoms matched by the pattern, @B are the bonds, and so on. Now if you want to do something other than printing the filename of the matching molecule, you only have to change the inner block. To print out the S=O bond lengths in a group of molecules, you just write:

mok '/S=O/g { printf "%s: %.3f\n", $MOL->name, $B[0]->length }' *.mol

where the g flag after the pattern means "global" (i.e., give all the matches for each molecule).

Notice how we went from hours of interactive point-and-click to writing a 20-line script to something you can type at the command-line in 30 seconds!

Conclusion

In this article, I have shown how Perl can be used for dealing with some of the typical problems in chemical informatics. I mentioned some of the design and programming issues that are applicable to writing almost any object-oriented module, such as using plug-in modules for I/O and ways of dealing with circular references. I showed how creating a specialized minilanguage such as Mok can turn routine 20-line scripts into one-liners. All of this was inspired by Perl's philosophy that "easy things should be easy, hard things should be possible." I have seen other toolkits that make the hard things possible, but at the price of making the easy things complicated or verbose.

If you are interested in the PerlMol project, please visit http://www.perlmol.org/. Contributions are welcome, either with coding and testing, documentation, or by making a donation.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.