Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Open Source

A DNA Sequence Class in Perl


Jun99: A DNA Sequence Class in Perl

Lincoln develops databases, applications, and user interfaces for the Human Genome Project at Cold Spring Harbor Laboratory in Long Island, NY. His books on web software development include The Official Guide to CGI.pm (John Wiley & Sons, 1998) and Writing Apache Modules in Perl and C (O'Reilly & Associates, 1999).


The Human Genome Project is a multinational project to determine the entire human DNA sequence by the year 2003. Obtaining this information means pushing around massive amounts of information -- estimates quickly run into the terabytes. This in turn requires sophisticated software engineering, fault-tolerant information systems, and rapid application development.

In this article, I describe a Perl library for manipulating DNA and RNA sequences. In the course of examining this library, you'll see how Perl's object-oriented features work together to create an elegant API. And hopefully, you'll learn a little biology as well.

DNA, RNA, and Proteins

The stuff of the genome is deoxyribonucleic acid (DNA), a long thin molecule that is usually compactly coiled into the chromosomes of our cells. DNA consists of four distinct subunits, called "nucleotide bases," which are repeated across its entire length. The four bases have been assigned the convenient single-letter names A, G, C, T -- abbreviations for their longer chemical names.

In DNA, the nucleotide bases are linked together into long chains that can be written down as an ASCII string. Figure 1(a), for example, is a DNA sequence consisting of 39 nucleotides.

DNA doesn't usually float around the cell in its single-stranded form. Instead it spends its time in a stable double-stranded form, the famous "double helix." In the double-stranded form, each nucleotide is paired with another nucleotide. Because of their chemical nature, A always pairs with T, and G pairs with C. Written down in a text representation, the double-stranded form of this short sequence looks like Figure 1(b).

Because the nucleotide bases are paired, they are often referred to as base pairs (bp). I've labeled the left end of the top strand 5' and the right end 3'. On the bottom strand, the numbering of the two ends is reversed. This numbering system is related to the way that DNA is put together chemically. Here, the only significance of this is that it emphasizes that DNA strands are directional. The two strands are often arbitrarily labeled the "plus" and "minus" strands to distinguish them.

DNA can do just two things: It can replicate, and it can be transcribed into RNA. The replication process is the key to both cell replication and to propagation of the species. The two strands of DNA unwind like a zipper, and each strand dictates the assembly of its complementary second strand. Schematically, the process looks like Figure 1(c).

More interesting is the transcription and translation process. Along its length, DNA encodes the instructions for many thousands of proteins, everything from the crystalline protein of the eye lens to the enzymes that make up the digestive juices of the gastrointestinal tract. These protein coding regions, separated from each other by large tracts of DNA of unknown function, are in fact genes.

To make a protein from the DNA sequence of a gene, the cell performs two phases of chemical transformation. In the first phase, the gene is transcribed into ribonucleic acid (RNA). RNA is like DNA in many ways, but instead of being double-stranded it usually exists in single-stranded form. In addition, instead of being composed of the four bases A, G, C, and T, RNA has no T, but uses a different nucleotide abbreviated as U.

To transcribe RNA, DNA unwinds just a bit in the region of an activated gene, and the nucleotide sequence of the DNA is read off by enzymes that synthesize an RNA copy of the gene. Sometimes the plus DNA strand is transcribed, and sometimes the minus strand, depending on whether the gene is oriented right-to-left or left-to-right.

Represented in text form, an RNA strand looks just like its parent DNA strand except for the substitution of U for every T. Figure 2(a) depicts our example DNA in RNA form.

Unlike DNA (which never leaves the nucleus of the cell), RNA is free to travel through the nuclear envelope into the cellular cytoplasm. Once there, the RNA is translated into a protein. Like RNA and DNA, proteins are also long strands of repeating units. However, instead of there being only four units, proteins are made up of 21 different "amino acid" subunits. Proteins fold into complex structures dictated by the order of their amino acids. The folding determines the protein's structure and function.

Like the nucleotide bases, biologists use one-letter abbreviations to refer to the amino acids as well. Protein sequences use the letters A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. Because there just aren't enough letters in the Latin alphabet to go around, the protein alphabet overlaps with the nucleotide alphabet, but don't let that confuse you. An A found in a nucleic acid sequence has nothing to do with the A of a protein sequence.

Because only four RNA bases must dictate the order of 20 amino acids, there is obviously more to protein translation than the simple one-to-one encoding that takes place during transcription. In fact, the protein translation machinery uses a three-letter code to translate RNA into protein. During translation, the RNA is divided into groups of three-letter "codons," as in Figure 2(b).

The codons are used as a template to synthesize a protein sequence, using a little lookup table that's hardwired into the biological machinery. AUG becomes the amino acid M, UUC becomes F, CGA becomes R, and so forth. Our example DNA is translated into a 12 amino acid protein in Figure 2(c).

There are two things to notice in this example. One is that certain amino acids are encoded by several different codons. For example, the amino acid K is encoded by both AAA and AAG. This should be expected from the fact that there are 64 possible codons, and only 20 amino acids for them to encode. The other thing to notice is that certain codons (three in all) don't encode any amino acids. Instead they are "stop codons," which tell the translation machinery to stop translating and release the finished protein. Generally, the RNA molecule extends farther to the left and right than the protein it encodes (I've glossed over this fact for simplicity of illustration). Like the stop codons, the AUG codon is special because it tells the protein translation machinery with which codon to begin.

A Sequence Class Library for Perl

A lot of Genome informatics involves splicing, dicing, and processing long strings of DNA sequences. I created a library of Perl routines specialized for dealing with DNA (available electronically; see "Resource Center," page 5), with a small class hierarchy like Figure 3.

Sequence::Generic is an abstract class that implements a few generic methods that all biological sequences share, such as a method for determining the sequence's length and a method for concatenating two sequences together. Sequence::Nucleotide is a subclass of Sequence::Generic that adds support for DNA- and RNA-specific operations. One of these new operations is the reverse complementation method, which transforms one strand of DNA into its complement; another is a method to translate RNA into protein.

Sequence::Nucleotide::Subsequence is a descendent of Sequence::Nucleotide. Because the chunks of DNA that need to be analyzed are usually quite long (100,000 bp is not unusual), it's typical to work with one subregion at a time. A Subsequence represents a subregion of a longer sequence.

The Sequence::Alignment class is a utility class that stores information about how two similar sequences are related. It is useful for figuring out how a smaller sequence fits into a larger one.

For completeness, there should also be a Sequence::Protein class descended from Sequence::Generic, but that was too much to squeeze into this article. Instead of returning a real Sequence::Protein object, the method that translates RNA into proteins just returns a simple character string.

The Sequence::Generic Class

Sequence::Generic (Listing One) defines three methods that are intended to be overridden by child classes: new(), seq(), and type(). The new() method is the object constructor. It does nothing but call the croak() function from the Carp package to abort the program with an error message. This prevents the generic class from being instantiated. The seq() method is a low-level routine that returns the raw sequence information as a text string. This method also croaks in case. Sequence ::Generic is subclassed without the seq() method being overridden. The type() method returns a human-readable string describing the type of the sequence, and is intended for debugging work. It's intended to return something like DNA, RNA, or "Protein." In the abstract class, this method returns "Generic Sequence."

The remainder of the methods are generic ones that will work with almost any biological sequence. One of these is length(), which returns the length of the sequence data; see Example 1. By convention, Perl methods are invoked with a reference to the object as the first argument on the subroutine argument list. This method begins with the idiom my $self = shift. The effect of this statement is to shift the object off the argument list and to copy it into a local variable named $self. The methods then invoke our object's seq() method with the Perl method-invocation syntax $self->seq and pass it to the Perl string-length function length() (this is a normal function call, not a method call). The result is then returned to the caller.

Another method defined in this file, concatenate() (see Example 2), concatenates two Sequence::Generic objects together or concatenates a Sequence::Generic object with a string, returning a new sequence object as the result.

In addition to its object reference, the method takes two arguments. The first is the new sequence to concatenate to the current one. The second argument is a flag that indicates whether the new sequence is to be prepended (true) or appended (false). concatenate() is usually called via operator overloading, and the Perl overload machinery actually takes care of setting up the two arguments.

The method first checks whether the new sequence is an object by calling the Perl built-in ref(), which returns the class name for objects, and the undefined value for nonobjects. If ref() indicates that the new sequence is an object, concatenate() checks whether it is a subclass of Sequence::Generic by using the built-in isa() method. The __PACKAGE__ token is replaced by the Perl run time with the name of the current package, and avoids having to hardcode the name of the class. If the object is not a subclass, the routine aborts with an error message. Otherwise, it recovers the sequence as a string by calling its seq() method. If the $new_seq argument isn't an object at all, the method treats it as a string.

The last statement of this method uses the Perl built-in concatenation operator "." to combine the sequence strings together in the order dictated by the $prepend flag. The concatenated string is passed to the object's new() constructor to create a new Sequence object, which is returned to the caller. Because concatenate() will be called from a subclass of Sequence::Generic, the new() constructor that gets called will belong to the subclass, not to Sequence::Generic. In Perl there is no strong distinction between constructors and object methods, which may be a source of confusion for C++ and Java programmers.

Perl lets you overload many of its built-in operators so that when they are applied to objects they invoke a method call rather than take their default actions. I overload three different operators in the Sequence::Generic class (Example 3). For example, by binding the "." operator to concatenate(), each of the constructions in Example 4 will work in the natural way.

The Sequence::Nucleotide Class

Sequence::Nucleotide (Listing Two) is a dual-purpose class that represents both DNA and RNA. Because DNA can be transformed into RNA and vice versa simply by exchanging Ts and Us, I store the data as DNA and transform it into an RNA form on demand.

This module begins by loading the other modules it depends on, including Sequence::Generic, Sequence:: Nucleotide::Subsequence, Sequence::Alignment, and Carp.

One difference between Sequence::Nucleotide and Sequence::Generic is its use of the @ISA global. The @ISA array contains a list of all the classes that the current one inherits from. Unlike Java, Perl's object system allows for multiple inheritance, although this feature is rarely needed. In this case, @ISA is a one-element list containing the name of the superclass, Sequence::Generic.

The next line in Listing Two defines a private package variable named %CODON _TABLE. It is a Perl hash table (associative array) that maps the 64 RNA codons to the 20 amino-acid protein alphabet.

The first method this class defines is the new() constructor, defined in Example 5. new() creates a hash array that contains the key's data and type. The data key will point to the raw nucleotide string data, canonicalized into upper-case DNA form. The "type" key is either DNA or RNA, indicating whether the sequence is to be displayed in DNA or RNA form.

The new() method can be called in several different contexts. It can be called "de novo" as a class constructor with a plain string argument to be interpreted as sequence data, as in Example 6(a). Alternatively, the argument might be another Sequence object (either a Sequence::Nucleotide or another subclass of Sequence::Generic), in which case the constructor should return a clone of the original sequence, as in Example 6(b). A final context in which the constructor might be called is one in which new() is used as an object method call rather than as a constructor. In this case, we want to return a new object of the same subclass as the object, as in Example 6(c). Finally, the new() method takes an optional second argument that can be used to force the sequence type. If the second argument is omitted, the method will guess whether the sequence is RNA or DNA by looking for the presence of "U-base pairs," as in Example 6(d).

When a method is called in the class constructor style (as in the first two examples above), the first argument passed to the function is a string containing the class name. When a method is called as an object method (as in the third example), the argument is a reference to the object. The first thing this method does is to recover the class name from the object reference in the event that the argument is a reference rather than an ordinary string. This ensures that both class constructor and method call styles work properly.

The method recovers the other arguments from the subroutine argument list, storing them in local variables $sequence and $type. It also initializes an empty hash reference using the anonymous hash constructor {}, and uses the bless operator to associate this reference with the current class, turning it into an object reference.

new() then examines the $sequence argument. If it is an object reference, the method determines whether the object implements the seq() method by invoking the built-in can() method. The decision to use can() here rather than isa(), as I did in Sequence::Generic, was an arbitrary choice, motivated only by the desire to show something new. If the object doesn't implement seq(), it isn't likely to be a Sequence, so it croak()s with an error message. Otherwise, it clones the object with the simple expedient operation of copying its entire hash table.

If the $sequence argument is not an object reference, the method treats it as a string. new() does a quick check to see if it looks like a nucleic acid by matching it against a regular expression containing the characters GATC and U. I also allow whitespace to match (character class \s) and the N character, commonly used in experimental data to indicate an unknown or ambiguous base. If the sequence passes this test, new() passes it to a private routine named _canonicalize() to fix case, to strip out whitespace, and convert the sequence into DNA form, if necessary. The canonicalized sequence is then stored in the data field of our object's hash reference. new() also sets the type field to contain either the value provided by the caller, or, if not provided, to a guess based on the nucleotide composition of the provided sequence. new() returns the new object as the function result.

The translate() method takes the current sequence and translates it into protein sequence. It often happens with novel DNA sequences that you don't know in advance where the gene or genes actually begin. The part of the gene that encodes the protein may start at any of three offsets along the strand, and may be read either from the plus or the minus strand, giving a total of six possible "reading frames" that the protein may be read from.

The translate() method accepts an optional frame number argument, which can be any of the integers 1, 2, 3, or -1, -2, -3, and returns the protein translation for that reading frame. If no argument is provided, the routine returns the translation beginning from the first nucleotide in the sequence, reading frame +1. After a bit of adjustment to trim the sequence to an even multiple of three, the core of the translation routine is:

$s=~s/(\S{3})/$CODON_TABLE{$1} || 'X'/eg;

This is a Perl global pattern match and substitution operation. It finds codons by identifying groups of exactly three nonwhitespace characters \S{3} and replaces them with the amino acid value looked up in %CODON_TABLE. If, for some reason, the codon isn't present in the table (perhaps because of an ambiguous "N" in the sequence), an X is used for the corresponding amino acid residue. The translated sequence is then returned as the function result.

Conclusion

Perl has met the needs of the Genome Project admirably so far and will probably continue to do so for years to come. In this article, I've tried to give you a taste of how Perl can reach beyond its "quick and dirty" heritage to build a set of object- oriented classes. These classes can, in turn, serve as the foundation for large and complex software projects.

If you are interested in learning more about the use of Perl in biology, check out the Bioperl Project at http://bio .perl.org/. This cooperative project is creating an extensive class library of biologically important objects. Here you'll find full-featured cousins of the simple nucleotide sequence classes presented here, as well as Perl classes for proteins, genes, genetic maps, phylogenetic trees, and 3D protein structures.

DDJ

Listing One

package Sequence::Generic;
# File: Sequence/Generic.pm

use strict;
use Carp;
use overload 
  '""'        => 'asString',
  'neg'       => 'reverse',
  '.'         => 'concatenate',
  'fallback'  => 'TRUE';

# These methods should be overriden by child classes
# class constructor
sub new {
    my $class = shift;
    croak "$class must override the new() method";
}
# Return the sequence as a string
sub seq {
    my $self = shift;
    croak ref($self)," must override the seq() method";
}
# Return the type of the sequence as a human readable string
sub type {
    return 'Generic Sequence';
}
# These methods probably don't have to be overridden
# The length of the sequence
sub length {
    my $self = shift;
    return length($self->seq);
}
# The reverse of the sequence
sub reverse {
    my $self = shift;
    my $reversed = reverse $self->seq;
    return $reversed;
}
# A human-readable description of the object
sub asString {
  my $self = shift;
  return $self->type . '(' . $self->length . ' residues)';
}
# Concatenate two sequences together and return the result

sub concatenate {
  my $self = shift;
  my ($new_seq,$prepend) = @_;
  my ($to_append);
  if (ref($new_seq)) {
      croak "argument to concatenate must be a string or a Sequence object"
      unless $new_seq->isa(__PACKAGE__);
      $to_append = $new_seq->seq ;
  } else {
      $to_append = $new_seq;
  }
  return $self->new($prepend ? $to_append . $self->seq 
                     : $self->seq . $to_append);
}
1;

Back to Article

Listing Two

package Sequence::Nucleotide;
# file: Sequence/Nucleotide.pm

use Sequence::Generic;
use Sequence::Nucleotide::Subsequence;
use Sequence::Alignment;
use Carp;

use strict;
use vars '@ISA';
:Generic';

my %CODON_TABLE = (
           UCA => 'S',UCG => 'S',UCC => 'S',UCU => 'S',
           UUU => 'F',UUC => 'F',UUA => 'L',UUG => 'L',
           UAU => 'Y',UAC => 'Y',UAA => '*',UAG => '*',
           UGU => 'C',UGC => 'C',UGA => '*',UGG => 'W',
           CUA => 'L',CUG => 'L',CUC => 'L',CUU => 'L',
           CCA => 'P',CCG => 'P',CCC => 'P',CCU => 'P',
           CAU => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q',
           CGA => 'R',CGG => 'R',CGC => 'R',CGU => 'R',
           AUU => 'I',AUC => 'I',AUA => 'I',AUG => 'M',
           ACA => 'T',ACG => 'T',ACC => 'T',ACU => 'T',
           AAU => 'N',AAC => 'N',AAA => 'K',AAG => 'K',
           AGU => 'S',AGC => 'S',AGA => 'R',AGG => 'R',
           GUA => 'V',GUG => 'V',GUC => 'V',GUU => 'V',
           GCA => 'A',GCG => 'A',GCC => 'A',GCU => 'A',
           GAU => 'D',GAC => 'D',GAA => 'E',GAG => 'E',
GGA => 'G',GGG => 'G',GGC => 'G',GGU => 'G',
          );
*complement = *reversec = \&reverse;

sub new {
  my $class = shift;
  $class = ref($class) if ref($class);
  my ($sequence,$type) = @_;

  my $self = bless {},$class;
  if (ref($sequence)) {
    croak "Can't initialize sequence from non-Sequence object.\n"
      unless $sequence->can('seq');
    %{$self} = %{$sequence};  # clone operation
  } else {
    croak "Doesn't look like sequence data" 
      unless $sequence=~/^[gactnu\s]+$/i;
    $self->{'data'} = $self->_canonicalize($sequence);
    $self->{'type'} = $type || ($sequence=~/u/i ? 'RNA' : 'DNA');
  }
  return $self;
}
sub seq {
    my $self = shift;
    $self->{'data'} = $self->_canonicalize($_[0])  if defined($_[0]);
    my $seq = $self->{'data'};
    return $seq unless $self->is_RNA;
    $seq=~tr/T/U/;
    return $seq;
}
sub type {
    my $self = shift;
    return defined($_[0]) ? $self->{'type'} = $_[0] : $self->{'type'};
}
sub is_DNA {
    my $self = shift;
    return $self->type eq 'DNA';
}
sub is_RNA {
  my $self = shift;
  return $self->type eq 'RNA';
}
sub subseq {
  my $self = shift;
  my ($start,$end) = @_;
  return (__PACKAGE__ . '::Subsequence')->new($self,$start,$end);
}
sub reverse {
  my $self = shift;
  return (__PACKAGE__ . '::Subsequence')->new($self,$self->length,1);
}
sub translate {
  my $self = shift;
  my $frame = shift() || 1;
  my $l = $self->length;
  my $seq = $frame > 0 ? $self->subseq($frame,$l-($l-$frame+1)%3)
              : $self->reverse->subseq(abs($frame),$l-($l-abs($frame)+1)%3);
  my $s = $seq->seq;
  $s=~tr/T/U/;  # put it in RNA mode
  $s =~ s/(\S{3})/$CODON_TABLE{$1} || 'X'/eg;
  return $s;
}
sub longest_orf {
    my $self = shift;

    my ($max,$pos,$frame);
    foreach (-3..-1,1..3) {
    my $translation = $self->translate($_);
    while ($translation=~/([^*]+)/g) {
        if (length($1) > length($max)) {
        $max = $1;
        $frame = $_;
        $pos = pos($translation) - length($max); 
        }
    }
    }
    $pos *= 3;
    $pos += abs($frame);
    return ($pos,$pos+3*length($max)-1) if $frame > 0;
    return ($self->length-$pos,$self->length-$pos-3*length($max));
}
sub align {
    my $self = shift;
    my $seq = shift;
    $seq = $seq->seq if ref($seq);
    return new Sequence::Alignment(src=>$seq,target=>$self->seq);
}
sub _canonicalize {
  my $self = shift;
  my $seq = shift;
  $seq =~ tr/uU/tT/;
  $seq =~ s/[^gatcn]//ig;
  return uc($seq);
}
1;



Back to Article


Copyright © 1999, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.