Channels ▼

Bil Lewis

Dr. Dobb's Bloggers

Fasta Frustrations

December 26, 2008

Last week at work, a problem with our Fasta file reader came up. It
seems that the reader was impressively slow for large files (> 100k
entries) and didn't work at all for very large files (> 1m entries). I
mentioned in the meeting that I had implemented a semi-functional
reader that was much faster and could probably do good things for the
"official" one. That's when the fun started...

A Fasta file is an encoding of DNA information. Typically our DNA
sequencers spit out DNA sequences of 100-800 base pairs, which we then
assemble into gobs of "contiguous regions" (contigs). We often write
these things out as Fasta files so we can share with our friends. It's
a pretty simple format:

>G124FA12.T0
ACGACCTTGGTACCCGTTGAACGACCTTGGTACCCGTTGAAC 
CTTGGTACCCGTTGAACGACCTTGGTACCCGTTGAACGACCT 
AAAAACGACCTTGGTACCCGTTGAACGACCTTGGTACCCGTT 
>G124FA12.T0
AGGCTTGGTACCCGTTGAACGACCTTGGTACCCGTTGAACGA 

Whatever follows a leading ">" is the "header" (effectively the ID)
and the following lines are strings of characters representing either
nucleotides (ACGT as above) or amino acids (ACDEFG... there are twenty
of'em). Simple, eh?

Our existing FastaReader would take a file and provide an iterator for
it and optionally allow the programmer to request specific sequences
by header. To support large files (100k sequences, any length), it
would scan the file and write out a sorted index file, which it would
then search on request. It was optimized to allow prefix searches on
the headers. Because it sorted the headers in core, it couldn't handle
more than about 1m sequences.

This was not a major limitation in 2007.

With the new Illumina sequencing machines, we will be working with
upwards of 10m very short sequences (36 BPs). The old reader can't do
it.

It was up to me.

A quick review of the existing indexing code scared the bejesus out of
me. I did not look forward to working with it. The reader consisted of
a single, 1300 line source file, containing several 200 line methods,
all lacking in clarity.  It wrote out a complex index file in a poorly
documented binary format.

So I started by timing it and comparing it to my semi-reader. Mine was
6x faster for iteration and 10x faster for random access. And for
small files, gazillions of times faster. (The old reader would go to
disk no matter how small the file was, whereas I just kept files under
10 megs in core.) And massively simpler.

The main interface is simple:

class FastaReader {

  public FastaReader(String filepath);
  public List<FastaSequence> getSequences(String headerPrefix);
  public FastaSequence getSequence(String header);
  public Iterator<FastaSequence> iterator();

}

And its operations are obvious. You create a reader for a specific
file. You can then iterate through all the sequences in order, or you
could retreive a single sequence or a set that matched a prefix
string. In practice, you either interate through the file, or you send
the file to an outside program that will write some kind of analysis
file, which you will then read and look up the individual headers that
appear in it (e.g., we'll "blast" our Fasta file against a database of
known sequences, looking for similarities).

I ignored the existing code.

I also questioned the existing assumptions.

The headers are generally meaningless symbols. It is meaningless to
search for prefixes in meaningless symbols. Do we need it? Does it
ever get used?

The answer to both was "no" and things got simpler.

Being a lazy bum, I wrote out an unsorted index file by scanning the
Fasta file, recording header and position. I then called the Unix sort
routine on the file and I had a sorted file. I tested it up to 10m
entries. Here's the file:

#### v1.0                               |
>G124FA12.T0      123456    |
>G124FA13.T0      124456    |
>G124FA14.T0      120456    |

I made it a human-readable text file so debugging and maintaining it
would be easier. As sorting large files is I/O-limited, I wasn't
worried about spending a little extra CPU time parsing the file.

I kept a sorted list in core of every 1000th entry with its position
in the index file. An arbitrary search required 3-4 disk reads and if
sequential requests happened to be close, the disk blocks would often
be in memory already. It ran about 10x faster and was able handle very
large files.

Then the fun began.

"What can we say about the header? How long can it be? What characters
are allowed?"

"No limitations. Any characters at all. Even if it's not officially
sancantioned by the gods of Fasta, some of the labs we work with are
used to putting special things in the headers."

Oh, goody.

(I was for pushing back at our fellow labs. "Insist they put things in
that make sense. They will be happier if they don't try to get
tricky." We agreed that I was right, but that it didn't change
things. I lost.)

We decided that I could reject files that contained duplicate headers,
but that was it. Such abominations as leading/trailing spaces were to
be retained. (A person will look at a header and see "A1", not " A1"
and will be very frustrated when nothing comes up! Too bad.) I did get
to reject non-printing characters.

They also insisted I allow arbitary characters in the DNA strings.
("But if we allow '>' in the sequence, how will we know it's not a
header line?"--"OK, no '>'. But everything else!")

I think you get the general flow of things.

I like to make things very simple and unambigious. All lines are max
80 characters wide. Only legal nucleotides are allowed in the sequence
strings. Headers are meaningless strings of alpha-numerics with a few
other characters. No spaces! No more than 20 characters in the
header. All nucleotides are uppercase. It would be nice.

'Course everyone has a reason why their files just have to different.

Lab 1 uses lowercase letters in the nucleotide sequence to indicate
that a nucleotide is a known "SNP" (one that is often different in
individuals, determining hair color or causing cancer). Lab 2 uses it
to mean that the particular nucleotide had a low confidence level. Now
they decide to share files...

But they do it anyway. And they make mistakes. If only they would
listen to me!

And I have some new problems.

My simple little index file cannot assume that the header ends at the
first space. And there is no unused printing character that I could
use as a terminator. I could place the file position at the end
position of the header:

#### v1.0                           |
>G124FA12.T0 13456       |
>G124FA3.T0 124456       |
>G124FA146.T11 01  156  |
 
[How in the heck can we have a blog tool for a technical computer magazine and not have an option for putting text  in a fixed-width font??!@#$ All lines in the above file and the same width.]
 
But that complicates the sorting. Now there's no easy way of defining
what the header to be sorted is. When a header was a non-whitespace
symbol, I could just hand it right to the Unix sort, because spaces
sort before other characters.

At the same time, it was suggested that using Unix sort wouldn't work
too well on a Windows machine. Could I write a file sort routine, please?

OK. I Google "large file merge sort"... Nothing usable?! That should
be a fine search string. I try a few variations to no avail. There
must be a merge sort I can steal...

That evening I wander out to a bar and down some suds. They're showing
"The Wizard of Oz" on the TV screens. So I sit there, reciting the
lines with Dorthy while pondering recursive file sorts.

You know... I'm sticking the headers into a sorted tree map anyway. I
could just read in a million, write 'em out, then keep going. I should
be able to get 100 file descriptors. I could do 100m entries without
doing any recursion. I could just... Yeah...

(This is what you think about in bars, isn't it?)

It was so simple to implement! I read the fasta file, putting the
header and file position into a sorted tree map. I wrote it out to
temp files at 1m entries per. (I estimated 200 bytes per entry. I can
expect to have 200m available.)

I then opened custom readers for all the temp files, putting them into
a sorted tree map (love them sorted maps!), keyed on the reader's next
header. I then removed the first entry in the map and wrote it out to
a final temp file. I advanced the reader to the next entry and put it
back into the map. Repeat. I didn't have to write a single
comparision! Woo-hoo!

I did decide to define an escape character for the index file so that
I could keep the visual separation in the file. (Well, that and the
fact that I'd already implemented it.) So the "#" is the escape
character and a "#" in a header is doubled. A "# " indicates the end
of the header. I also dropped the version line. Here it is:

>G124FA12.T0#              134566  |
>G1##24FA3.T0###     124456   |
>G124FA146.T11 01 #    156         |

The three headers are G124FA12.T0", "G1#24FA3.T0#", and "G124FA146.T11 01 ".

It's about 2x slower than Unix sort is for small files, and closer to
even for 10m entries. I don't know why. My version eats a lot more CPU
than the Unix version does, but it's still disk bandwidth limited. 

For current programs' access patterns, we spend 100x more time in
random lookups than in sorting. And that's good enough for me. 

If I can get the full 2g of memory, I can sort 1g entries with 100
file descriptors. We are years away from that many entries in Fasta
files and there's a good chance that by the time we get there, we'll
be doing things completely differently. Plus the fact, this one runs
10x faster and is *way* simpler.

Now to the sort...

Being a lazy !@$!$ I just read in 1m entries and stuck 'em in a sorted
tree map. I wrote each of the 1m out to a file. I don't have to call
sort. I don't have to write a comparision object. I'm happy.

   private File writeSortedBlockToTempFile(LineReader reader) {
       hlMap.clear();
 
       for (int i = 0; i < MAX_ENTRIES_PER_TEMP_FILE
                               && reader.hasNext(); i++) {
           String line = reader.getNextLine();
           String header = extractPrefixedHeader(line);
           hlMap.put(header, line);
       }
       return writeSortedBlockToTempFile();
   }
 

 
   private File writeSortedBlockToTempFile() {
           File f = File.createTempFile("sortedFile", ".tmp");
           FileWriter w = new FileWriter(f);
           for (String line : hlMap.values()) {
               w.write(line);
               w.write("\n");
           }
           w.close();
           return f;
   }


(I just checked it out and I can get 10k open file descriptors, so I
can sort upwards of 100g headers.)

The merge portion of the sort is just as lazy. I open a reader for
each file, grab the first header, stuff it in a sorted tree map, and
then write out the first element of the map. I stick the "used" reader
back in with its next header, repeat.

   private void mergeFiles() {
       for (File f : files) {
           HeaderReader reader = new HeaderReader(f);
           String header = reader.getHeader();
           hrMap.put(header, reader);
       }

       FileWriter w = createFileWriter(outputpath);

       while (hrMap.size() > 0) {
           String header = hrMap.firstKey();
           HeaderReader reader = hrMap.get(header);
           writeLine(w, reader.getLine());
           hrMap.remove(header);
           header = reader.getNextHeader();
           if (header != null)
               hrMap.put(header, reader);
       }

       close(w);
   }

[TreeMap really ought to have a removeFirstEntry() method.] 

And... that's pretty much it. It took an absurdly small amount of time
to write. It runs a little faster than Unix sort on 100m entries and
is about the same as the SmallText sorter. (When I started writing
this, I decided I'd better try a little harder to search the web and
found SmallText. It wouldn't handle my mixed-length, escaped headers,
so I stuck with mine.)  

There is one more detail about common forms of the file that I haven't
mentioned yet. Most of the files with > 1m sequences in them will be
Illumina reads that will have consecutive numeric headers and a small
number of base pairs with a very small standard deviation. In other
words, they look like this:


>1
tAGATCGTAGGGTAGAGCCGATG
>2
AGATCGTAGGGTAGAGCCGATTG
>3
CAGATCGTAGGGTAGAGCCGATG
>4
ATAGATCGTAGGGTAGAGCCGAT
>5
AGATCGTAGGGTAGAGCCGATG


Now we can get seriously cheap. We can either allow the user to
promise us that the sequences will be in order, or we can scan the
file ourselves to verify it. I chose to scan the first few sequences
and assume it was numerical based on that. I'd scan the rest, building
an in-core one-in-10000 index as I went. I could thrown an exception
and restart if I ran into a non-numeric header. This approach brought me very
close to the perfect 1 disk-read per request.

It was about then that someone said to me "You know, Bil. Some of our
collaborators edit their fasta files in MS Word. So you'll get files
with some lines terminated in the Unix fashion by a New Line character
(aka Line Feed) and others terminated in the Win32 style, by a
Carriage Return and a Line Feed (CRLF)."

Next: The Curse of the CRLF

 

 

 

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video