Channels ▼

Bil Lewis

Dr. Dobb's Bloggers

Never Produce Human-Readable Output

December 08, 2009

The worst thing you can do is to output human-readable files, or even
worse, write it to the terminal. Unfortunately, this is precisely what
most naive bio-informatics programs do.

Why is that so bad?

First, some background...

It is quite common for people in bio-informatics to write all sorts of
different tools that poke, prode, analyze, and manipulate genetic data
in all sorts of forms, using all sorts of computer languages (Java, C,
C++, Python, and Perl being the main ones I see). Because each of
these tools is (often) used in a stand-alone mode, it is necessary to
design some sort of persistance mechanism to record the results of the
computation. The most common mechanism is some sort of simple text
file.

Let's take a simple example. A great many tools write out simple
sequence data. The (closest thing to a standard that exists) Fasta
file is the usual format. Many other tools write alignment information
and some sort of Blast-like output file is common for them.

Now, writing files in ad-hoc formats is a fine thing to do when you're
the only one in the game. When your program has to start working with
other programs, the rules change. Now consistancy and robustness
become the primary concerns. And this is where so many programs fall
down. They start with a loose format that gives them approximately
what they want. Someone else says "Gee, that sure is a useful
format. If only it had..." And they change the way they use the format
subtly and things go downhill from there.

I think there are two primary culprits in this story: The insular view
("It's my format. I'll change it when I want.") and the human-readable
view ("The output file must make sense to the person reading it.")
Both are harmful.

People do not read output files, computers do.

I'll guess that I actually look at 0.1% of all Fasta files I ever
write (and I wrote a Fasta reader!) Most people NEVER look at their
Fasta files directly. And this is fine. There's no reason they
should. That's what viewer programs are for.

The same goes for all the other files these programs write. They are
almost never seen by human eyes. So all these clever alignment strings
they carefully construct go unseen, unused. Instead, what happens is
that lots of people like me write parsers for them and, as often as
not, the clever text-based, human-readable alignments actually make
our lives harder. Which means that there'll be more bugs in our
programs and we won't use your program as much as we would otherwise.

If you want your program to be used, write your output file for the
computer to read.

If you *really* want your program to be used, supply parsers for it in
all the major languages.

How do you design a format that is easily machine readable? Easy, just
write out the data objects you're using in a simple, uniform
format. XML comes to mind as a good possibility. It's a bit verbose,
but it's a cinch to parse and there are gobs of tools that manipulate
XML. If you're concerned about space, compact the thing.

(An example of a compact binary format for sequence data is the BAM
format file, which restricts sequences to contain only A, C, G, or
T. It is about 6x smaller than a Fasta file. How much 6x worth, when
compared to the complexity of the code to read it?)

Perhaps you have very special data that is highly compressible and
that you want to be able to manipulate in special fashions. For
example, you might want a FastaReader that can find a named sequence
in a large file. A linear search would be out of the question in a
large file, so you might want the file itself sorted, or you may wish
to create an auxillary file of the names in sorted order.

So you may decide that XML isn't the thing.

If that's the case, then you simply need to come up with a format that
works and is simple to parse. The Fasta format is a good example of a
human-readable output file that's also easy to parse--it consists of
an ID preceeded by ">" and a bunch of nucleotides (or proteins), each
represented by a single letter:

>Sequence1
ACGTGGTTAACACACACATAC...
TGGTTTTTAACACACACATAC...
>Sequence2
ACGTGGTTAACACACACATAC...
...

It's a cinch to parse and if everyone agreed on the details, it would
be a cinch to understand too--and this is where human-readable formats
often go wrong. So some folks looked at the Fasta format and said
"Gee, it sure would be nice to know where those introns were... I
know! Let's write the introns in lower case!"

>Sequence1
ACGTggttaacacacACATAC...

And suddenly we're faced with the question of interpretation. Is "g"
the same as "G"? When should we distinguish between them? Can we read
the file in all upper case if we're not interested in the intron
structure? Do we have to write stuff like this:

if (c == 'G' || c == 'g')...

everywhere? 

Other folks decided to add meta data to the "header line" to indicate
a plethora of different things.

>Sequence1 start:756 end:9999 intron[1]: 1010-1200
ACGTGGTTAACACACACATAC...

And now the reader has to decide what is really the identifier and
what to do with stuff that isn't part of the identifier. 

What happens is that your file format that started out being quite
useful and universely understandable, becomes a mismash of competing
"improvements" that render your format impotent. This is precisely
what has happened to the Fasta format, the GFF (1, 2, and 3) formats,
and the forementioned BAM format.

Don't let this happen to you!

Define your format narrowly and provide a verifier.

Better yet, provide readers and writers for all the languages.

Your objective is to make it possible for anyone, anywhere, in any
(major) language, to be able to run your program on their data and
process the results. (No, not just "possible", you want it to be
easy!) I should be able to write something like this:

BlastRunner runner = new BlastRunner(parameters);
Iterator<BlastAlignment> it = runner.runBlast(refSequences, qrySequences);

 
while (it.hasNext()) {
  process(it.next());
}
 

 
class BilsBlastRunner extends BlastRunner {
 
  public Iterator<BlastAlignment> runBlast(List<Sequence>refSeqs, List<Sequence>qrySeqs) {
 
    return runBlast(writeToFile(refSeqs), writeToFile(qrySeqs));
 
  }
 


  public File writeToFile(List<Sequence>qrySequences) {...}
 

}

where you have written both the BlastRunner and BlastAlignment
classes, so all I have to do is supply parameters (such as the
location of the blast executable) and a subclass that will take my
Sequence objects and write them to files for your program to use.

In a more perfect world, we would be able to dispense with using files
as the means of communication between your program and mine, but our
current world isn't that perfect. So we use files.

The implications of this is that your tool should accept one or more
files of data as input, a few simple command-line options, and perhaps
an options file, should the parameters be more complex than a simple
number or short string. So your command line should look something
like this:

% yourGreatTool  foo.fasta   bar.fasta  baz.output

or if some small options are required:

% yourGreatTool -Q GOOD  -MAX 15  foo.fasta  bar.fasta  baz.output

or if a pile of options are required:

% yourGreatTool -Q GOOD  -MAX 15  -O options.txt  foo.fasta  bar.fasta baz.output

where options.txt can contain anything you want in a human-writable
format. If you want some binary information, then add a binary file,
but don't call it an options file. (There's no firm rule as to what is
"data" and what is a "option", but generally "option" means it's
simple data that you expect a human to type in by hand. "Data" implies
it's machine-produced and not human-readable.)

"BUT I WANT TO OUTPUT HUMAN-READABLE DATA!"

Fine.

That's what viewer programs are for.

The point here is that the tool itself only does the essential
calculations and writes non-human-readable files. Your viewer should
read those files and produce whatever pretty pictures you want. (If
it's just text, fine.)  You do not want to integrate your tool and
your viewer into a single program because it makes it harder to keep
your tool clean and efficient and it makes it easier to accidentially
diverge your binary writer and your human-readable writer.

Blast does this. It has an option that tells it to output in XML format
or in human-readable format. But do the two output formats REALLY say
exactly the same thing? You do not ever want to have to ask this
question.

If you have a single output format, then you're sure of what you've
got. (What do they say? "A man with a watch always knows what time it
is. A man with two watches is never sure.")

If you want to write a script that first runs your tool and then calls
up a viewer on the data file just produced, fine!

You also want to make your input files distinct, as opposed to making
assumptions about their names. Fasta, for example, expects quality
data to be stored in a parallel file, but with a ".qual"
extension. (Thus, foo.fasta and foo.qual would be a pair.) Blast does
something similar. It expects you to pass it the base name of a set of
pre-computed "database" files.

% blast -d blastDb ...

where it expects to see three files, blastDb.nin, blastDb.nhr ,and
blastDb.nsq. This can be a real pain in the ***, because many
sophisticated systems copy files around a network and it's a lot
easier to say 

copyFile("blastDb.nin", remoteMachine.createTempFile("blastDb.nin"));
copyFile("blastDb.nhr", remoteMachine.createTempFile("blastDb.nhr"));
copyFile("blastDb.nsq", remoteMachine.createTempFile("blastDb.nsq"));

(producing /tmp/blastDb.nin_123, etc.), than to say:

copyFilesWithThesePrefixesWhileRetainingBaseNameIdentity(remoteMachine,
  "blastDb", {"nin", "nhr", "nsq"});

and expect this method to create a unique remote name
("blastDb_123.nin") and ensure that the other names are also available
with "_123".

By contrast, this is easy to do:

% blast /tmp/blastDb.nin_123 /tmp/blastDb.nsq_345 /tmp/blastDb.nhr_567

If you wish to create your own directory and stuff a bunch of files
with known names into it, fine. That works too. Just don't put your
users in the situation where they have to mess around with exact file
names.

I've written a variety of wrappers for a number of tools and these are
the biggest issues I've run into when trying to integrate them into
our system.

If you make it easy for me, I will use your tool happily and love you
for it. If you make it hard, I'll bitch about it as I work, and may
even just drop your tool and find a replacement.

If you want your tool to be used, make it easy.

-Bil

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video