I work with biologists.
August 21, 2008
I work with biologists.
Biologists don't quite get computer science. They may be super with
statistics, but they have no idea how to write programs.
(I don't mean they can't write them, I mean that you can't read
them. You can't extend them. You can't fix them.)
Let us take the example of protocols. We need a way to be able to
associate sequences of DNA with the critters that they came
from. Nothing fancy here, just "This sequence came from Nancy's*
cancer." And a sequence is just a list of nucleotides: ACGTTTGAC,
etc. I spend all day with these things.
The usual transfer file format is called "FASTA" and looks like this:
Where the unique identifiers are the "G1404PF1.T0".
Now by itself, G1404PF1.T0 doesn't tell you it came from Nancy's
tumor. There's a little table somewhere that says "G1404PF1.T0 =>
Nancy". No biggie.
But what they often do is encode extra information in the actual
characters of the identifiers. For example, G1404 indicates that this
is from the 1404'th patient with Lyme's disese. The P means a partial
sample, the F means it's a "forward" read, the 1 is a simple counter
for sequences that come from the same sample, and T0 is a
constant. (They intended it to mean something, but then decided it wasn't
important. They didn't want to change the format to eliminate it.)
Clever, eh? Now they don't have to consult a table to know those
things about a sequence. For more detailed information, then it's off
to the table and that's OK. It's just nice to be able to eyeball a
sequence and know what you're looking at.
No problem so far.
Of course, there's that entire header line in the file...
We could put more information there. There's lots of times when we
want to pass specific, ad-hoc bits to a program. Instead of writing a
new file for that, we can just stick it in the header!
Now we see things like this:
>G1404PF2.T0 Parent:G1404PF1.T0, Start:14556 End:14667 Range:23,55
It's so useful!
Then someone says "Hey! these other guys have some very interesting
information! Let's get their fasta files!"
And they encode their identifiers differently. As does the next
institute and the next and the next...
Pretty soon you're not sure if two encodings overlap or not. So you
have to attach the encoding protocol (we use the institute's name) to
the fasta file in order to read it correctly!
Oh, and if you'd like to print out reports that contain the
identifiers on them? So you can say things like "Sequence G1404PF4.T0
was unusual". Well, it's a pain to have table entries that might be
10 characters, or might be 50.
So, of course, what I want is a single format that everyone uses. I
want a universally unique identifier (123456 is just fine) that will
be the sole identifier used. You can stick whatever ad-hoc information
on the end of the line you want. I get to ignore it. Just give me one
So... What should we do?
What would you do?
* So there's this woman, Nancy or something. She died in like, 1950
from cancer. They wanted to experiment with the tumor, so they kept
it alive. It's still going. Healthy and vibrant as a tumor could wish