Channels ▼

Bil Lewis

Dr. Dobb's Bloggers

I work with biologists.

August 21, 2008

I work with biologists.

Biologists don't quite get computer science. They may be super with
statistics, but they have no idea how to write programs.

(I don't mean they can't write them, I mean that you can't read
them. You can't extend them. You can't fix them.)

Let us take the example of protocols. We need a way to be able to
associate sequences of DNA with the critters that they came
from. Nothing fancy here, just "This sequence came from Nancy's*
cancer." And a sequence is just a list of nucleotides: ACGTTTGAC,
etc. I spend all day with these things.

The usual transfer file format is called "FASTA" and looks like this:


Where the unique identifiers are the "G1404PF1.T0".

Now by itself, G1404PF1.T0 doesn't tell you it came from Nancy's
tumor. There's a little table somewhere that says "G1404PF1.T0 =>
Nancy". No biggie.

But what they often do is encode extra information in the actual
characters of the identifiers. For example, G1404 indicates that this
is from the 1404'th patient with Lyme's disese. The P means a partial
sample, the F means it's a "forward" read, the 1 is a simple counter
for sequences that come from the same sample, and T0 is a
constant. (They intended it to mean something, but then decided it wasn't
important. They didn't want to change the format to eliminate it.)

Clever, eh? Now they don't have to consult a table to know those
things about a sequence. For more detailed information, then it's off
to the table and that's OK. It's just nice to be able to eyeball a
sequence and know what you're looking at.

No problem so far.

Of course, there's that entire header line in the file...

We could put more information there. There's lots of times when we
want to pass specific, ad-hoc bits to a program. Instead of writing a
new file for that, we can just stick it in the header!

Now we see things like this:

>G1404PF2.T0 Parent:G1404PF1.T0, Start:14556 End:14667 Range:23,55

It's so useful!

Then someone says "Hey! these other guys have some very interesting
information!  Let's get their fasta files!"

And they encode their identifiers differently. As does the next
institute and the next and the next...

Pretty soon you're not sure if two encodings overlap or not. So you
have to attach the encoding protocol (we use the institute's name) to
the fasta file in order to read it correctly!

Oh, and if you'd like to print out reports that contain the
identifiers on them? So you can say things like "Sequence G1404PF4.T0
was unusual". Well, it's a pain to have table entries that might be
10 characters, or might be 50.

So, of course, what I want is a single format that everyone uses. I
want a universally unique identifier (123456 is just fine) that will
be the sole identifier used. You can stick whatever ad-hoc information
on the end of the line you want. I get to ignore it. Just give me one

Ain't happenin.

So... What should we do?

What would you do?




* So there's this woman, Nancy or something. She died in like, 1950
  from cancer. They wanted to experiment with the tumor, so they kept
  it alive. It's still going. Healthy and vibrant as a tumor could wish
  to be.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.