October 08, 2009
Truth changes over time. What was true for our parents is doubted by us and ridiculed by our children. And the data that supports those truths also changes. In genetics research, truth changes rapidly, because the data that we computer scientists are providing, is changing even more rapidly.
In my parents generation, it was a well-known fact that race determined intelligence and the idea of a black president was absurd. My generation doubted this fact and elected a (half) black man as president*. The generation growing up now finds racism idiotic and will elect presidents of varigated color schemes, ignoring sex,
cultural background, parential marriage status, etc.
OK, that's a bit of a metaphor, but I think largely accurate.
In my little world of As, Cs, Gs, and Ts, the data that is changing is
the precise analysis of those nucleotides and the conclusions we draw
from them. So the "truth" of yesterday regarding the close
relationship between Hippos and Pigs, has been completely over turned
by genetic analysis. Today's truth is that the Hippo's closest living
relative is the whale. And they're both closer to the ruminates
(horses, cows, etc.) than to the artiodactyls (pigs).
In one little area, my work has something to say about this, and that
is the focus of this article.
I write programs to analyze DNA.
Somewhere, somebody obtains a DNA sample somehow. That sample is
processed and run through an elaborate set of machines that ultimately
turn out a file with a bunch of sequences and probability scores for
the accuracy of each nucleotide ("Base Pair"). The classic "Fasta" file:
and companion "quality score" file:
20, 21, 23, 19, 26, 29, 32, 31, 35...
Depending upon the machine, prep, etc, these sequences ("reads") may
be as short as 40 BPs, or as long as 1,000. (The machines that turn
out short "reads" are about 100x cheaper to run per BP, than the
machines that turn out the longer reads.)
From my perspective, these Fasta files constitute axiomatic
reality. "We stuck this DNA sample in the front end and got these
reads out the back." You'll notice there is no assumption about the
accuracy of the machine's analysis--that's part of what I reanalyze
Now, we do a whole pile of interesting things with these reads. One
classic thing to do is to "assemble" the reads into longer "contigs",
ultimately putting together an entire genome. That genome is then used
by other people and other programs to be compared to other genomes or
perhaps to guide the analysis of other sets of reads. The "real"
biologists use that to guide their work on a million topics. Medical
people use that data to guide their search for new medicins.
And it's here that the data starts to change.
You see, they are taking our work as axiomatic. They are saying "This
is what the genome looks like, therefore..." and they draw a
What if I make a mistake?
What if two genes look so similar that my program thinks there's just
one gene? And based on that, the bio folks conclude a similarity
between mouse and human that doesn't exist, and the medical folks run
off and spend $1000000 searching for a drug that doesn't exist?
It'd be a bummer.
It's happened lots of times.
What I want to ask is "What happens when I realize my mistake and fix
In today's world, very little will happen.
As the correct genome is used in analyses more and more, the old
conclusions will be rebutted and eventually the medical folks will
start spending their money on more viable projects. Eventually. It may
take a long time for the new truth to completely replace the old,
because the conclusions from the old truth may have spread to a
thousand different areas, showing up only as vague references in
papers on ostensively different topics.
Because the data is considered static.
If we looked at the data itself as being dynamic, we could imagine a
In my fantasy world, every bit of processed data would be accompanied
by meta data, relating it back to the axiomatic data (can I say that?)
along with its precise path, all fully reproducible.
In other words, if you gave me a collection of sequences, I could look
at the meta data and know everything about it--both the programs used
AND their current status. In particular, I would know if any of the
processing steps have been changed since the data you gave me was
produced. And--with the push of a button--I could rerun those steps
on the original data using the most recent (and presumably better)
As nice as it would be to get that "updated" data, it would be even
nicer if we could complete the loop. The loop from data to conclusion
to new experiments and to new data. That loop.
And why should I have to be the one to push the button? Why not let
the people doing the revisions also propogate their changes?
I would like to improve one of my programs, then make a formal
release. Part of that release would be to trigger reanalysis of all
previous runs. Should the results change, those changes would be
propogated to anything else that relied on it--in particular,
Scientific papers would have to become dynamic, just like the
data. (You might argue that papers are just fancy data.) So when I
release my program, not only will a pile of other calculations be
updated, but also the data in those papers. People looking at the
papers would see an original version of the paper, along with an
updated version, including updated charts, graphs, and tables. And
even updated conclusions.
It might be a little bit tricky to automatically rewrite a paper's
conclusion based on the new data, but in a very fundamental sense,
that's exactly what we should be able to do. I mean, the whole idea of
a paper is to look at a bunch of data and demonstrate that the data
implies a certain conclusion, right? A sufficently intelligent program
could certainly do that.
["Sufficently intelligent"--don't you just love that phrase? So,
perhaps we simply alert authors of significant changes so that they
can rethink things. And propagate any changes on to folks who depended
I see a world where our children will have perfect flexibility of
truth. They will have no problems at all believing completely
contradictory statements from day to day. By 2050, it will all be
On Monday, it will be known that (the 2050 equivalents to) Hippos and
Pigs are closely related. On Tuesday, Jane Q Hacker will revise her
program. The results of that revision will spread across the
"Universal Knowledge Repository", updating everything that depends
upon it. By Wednesday morning, the "Memory Assistance Chips" that we
will all have implanted in our brains at birth, will be updated and
when we think about it, we will naturally remember that Whales are in
and Pigs are out.
* I was Peace Corps in Kenya. I probably met his father. I am so totally pro-Obama it isn't funny. And the real reason is because he echos my political views to a T. And it's kinda cool to have an African connection.