How Simple is Too Simple?
September 27, 2009
This question is somewhat related to the weak typing discussion, but
only somewhat.
The basic question is "How complex should a class/method be?"
I could write a bunch of very simple methods, such as add1(int) and
add2(int) and add3(int). But most people would demur and prefer an
add(int, int) method.
More generally, how many and what kinds of options does it makes sense
to have for a class and when does it make more sense just to have a
new class?
I want to write files containing DNA sequences. There are several
competing formats for storing DNA: Fasta, SangerFastq, SolexaFastq,
BAM, SAM, mm. All of these formats will store sequence data (eg,
ACGCTTAAGGGCACACACA...) along with individual base quality scores
(eg, 20, 21, 20, 30, 26, 33...) under some sort of id (eg, G1234P31).
Now... should I write one class that takes a type option (Fasta,
SangerFastq, etc.), or should I have multiple classes (FastaWriter,
SangerFastqWriter, etc.)?
If I have a single writer class with options, then programmers will
only have to figure out what one class does. On the other hand, the
class will be more complex than if I have multiple classes.
Which would you choose?
Another example of this is pipeline steps.
At the Broad, we use pipelines with multiple steps to analyze our
DNA. A typical pipeline would look like this:
Pipeline p = new Pipeline()
p.add(new FastaReaderStep("/tmp/eColi.fasta"));
p.add(new TrimLowQuality(30));
p.add(new FilterOutBadSequences());
p.add(new FastaWriterStep("/tmp/goodEColi.fasta"));
This pipeline would read the sequences from the Fasta file, pass them
one-by-one to the trimmer, which would drop low quality bases. It
would pass them to another filter that would just toss out "bad"
sequences, and finally the final step will write them out.
The really cool thing about using a pipeline for this is that it is
simple to run steps on separate machines. And because the sequences
are being streamed through the pipeline, there is no limit on the
number of sequences we're analyzing.
Now for our issue...
What if, instead of reading in sequences from a Fasta file (we
construct FastaSequence objects from Fasta files), we decide to read a
different kind of object, say a BioSequence object from the
database. (BioSequence objects are similar to FastaSequence objects,
with some different fields. They both implement the Sequence inteface,
but FastaWriter needs more than that.)
Should our steps accept both FastaSequence objects and BioSequence
objects?
Because the pipeline was designed some years ago, it is not parameterized.
The basic method of a pipeline step is the processObject(Object) method.
So it is not unusual for some of our programmers to write steps that take
all sorts of different input objects.
Usually, when we want a FastaSequence version of a BioSequence object,
we just use the BioSequence ID field for the Fasta header string. But sometimes we
want something else. So there is no "correct" method of selecting the header.
====
Should a FastaWriterStep accept BioSequence objects?
We could write a simple, Fasta-only writer step:
class FastaWriterStep {
public void processObject(Object obj) {
writer.write((FastaSequence) obj);
}
...
}
and require our programmers to add an express conversion step:
Pipeline p = new Pipeline()
p.add(new QueryStep("FROM BioSequence WHERE group.name = 'eColi'"));
p.add(new TrimLowQuality(30));
p.add(new FilterOutBadSequences());
p.add(new ConvertBioSequenceToFastaSequence());
p.add(new FastaWriterStep("/tmp/goodEColi.fasta"));
or should we write a more complex step that won't require a conversion
step?
class FastaWriterStep {
public void processObject(Object obj) {
if (obj instanceof FastaSequence) {
writer.write((FastaSequence) obj);
}
else if (obj instanceof BioSequence) {
FastaSequence fs = new FastaSequence(bs.getId(), bs.getSequence(), bs.getQuality());
writer.write(fs);
}
else if (obj instanceof BamSequence) {
...
else
throw new RuntimeException("Unknown object type " + obj);
}
...
}
My group is debating exactly this question right now.
What would you do?
-Bil