Channels ▼

Bil Lewis

Dr. Dobb's Bloggers

How Simple is Too Simple?

September 27, 2009

This question is somewhat related to the weak typing discussion, but
only somewhat.

The basic question is "How complex should a class/method be?"

I could write a bunch of very simple methods, such as add1(int) and
add2(int) and add3(int). But most people would demur and prefer an
add(int, int) method.

More generally, how many and what kinds of options does it makes sense
to have for a class and when does it make more sense just to have a
new class?

I want to write files containing DNA sequences. There are several
competing formats for storing DNA: Fasta, SangerFastq, SolexaFastq,
BAM, SAM, mm. All of these formats will store sequence data (eg,
ACGCTTAAGGGCACACACA...) along with individual base quality scores
(eg, 20, 21, 20, 30, 26, 33...) under some sort of id (eg, G1234P31).

Now... should I write one class that takes a type option (Fasta,
SangerFastq, etc.), or should I have multiple classes (FastaWriter,
SangerFastqWriter, etc.)?

If I have a single writer class with options, then programmers will
only have to figure out what one class does. On the other hand, the
class will be more complex than if I have multiple classes.

Which would you choose?

Another example of this is pipeline steps.

At the Broad, we use pipelines with multiple steps to analyze our
DNA. A typical pipeline would look like this:

Pipeline p = new Pipeline()
p.add(new FastaReaderStep("/tmp/eColi.fasta"));
p.add(new TrimLowQuality(30));
p.add(new FilterOutBadSequences());
p.add(new FastaWriterStep("/tmp/goodEColi.fasta"));

This pipeline would read the sequences from the Fasta file, pass them
one-by-one to the trimmer, which would drop low quality bases. It
would pass them to another filter that would just toss out "bad"
sequences, and finally the final step will write them out.

The really cool thing about using a pipeline for this is that it is
simple to run steps on separate machines. And because the sequences
are being streamed through the pipeline, there is no limit on the
number of sequences we're analyzing.

Now for our issue...

What if, instead of reading in sequences from a Fasta file (we
construct FastaSequence objects from Fasta files), we decide to read a
different kind of object, say a BioSequence object from the
database. (BioSequence objects are similar to FastaSequence objects,
with some different fields. They both implement the Sequence inteface,
but FastaWriter needs more than that.)

Should our steps accept both FastaSequence objects and BioSequence
objects?

Because the pipeline was designed some years ago, it is not parameterized.
The basic method of a pipeline step is the processObject(Object) method.
 So it is not unusual for some of our programmers to write steps that take
all sorts of different input objects. 

Usually, when we want a FastaSequence version of a BioSequence object,
we just use the BioSequence ID field for the Fasta header string. But sometimes we
want something else. So there is no "correct" method of selecting the header.
 
====
 
Should a FastaWriterStep accept BioSequence objects?

We could write a simple, Fasta-only writer step:


class FastaWriterStep {

  public void processObject(Object obj) {
    writer.write((FastaSequence) obj);
  }
  ...
}


and require our programmers to add an express conversion step:


Pipeline p = new Pipeline()
p.add(new QueryStep("FROM BioSequence WHERE group.name = 'eColi'"));
p.add(new TrimLowQuality(30));
p.add(new FilterOutBadSequences());
p.add(new ConvertBioSequenceToFastaSequence());
p.add(new FastaWriterStep("/tmp/goodEColi.fasta"));


or should we write a more complex step that won't require a conversion
step?


class FastaWriterStep {

  public void processObject(Object obj) {
    if (obj instanceof FastaSequence) {
      writer.write((FastaSequence) obj);
    }
    else if (obj instanceof BioSequence) {
      FastaSequence fs = new FastaSequence(bs.getId(), bs.getSequence(), bs.getQuality());
      writer.write(fs);
      }
    else if (obj instanceof BamSequence) {
    ...
    else
      throw new RuntimeException("Unknown object type " + obj);
  }
  ...
}


My group is debating exactly this question right now.

What would you do?

-Bil

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video