Web Development

Lessons Learned Converting Java to Perl

By Simon Cozens, January 01, 2004

With all the horror stories I've heard over the past few years of Perl projects being packed up and replaced wholesale with Java projects, I recently had the happy opportunity to get back in some small way.

January, 2004: Lessons Learned Converting Java to Perl

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.

Jakarta Lucene is a Java-based framework for embedding search engines into an application. It provides a simple search engine with analyzers, index writers, index readers, an optimizer, a query parser, and several query processors and scorers. Lucene is steadily being ported to other languages; Dr Dobb's Journal recently reported on a C# version, and lupy, the Python version, and ruby-lucene are both in the works.

An application we were working on needed a search engine, and Lucene looked like the best of breed, so we decided to use it in conjunction with the Inline::Java module to glue the Perl and Java parts together. However, there were certain problems with this approach.

First, it was extremely complicated—the Java-to-Perl bridge wasn't ideally suited to being used for multiple users and concurrent access. And it was too complicated in terms of architecture—it just didn't feel like a neat design.

Also, we wanted to be able to extend the search in arbitrary ways, including having the ability to dive into the index and pick out indexed terms, and so on. We couldn't really do this in Java as flexibly as we'd liked, not least because only few of us knew enough Java.

But we knew a lot of Perl, and hey, it's only code. I took one and a half man-months to attack the 13,790 lines of Java code in Lucene 1.2, and produced Plucene. It's not quite ready for prime-time at the time of writing, so don't ask me for it yet, but it's rapidly getting there.

However, you can't work for a month on something absolutely new without learning some lessons, can you? So this month, I hope to share with you some of the lessons I've learned over the past month as part of this Java-to-Perl translation project.

Estimating the Job

The first lesson has absolutely nothing to do with the specific technology but everything to do with project management. The conversion took much longer than I was anticipating, and that's because my estimation was completely off.

When you're converting code, it isn't appropriate to just try converting a few files and making an estimate based on how long that took and how much a percentage it was of the total source. In this case, I started by converting the textual analysis classes and completed about 10 or 15 in a morning. However, these were very simple ancillary classes, many of which abstract classes override in only one or two methods.

When I began messing with the two index writer classes, I found that each one was going to take at least one day to fully understand, and another day to code up. Suddenly, my estimates were laughable.

So the first lesson is, if nothing else, have an understanding of the project as a gestalt before making any estimates. Simply chipping away at a corner of it and then extrapolating will put you in danger of racing through the simple cases and becoming stuck on the actual meat of the project.

There's a particular hacker fallacy that says you should spend your hacking time hacking, since that's what you're good at. That's very often the best solution if the problem is clearly defined, but in my case, I would have benefited from stepping back and taking two days to really understand the intricacies of the task ahead. It's easy to see two days like that as wasted, but time spent planning should not be seen as wasted, but as an investment.

It seems so easy when glibly put like that, but there is an undeniable "urge to hack." If nothing else, thinking time gives you nothing to show to your boss, while hacking does. But on the other hand, as Brian Kernighan and Rob Pike put it in their Practice of Programming (Addison-Wesley, 1999): "Resist the urge to start typing; thinking is a worthwhile alternative."

Another important concept to remember when translating existing code is that the vast majority of the code that's there is there for a reason. Our initial port of Lucene was going to be "just enough" to work, and so I estimated that we wouldn't need to port about 20 percent of the Java. But while there were some classes that could be left alone for the moment—for instance, Lucene allows queries that specify that one word should appear "near" another word, but that's not critical to its functionality—most of the classes were there because they were actually useful.

This is another one of those things that should not be a surprise. People don't put code into a project for the fun of it. They put it there because it's used by other code. But it's sometimes tempting to account for pieces of code that we "don't need to do yet." You do need to do them, and if you'd spent a couple of days analyzing in advance, you'd know this.

Use Available Tools

How do we do our analysis? Well, there are almost always useful tools to do some kind of static analysis for us. In the case of Java, I picked up the lovely JAnalyzer (http://www.bodden.de/projects/ janalyzer/), which can perform static analysis and tell you where methods are being called and which methods they in turn call.

This was particularly useful when I had to do something about the Java tendency toward method overloading. For instance, we have the two methods:

public void seek(Term term) throws IOException { ... };
void seek(TermInfo ti) throws IOException { ... };

Both are called seek, and which one gets called depends on the type of the argument. Of course, there's a naturally Perlish way to do this:

sub seek {
    my ($self, $t) = @_:
    if ($t->isa("Plucene::Index::Term")) {
        # seek version 1
    } else {
        # seek version 2
    }
}

However, this suffers from muddled thinking—it's certainly not the Perl way to have a subroutine do one thing if it's called with one type of argument and a completely different thing if it's called with another type. Since Java has this kind of method overloading built into the language, it's much more natural to see it in Java; but Perl does not, and so it is not.

Instead, the best way to do this is to identify what's really going on—the TermInfo version does all the work, and the Term one is a front end that turns the Term into a TermInfo. So we'll call one seek and the other seek_ti. Now we need to work out where the two different methods are actually called, and rename appropriately; this is where our analyzer comes in.

With a decent set of analysis tools, this is a simple process—click on the method, you get a list of places where it was called, and you track them down in your ported version. Without analysis tools, it's down to grep, checking the context of each returned line, and painstakingly looking through each one. It's worth taking the time out to see what tools are available to play with the code.

The Joy of Tests

Another thing that held me back and could have been done better was the identification of distinct subprojects. Once you've identified the major components of the program, you can treat them as individual, isolated parts, port across the relevant files, test, rinse, and repeat.

Did I mention tests? Tests are your friend. Really, they are. It took me many years to realize this, but tests are not just a tedious thing you do after writing the code to make it look professional. Once you've properly componentized your task, you can use unit tests to ensure each component is doing what you think it ought to be doing.

I'm not one of those people who believes that you should write your tests first, watch them fail, and then build your code until they pass; and yes, I have heard all the arguments for it, thank you very much. However, I have far too many moments of enlightenment just after staring at the code and just prior to uttering, "Wait, does this actually do anything right at all?"

That's where unit tests come in, and there's been a lot of work put in to make unit tests really quite easy in Perl. My favorite testing module is Test::More, which provides, among others, the ok, is, isa_ok, and is_deeply routines. For instance, here's a portion of Plucene's test suite:

my $size = -s DIRECTORY . "/words.tis";
ok($size, "Wrote index of $size bytes");

First, we check that the index writer produced a nonzero sized index. ok takes an argument and prints "ok" if it is a True value and "not ok" if it is not. It also, like all the other Test::More routines, takes an optional comment to identify the test.

my $reader = Plucene::Index::TermInfosReader->new(DIRECTORY, "words", $fis);
isa_ok($reader, "Plucene::Index::TermInfosReader", "Got reader");
my $enum = $reader->terms;
isa_ok($enum, "Plucene::Index::SegmentTermEnum", "Got term enum");

isa_ok is used to ensure that a value is the type we expect it to be.

for my $i (0 .. $#keys) {
    $enum->next;
    my $key = $keys[$i];
    is_deeply($enum->term, $key, "Key $i matches");

is_deeply compares two structures recursively, reporting on where they differ.

    my $ti = $enum->term_info;
    is($ti->doc_freq, $doc_freqs[$i], "Doc frequency at $i matches");
}

And is compares two scalar values, reporting a difference. That's essentially all there is to testing in Perl, so unfortunately, there's hardly any excuse for not doing it.

Asserting Your Rights

Even once you've got all your code ported across and your unit tests in place, there will be bugs. You can't avoid it. And these will not be friendly bugs, which are easy to diagnose. They will be bugs you don't understand, that will take you a day to work out where they're coming from. They will be bugs that manifest themselves somewhere completely different in the program, and say things like:

Can't take log of 0 at blib/lib/Plucene/Search/Similarity.pm line 61

And that means that you didn't pass in the appropriate parameter to a method 10 frames up the call stack. Of course.

How are we supposed to know this? Because when we find something like this, where there's obviously a parameter gone adrift somewhere, we take the relevant subroutine:

sub idf {
    my ($self, $tf, $docs) = @_;
    my ($x, $y) = ($docs->doc_freq($tf), $docs->max_doc);
    return 1 + log($y / (1 + $x));
}

and just before the failing line, we inject the following code:

use Carp qw(confess);
confess("No documents for that term?")
    unless $x;

or some similarly informational message. This time, instead of a single cryptic error message, you'll get something like:

No documents for that term? at Plucene/Search/Similarity.pm line 62
Plucene::Search::Similarity::idf('Plucene::Search::Similarity','Plucene::Index::Term=HASH(0x942054)','Plucene::Search::IndexSearcher=HASH(0x940890)') called at Plucene/Search/TermQuery.pm line 64
Plucene::Search::TermQuery::sum_squared_weights('Plucene::Search::TermQuery=HASH(0x9423c0)','Plucene::Search::IndexSearcher=HASH(0x940890)') called at Plucene/Search/Query.pm line 78
Plucene::Search::Query::scorer('Plucene::Search::Query','Plucene::Search::TermQuery=HASH(0x9423c0)','Plucene::Search::IndexSearcher=HASH(0x940890)','Plucene::Index::SegmentsReader=HASH(0x93d6d0)') called at Plucene/Search/IndexSearcher.pm line 138
Plucene::Search::IndexSearcher::_search_hc('Plucene::Search::IndexSearcher=HASH(0x940890)','Plucene::Search::TermQuery=HASH(0x9423c0)','undef','Plucene::Search::HitCollector=HASH(0x8cfd84)') called at Plucene/Search/Searcher.pm line 67
Plucene::Search::Searcher::search_hc('Plucene::Search::IndexSearcher=HASH(0x940890)','Plucene::Search::TermQuery=HASH(0x9423c0)','Plucene::Search::HitCollector=HASH(0x8cfd84)') called at Plucene/Simple.pm line 114

Now we know what we're doing and how we got to where we are. This saves us a lot of tedious tracing through the program and trying to find out where it's getting itself in a knot. Because it shows us the arguments to each subroutine, sometimes this trace is enough to spot a stray undef or wrongly typed parameter. Other times, you need to crawl through the values of the arguments; Data::Dumper is an excellent way to do this.

In this case, temporarily adding in:

use Data::Dumper;
print Dumper($docs);

would show me that there's something wrong with the data in the IndexSearcher.

The key point here, though, is that once you've worked out what the bug is, and you've written a handy test case to stop it from coming back again, you don't necessarily have to remove your confess assertions. They'll be helpful for catching similar bugs and things that shouldn't be able to happen in the future.

One particularly good way to turn your bug tracing into assertions is to use the Carp::Assert module. This provides a number of functions, the most useful being assert. For instance, given this code, to read a string from a network socket:

my $length = read_string_length($socket);
my $string = " " x $length;
$socket->read($string, $length);

You could ensure that the first thing read, the string's length, is a sensible value, like so:

my $length = read_string_length($socket);
assert($length >= 0);

my $string = " " x $length;
$socket->read($string, $length);

By peppering your code with these assertions, you can be confident that your data is what you think it should be at each stage of your program's operation. If the length returned is negative, you'll get an error, and also a stack trace just like the one we saw earlier. But, surely, it takes up a lot of time to constantly check these assertions, and what happens when you want to go into production?

Carp::Assert also provides the symbolic constant DEBUG, which it sets to 1 on import. This allows you to say:

my $length = read_string_length($socket);
assert($length >= 0) if DEBUG;

and the condition will be tested just like before. However, when you want to go into production and need to get rid of these assertions, just change use Carp::Assert to no Carp::Assert. This sets the DEBUG constant to 0, and Perl is smart enough to know that code followed by 0 never needs to run and optimizes it away.

This is particularly useful for testing the interfaces to internal API functions in the absence of strict type checking. In Java, you can declare that a subroutine takes a Plucene::Index::Reader and the compiler can tell at compile time if you've passed it a value that's not going to be a Plucene::Index::Reader.

In Perl, however, variables can contain any kind of scalar, so they can't easily be type checked at compile time. However, we can use Carp::Assert to check them at runtime, which is the next best thing, and saves even more obscure errors later:

sub add {
    my ($self, $reader) = shift;
    assert($reader->isa("Plucene::Index::Reader"));
    ...
}

Mind Your Interfaces

Why is this important? The final lesson to learn is that interface consistency is a massive help to avoiding bugs in a large project. For instance, let's consider two things: first, styles of passing parameters. Java and the C-related languages have only one style: You pass a list of typed parameters in a defined order:

public IndexWriter(String path, Analyzer a, boolean create)

But Perl has several different styles that are in common use. There's the C-like style:

IndexWriter->new($path, $analyzer, $create)

Or there's the named parameter style:


IndexWriter->new(path => $p, analyzer => $a, create => $c)

Or sometimes the hash reference style:


IndexWriter->new({path => $p, analyzer => $a, create => $c})

The second thing to consider is that Java has a rather neat way of creating constructors for a class and accessors to its members. You simply declare the accessors as variables inside the class, and create a function with the same name as the class:

final class TermInfo {
 
  int docFreq = 0;
  long freqPointer = 0;
  long proxPointer = 0;

  TermInfo() {}

}

This creates a very simple, data-only class with a constructor and three accessors, with default values, in very little code at all.

The Class::Accessor Perl module gives us very much the same sort of thing:

package TermInfo;
use base 'Class::Accessor';
TermInfo->mk_accessors(qw/ doc_freq freq_pointer prox_pointer);

This gives us a new method that takes parameters in the hashref style above, and three methods to get or set the values of the appropriate data members. (Did you notice, incidentally, how we changed the names of the members from the usual Java camel-case style to the Perl lower-case-and-underscore style?) Now we can say:

my $ti = TermInfo->new({
            doc_freq     => 2, 
            freq_pointer => 12,
            prox_pointer => 28
         });

$ti->doc_freq(3);
...

Now I came to a dilemma. I wanted to use Class::Accessor to get this rapid development and clean access to data members, but I was also trying to emulate the Lucene API and wanted to keep the arguments roughly the same. This led to a mix of styles in the same program. This is, of course, very bad.

The reason this is particularly bad is that the interfaces between functions are the best place to spot erroneous parameters being passed around, and that's where Carp::Assert comes in handy.

Wouldn't it be nice, I thought, if there was some way to mix Class::Accessor with Carp::Assert to ensure that the values that you give to your constructor and accessors are what you expect? After quite a lot of struggling with the intricacies of Class::Accessor, I produced Class::Accessor::Assert.

This extends Class::Accessor with a tiny smattering of syntax: If you add a + before the name of a data member, it will be marked as required, and the constructor will fail if it is not present:

package Person;
use base 'Class::Accessor::Assert';
__PACKAGE__->mk_accessors(qw/ +name address date_of_birth /)

my $x = Person->new({ name => "Joel" }); # OK
my $y = Person->new({}); # Dies with backtrace

Additionally, if you add =Some::Class to the end of a member's name, it will ensure that that member is always an object of that class:

package Plucene::Index::Writer;
use base 'Class::Accessor::Assert';
__PACKAGE__->mk_accessors(qw/ +path create
                              +analyzer=Plucene::Analysis::Analyzer /);

my $x = Plucene::Index::Writer->new({ path => "/tmp/index/",
    analyzer => Plucene::Analysis::SimpleAnalyzer->new() });

$x->analyzer(undef); # OK
$x->analyzer(1);     # Dies with backtrace - not an ::Analyzer

Unfortunately, of course, I wrote the module after all of the more heinous interface incompatibility bugs in Plucene had been worked out, but it's something I'll be sure to use next time I'm ever converting code in a typed language into Perl...

TPJ

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development

Lessons Learned Converting Java to Perl

Estimating the Job

Use Available Tools

The Joy of Tests

Asserting Your Rights

Mind Your Interfaces

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Web Development

Lessons Learned Converting Java to Perl

Estimating the Job

Use Available Tools

The Joy of Tests

Asserting Your Rights

Mind Your Interfaces

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content