Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

SOAP: Simplifying Distributed Development


Sep01: SOAP: Simplifying Distributed Development

Neil is a web site developer who can be contacted at [email protected].


I was developing a noncommercial community web site (http://www.crazyguyonabike.com/) using Embperl (http://perl.apache.org/embperl/), which works on top of mod_perl and Apache. In the process, I decided to add a spellchecker for the message boards and journals people were writing. Since I wanted to stay in the open-source arena, I settled on a Perl Lingua::Ispell module (http://www.redhat.com/swr/i386/perl-Lingua-Ispell-0.05-6.i386.html), which encapsulates the ispell (or aspell) program. Lingua::Ispell works by spawning the ispell process once, then using pipes to feed it input. This is nice because it uses an external process, but doesn't need to launch that process for every call.

However, I found I could not practically use Lingua::Ispell directly from my Embperl pages because of an obscure problem related to how mod_perl/apache handles the spawning of subprocesses (see http://perl.apache.org/guide/performance.html#Forking_and_Executing_Subprocess). I found that zombie subprocesses were being left for every call I made, and Lingua::Ispell was having to restart ispell every time, which is not how things are supposed to work in production systems. And even if it did work, every copy of Apache would be launching its own ispell subprocess, which seemed inefficient.

Consequently, I decided to write a spelling server that would remain persistent and accept connections from my Embperl scripts. Traditionally, the lowest common denominator for writing servers and daemons is sockets and IPC. However, I wanted to avoid this since it can get tricky, and I wanted to use one of the higher level libraries that have been developed. One option was CORBA, but it has always been a pain because of its complexity. That's when I turned to SOAP.

A Brief Overview of SOAP

The Simple Object Access Protocol (SOAP), currently in Version 1.2, was developed as an open RPC protocol using XML, targeting much the same problem set as CORBA, DCOM, and Java RMI. SOAP has been picked up by the World Wide Web consortium's XP (XML Protocol) project (http://www.w3.org/TR/2001/WD-soap12-20010709/) and, perhaps most visibly, Microsoft .NET has adopted it as its central RPC standard. For the time being, SOAP remains an independent, open standard. Even if Microsoft tries to "embrace and extend" the protocol, a core set of functionality will likely work between any SOAP client and server. A superset of Microsoft-specific stuff will be necessary to interact fully with Microsoft products, but not essential for the basic RPC that SOAP is designed for.

One advantage of SOAP is that it works over existing protocols, such as HTTP. One of the biggest problems with CORBA is that IIOP needed its own ports, and so firewalls were a huge issue. (None of the big banks I worked with were anywhere near ready to poke holes in their bastions for a new, untested protocol.)

The other advantage is that SOAP is not a binary protocol. Some people see this as a liability, since there are inevitably issues with the amount of data that must be transmitted and parsed (XML is infamous for being more verbose than other data formats). However, the fact that the protocol is text based harks back to the simplicity of HTTP and HTML, and makes applications easier to debug. Also, interoperability is improved as issues such as Big-endian, Little-endian, and byte ordering go away completely. This means SOAP clients running on Windows clients should not care that the SOAP server it's talking to is running on Linux.

SOAP::Lite

SOAP::Lite is an open-source package written by Paul Kulchenko (http://www.soaplite.com/ and http://www.cpan.org/), which encapsulates SOAP clients and servers in Perl. (According to the author, the "Lite" refers to the ease of use, not the functionality.) I use Perl for all my web work because it is such a mature language, rock solid in its reliability, and flexible. Throw in the sheer number of available open-source modules and it's a world-class platform.

When you get SOAP::Lite, you have a couple of additional packages to install before using it — the XML::Parser bundle (http://search.cpan.org/), and the Expat toolkit (http://expat.sourceforge.net/). If you're going to use SOAP::Lite from Apache and mod_perl, then there is a gotcha. The problem is that Apache comes with an Expat (Lite) module, which conflicts in the symbol tables with the Expat module XML::Parser uses. So you need to edit the Apache configuration file (src/Configuration), look for the rule EXPAT = default, and change it to EXPAT = no. Then rebuild Apache and all should be well. If you installed Apache from RPM, then you'll have to get the sources and build it manually. Embperl and mod_perl require access to the Apache source when being built anyway, so this isn't too unusual.

SOAP::Lite can be used to implement a server and client without your having to know anything at all about XML. Even though XML is fundamental to the toolkit, it is all under the covers (where it belongs). All you have to do is set up the daemon (see Listing One) and the client (Listing Two).

The Server

The server uses the package SOAP::Transport::HTTP::Daemon, which contains all the functionality needed to implement a daemon; see Listing One for the server code.

The first thing you do is create the daemon object, giving it a port to listen on and an address (localhost in this example, and port 81). You also tell the daemon to use objects_by_reference, a directive that specifies a package that is to be used in a persistent way. This means that between calls to the server, the package is not reinitialized. The objects_by_reference is actually more powerful than this, but all you need out of it for your purposes is the ability to have the package (Spelling) variables persist between calls. You need this because the Lingua::Ispell package maintains the pipe to ispell in a variable. If it were reinitialized on each call, then Lingua::Ispell would lose track of ispell and restart it each time.

You then tell the daemon to dispatch calls to the package (Spelling) and ask it to handle connections. That's pretty much it. There is no more SOAP-specific stuff in the server code. You have the package definition, which contains a single subroutine, check(). This is in the form of an object method (for example, $self is the first parameter). The other parameter is the text being spellchecked.

The check() method calls Lingua::Ispell, then parses the resulting terms, which are all the unrecognized words. You then replace these terms in the text with HTML, which makes the words red, by placing FONT tags around them. Ignore terms that appear within HTML tags or Embperl code blocks, otherwise the HTML will get corrupted. There are probably much more efficient ways to do the pattern matching than the three separate statements that I have here, but this works pretty well.

And that's that for the server. When you run it, the daemon sits and waits for connections. I have had the same daemon up and running for days, and it seems stable. It is not multithreaded, it would be an exercise for you to convert the daemon into a more ambitious preforking version (there is example code to do this included with SOAP::Lite). For me, the current form is sufficient because spellchecking happens on a relatively infrequent basis (only when someone is posting a message or explicitly requesting a spellcheck on their journal). If the site were to expand, then I would utilize load balancing by having multiple web server machines, each with its own spellcheck server.

The Client

The client code is even simpler. As you'll see from Listing Two, all you do is specify the address and port of the server, and the function to call. You can see that the package you are calling is given as part of the address; this is how the server knows where to dispatch the call.

I have added extra logic as a safeguard just in case the server is not running; this is a directive to catch errors and undefine the result string. If this happens, then the text we are spellchecking is simply left unchanged. Thus, it's a kind of failsafe mechanism.

The proxy part of the call specifies the server that we are using. The beauty of SOAP is that you can use HTTP, FTP, or even MAILTO URLs here. SOAP can use any of these protocols. The uri directive specifies the specific service (in this case, the Spelling package), which we are calling on the server. Each server may have multiple services available. The on_fault part says how we are to handle errors; in this case, you simply say that $result is undefined.

When you have made the call, then you call result() on the return object to get the actual result of the function call. This can be more automated, using autodispatch, though I haven't done that here. According to the documentation, the call can end up looking like a normal Perl call.

Conclusion

As an added bonus, thanks to SOAP, I am using XML now on my web server without knowing anything (and caring less) about XML, schemas, XSLT, XPath, XLink, or any of the other technologies I've been avoiding. SOAP is hiding everything, and it can truly be said that SOAP results in cleaner code.

You can see a demo of the server/client discussed in this article at http://www.crazyguyonabike.com/ spellcheck/. There is a simple form that lets you type some text and see the unrecognized words marked in red.

DDJ

Listing One

#!/usr/local/bin/perl -w
use SOAP::Transport::HTTP;
my $daemon = SOAP::Transport::HTTP::Daemon
    -> new (LocalAddr => 'localhost', LocalPort => 81)
    -> objects_by_reference(qw(Spelling))
    -> dispatch_to('Spelling');
  print "Contact to SOAP server at ", $daemon->url, "\n";
  $daemon->handle;
package Spelling;
sub check
{
    my ($self, $text) = @_;
    use Lingua::Ispell;
    $Lingua::Ispell::path = "/usr/bin/ispell";
    $dummy = '__TERM__';
    for $word (Lingua::Ispell::spellcheck ($text))
    {
        if ($word->{type} eq 'miss' )
        {
            # First replace any terms found inside HTML or Embperl 
            #                                  brackets with dummy term
            $text =~ s{([\<\[])([^\>\]]*?)(\b$word->{term}\b)}{$1$2$dummy}g;

            # Now mark up the remaining terms with red font
            $text =~ s{(^|[^\>\&])(\b$word->{term}\b)}{$1\<FONT 
                                              COLOR=\"red\"\>$2\<\/FONT\>}g;
            # Restore the dummy terms
            $text =~ s{$dummy}{$word->{term}}g;
        }
    }
    return $text;
}
1;

Back to Article

Listing Two

sub check_spelling
{
        my ($self, $textref) = @_;

        # Check spelling
        use SOAP::Lite;
        my $soap = SOAP::Lite
           -> uri('http://localhost/Spelling')
           -> proxy('http://localhost:81')
           -> on_fault (sub {undef $result});
        $result = $soap->check($$textref);
        $$textref = $result ? $result->result() : $$textref;
}

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.