Channels ▼
RSS

Web Development

Parsing MIME & HTML


Feb03: Parsing MIME & HTML

Parsing MIME & HTML

The Perl Journal February 2003

By Luis E. Muñoz

Luis is an Open Source and Perl advocate at a nationwide ISP in Venezuela. He can be contacted at luismunoz@cpan.org.


Understanding an e-mail message encoded with MIME can be very difficult. There are many options and many different ways to do the actual encoding. Add to that the sometimes too-liberal interpretations of the relevant RFCs by the e-mail client designers, and you will begin to get the idea. In this article, I will show you how this task can be laughably simple, thanks to Perl's extensive bag of tricks, CPAN.

I started out with a simple and straightforward mission: Fetch an e-mail from a POP mailbox and display it in a 7-bit, text-only capable device. This article describes the different stages for a simple tool that accomplishes this task, written in Perl with a lot of help from CPAN modules. I hope this is useful to other Perl folks who might have a similar mission.

Let's discuss each part of this task, in turn, as we read through mfetch, the script I prepared as an example. The script appears in its entirety in Listing 1.

Setting Up the Script

The first thing is to load all the necessary modules. You should be familiar with strict and warnings. We'll see how to use the rest of the modules a bit later.

 1   #!/usr/bin/perl
 2   
 3   # This script is (c) 2002 Luis E. Muñoz, All
     # Rights Reserved
 4   # This code can be used under the same terms as
     # Perl itself. It comes with absolutely
 5   # NO WARRANTY. Use at your own risk.
 6   
 7   use strict;
 8   use warnings;
 9   use IO::File;
10   use Net::POP3;
11   use NetAddr::IP;
12   use Getopt::Std;
13   use MIME::Parser;
14   use HTML::Parser;
15   use Unicode::Map8;
16   use MIME::WordDecoder;
17   
18   use vars qw($opt_s $opt_u $opt_p $opt_m $wd $e $map);
19   
20   getopts('s:u:p:m:');
21   
22   usage_die("-s server is required\n") unless $opt_s;
23   usage_die("-u username is required\n") unless $opt_u;
24   usage_die("-p password is required\n") unless $opt_p;
25   usage_die("-m message is required\n") unless $opt_m;
26   
27   $opt_s = NetAddr::IP->new($opt_s)
28   or die "Cannot make sense of given server\n";

Note lines 27 and 28, where I use NetAddr::IP to convert whatever the user gave us through the -s option into an IP address. This is a very common use of this module, as its new() method will convert many common IP notations into an object from which an IP address can later be extracted. It will even perform a name resolution if required. So far, everything should look familiar.

It is worth noting that the error handling in lines 22-25 is not a brilliant example of good coding or documentation. It is much better to write your script's documentation in POD, and to use a module such as Pod::Usage to provide useful error messages to the user. At the very least, try to provide an informative usage message. You can see the usage_die() function at the end of Listing 1.

Fetching a Message Via POP3

On to deeper waters. The first step in parsing a message is getting at the message itself. For this, I'll use Net::POP3, which implements the POP3 protocol described in RFC-1939. This is all done in Example 1.

At line 30, a connection to the POP server is attempted. This is a TCP connection, in this case to port 110. If this connection succeeds, the USER and PASS commands are issued at line 33, which are the simplest form of authentication supported by the POP protocol. Your username and password are being sent here through the network without the protection of cryptography, so a bit of caution is in order.

Net::POP3 supports many operations defined in the POP protocol that allow for more complex operations, such as fetching a list of messages, unseen messages, and the like. It can also fetch messages for us in a variety of ways. Since I want this script to be as lightweight as possible (i.e., to burn as little memory as possible), I want to fetch the message to a temporary on-disk file. The temporary file is nicely provided by the new_tmpfile method of IO::File in line 36, which returns a file handle to a deleted file. I can work on this file, which will magically disappear when the script is finished.

Later, I instruct the Net::POP3 object to fetch the required message from the mail server and write it to the supplied file handle using the get method on line 39. After this, the connection is terminated gracefully by invoking quit and destroying the object. Destroying the object ensures that the TCP connection with the server is terminated. This ensures that the resources being held in the POP server are freed as soon as possible, which is a good programming practice for network clients. Note that in line 45, I rewind the file so that the fetched message can be read back by subsequent code.

The interaction of mfetch with the POP server is very simple. Net::POP3 provides a very complete implementation of the protocol, and allows for much more sophisticated applications—I'm only using a small fraction of its potential here.

For this particular example, we could also have used Net::POP3Client, which provides a somewhat similar interface. The code would have looked more or less like the following fragment:

my $pops = new Net::POP3Client(USER => $opt_u, 
               PASSWORD => $opt_p, HOST => $opt_s->addr)
    or die "Error connecting or logging in: $!\n";

my $fh = new_tmpfile IO::File
    or die "Cannot create temporary file: $!\n";

$pops->HeadAndBodyToFile($fh, $opt_m)
    or die "Cannot fetch message: $!\n";

$pops->Close();

Parsing the MIME Structure

Just as e-mail travels inside a sort of envelope (the headers), complex messages that include attachments (and generally, HTML messages) travel within a collection of MIME entities. You can think of these entities as containers that can transfer any kind of binary information through the e-mail infrastructure, which in general does not know how to deal with 8-bit data. The code in Example 2 takes care of parsing this MIME structure.

Perl has a wonderful class that provides the ability to understand this MIME encapsulation, returning a nice hierarchy of objects that represent the message. You access this facility through the MIME::Parser class, which is part of the MIME-Tools bundle. MIME::Parser returns a hierarchy of MIME::Entity representing your message. The parser is smart; if you pass it a non-MIME e-mail, it will be returned to you as a text/plain entity.

MIME::Parser can be tweaked in many ways, as its documentation will tell you. One thing that can be tweaked is the decoding process. For my purposes, I needed to be as light on memory usage as possible. The default behavior of MIME::Parser involves the use of temporary files for decoding of the message. These temporary files can be spared and core memory used instead by invoking output_to_core(). Before doing this, note all the caveats cited in the module's documentation. The most important one is that if a 100-MB file ends up in your inbox, this whole thing needs to be slurped into RAM.

In line 47 (Example 2), I create the parser object. The call to ignore_errors() in line 48 is an attempt to make this parser as tolerant as possible. extract_uuencode() on line 49 takes care of pieces of the e-mail that are uu-encoded automatically, translating them back into a more readable form.

The actual request to parse the message, available through reading the $fh filehandle, is in line 51. Note that it is enclosed in an eval block. I have to do this as the parser might throw an exception if certain errors are encountered. The eval allows me to catch this exception and react in a way that is sensible. In this case, I want to be sure that any temporary file created by the parsing process is cleared by a call to purge(), as seen in lines 56 and 57.

Setting Up the HTML Parser

Parsing HTML can be a tricky and tedious task. Thankfully, Perl has a number of nice ways to help you do this job. Several excellent books, such as The Perl Cookbook (O'Reilly, 1998), had a couple of examples that came very close to what I needed, especially recipe 20.5, "Converting HTML to ASCII," which I reproduce below.

use HTML::TreeBuilder;
use HTML::FormatText;

$html = HTML::TreeBuilder->new();
$html->parse($document);

$formatter = HTML::FormatText->new(leftmargin => 0,
                                   rightmargin => 50);

$ascii = $formatter->format($html);

I did not want to use this recipe for two reasons: I needed fine-grained control in the HTML to ASCII conversion, and I wanted to have as little impact as possible on resources. I did a small benchmark (available electronically from www.tpj.com/) from that, which shows the performance difference between the two options while parsing a copy of one of my web articles. The following result shows that the custom parser runs faster than the Cookbook's recipe. This does not mean that the recipe or the modules it uses are bad. This result simply means that the recipe is actually doing a lot of additional work, which just happens to not be all that useful for this particular task.

bash-2.05a$ ./mbench
Benchmark: timing 100 iterations of Cookbook's, Custom...
Cookbook's: 73 wallclock secs (52.82 usr +  0.00 sys =
                               52.82 CPU) @  1.89/s (n=100)
Custom:  1 wallclock secs ( 1.17 usr +  0.00 sys = 
                            1.17 CPU) @ 85.47/s (n=100)
           Rate        Cookbook's    Custom
Cookbook's 1.89/s      —        -98%
Custom     85.5/s      4415%         —

HTML::FormatText does a great job of converting the HTML to plain text. Unfortunately, I have a set of guidelines that I need to follow in the conversion that are not compatible with the output of this module. Additionally, HTML::TreeBuilder does an excellent job of parsing an HTML document, but produces an intermediate structure—the parse tree—which, in my case, wastes resources.

However, Perl has an excellent HTML parser in the HTML::Parser module. In this case, I chose to use this class to implement an event-driven parser, where tokens (syntactic elements) in the source document cause the parser to call functions I provide. This allowed me complete control over the translation while sparing the intermediate data structure.

Converting HTML to text is a lossy transformation. This means that what comes out of this transformation is not exactly equivalent to what went in. Pictures, text layout, style, and a few other information elements are lost. My needs required that I noted the existence of images as well as a reasonably accurate rendition of the page's text, but nothing else. Remember that the target device can only display 7-bit text, and this is within a very small and limited display. This piece of code sets up the parser to do what I need:

62  my $parser = HTML::Parser->new
63      (
64       api_version => 3,
65       default_h => [ "" ],
66       start_h   => [ sub { print "[IMG ", 
67             d($_[1]->{alt}) ||  $_[1]->{src},"]\n" 
68               if $_[0] eq 'img';
69                           }, "tagname, attr" ],
70       text_h    => [ sub { print d(shift); }, "dtext" ],
71       ) or die "Cannot create HTML parser\n";
72  
73  $parser->ignore_elements(qw(script style));
74 $parser->strict_comment(1);

Starting on line 71, I set up the HTML::Parser object that will help me do this. First, I tell it I want to use the latest (as of this writing) interface style, which provides more flexibility than earlier interfaces. On line 65, I tell the object that by default, no parse events should do anything. There are other ways to say this, but the one shown is the most efficient.

Lines 66 through 69 define a handler for the start events. This handler will be called each time an opening tag such as <A> or <IMG> is recognized in the source being parsed. Handlers are specified as a reference to an array whose first element tells the parser what to do and its second element tells the parser what information to pass to the code. In this example, I supply a function that for any IMG tag will output descriptive text composed with either the ALT or the SRC attributes. I request this handler to be called with the name of the tag as the first argument, and the list of attributes as further arguments, through the string "tagname, attr" found in line 69. The d() function will be explained a bit later—it has to do with decoding its argument.

The text event will be triggered by anything inside tags in the input text. I've set up a simpler handler for this event that merely prints out whatever is recognized. I also request that HTML entities such as &euro; or &ntilde; be decoded for me through the string "dtext" on line 70. HTML entities are used to represent special characters outside the traditional ASCII range. In the interest of document accuracy, you should always use entities instead of directly placing 8-bit characters in the text.

Some syntactic elements are used to enclose information that is not important for this application, such as <style>...</style> and <script>...</script>. I ask the parser to ignore those elements with the call to ignore_elements() at line 73. I also request the parser to follow strict comment syntax through the call to strict_comment() on line 74.

Setting Up the Unicode Mappings

MIME defines various ways to encode binary data depending on the frequency of octets greater than 127. With relatively few high-bit octets, Quoted-Printable encoding is used. When many high-bit octets are present, Base-64 encoding is used instead. The reason is that Quoted-Printable is slightly more readable but very inefficient in space, while Base-64 is completely unreadable by standard humans and adds much less overhead in the size of encoded files. Often, message headers such as the sender's name are encoded using Quoted-Printable when they contain characters such as a "ñ". These headers look like "From: =?ISO-8859-1?Q?Luis_Mu=F1oz?= <some@body.org>" and should be converted to "From: Luis Muñoz <some@body.org>." In plain English, Quoted-Printable encoding is being used to make the extended ISO-8859-1 characters acceptable for any 7-bit transport, such as e-mail. Many contemporary mail transport agents can properly handle message bodies that contain high-bit octets but will choke on headers with binary data, in case you were wondering about all this fuss.

Lines 92 through 102 define setup_decoder(), which can use the headers contained in a MIME::Head object to set up a suitable decoder based on the MIME::WordDecoder class. This will translate instances of Quoted-Printable text to their high-bit equivalents. Note that I selected ISO-8859-1 as the default when no proper character set can be identified. This was a sensible choice for me, as ISO-8859-1 encloses Spanish characters, and Spanish happens to be my native language.

 92   sub setup_decoder
 93   {
 94       my $head = shift;
 95       if ($head->get('Content-Type')
 96           and $head->get('Content-Type') = ~ m!charset="([^\"]+)"!)
 97       {
 98           $wd = supported MIME::WordDecoder uc $1;
 99       }
100       $wd = supported MIME::WordDecoder "ISO-8859-1" unless $wd;
101   }

But this clever decoding is not enough. Getting at the original high-bit characters is not enough. I must recode these high characters into something usable by the 7-bit display device. So in line 76, I set up a mapping based on Unicode::Map8. This module can convert 8-bit characters such as ISO-8859-1 or ASCII into wider characters (Unicode) and then back into our chosen representation, ASCII, which only defines 7-bit characters. This means that any character that cannot be properly represented will be lost, which is acceptable for our application.

76    $map = Unicode::Map8->new('ASCII')
77        or die "Cannot create character map\n";

The decoding and character mapping is then brought together at line 90, where I define the d() function, which simply invokes the appropriate MIME decoding method, transforms the resulting string into Unicode via the to16() method, and then transforms it back into ASCII using to8() to ensure printable results on our device. Since I am allergic to warnings related to undef values, I make sure that decode() always gets a defined string to work with.

90    sub d { $map->to8($map->to16($wd->decode (shift||''))); }

As you might notice if you try this code, the conversion is again lossy because there are characters that do not exist in ASCII. You can experiment with the addpair() method to Unicode::Map8 in order to add custom character transformations (for instance, 'ñ' might be 'n'). Another way to achieve this is by deriving a class from Unicode::Map8 and implementing the unmapped_to8 method to supply your own interpretation of the missing characters. Take a look at the module's documentation for more information.

Starting the Decode Process

With all the pieces in place, all that's left is to traverse the hierarchy of entities that MIME::Parser provides after parsing a message. I implemented a very simple recursive function, decode_entities, shown in Example 3.

The condition at line 107 asks if this part or entity contains other parts. If it does, it extracts them and invokes itself recursively to process each subpart at line 109.

If this part is a leaf, its body is processed. Line 111 gets it as a MIME::Body object. On line 155, I set up a decoder for this part's encoding and based on the type of this part, which is determined at line 113; the code on lines 117 to 122 calls the proper handlers.

In order to fire off the decoding process, I call decode_entities() with the result of the MIME decoding of the message on line 86. This will invoke the HTML parser when needed and, in general, produce the output I look for in this example. After this processing is done, I make sure to wipe temporary files created by MIME::Parser on line 88. Note that if the message is not actually encoded with MIME, MIME::Parser will arrange for you to receive a single part of type text/plain that contains the whole message text, which is perfect for our application.

86  decode_entities($e);
87  
88 $mp->filer->purge;

Conclusion

After these less than 130 lines of code, I can easily fetch and decode a message, such as in the following example:

bash-2.05a$ ./mfetch -s pop.foo.bar -u myself \
            -p very_secure_password -m 5
Date: Sat, 28 Dec 2002 20:14:37 -0400
From: root <root@foo.bar>
To: myself@foo.bar
Subject: This is the plain subject

This is a boring and plain message.

More complex MIME messages can also be decoded. Look at Example 4, where I dissect a dreaded piece of junk mail. I edited it to spare you pages and pages of worthless image links.

That about covers the operation of the mfetch script. I hope you find it useful if you have a similar MIME or HTML decoding task to accomplish.

TPJ

Listing 1

#!/usr/bin/perl

# This script is (c) 2002 Luis E. Muñoz, All Rights Reserved
# This code can be used under the same terms as Perl itself. It comes
# with absolutely NO WARRANTY. Use at your own risk.

use strict;
use warnings;
use IO::File;
use Net::POP3;
use NetAddr::IP;
use Getopt::Std;
use MIME::Parser;
use HTML::Parser;
use Unicode::Map8;
use MIME::WordDecoder;

use vars qw($opt_s $opt_u $opt_p $opt_m $wd $e $map);

getopts('s:u:p:m:');

usage_die("-s server is required\n") unless $opt_s;
usage_die("-u username is required\n") unless $opt_u;
usage_die("-p password is required\n") unless $opt_p;
usage_die("-m message is required\n") unless $opt_m;

$opt_s = NetAddr::IP->new($opt_s)
    or die "Cannot make sense of given server\n";

my $pops = Net::POP3->new($opt_s->addr)
    or die "Failed to connect to POP3 server: $!\n";

$pops->login($opt_u, $opt_p)
    or die "Authentication failed\n";

my $fh = new_tmpfile IO::File
    or die "Cannot create temporary file: $!\n";

$pops->get($opt_m, $fh)
    or die "No such message $opt_m\n";

$pops->quit();
$pops = undef;

$fh->seek(0, SEEK_SET);

my $mp = new MIME::Parser;
$mp->ignore_errors(1);
$mp->extract_uuencode(1);

eval { $e = $mp->parse($fh); };
my $error = ($@ || $mp->last_error);

if ($error)
{
    $mp->filer->purge;      # Get rid of the temp files
    die "Error parsing the message: $error\n";
}

                # Setup the HTML parser

my $parser = HTML::Parser->new
    (
     api_version => 3,
     default_h      => [ "" ],
     start_h => [ sub { print "[IMG ", 
            d($_[1]->{alt}) || $_[1]->{src},"]\n" 
                if $_[0] eq 'img';
            }, "tagname, attr" ],
     text_h => [ sub { print d(shift); }, "dtext" ],
     ) or die "Cannot create HTML parser\n";

$parser->ignore_elements(qw(script style));
$parser->strict_comment(1);

$map = Unicode::Map8->new('ASCII')
    or die "Cannot create character map\n";

setup_decoder($e->head);

print "Date: ", $e->head->get('date');
print "From: ", d($e->head->get('from'));
print "To: ", d($e->head->get('to'));
print "Subject: ", d($e->head->get('subject')), "\n";

decode_entities($e);

$mp->filer->purge;

sub d { $map->to8($map->to16($wd->decode(shift||''))); }

sub setup_decoder
{
    my $head = shift;
    if ($head->get('Content-Type')
    and $head->get('Content-Type') =~ m!charset="([^\"]+)"!)
    {
    $wd = supported MIME::WordDecoder uc $1;
    }
    $wd = supported MIME::WordDecoder "ISO-8859-1" unless $wd;
}

sub decode_entities
{
    my $ent = shift;

    if (my @parts = $ent->parts)
    {
    decode_entities($_) for @parts;
    }
    elsif (my $body = $ent->bodyhandle)
    {
    my $type = $ent->head->mime_type;

    setup_decoder($ent->head);

    if ($type eq 'text/plain')
    { print d($body->as_string); }
    elsif ($type eq 'text/html')
    { $parser->parse($body->as_string); }
    else
    { print "[Unhandled part of type $type]"; }
    }
}

sub usage_die
{
    my $msg = <<EOF;
Usage: mfetch -s pop-server -u pop-user -p pop-password -m msgnum
EOF
    ;
    $msg .= shift;
    die $msg, "\n";
}

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV