Channels ▼
RSS

Web Development

Grokking Web Archives


March, 2004: Grokking Web Archives

brian has been a Perl user since 1994. He is founder of the first Perl Users Group, NY.pm, and Perl Mongers, the Perl advocacy organization. He has been teaching Perl through Stonehenge Consulting for the past five years, and has been a featured speaker at The Perl Conference, Perl University, YAPC, COMDEX, and Builder.com. Contact brian at comdog@panix.com.


Iam in the middle of the desert in Iraq, but even here, in the modern army, I get to use the Internet sometimes; and with portable diesel generators, I have power to run my laptop. I wonder what this used to be like before Microsoft Word, upon which our military operations seem utterly dependent. Even with all of our fancy equipment, surfing the Web is a challenge both in the limited bandwidth and available time.

When I get a chance to get on the Internet, I have to use a government-approved computer, which means I have to use whatever they happen to have and it always is some form of Windows. I plan my computer time carefully and work my way through my list as quickly as I can so I can do as much as possible in my limited time—there is always a line for the computers. I do not stop to read long web pages, for instance. I save them to a floppy disk, then read them later. If I didn't, I would use up all of my time just trying to get through the first page of Damien Conway's Exegesis 6.

Since all of these computers, no matter which part of Iraq I may be in, run Windows, I have to use Internet Explorer, which allows me to save web pages as Text, HTML, Complete Web Page, or a Web Archive. Unfamiliar with these choices, I saved a bunch of pages as Web Archives. I expected it to create a bunch of files just like Mozilla does, but instead, I got a single file. This turned out to be a good thing since I had an easier time copying one file than a file and a directory, so I did not have to think about it much. However, Internet Explorer on my Powerbook did not know what to do with it, although it thought about it for a long time before giving up.

Perl to the rescue! I looked at the file in BBEdit—I initially reached for HTTP::Response to handle this, but its interface does not have anything for breaking apart a multipart message. I could use it to munge the web archive, then extract the HTTP message body so I could split it up to get the various parts, then create HTTP::Response objects out of that, but I did not want to do that much work. In this case, a module is more cost than benefit.

I need to do three things to be able to read the web archives. First, I need to extract each part. Next, I need to decode the data to put them in their natural states; and once I have all of the pieces, I need to ensure that the browser finds them all. That last part is the trickiest—I need to make sure the original HTML text tells my browser to look for image files locally instead of at their original location.

To do that, I need to know a little about HTTP messages. They come in two parts—a header and a message body—and those are separated by a blank line. In this case, since it comes from Windows, a blank line is a line containing only the string \x0D\x0A, the literal bit pattern for a carriage-return line feed. I cannot rely on the \r and \n representations because the Mac has an odd notion about what those are—the perlport manual page explains it. As long as I use the literal representation, I am okay. Two of these in a row (one for the previous line end and one for the blank line show me where the header ends).

I also know that a multipart HTTP message has a boundary string. This string signals the start of a new part, and the particular string shows up in the main header as part of the Content-Type header:

From: <Saved by Microsoft Internet Explorer 5>
Subject: Example Web Page
Date: Mon, 27 Oct 2003 11:58:12 +0300
Content-Type: multipart/related;
	boundary="——=_NextPart_000_0000_01C39C81.9921A8A0";
	type="text/html"

The header for the entire web archive gives me the boundary string ---=_NextPart_000_0000_01C39C81.9921A8A0, which looks complicated but could be simpler. Perl will not care though, so it does not matter.

After the header comes the blank line, then the message body. The message body contains all of the parts of the web archive and each part has the boundary string in front of it. The data after the boundary string is another HTTP message, so I have to extract the data the same way I just did for the web archive message body:

---=_NextPart_000_0000_01C39C81.9921A8A0
Content-Type: text/html;
	charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
Content-Location: http://www.example.com/index.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Example Web Page</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =

———=_NextPart_000_0000_01C39C81.9921A8A0
Content-Type: image/gif
Content-Transfer-Encoding: base64
Content-Location: http://www.example.com/image.gif

R0lGODlhFwAXAPcAAAAAAAoKChQUFDExMTc3Nz09PUJCQkxMTFBQUFZWVlpaWmRkZGZmZm1tbXV1

I can see the parts I want and I can see how I need to process them. The HTML text has special characters turned into their hexadecimal representations preceded by an equal sign. For instance, the equal sign itself becomes the literal sequence =3D, and HTML text has a lot of equal signs. Additionally, each line ends with a bare equal sign. Somewhere, this format must have solved a problem (or, at least, I hope it did). The images are much easier to process since I just base64-decode them.

As I took all of this research and turned it into a program, I went through two major designs: Just dump all of the files into the current directory, then, to mirror the "Web Page Complete," I create a directory for the image and supporting files. The first program was too sloppy—it created a lot of files in the current directory—so in this article, I only show the second program (see Listing 1).

Lines 1-2 start my program in the usual manner, with warnings on and the strict pragma in effect so I do not get sloppy.

On line 4 I pull in the MIME::Base64 module since I will use its decode_base64() function to process the image data. That is as complicated as that will get.

On line 6, I start a do {} block to grab file data. The do {} block creates a scope so I can temporarily affect special variables and limit the lifetime of some temporary variables. I set $/, the input record separator (the fancy way of saying line ending) to undef so I can slurp in the file in one shot. I open the file I specify in the first command-line argument, $ARGV[0], and make sure I was able to do that, die()-ing otherwise. Finally, I read a line—the entire file because $/ is undef—and since that is the last evaluated statement, it becomes the return value of the do {} block, which I assign to $archive.

On line 12, I split $archive, the web archive HTTP message, into its header and message body and assign those to $header and $multipart, respectively. My split() regular expression uses the literal values of the line endings to avoid portability problems between Windows and my Mac, which are pernicious in this case. I also use split()'s optional third argument to specify that I want to end up only with two parts. I know other blank lines exist in the data, and I do not want to split() on those because I cannot ensure that they come after an HTTP header. They could be in the middle of HTML text, for instance.

On line 14, I extract the boundary string from the header with a simple regular expression. A match in list context returns, as a list, the parts of the regular expression I remembered in parentheses. I create the list context by putting the assignment list in parentheses. Once I have that string in $boundary, I use it in another split(), on line 16, to get each part of the web archive. Each part becomes an element of the array @parts. In that split() regular expression, I start with the \Q sequence to quote any special characters in the boundary string so Perl does not interpret them as regular expression metacharacters. The full stop, ".", shows up in the boundary string, for instance.

I have to play with @parts before I go on. The first element is going to be the overall web archive header since that comes before the first boundary string, and the overall message body ends with the boundary string, so split() creates an empty element at the end of the array. I shift() off the first element and pop() off the last so @parts only contains the elements I need to recreate the web page.

On line 19, I call my own parse routine on the first element of @parts, which I shift() off the array. I expect that to be the HTML text of the page. The parse() routine starts on line 45 and is really just a wrapper for other routines. Its argument is the HTTP message for the part I am processing. First, parse() calls &divide to split the message into the header and message body. Since I use the ampersand in front of &divide, the divide() routine can see the argument list, @_, of parse(). I do not have to pass any arguments to divide() because Perl does it for me. Read more about this in the perlsub manual page.

Once divide() returns the header and the message body, I use my location() routine, on line 47, to extract the file name from the Content-Location header, and I use the encoding() routine, on line 48, to get the value of the Content-Transfer-Encoding header. I return those two items and the body for a total of three items as the return values for parse(). On line 19, I save those to their own variables.

I want to save the image files in a directory that has a similar name as the HTML text file, so I extract the file name portion (sans extension, that is) from $location and assign that to $directory. Again, a match in list context returns a list of the things remembered in parentheses. My intended directory name ends up in $directory and, in line 21, I give it a default value if, for some reason, the match failed. On line 22 I create the directory. If that fails, I catch it later, on line 37, when I try to write a file in that directory.

On line 24, I use my own unquote() routine on line 50 to undo the quoted-printable encoding of the HTML text. In the substitution on line 53, I use the substitution's e flag and a little bit of Perl code to determine what I what to use as replacement text. I take the hexadecimal representation, 3D, for instance, and turn that into the correct character. I use Perl's hex() function to ensure Perl turns the string 3D into the correct number, then Perl's chr() function to take that number and return the ASCII character.

On line 25, I munge the HTML IMG tags to ensure the URLs point to the directory where I will store all of the images. I use the images' original file names but munge their paths.

On line 27, once I am done processing the HTML text, I save it to a file.

The rest of the script processes the remaining parts. For my purposes, those parts are images that are base64-encoded, but you may run into other things and have to add a bit of programming.

On line 31, I start a foreach loop that will go through each of the remaining parts. First, as I did before, I use the parse() routine to get the information about the part. On line 35, I use the decode_base64() function from MIME::Base64 to turn the body into its original data if the encoding is Base64. On line 37, I open a file in the directory that I created earlier and save the file there.

Now, when I look in the current working directory, I should have an HTML text file and a directory with the supporting files. When I view the HTML file in my browser, the page looks just like it did when I was on the network.

TPJ



Listing 1

1   #!/usr/bin/perl -w
2   use strict;
3   
4   use MIME::Base64;
5   
6   my $archive = do {
7       local $/;
8       open my $fh, $ARGV[0] or die "Could not open $ARGV[0]: $!\n";
9       <$fh>;
10      };
11  
12  my( $header, $multipart ) = split /(?:\x0D\x0A){2}/, $archive, 2;
13  
14  my( $boundary ) = $header =~ m/boundary="(.*?)"/;
15  
16  my @parts = split /\Q$boundary/, $multipart;
17  shift @parts; pop @parts;
18  
19  my( $location, $encoding, $body ) = parse( shift @parts );
20  my( $directory ) = $location =~ m/(.*)\./;
21  $directory = "$location.d" unless $directory;
22  mkdir $directory;
23  
24  unquote( $body ) if $encoding eq 'quoted-printable';
25  $body =~ s|<img(.*?)src="[^"]*/(.*?)"|<img$1src="$directory/$2"|isg;
26  
27  open my $fh, "> $location" or die "Could not open $location: $!";
28  print $fh $body;
29  close $fh;
30  
31  foreach my $part ( @parts )
32      {
33      my( $location, $encoding, $body ) = parse( $part );
34      
35      $body = decode_base64( $body ) if $encoding eq 'base64';
36          
37      open my $fh, "> $directory/$location" 
38          or die "Could not open $location: $!";
39      print $fh $body;
40      }
41  
42      
43  sub divide   { ( split /(?:\x0D\x0A){2}/, $_[0], 2 )               }
44  
45  sub parse    { my( $h, $b ) = &divide; ( location( $h ), encoding( $h ), $b ) }
46   
47  sub location { ( $_[0] =~ m|Content-Location:.*/(\S+)|i )          }
48  sub encoding { ( $_[0] =~ m|Content-Transfer-Encoding:\s*(\S+)|i ) }
49  
50  sub unquote  
51      { 
52      $_[0] =~ s/=\x0D\x0A//g; 
53      $_[0] =~ s<=([0-9A-F]{2})>{
54              chr hex $1
55              }ge;
56      } 
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV