Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Scraping Yahoo Groups


February, 2004: Scraping Yahoo Groups

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.


You can learn a lot about a man from what he writes about. In my previous articles, ostensibly about Perl both here and at perl.com, I've talked about poker, Ruby, church work, Japanese literature, housemates, and linguistics. This month, I'm going to talk about obscure Japanese pop groups. No, really.

The great Pizzicato Five began life in 1984 as a four-piece band made up of Konishi Yasuharu, Takanami Keitaro, Ryo Kamamiya, and Sasaki Mamiko, and went through a series of line-up changes and hit albums. Eventually becoming a duo of Konishi Yasuharu and Nomiya Maki, they split up in March 2001. I got interested in Pizzicato Five (P5) in March 2001. Life is hard sometimes.

The P5 have always been more popular outside Japan than inside it, although "popular" is something to be taken relatively—there's a vibrant English-language discussion group, P5ML, at Yahoo Groups. Unfortunately, coming late into the P5 scene, I missed a lot of the initial mail. I'd love to have a copy of all the old messages in my mailbox, but I don't really want to have to go trolling through the Yahoo Groups web interface to get them.

Maybe Perl can help.

Grabbing Mails

The first port of call in cases like this is search.cpan.org, and asking it about "Yahoo Groups" points us directly to the WWW::Yahoo::Groups module. This is a module based on WWW::Mechanize, a tool we've looked at before for scraping web sites.

Once we've read, marked, and inwardly digested the documentation, we can start working on our program to download the mail:

use WWW::Yahoo::Groups;
my $scraper = WWW::Yahoo::Groups->new();
$scraper->login($username => $password);
$scraper->list("p5ml");

Now we have logged in, all being well, and selected our group. We're ready to see what mail is available to be downloaded. WWW::Yahoo::Groups gives us functions to tell us the first message in the archive and the last one.

my $lwm = $scraper->first_msg_id();
my $hwm = $scraper->last_msg_id();

These IDs form a sequence internal to Yahoo Groups, and are not actually related to the message ID specified in the e-mail. We could take a simple-minded approach to grabbing all the messages, and simply walk from the low water mark to the high water mark, downloading and storing each message in turn:

open OUT, ">> mail/p5ml" or die $!;
for ($lwm...$hwm) {
    print OUT $scraper->fetch_message($_)."\n";
}

However, there are two problems with this idea. The first is that it's dangerous—any mail arriving due to new posts to the P5ML mailing list will be clobbered by our overzealous output. The second is that it is wasteful—we may already have downloaded many of the messages in the archive. Let's take these problems one at a time.

Storing Mail

The first thing we want is a safe way of storing mail into a mailbox. We'll assume that we're going to be using the UNIX mbox format for the purposes of this article since, well, that's what I use. There are, as I've mentioned in the past, a large number of mail folders handling libraries on CPAN, but we're going to use the simplest one, Email::LocalDelivery.

Its job is to take an e-mail from the "wire," and store it in some kind of file on disk. It's a very simple module with one method: deliver. It takes the e-mail as a plain string, which is quite handy because that's what WWW::Yahoo::Groups gives us. Hence, we can make our code a lot safer by rewriting it like so:

for ($lwm...$hwm) {
    Email::LocalDelivery->deliver($scraper->fetch_message($_),
                                  "~/mail/p5ml");
}

This makes sure we don't overwrite incoming messages, but it doesn't necessarily help us to pull down messages we already have. Of course, for that, we need to know what messages we've already got.

Reading Folders

Again, there is a plethora of modules for dealing with mail folders, but the best one, which does nothing other than split a folder into separate mail messages, is Richard Clamp's Email::Folder:

use Email::Folder;
my $folder = Email::Folder->new("mail/p5ml");

Once we've opened the folder, we can get its messages as individual Email::Simple objects:

my @ids = map { $_->header("Message-ID") } $folder->messages;

Now we know what message IDs we have, but we're going to find it much more convenient to have that list as a hash rather than an array. Did that immediately jump out at you? Don't worry if it didn't; just like learning the idioms and constructions of the human language, the hash formulation is just one of those things you sense as you become more fluent in Perl.

We want to know if an incoming message is one that we already hold a copy of. Another way to phrase this is that the message exists in the set of messages in our folder. Once you start thinking about sets and existence in Perl, you should immediately start thinking about hashes because, like sets, they provide a data structure that doesn't necessarily hold an order, but can be used to test easily whether or not something is a member of the set. Perl's exists function gives us a quick and simple test for set membership.

It's a Perl rule of thumb—if you want to know if something is part of a list of values, that list should become a hash and you should use exists.

So, I'd naturally have written the above line as:

my %ids = map { $_->header("Message-ID") => 1 } $folder->messages;

Notice that we don't really care about the value attached to the message ID in the hash is because all we're going to use it for is to test for existence.

Now we can look again at the messages we're downloading via WWW::Yahoo::Groups. If we see a message that's already in the folder, we can assume that we've caught up with the archives.

The only slight nit is that to match the message ID from the downloaded message against our set from the folder, we need to turn that into an Email::Simple object, too. But thankfully, by definition, that's simple!

for ($lwm...$hwm) {
    my $message = $scraper->fetch_message($_);
    my $id = Email::Simple->new($message)->header("Message-ID");
    last if exists $ids{$id}; # Caught up with archive.

    Email::LocalDelivery->deliver($message, "~/mail/p5ml");
}

Hoorah! Now we should have a mail box full of old Pizzicato Five messages. What could be a better use for Perl?

Threading Mail

I suppose, though, that we might want to have a fast way of catching up with these hundreds of old messages. In next month's article, we're going to look in more depth about summarizing mailing-list archives, but for this month, we'll take a quick look at how to organize these messages into threads. We'll again start with Email::Folder and Email::Simple to split off the messages from the folder, and now we want to thread them.

Thankfully, CPAN is there for us again, and Email::Thread does all the work in arranging a collection of Email::Simple objects into threads. It could hardly be much easier. To kick off the threading process, we create a new threader object with all the messages we want organized, and then call the obviously named thread method:

use Email::Folder;
use Email::Thread;

my $threader = Email::Thread->new(
                Email::Folder->new("mail/p5ml")->messages
              );
$threader->thread;

This goes away and populates a bunch of internal data structures—quite remarkably quickly—and allows us to ask for the set of root messages that form the initial messages in threads:

my @roots = $threader->root_set;

Each of these root messages returned from root_set is an Email::Thread::Container object; it may have a parent, child, or sibling, each of which is another Email::Thread::Container, and a message, which is the Email::Simple object representing the message. The child is a container that refers to a reply to the current node; the sibling is another reply to the same parent.

Sometimes the message will be empty (if we have missed out some posts in the thread), but the threader can still deduce that a message should be there. Let's begin by looking at the subjects of each thread:

for my $root (@roots) {
    print $root->message->header("Subject")."\n";
}

This gives us a good overview of the various topics that come up on the mailing list, but how popular is each one? To count the number of replies to a thread-starter, we'll have to traverse the child and sibling links of the thread.

Walking a tree-like data structure, such as a message thread, should suggest to you a recursive algorithm: Do something for the node, then perform this algorithm on its sibling, then perform it on its child. I was once told two golden rules for dealing with recursive procedures. First, make sure there's a termination condition, then have faith that it'll do the right thing. It usually will. Here's a basic thread walker:

sub walk_thread {
    my $node = shift;
    do_something($node);
    walk_thread($node->sibling) if $node->sibling;
    walk_thread($node->child)   if $node->child;
}

Will this terminate? Eventually, we'll see a node that has no siblings or no replies. Will it do the right thing? It's easier to see this if you assume a thread with no siblings, just one reply to each message:

sub walk_thread {
    my $node = shift;
    do_something($node);
    walk_thread($node->child)   if $node->child;
}

Now we can see that this will cause each child to be visited in turn, and then terminate. Putting the sibling link back is exactly the same but makes the walk two-dimensional.

But how do we go from here to finding out the number of messages in a thread? Imagine an army squad in a field. It's almost in a line, but some people are ahead of others. The commander at the west end of the field wants to know how many soldiers are still alive and responding. He can only the see the sergeant, so he stealthily runs over to him and says, "Find out how many soldiers are east of you, add one for yourself, and tell me the answer." The soldier can see two corporals to his east, so asks both of them "Find out how many soldiers are east of you, add one for yourself, and tell me the answer." The corporals go and do exactly the same thing for the men that they can see, and report back the answer. The sergeant gets two answers back, adds them together, adds one for himself and reports it to the commander, who adds one for himself, and knows how many messages, uh, men he's got.

This works because the men at the east end can see no other men, so they add one for themselves, and report back "one," and the answer bubbles up until you've summed everyone. Here's what that looks like in Perl.

sub count_offspring {
    my $node = shift;
    my $count = 0;
    if ($node->next)  { $count += count_offspring($node->next); }
    if ($node->child) { $count += count_offspring($node->child); }
    # And one for mynode
    $count++;
    return $count;
}

And now, we can sort the threads by the number of messages they contain:

my @rootset = $threader->rootset;
my %counts = map { $_ => count_offspring($_) } @rootset;
for my $node (sort { $counts{$b} <=> $counts{$a} } @rootset) {
    print $node->message->header("Subject"),": $counts{$node}\n";
}

When we do this, we find that, unsurprisingly, the most prominent thread is the one about the band splitting up...

Providing Links

Now let's change the subject a tiny bit. Getting back to these Yahoo Groups, you'll remember I said there was no correlation between a Yahoo Group message ID and a message's Message-ID header. Wouldn't it be nice, then, if we could insert a header into each message giving the URL where the message came from?

We can do this because we know that WWW::Yahoo::Groups uses the WWW::Mechanize robot to do its fetching, that WWW::Mechanize allows us to get at the HTTP::Response object for the last fetch, and that HTTP::Response has a base method. And thankfully, we can get at the underlying WWW::Mechanize object by calling agent. So our modified scraper code looks like this:

for ($lwm...$hwm) {
    my $message = $scraper->fetch_message($_);
    my $uri = $scraper->agent->res->base;

Next, we need to modify the headers of this e-mail. Since we're parsing the mail into Email::Simple anyway to get the Message ID, we can add an X-Yahoo-URL header in there as well:

    my $simple = Email::Simple->new($message);
    my $id = $simple->header("Message-ID");
    $simple->header_set("X-Yahoo-URL", $uri);

    last if exists $ids{$id}; # Caught up with archive.

    Email::LocalDelivery->deliver($simple->as_string, "~/mail/p5ml");
}

There's only one slight problem with this—the URLs you get back are really horrifically long in some cases. Thankfully, there's a nice solution we can slot in here—another CPAN module called WWW::Shorten. This uses the various URL-shortening services out there, like the well-known MakeAShorterLink—but many others, too—to produce a more palatable URL. Let's support the home side, and use Ask Bjorn Hansen's metamark. (Ask runs most of the perl.org services.)

use WWW::Shorten 'metamark';
for ($lwm...$hwm) {
    my $message = $scraper->fetch_message($_);
    my $uri = makeashorterlink($scraper->agent->res->base);
    my $simple = Email::Simple->new($message);
    my $id = $simple->header("Message-ID");
    $simple->header_set("X-Yahoo-URL", $uri);

    last if exists $ids{$id}; # Caught up with archive.

    Email::LocalDelivery->deliver($simple->as_string, "~/mail/p5ml");
}

And there we are! Job done!

Dedication

Some of you may have cottoned on to something strange about this article: As it progressed from a relatively plausible (for me, at least) premise, it got more and more contrived. My apologies for that, but there is a good reason for it.

This month, I wanted to particularly showcase a special set of modules and a special module author. Three of the modules we discussed in this article, WWW::Yahoo::Groups, Email::Thread, and WWW::Shorten were all written by the same man, Iain Truskett.

Iain not only wrote these modules, but many others, too. He also contributed fixes and advice to a large number of other people's modules, to the Perl 5 core documentation, and to the DateTime project. Additionally, he ran the Perl books site at http:// books.perl.org/, and was a regular contributor to many Perl mailing lists, newsgroups, and IRC channels, always known for his patience and calmness at all times.

He was a tireless benefactor to the Perl community, even though he largely remained humbly behind the scenes. We lost a good man when Iain passed away on New Year's Eve, 2003. This article is respectfully dedicated to him.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.