Analyzing Web Traffic with Perl

Our authors present SurfReport, a program that takes the nearly unreadable format of your web-site log entries, and converts them to a meaningful report of who has been accessing your site.


October 01, 1996
URL:http://www.drdobbs.com/web-development/analyzing-web-traffic-with-perl/184410090

Web Development: Analyzing Network Traffic...

The authors work at Bien Logic and can be contacted at [email protected].


The World Wide Web is the first multimedia communication system that can be tracked in real time. Of course, the telephone is a fully trackable system (take a look at your next phone bill), but it's not really multimedia. Television is a true multimedia system, providing sound, images and motion, but it cannot be tracked accurately in real time. No one knows exactly how many people watch "Friends" or the Super Bowl. Although companies such as Nielsen generate remarkably good estimates based on surveys of selected households, this process is not nearly as accurate as your telephone bill.

The Web, however, allows you to accurately count usage. A webmaster can easily keep track of the text, photos, or sound files used by the people coming to a web site. We can find out where they came from, at what time, with what type of computer, what browser software they use, and even if the file was delivered without errors.

Web servers cannot record how many "people" look at a photo, or how old or how rich those people are. If you really need this information, the best solution is simply to ask your users by having them log in.

In this article, we will explain the Common Log Format used by most web servers and present some basic Perl routines that are used in SurfReport 1.2, our statistical-analysis program that's available free of charge at http://bienlogic.com/surfreport.

The Log File

The server stores information about its operation to a collection of "log" files. The most important of these files is usually called "access" or "access_log," and is kept in a directory frequently called "logs" or "stats." One entry is added to the log for every separate file requested. For example, if a particular HTML page has three GIF files included with it, a "hit" to that page will add four entries to the log file. An entry consists of information about the visitor to that page and contains roughly 100-130 characters.

Some servers, such as Apache, also provide a referrer_log file that complements the information found in access_log. Every server I have seen allows you to combine the access and referral information into one access_log file that is then written in an extended version of Common Log Format.

The log entries are usually, but not always, consecutive and are separated from each other by newline characters. Even errors (such as requests for nonexistent files) are logged, so it's easy to see how the log can grow to enormous sizes. A moderately popular web site might get 100,000 hits per day, adding 70 to 100 MB of information to the log file every week. Many servers will compress or archive the log files on a regular basis to prevent infinite growth.

A typical Common Log Format entry looks something like Example 1. The general form of an entry is: remote_host rfc931 authuser [date_time] "request" status bytes. Table 1 explains these fields, and Example 2 lists the codes used in the status field. Some servers can add extra fields to their logs depending on their configuration. For instance, the most recent version of the NCSA HTTPD server has options to include two more fields, creating a "Combined Log Format."

The Combined Log Format augments the Common Log Format with information about the HTTP referrer and the user agent. The HTTP referrer is the URL from which the user comes, and the user agent represents the user's browser and machine type. Example 3(a) shows this format.

Some servers also include fields like the virtual domain name and the time it took for the data to be sent to the client. In this case, the log entries have the format in Example 3(b). However, tools such as SurfReport only use the remote_host, [date_time], "request", and status fields.

So, even with the advent of a "standard," log entries still come in many shapes and flavors. There are even servers, such as WebStar, that use their own unique format.

Processing a Log File

Although the log files themselves can be huge, it's possible to produce meaningful reports using relatively little memory. For example, you only need to store one copy of each visitor and file name in memory, with a counter for each one. Also, you can simply skip many entries, such as entries outside of the relevant date range, or entries with error status codes.

With such optimizations, the memory needed to process a log file depends upon the number of unique files, the number of unique visitors, and the number of days involved. We've found that even for log files larger than 30 MB, the memory requirements stayed under 1 MB, mainly depending upon the number of casual visitors. SurfReport processes the log files in four basic steps:

1. Process the user input, if an interface is provided (it usually is).

2. Parse the log file and create tables.

3. Correct the tables for aliases.

4. Process the final tables into reports.

If the application is available online, user parameters come from an HTML form. The CGI script that receives the form input must check every parameter carefully, both for security reasons and to guard against erroneous input. Be careful when processing checkboxes. An HTML form does not return a value for an unchecked checkbox, and the corresponding name=value pair will simply be omitted. Never check for an off value for a checkbox.

Parsing can be done efficiently by retrieving entries one at a time, parsing each entry for fields of interest, and rejecting any entry that does not fall within the given parameter range. Perl provides an easy (although far from readable) way to match patterns within a given string. Listing One is the Perl code that searches a Common Log Format entry and extracts the remote host name, date request, and status bytes.

As you scan the log file, it's a good idea to convert dates and times into integer timestamps. This saves memory and allows for faster comparisons.

If the entry fits the required parameters, relevant data is gathered into associative arrays. For example, $Files{$request}++ will create a table called %Files, with unique file names as keys and the number of hits for each name as values. Often, you'll want to associate more than a simple counter with each key. One way to do this is to bundle the separate pieces of data into a single string, separating them with the subscript separator $;. Listing Two demonstrates this technique; it associates each visitor name with both a counter and a list of timestamps.

The "request" part of a log entry is recorded exactly as it comes from the client. The server knows about aliases and symbolic links, but doesn't translate them when writing to the log file. Because of these aliases, the log file can contain two or more names that refer to the same file.

The simplest example is the ".." used to indicate the next higher directory. The names /pub/gifs/../doc.html and /pub /doc.html usually refer to the same file.

However, note that if "gifs" in this example were a symbolic link, then /pub/ gifs/../doc.html might not be the same as /pub/doc.html. Similarly, our server uses the alias cgi-(user) to refer to the cgi-bin directory, so the entry /cgi-frederic/ script.cgi is really the same as /cgibin/script.cgi. For this reason, your reporting tool must be told about symbolic links and server aliases.

A similar problem arises with entries that end with the folder delimiter "/", which actually refer to the default HTML page, often called index.html or default.html.

It is faster to do all aliasing after the tables are built, so that you only do this conversion once for each unique key. If the result is a name already in the table, you'll need to merge the data and delete the old key.

Aliasing Visitors

Perl also provides access to network routines, like gethostbyaddr, for converting numerical IP addresses into a more legible form. Listing Three shows how to do this.

The last step is to process the results into the final report that you see on your screen or in your e-mail box.

All of these techniques are used in our SurfReport program (available electronically)to convert the nearly unreadable format of log entries to a meaningful and easily readable report indicating who has been accessing your site.

Example 1: Common Log Format entry.

myhost.mymachine.com - - [07/Apr/1996:14:11:33 -800] "GET /gifs/space.gif HTTP/1.0" 200 4745

Example 2: Server status codes.

Success Codes
 200: OK, request fulfilled.
 201: Created.
 202: Accepted.
 203: Partial information.
 204: No content.

Redirects
 300: Multiple choice.
 301: Moved permanently.
 302: Moved temporarily.
 303: Method.
 304: Not modified.

Errors
 400: Bad Request.
 401: Unauthorized.
 402: Payment required.
 403: Forbidden.
 404: Not Found.
 500: Internal server error.
 501: Not implemented.
 502: Bad gateway, service overload.
 503: Service unavailable or server timeout.

Example 3: (a) The Combined Log Format; (b) Combined Log Format with domain name and duration fields.

(a)

remote_host rfc931 authuser [date_time] "request" status bytes
"http_referrer" "user_agent"


(b)

remote_host rfc931 authuser [date_time] "request" status bytes
"http_referrer" "user_agent" vdm duration_in_secs

Table 1: Common Log entry field definitions.

     Term          Meaning
     
     remote_host   The name or IP address of the client who logged in.
     rfc931        The remote login name of the client if available or dash (-) if not, 
                     which is usually the case.
     authuser      The HTTP/1.0 authenticated user if available or dash (-) if not, 
                     which is usually the case.
     date_time     The date and time that the request was made, formatted as 
                     [DD/Mon/YYYY;hh:mm:ss timezone], where "timezone" is the 
                     server's timezone in the form -/+hhmm offset from GMT.
     "request"     The request line as sent by the client; quoted and stated in the 
                     format: "Method URL HTTP_Version."
     status        The http status code returned to the client or dash (-) when not 
                     available. Example 2 shows the possible status codes.
     bytes         The content length of the document transferred, not including the 
                     header, or dash (-) when not applicable.

Listing One

while(<LOG>)
{
   if(/^(\S*).*\[(\S*).*\].*\"(.*)\"\s*(\d*).*$/)
     {
        $visitor = $1;
        $dateTime = $2;
        $request = $3;
        $status = $4;
      }
}

Listing Two


($counter, @stamps) = split($_ $Visitors{$visitor});  #split the
                                                      #  values for this
                                                      #  key
$counter++;                                           #increment the
                                                      #  counter
do AddStampToList(*stamps, $timeStamp);               #add current
                                                      #  timestamp
                                                      #  to the list of
                                                      #  timestamps for
                                                      #  this visitor
$Visitors{$visitors} = join($_ $counter, @stamps);    #join the values

Listing Three


sub ResolveIP
{
    local ($ip) = @_;
    local ($addNum);

    $AF_INET = 2;

    $addNum = pack ('C4', split (/\./, $ip));
    (gethostbyaddr ($addNum, $AF_INET))[0];
}

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.