Betsy is an independent Internet contractor, adjunct professor, and writer. She can be contacted at [email protected]
Preprocessing makes it possible for complex homepages to be delivered lightning fast, and lets you significantly increase the number of pages served in an extremely cost-effective manner. The idea of preprocessing content for web pages grew out of necessity. As a web developer, I worked on a site that uses an excellent commercial content-management system (CMS). The site had been running well for three or four months when a new traffic pattern emerged. Suddenly, one day at about noon, thousands of people visited the homepage. The server did its best to deliver the pages, but the application could not run quickly enough. The server crashed.
To avoid the problem, I manually substituted HTML for the dynamic data on the homepage, and the server was able to support the traffic. The site was moved from a shared server to a much more powerful dedicated server, and I assumed it would run fine. It didn'tit crashed again. The application was simply too CPU-, memory-, and disk-intensive to support a high-traffic homepage. That is when I began experimenting with preprocessing.
In general, homepages of web sites are the most commonly accessed page. Because they are usually the site visitor's entry point, homepages should be prepared and delivered as quickly as possible, especially if there are busy hours where many visitors access the page simultaneously. Sites that use resource-intensive content on the homepage can crash the server. "Resource-intensive" may refer to the server, including the CPU, disk, and memory, or it may be a bandwidth bottleneck, if content is supplied from different servers with slow connections.
Sites with high-traffic busy hours include online purchasing systems, especially those with time-sensitive sales or transactions. Concert ticket sales, vacation reservations, and auctions require a high-speed homepage to serve large numbers of site visitors simultaneously. News providers should also be prepared for high-volume sites, because many people are turning to the Internet as their primary source of information on current events. A site that reports the winning numbers for high-value, high-sales sweepstakes needs extremely fast pages. Popular Internet radio shows draw many site visitors, yielding a burst of traffic at the start of the broadcast.
The most common reason a site is resource intensive is the use of a sophisticated application. Popular applications are online shopping carts, content-management systems, forums, and blogs. These applications can be preprocessed into HTML, which lets them be delivered much more quickly. Sites with demanding multimedia displays are also resource intensive.
In this article, I present an online example (http://www.wirehopper.com/php/ preprocess.php) that demonstrates how to preprocess the content, and the impact of preprocessing on the delivery time. The scripts that implement this demonstration are available electronically from DDJ (see "Resource Center," page 4) and at http://www.wirehopper.com/php/list.php.
To run the code, go to the preprocessor URL (http://www.wirehopper.com/ php/preprocess.php), and enter the URL of the page you'd like to process. This code works on any web page, from any server with any OS, any server software, and any scripting language. It creates .htm files. Once you press the Submit button, the script reads the content from the page and writes out the .htm file. It then reopens the URL and .htm file, reads them, and calculates the time required to read both and the number of pages per second each would support. Figure 1 shows the results of the preprocessing for the WireHopper homepage.
The delivery times of pages from remote servers (anything that isn't on the WireHopper server) include the full request, process, and response cycle, and all the related transmission times. For this reason, there are three local applications to demonstrate the code:
- The WireHopper homepage. This page uses Smarty templates to make it more manageable. Although the templates are compiled, there is enough overhead that you can see significant performance increases with the preprocessed page. This comparison eliminates all the transmission time delays and accurately reflects the time required to prepare the WireHopper homepage, and a preprocessed version of the same page.
- A bulletin board application that uses MySQL.
- A quote-of-the-day program that uses a flat text file to store the quotes.
The pages per second should be treated as a relative number because server performance and page delivery time can be affected by many variables.
To create a high-speed version of application content, you must first consider the site architecture. If the application is currently serving as the homepage, then it should be preprocessed as a page. In the case where the application content is displayed in an iframe or other container, the preprocessing may be more complex to reduce the size of the content. Figure 2 illustrates the flow of data within a preprocessing architecture.
Preprocessing a page is straightforward. It is read like a file and written out to a new file. Most server-side scripting languages should support the necessary reads and writes. The read delivers the page from the server, just as if the URL was entered on the address line of a browser. By writing it out to an HTML file, the server preserves the processing. It is similar to compilation and caching. If the name of the page is index.htm, the preprocessed file becomes the homepage. The homepage will be extremely fast, and subsequent navigation will be through the application. Dynamic pages can be created with preprocessing. Simply insert the additional code as needed and name the file appropriately.
If the application content is a small amount of data that is pulled in through an iframe, time should first be taken to deliver the application content with as little overhead as possible, usually with a transparent background. Overhead includes application stylesheets, which serve the entire application, metatags, and any unnecessary code or processing. Comments can be used to indicate the sections of code to be preprocessed. Application stylesheets should be replaced by style tags or a specialized stylesheet that supports only the delivered content for optimum performance. Metatags will be superseded by the tags of the parent document.
The homepage would pull in the preprocessed content with an iframe tag, like so:
<iframe src="preprocessed.htm" width=150 height=145
frameborder=no style="overflow: hidden;border:none"
You may run the preprocessor manually through a web interface as in this demonstration, but it is more likely it will be run from the command line by scheduling software such as cron. First, you'll need to determine the correct command line to execute the script. On Linux, it is often:
/usr/local/bin/php cmdline.php parameters
When the PHP script runs from the command line, the parameters are passed as argc (the number of arguments), and argv (the array of arguments). argc is always greater than or equal to one, because the first argument is the name of the script.
Once you have the script running correctly from the command line, you can configure the scheduling software. I describe Linux cron and the crontab, but any scheduling software is fine. The crontab entries are system dependent, but there are a few common rules: The program that is to be executed (in this case, it is PHP, which is processing the script) should be named with the full path name, and the script name should also be fully qualified.
A sample crontab is:
0 1 * * * /usr/local/bin/php
This script would run at 1:00 am every day. /usr/local/bin/php is the PHP executable, the program that processes cmdline.php. The output would be sent to the MAILTO address. Refer to your system software documentation for a full discussion of the available scheduling software.
You should also create a web interface to easily refresh the preprocessed page. This lets you and your coworkers quickly and easily recover if there are problems (often new data becomes available and must be posted on the site). Listing One, a refresh simulator (http://www.wirehopper.com/ php/webrefresh.php), simulates the system command with an echo to show how to run the command-line version of the preprocessor from a web-based script.
Refreshing Delivered Client Content
Refreshing preprocessed content on the client machines can be accomplished with the use of meta refresh tags in Listing Two (http://www.wirehopper.com/php/ refresh.php).
The metatag syntax is like this:
<meta http-equiv="refresh" content="30">
This means that the browser requests the content from the server after 30 seconds. If the site has to reflect currently changing conditions (for example, the number of items available for purchase), this ensures the site visitors see updated data. This tag also supports a redirect, so site visitors that stay on the page beyond the refresh time could be redirected to a different page. This lets you implement a very high-speed homepage, with a more robust back end.
You can use any interval you would like and you may want it to be calculated by the preprocessor in response to the time of delivery. For example, you might want to refresh the page at the top of the hour. The refresh content is the number of seconds until refresh, so you would need to calculate how many seconds there are between the current time and the time you would like to run a refresh.
Search Engine Optimization
The preprocessed content is composed from existing code on the site. If it is serving as a homepage, best practices for search engine optimization (SEO) should be followed. You can probably assume the SEO work has been done already for a homepage and will be preprocessed, but if the content is embedded in an iframe, it is important to use metatags or robots.txt to prevent the search engines from indexing the file and allowing site visitors to access it out of context.
Preprocessing does have some drawbacks. If the application assigns a session ID to site visitors as they enter the system through the homepage, preprocessing may cause problems as the same session ID will be given to all site visitors. In this case, the application should refrain from assigning the session ID until after the user enters a secondary page or logs in. Another alternative would be to evaluate the page as it is read and create a preprocessed version of the content that is displayed on the homepage. If the application cannot be modified or aggressive preprocessing is not possible, a static homepage may be required. An application that uses a redirect on its first page may require that the second page be preprocessed and used as the entry point, to avoid the redirect defeating the preprocessing.
Another side effect of the preprocessing is that some dynamic features may be lost. To compensate, you may adapt the preprocessing code to allow some dynamic content. Balancing the site features with the performance requirements will dictate the type of page possible. You can further adapt your site by running preprocessed pages only during high-traffic hours, using cron jobs and scripts to determine the type of homepage to deliver based on the time of day or server load.
Preprocessing may improve the performance of multimedia sites, but it is probably better to optimize the multimedia content by reducing the size, quantity, or quality of the images, audio, and video during periods of high traffic.
Many applications have template support that allows you to extract content with different displays, paths, or interfaces. A specialized application template can be a great advantage as it simplifies the preprocessing and integration.
Finally, the hosting account may limit how preprocessing can be implemented. Some accounts don't allow shell access or cron jobs. In that case, you may have to use a manual web-based preprocess/refresh interface or a different hosting account.
Every system has a finite performance limit. The limit is a complex function of the server hardware and software configuration, site architecture, page structure, and application specifics, as well as the other processes running on the server. Preprocessing can greatly speed up the delivery of pages and extend a server's capacity in an extremely cost-effective manner.
<?php /* webrefresh.php -------------- Web interface to run a command-line PHP script. The system command is simulated with an echo command for the demonstration. The /usr/local/bin/php is the PHP processor, with the full path. cmdline.php is the name of the PHP script. parameter is an optional parameter. */ echo '<br><b>Refreshing home page content</b><br>'; echo 'system ("/usr/local/bin/php cmdline.php parameter");'; echo '<br><b>Done</b><br>'; ?>Back to article
<!-- refresh.php ----------- This is a demonstration of the use of the meta refresh tag to update dynamic content. In this case, every 5 seconds, the quotes script will run and display a random quote. If you create dynamic pages that need to be refreshed, this ensures that site visitors will have current data. The second refresh tag can be used to redirect site visitors to a different page. --> <html> <head> <title>Refresh Tag Demonstration</title> <!-- Simple refresh tag demonstration, refresh this page after 5 seconds --> <meta http-equiv="refresh" content="5"> <!-- Refresh tag that redirects the site visitor to a different page after 30 seconds <meta http-equiv="refresh" content="30;url=http://www.google.com/"> --> <meta http-equiv="content-type" content="text/html" > </head> <body> <?php include "../quotes/quotes.php"; ?> </body> </html>Back to article