Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

The HTTPsync Incremental Update Utility


Jul99: The HTTPsync Incremental Update Utility

Forrest provides consulting services and software development through Mib Software. He can be contacted at forrest@ mibsoftware.com.


Software developers with collections of files to share and distribute often use a compressed archive format such as ZIP or tar.gz for transfer. But compressed archives are inconvenient to browse, and they waste bandwidth when only updates are needed. Even in a rapidly evolving open-source software project, active development usually affects a very small subset of files. When an update is needed, therefore, it is unlikely that compression will make up for the unnecessary transfer of a large number of unchanged files also in the archive. Storing a collection as individual files makes them much easier to browse and maintain. Only files that change need to be updated.

HTTPsync, the utility I present here, is client-side-only software that performs fast and efficient incremental updates to synchronize collections of files. Only the standard features of HTTP are used. No special protocols, server software, or daemons are needed. HTTPsync, available electronically from DDJ (see "Resource Center," page 5) and from http://www .mibsoftware.com/, is implemented as a C source file that compiles for Windows and UNIX-like systems.

Using HTTP for File Distribution

Hypertext Transfer Protocol (HTTP) is a good base protocol for file distribution. A single stream socket connection is used to make a request and return the data. This simplifies the implementation of clients and servers. In comparison, FTP requires a control connection and a separately arranged data connection. CVSup also uses a two-connection approach, with a special protocol. Special protocols and ports are often blocked by firewalls that are configured to pass only HTTP requests and data.

HTTP servers are ubiquitous and available for just about every operating platform. Nearly every software developer can publish web pages at little or no extra cost. The general-purpose nature and simple protocol allows the design of HTTP servers for efficient handling of large numbers of requests, as well as caching intermediaries and proxies. HTTP/1.1 supports multiple requests and data returns over a single connection, which avoids connection setup and tear-down overhead when a large number of files must be transferred.

In contrast, many popular file distribution systems, including CVSup, Sup, track, rdist, cvs, and rsync, require privileged-port server-side daemons, which are not easily available for all platforms. CVSup is implemented in Modula-3, for example. A survey of other systems and comparison to HTTPsync is available at http:// www.mibsoftware.com/httpsync/.

HTTP does lack methods that some specially designed protocols and software use to implement efficient incremental updates. There is no method to determine which files are part of a collection or which have changed. The "If-Modified-Since" request header allowing conditional transfer is ignored by some servers (the protocol allows this if the data is always sent). Even if all servers supported this header, making a separate request for each file would be an inefficient way to find 10 changed files in a collection of 500. The "Last-Modified" response header, which provides time stamp information, is not always provided reliably. Read/write/execute permissions are not handled within the protocol at all. HTTPsync shows a way to overcome these limitations with client-side-only software. No modifications to the protocol or the server are required.

A Client-Side-Only Solution

To distribute a collection of files using HTTPsync, you store on the server a packing list that describes the collection. For all the files in the collection, the packing list includes path names relative to the current directory, sizes, time stamps, and read/write/execute permissions. HTTPsync first obtains the packing list specified at run time by a URL, then compares it to the local status of files to determine which are needed for incremental update. HTTP GET requests are made, and the files are stored with the time stamp and status bits provided by the packing list. The ownership of the files is naturally whatever user is running HTTPsync.

HTTP/1.1 persistent connections are used when possible, but a fallback to HTTP/1.0 happens after a small number of attempts. (Poorly implemented caches and proxies can cause this, even if the source server supports HTTP/1.1.)

From this design, the implementation is straightforward and the software is written for use on UNIX-like and Windows systems. As a simple client, there are no advanced features found in other incremental file distribution protocols. If you're used to working with CVSup, it is important to note that HTTPsync does not merge local changes -- rather, it synchronizes the local copy to the remote copy exactly. Local changes will be overwritten regardless of source and destination time stamp ordering. Merges with local changes can be accomplished with CVS, patch, or other software external to HTTPsync.

Common Code For WinSock and BSD Sockets

With a few conditional defines and code blocks, HTTPsync compiles cleanly on WinSock and BSD systems; see Listing One. The most important difference between the systems is that WinSock sockets are not file descriptors. HTTPsync is written using macros readsocket, writesocket, and closesocket, which are defined conditionally to be recv, send, and closesocket for WinSock, and read, write, and close for BSD sockets. The WIN32 preprocessor definition controls which header files are included and the WSAStartup and WSACleanup calls needed under WinSock. This organization allows the sockets code to be platform independent.

There are some other differences in file handling controlled by WIN32 as well. Under the MS-DOS FAT file systems, time stamps have a two-second resolution, handled in HTTPsync with the TDIFFLIMIT macro. Under MS-DOS there are no file mode bits for groups and other classes. These are handled with the MODMASK and GENMODMASK macros. See the HTTPsync code (available electronically) for details.

Reusable Routines

In early versions of HTTPsync, I used HTTP/1.0 transfers (one request per connection) only. Traffic and performance studies showed that using persistent HTTP/1.1 connections would provide significant improvement. The routines with names beginning with "HTTPaccess_" implement DNS lookup, requests through proxies, and persistent HTTP/1.1 connections with fallback to HTTP/1.0 if problems are detected. These functions will be useful in other projects. Listing Two shows the use of these functions to make multiple requests to a server. In summary, a call to HTTPaccess_OpenConn() is followed by a call to HTTPaccess_Retrieve() for each request.

HTTPsync includes the source code for two functions not provided in all standard libraries. The reusable software directory at http://www.mibsoftware.com/reuse/ was used to locate and include source code for these functions. HTTPsync needs to parse and convert numbers in base 10 (dates and sizes), base 8 (file modes), and base 16 (HTTP/1.1 chunked-transfer encoding). The "Integer Conversion" topic of the directory led to strtol() sources from the FreeBSD cvsweb, which converts any base from 2 to 36. Also, there is no portable and reliable system call to convert a GMT date to a time_t. The mktime() function is close, but does not handle time zones deterministically. The "Calendar and Time" topic listed in tm_to_time.c from the comp.sources.unix archive, which was modified slightly to remove time zone handling.

The HTTPsync Packing List

You generate an HTTPsync packing list with the -m command-line parameter, providing a list of files on the standard input, one per line. There are four types of lines in an HTTPsync packing list, determined by the initial characters starting the line.

  • Lines beginning with "." specify a file or directory; see Example 1. The fields of the line in order are file path (which must begin with "./"), file size, file date (day of week, day, month name, four-digit year, hh:mm:ss GMT), and the octal mode. All file paths are relative to the current directory. They may include subdirectories that will be created automatically, but must not contain the character sequence "..", or white space. The ".." restriction is enforced on the client side to prevent a malicious packing list from writing or deleting any files outside the current directory hierarchy. HTTPsync runs as an unprivileged utility, so file ownership will be the same as the process owner. The file mode bits are masked to prevent the installation of symbolic links and setUID executables.
  • Lines beginning with "O" specify obsolete files or directories. These files will be removed from the local collection, if they exist. When creating a packing list, it is important to list all files that have ever been made obsolete. HTTPsync with the -m parameter will write an "O" line for files that are named, but do not exist.

  • Lines beginning with "R" modify the root URL used for the HTTP requests. The current directory used to store local files is not changed. This permits URLs for packing lists generated from CGIs to be separated from the file requests. CGI generated packing lists permit access control, authentication, and other customization as necessary. There may be only one "R" line per packing list.

  • Lines beginning with "#" are ignored as comments, except for a line beginning with the sequence "#-#httpsync." This line specifies the version of the packing list format. The version number provided in the packing list is checked to prevent older HTTPsync software from trying to process future formats.

HTTPsync in Action

My first application of HTTPsync was to provide an alternate method of obtaining the InterNetNews development tree, a collection of approximately 500 files totaling more than 4 MB uncompressed (1 MB when compressed). The tree was already available to those with CVSup clients and via FTP as compressed archives.

Over a 10-week period beginning August 11, 1998, HTTPsync was used to synchronize to the INN 2.2 development tree a total of 359 times. For that time period, running HTTPsync once per day resulted in an average transfer of 160 KB, which includes the 26 KB packing list, which is always transferred. With daily incremental updates, HTTPsync transferred 84 percent less data than FTP of compressed archives would have.

Conclusion

The portable design of HTTPsync and use of basic standards allow use in a much wider range of situations than other incremental update distribution systems. Individual software developers can make source code collections available as individual files on standard web servers, and HTTPsync allows incremental synchronization. With personal web servers for PC-level systems, you can keep a master collection on a small or home machine, and efficiently transfer it to target systems. The implementation itself can be reused in writing other HTTP/1.1 clients for BSD and WinSock systems. Some of the features that may be included in future versions of HTTPsync are data compression, using HTTP Authentication for requests, partial file transfers (only the changed parts of files), and "pipelined" HTTP/1.1 (requesting the next file before the current transfer is complete).

DDJ

Listing One

/**********************************************************/
/* Macros and environment for portable sockets code
 * 1. Code is written using readsocket, writesocket, closesocket,
 *    INVALID_SOCKET, SOCKET_ERROR, SOCKET, and INADDR_NONE
 * 2. Always use SocketStartup() and SocketCleanup()
 * 3. Conditional includes and definitions for those macros
 *    allow operation under Windows and BSD-style sockets.
 */
#ifdef WIN32 /* Windows systems */

#include <winsock.h>
#include <io.h>
#define readsocket(a,b,c) recv(a,b,c,0)
#define writesocket(a,b,c) send(a,b,c,0)
/* closesocket() does not need a macro. INVALID_SOCKET, SOCKET_ERROR, 
 * SOCKET, and INADDR_NONE are already defined in winsock.h
 */
WSADATA libmibWSAdata;
#define SocketStartup() if (WSAStartup(0x101,&libmibWSAdata)) exit(-1)
#define SocketCleanup() WSACleanup()

#else /* Unix-style systems */

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

#define readsocket read
#define writesocket write
#define closesocket close
#define SocketStartup()
#define SocketCleanup()
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define SOCKET int
/* define INADDR_NONE if not already */
#ifndef INADDR_NONE
#define INADDR_NONE ((unsigned long) -1)
#endif

#endif

Back to Article

Listing Two

/* Example use of HTTPaccess_ subroutines to retrieve via HTTP/1.1.
 * This code supports automatic retries and fallback to HTTP/1.0
 */
char *pszProxy = 0; /* Set to hostname if proxy should be used */
char *pszHost = "host.domain.com"; 
int nPort = 80; /* Default HTTP port */
char *pszURI = "/source/packing.lst";
char *pszDest = "./packing.lst";
struct HTTPaccess_s theHTA; /* Holds state across multiple requests */
char *pszReason; /* Gets error return of the form X-NNN-Message */
char buf[8192]; /* Retrieve fails if all headers don't fit in this buffer */
FILE *fDEST;

SocketStartup(); /* Needed once per application. */
HTTPaccess_Initialize(&theHTA, pszProxy, pszHost, nPort, buf, sizeof(buf));

/* Once initialized, theHTA can be used to make more than one request.
 * This example shows just one request, with retries. 
 */
theHTA.cAttempt = 0;
while(1) { /* Will loop and retry until success, or too many attempts */
  /* Initialize for read */
  fDEST = fopen(pszDest,"wb");
  if (!fDEST) {
    fprintf(stderr, "Terminated. Could not open %s for writing\n",pszDest);
    exit(-1);
  }
  pszReason = HTTPaccess_Retrieve(&theHTA, pszURI, fnWriteFile, fDEST); 
  fclose(fDEST);
  if (!pszReason) break; /* success */
  if (atoi(pszReason+2)!=3) {
    fprintf(stderr, "Terminated. Could not transfer %s\n",pszDest);
    exit(-1);
  }
  /* loop to retry */
}
closesocket(theHTA.s);
SocketCleanup(); /* Needed once per application. */


Back to Article


Copyright © 1999, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.