The HTTPsync Incremental Update Utility

HTTPsync is client-side-only software that performs fast and efficient incremental updates to synchronize collections of files. And only the standard features of HTTP are used.


July 01, 1999
URL:http://www.drdobbs.com/web-development/the-httpsync-incremental-update-utility/184410990

Jul99: The HTTPsync Incremental Update Utility

Forrest provides consulting services and software development through Mib Software. He can be contacted at forrest@ mibsoftware.com.


Software developers with collections of files to share and distribute often use a compressed archive format such as ZIP or tar.gz for transfer. But compressed archives are inconvenient to browse, and they waste bandwidth when only updates are needed. Even in a rapidly evolving open-source software project, active development usually affects a very small subset of files. When an update is needed, therefore, it is unlikely that compression will make up for the unnecessary transfer of a large number of unchanged files also in the archive. Storing a collection as individual files makes them much easier to browse and maintain. Only files that change need to be updated.

HTTPsync, the utility I present here, is client-side-only software that performs fast and efficient incremental updates to synchronize collections of files. Only the standard features of HTTP are used. No special protocols, server software, or daemons are needed. HTTPsync, available electronically from DDJ (see "Resource Center," page 5) and from http://www .mibsoftware.com/, is implemented as a C source file that compiles for Windows and UNIX-like systems.

Using HTTP for File Distribution

Hypertext Transfer Protocol (HTTP) is a good base protocol for file distribution. A single stream socket connection is used to make a request and return the data. This simplifies the implementation of clients and servers. In comparison, FTP requires a control connection and a separately arranged data connection. CVSup also uses a two-connection approach, with a special protocol. Special protocols and ports are often blocked by firewalls that are configured to pass only HTTP requests and data.

HTTP servers are ubiquitous and available for just about every operating platform. Nearly every software developer can publish web pages at little or no extra cost. The general-purpose nature and simple protocol allows the design of HTTP servers for efficient handling of large numbers of requests, as well as caching intermediaries and proxies. HTTP/1.1 supports multiple requests and data returns over a single connection, which avoids connection setup and tear-down overhead when a large number of files must be transferred.

In contrast, many popular file distribution systems, including CVSup, Sup, track, rdist, cvs, and rsync, require privileged-port server-side daemons, which are not easily available for all platforms. CVSup is implemented in Modula-3, for example. A survey of other systems and comparison to HTTPsync is available at http:// www.mibsoftware.com/httpsync/.

HTTP does lack methods that some specially designed protocols and software use to implement efficient incremental updates. There is no method to determine which files are part of a collection or which have changed. The "If-Modified-Since" request header allowing conditional transfer is ignored by some servers (the protocol allows this if the data is always sent). Even if all servers supported this header, making a separate request for each file would be an inefficient way to find 10 changed files in a collection of 500. The "Last-Modified" response header, which provides time stamp information, is not always provided reliably. Read/write/execute permissions are not handled within the protocol at all. HTTPsync shows a way to overcome these limitations with client-side-only software. No modifications to the protocol or the server are required.

A Client-Side-Only Solution

To distribute a collection of files using HTTPsync, you store on the server a packing list that describes the collection. For all the files in the collection, the packing list includes path names relative to the current directory, sizes, time stamps, and read/write/execute permissions. HTTPsync first obtains the packing list specified at run time by a URL, then compares it to the local status of files to determine which are needed for incremental update. HTTP GET requests are made, and the files are stored with the time stamp and status bits provided by the packing list. The ownership of the files is naturally whatever user is running HTTPsync.

HTTP/1.1 persistent connections are used when possible, but a fallback to HTTP/1.0 happens after a small number of attempts. (Poorly implemented caches and proxies can cause this, even if the source server supports HTTP/1.1.)

From this design, the implementation is straightforward and the software is written for use on UNIX-like and Windows systems. As a simple client, there are no advanced features found in other incremental file distribution protocols. If you're used to working with CVSup, it is important to note that HTTPsync does not merge local changes -- rather, it synchronizes the local copy to the remote copy exactly. Local changes will be overwritten regardless of source and destination time stamp ordering. Merges with local changes can be accomplished with CVS, patch, or other software external to HTTPsync.

Common Code For WinSock and BSD Sockets

With a few conditional defines and code blocks, HTTPsync compiles cleanly on WinSock and BSD systems; see Listing One. The most important difference between the systems is that WinSock sockets are not file descriptors. HTTPsync is written using macros readsocket, writesocket, and closesocket, which are defined conditionally to be recv, send, and closesocket for WinSock, and read, write, and close for BSD sockets. The WIN32 preprocessor definition controls which header files are included and the WSAStartup and WSACleanup calls needed under WinSock. This organization allows the sockets code to be platform independent.

There are some other differences in file handling controlled by WIN32 as well. Under the MS-DOS FAT file systems, time stamps have a two-second resolution, handled in HTTPsync with the TDIFFLIMIT macro. Under MS-DOS there are no file mode bits for groups and other classes. These are handled with the MODMASK and GENMODMASK macros. See the HTTPsync code (available electronically) for details.

Reusable Routines

In early versions of HTTPsync, I used HTTP/1.0 transfers (one request per connection) only. Traffic and performance studies showed that using persistent HTTP/1.1 connections would provide significant improvement. The routines with names beginning with "HTTPaccess_" implement DNS lookup, requests through proxies, and persistent HTTP/1.1 connections with fallback to HTTP/1.0 if problems are detected. These functions will be useful in other projects. Listing Two shows the use of these functions to make multiple requests to a server. In summary, a call to HTTPaccess_OpenConn() is followed by a call to HTTPaccess_Retrieve() for each request.

HTTPsync includes the source code for two functions not provided in all standard libraries. The reusable software directory at http://www.mibsoftware.com/reuse/ was used to locate and include source code for these functions. HTTPsync needs to parse and convert numbers in base 10 (dates and sizes), base 8 (file modes), and base 16 (HTTP/1.1 chunked-transfer encoding). The "Integer Conversion" topic of the directory led to strtol() sources from the FreeBSD cvsweb, which converts any base from 2 to 36. Also, there is no portable and reliable system call to convert a GMT date to a time_t. The mktime() function is close, but does not handle time zones deterministically. The "Calendar and Time" topic listed in tm_to_time.c from the comp.sources.unix archive, which was modified slightly to remove time zone handling.

The HTTPsync Packing List

You generate an HTTPsync packing list with the -m command-line parameter, providing a list of files on the standard input, one per line. There are four types of lines in an HTTPsync packing list, determined by the initial characters starting the line.

HTTPsync in Action

My first application of HTTPsync was to provide an alternate method of obtaining the InterNetNews development tree, a collection of approximately 500 files totaling more than 4 MB uncompressed (1 MB when compressed). The tree was already available to those with CVSup clients and via FTP as compressed archives.

Over a 10-week period beginning August 11, 1998, HTTPsync was used to synchronize to the INN 2.2 development tree a total of 359 times. For that time period, running HTTPsync once per day resulted in an average transfer of 160 KB, which includes the 26 KB packing list, which is always transferred. With daily incremental updates, HTTPsync transferred 84 percent less data than FTP of compressed archives would have.

Conclusion

The portable design of HTTPsync and use of basic standards allow use in a much wider range of situations than other incremental update distribution systems. Individual software developers can make source code collections available as individual files on standard web servers, and HTTPsync allows incremental synchronization. With personal web servers for PC-level systems, you can keep a master collection on a small or home machine, and efficiently transfer it to target systems. The implementation itself can be reused in writing other HTTP/1.1 clients for BSD and WinSock systems. Some of the features that may be included in future versions of HTTPsync are data compression, using HTTP Authentication for requests, partial file transfers (only the changed parts of files), and "pipelined" HTTP/1.1 (requesting the next file before the current transfer is complete).

DDJ

Listing One

/**********************************************************/
/* Macros and environment for portable sockets code
 * 1. Code is written using readsocket, writesocket, closesocket,
 *    INVALID_SOCKET, SOCKET_ERROR, SOCKET, and INADDR_NONE
 * 2. Always use SocketStartup() and SocketCleanup()
 * 3. Conditional includes and definitions for those macros
 *    allow operation under Windows and BSD-style sockets.
 */
#ifdef WIN32 /* Windows systems */

#include <winsock.h>
#include <io.h>
#define readsocket(a,b,c) recv(a,b,c,0)
#define writesocket(a,b,c) send(a,b,c,0)
/* closesocket() does not need a macro. INVALID_SOCKET, SOCKET_ERROR, 
 * SOCKET, and INADDR_NONE are already defined in winsock.h
 */
WSADATA libmibWSAdata;
#define SocketStartup() if (WSAStartup(0x101,&libmibWSAdata)) exit(-1)
#define SocketCleanup() WSACleanup()

#else /* Unix-style systems */

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

#define readsocket read
#define writesocket write
#define closesocket close
#define SocketStartup()
#define SocketCleanup()
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define SOCKET int
/* define INADDR_NONE if not already */
#ifndef INADDR_NONE
#define INADDR_NONE ((unsigned long) -1)
#endif

#endif

Back to Article

Listing Two

/* Example use of HTTPaccess_ subroutines to retrieve via HTTP/1.1.
 * This code supports automatic retries and fallback to HTTP/1.0
 */
char *pszProxy = 0; /* Set to hostname if proxy should be used */
char *pszHost = "host.domain.com"; 
int nPort = 80; /* Default HTTP port */
char *pszURI = "/source/packing.lst";
char *pszDest = "./packing.lst";
struct HTTPaccess_s theHTA; /* Holds state across multiple requests */
char *pszReason; /* Gets error return of the form X-NNN-Message */
char buf[8192]; /* Retrieve fails if all headers don't fit in this buffer */
FILE *fDEST;

SocketStartup(); /* Needed once per application. */
HTTPaccess_Initialize(&theHTA, pszProxy, pszHost, nPort, buf, sizeof(buf));

/* Once initialized, theHTA can be used to make more than one request.
 * This example shows just one request, with retries. 
 */
theHTA.cAttempt = 0;
while(1) { /* Will loop and retry until success, or too many attempts */
  /* Initialize for read */
  fDEST = fopen(pszDest,"wb");
  if (!fDEST) {
    fprintf(stderr, "Terminated. Could not open %s for writing\n",pszDest);
    exit(-1);
  }
  pszReason = HTTPaccess_Retrieve(&theHTA, pszURI, fnWriteFile, fDEST); 
  fclose(fDEST);
  if (!pszReason) break; /* success */
  if (atoi(pszReason+2)!=3) {
    fprintf(stderr, "Terminated. Could not transfer %s\n",pszDest);
    exit(-1);
  }
  /* loop to retry */
}
closesocket(theHTA.s);
SocketCleanup(); /* Needed once per application. */


Back to Article


Copyright © 1999, Dr. Dobb's Journal
Jul99: The HTTPsync Incremental Update Utility


#-#httpsync 101 Packing List for httpsync 1.01
# Visit the httpsync home page: http://www.mibsoftware.com/httpsync/
#
# This list was created from a simple list of files, relative
# to '.', one per line.
./astring/astring.h 4934 Tue, 21 Jul 1998 10:38:58 GMT 644
#
# Next is an obsolete file.
O./include/comm.h
./include/tsd.h 57 Mon, 17 Aug 1998 14:59:29 GMT 664

Example 1: Use HTTPsync -m to generate packing lists.


Copyright © 1999, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.