Mobile

Scalable Multicast File Distribution

By Jim Gemmell, May 01, 2000

Fcast, the file distribution mechanism Jim presents here, uses IP multicast and forward error correction (FEC) to serve extremely large numbers of clients with minimal loads for servers and networks.

May00: Scalable Multicast File Distribution

Fcast offers a multicast solution to overloaded web servers

Jim is a researcher at the Microsoft Bay Area Research Center (http://research .microsoft.com/barc/).

What About Caching?

A couple of years ago, students at the University of Washington were surprised to find their Internet access had all but shut down. They must have wondered what had happened. Was a T1 line down? Did some network equipment fail? In fact, what had happened was the release of Internet Explorer (IE) 3.0 from nearby Microsoft. The resulting frenzy of downloads saturated network links in the region to the point that nearly nothing else could get through. Meanwhile, the Microsoft web servers limped along in a state of overload. Similar overloads have occurred with the NASA Pathfinder vehicle landing on Mars, the Kasparov versus Deep Blue chess match, and the Starr report.

Think about the nature of the data overloading the servers and network in these cases: It is just the same set of data being sent out over and over. Surely there must be a better way! There is, and in this article I describe a file distribution mechanism called "Fcast" that uses IP multicast and Forward Error Correction (FEC) to serve extremely large numbers of clients with minimal loads for servers and networks. If IP multicast had been universally deployed in time for the release of IE 3.0, a single server with a modem Internet connection could have used Fcast to distribute the software to every modem user in the world. In contrast, microsoft.com employs 44 servers with a 1.2 Gbits/sec. Internet connection, and overload is still a threat.

Reliable Multicast

Web servers use the TCP/IP protocols to deliver files over the Internet. IP is a best-effort protocol -- it tries its best to get your data to its destination, but there are no guarantees. Data may be discarded en route and IP will do nothing about it. It will not even notify the sender that there has been a problem. TCP is built on top of IP to provide reliability. With TCP, receivers communicate with senders so that losses are detected and data is resent. When packets are received, an acknowledgment (ACK) is transmitted to the sender.

To send the same file to many receivers using TCP/IP, a server must repeatedly transmit the same data, and network links and routers will see the same data pass over them many times. In contrast, IP multicast allows a server to send the data once, and the data passes over a given link one time. It is the responsibility of the routers to make copies of the data to the appropriate links. Links that have receivers downstream receive copies. Links that have no receivers downstream do not.

Major router manufacturers such as Cisco and 3Com have been making IP multicast-capable routers for several years now. Many routers require only a software upgrade to handle IP multicast. However, some network operators have chosen to disable IP multicast. Naturally, it is their job to be cautious, so they will only start using something new if there is a lot of demand (and after a lot of testing). Also, some ISPs want to know how they can make extra money from IP multicast before they enable it. Over the past few years, multicast deployment has been increasing rapidly in enterprises, and is beginning to be supported by major backbones such as Sprint and UUNet. To find out if your network supports IP multicast, you should contact your network administrator or ISP.

While IP multicast is very powerful and efficient, it is only best effort, just like unicast IP. A reliable protocol like TCP must be built on top of IP multicast to transmit data-like files. Researchers have proposed a slew of reliable multicast protocols, and some have even been made into commercial products.

The great challenge for reliable multicast protocols is scalability. Many of the protocols are designed to deal with tens or hundreds of receivers. Those protocols that deal with even a thousand have been hailed as scalable. However, this falls far short of what IP multicast can deliver, and even at these modest numbers there are often limitations: performance degradations, requirements for cache deployment and maintenance, or the need to modify routers (which, unless you own a router company, isn't usually an option).

The primary barrier to scalability of these protocols is feedback from the receivers to senders in the form of acknowledgments (ACKs) or negative acknowledgments (NACKs). If many receivers generate feedback, they may overload the source, or the links leading to it, with message "implosion." Fcast has a simple solution to this: It has no feedback from the receivers at all.

The roots of Fcast are in the data carousel approach, in which the sender simply multicasts the file repeatedly (see S. Acharya, M. Franklin, and S. Zdonik, "Dissemination-Based Data Delivery Using Broadcast Disks," IEEE Personal Communications, December 1995). The receiver is able to reconstruct missing components of a file by waiting for it to be transmitted again in the next loop without having to request retransmissions. With no feedback from receivers, implosion is avoided. The drawback is efficiency. Suppose you missed only the last packet of the file. You would need to wait for the whole file to be repeated to get the missing packet (and pray you don't miss it again).

Another possibility is to combine Forward Error Correction (FEC) with a data carousel to improve efficiency, and it is this idea that Fcast is based on. Before I get into how this is done, I'll explain how the FEC works.

FEC in Fcast

FEC generally refers to error correction; that is, the ability to detect and repair bit-level corruption. However, an IP multicast application never receives corrupted data packets, because they are detected and discarded by IP or even lower network layers. Therefore, to be more precise, it is not error correction that must be done; it is erasure correction.

Fcast uses a type of erasure correction called an (n,k) linear block code. You start with k source blocks. These are blocks of data from a file, as distinguished from packets, which are data blocks with an attached header. These are then encoded into n blocks, where n>k. If you manage to receive any k of these n blocks, then they can be decoded to obtain the original k blocks, as in Figure 1. The encoded blocks are the same size as the source blocks.

Parity, which you are probably familiar with, is a (k+1,k) linear block code. Parity is an example of what's known as a "systematic" code, where the first k of the n encoded blocks are just copies of the source blocks. Fcast also uses a systematic code.

There are Reed-Solomon-based (n,k) codes that are efficient enough to be run on PCs. For example, Rizzo's public-domain codec can code and decode data at 90 mbps on a 133-MHz Pentium processor (see L. Rizzo and L. Vicisano, "Effective Erasure Codes for Reliable Computer Communication Protocols," ACM SIGCOMM Computer Communication Review, Vol.27, No.2, April 1997). However, to achieve speeds like this, k and n must be limited, because the encoding becomes too computationally expensive with large values. (n,k)=(255,64) are typical limits.

An alternative approach is to use Tornado codes, which are based on bipartite graphs (see J.W. Byers, M. Luby, M. Mitzenmacher, and A. Rege, "A Digital Fountain Approach to Reliable Distribution of Bulk Data," Proceedings ACM SIGCOMM '98, September 1998). A Tornado code may require slightly more than k blocks to reconstruct the original k blocks, but the value of k may be on the order of tens of thousands. It is beneficial to increase the value of k as much as possible. However, Tornado codes are proprietary, while Reed-Solomon codes exist in the public domain. Furthermore, the additional efficiency due to Tornado codes, while noticeable, is not so significant compared to the orders of magnitude saved by switching from unicast to multicast.

Multicasting Files Using FEC

Because many files will be longer than k blocks, they must be divided into erasure-correcting (EC) groups of k blocks each. The linear code can be used to obtain n encoded blocks from each group. Each encoded block is identified by an index specifying which of the n encoded blocks it is, as well as a group identifier indicating what EC group it is from.

The order in which blocks are transmitted is important. Suppose, for example, that all n blocks were sent from one group before sending any from the next. Receivers requiring a block from just one group may have to wait through n blocks from all the other groups to get what they need -- a situation worse than simple data carousel. (With a simple data carousel, you may have to wait through the whole file. n blocks for each group is usually more than the size of the whole file since n>k.) To avoid this, the sender sends all blocks with index i before sending blocks with index i+1. Now it is a matter of getting one block from each group before getting to the group you need. When block n of the last group of the last file is sent, the transmission cycles; see Figure 2.

One subtle danger is that periodic network losses may become synchronized with the transmission so as to always impact blocks from certain groups. For example, suppose that some scheduled network activity makes the network lose packets once every minute and that Fcast just happens to be sending from group 5 every time this happens -- you will never get group 5. To eliminate this possibility, the order of groups for each index is selected randomly. Now a periodic loss will impact a random group each time. A nice feature of this is that any losses, whether bursty or sporadic, are spread randomly across groups, so from the receiver's point of view losses might as well be random. Random loss characteristics make it easy to analyze the performance of Fcast.

To complete the reception, k distinct blocks (that is, with different index values) must be received from each group. Whenever more than k blocks are received for a group, the extra blocks must be discarded. Naturally, receiving redundant blocks means the transfer is less efficient. For example, a receiver may need only a single block from a single group to complete reception, but may have to wait through blocks from all the other groups until the desired group has a packet sent.

In the case of the simple data carousel, the inefficiency was related to the size of the file. For the FEC carousel, the inefficiency is related to the size of the file divided by 8, that is, the number of EC groups. In Figure 3, the time spent receiving versus the probability of complete reception depends on the value of k. The ideal transmission has k set to the number of blocks in the file so that there is only one EC group. In such an ideal transmission, the receiver would be finished once it had received as many blocks as are in the file -- notice that a unicast method couldn't do any better than this. Figure 3 shows us that reduced values of k noticeably increase the transmission time over the ideal. However, the time is still much less than double the ideal, and considering that for this price you get to serve as many clients as you can manage to connect via IP multicast, it seems like a really good deal. How much does your file server slow down with 100,000 clients connected?

The flip side of this is that for a fixed value of k, increasing the file size will increase the number of groups. Figure 4 shows the time expected to complete the reception for file sizes up to 1 GB, with k=32.

Disk Performance Issues

When work on Fcast began, we were trying to solve an Internet distribution problem. In that context, we couldn't imagine the network outpacing the disk. However, when Fcast was applied to distributing a 200+ MB software update on our corporate LAN (the NT build; see the accompanying text box "What About Caching?") we found that the receiver's disk couldn't keep up with the transmission.

The problem was that the receiver was using a method I call "group-area caching" (see Figure 5). Blocks are written to the area of the disk corresponding to the group they belong to, in the next available space. That is, a block from group g is written in the area beginning at block offset gk. When k packets have been received from each group, then a single pass is made over the file where each group is read, decoded, and then written back out. Unfortunately, this means that as each packet is received, the disk must seek its group to write it (recall that the sender will send a packet from each group). All this seeking makes writing very slow.

There are two obvious ways to get around all this seeking. One is to simply cache everything in memory until it is decoded. Of course, this means your files can't be any bigger than the memory of your receivers, which isn't really practical. Another way is to write all the blocks sequentially to a file on the disk in the order they are received. When k packets have been received for each group, then the file is sorted into group order. Then, you can pass over the file, reading, decoding, and writing each group as with the group-area method. This will certainly allow fast reception, but adds a significant time penalty to the postprocessing phase of the reception. Also, typical sorting algorithms require disk space of twice the file size.

To achieve improved write performance, while avoiding a full file sort, I developed the crowd-pool scheme. To implement this, the receiver considers multiple groups of blocks to be in crowds as in Figure 6. The sender has no awareness of crowds or anything the receiver is doing; it just sends as it normally would. The receiver writes out blocks to a temporary file name, just as it would in the group-area scheme, but rather than writing to the next available position in the group, they are written to the next available position in the crowd. When k packets have been received for all groups, the receiver does a single pass over the file, reading in each crowd, sorting it into its groups (in memory), decoding the group, and writing out the final result. So crowd-pool is like group-area in that it only does a single pass over the file in postprocessing, and in that it only requires as much disk space as the size of the file plus the metadata (explained later).

To avoid seeking for every block written, crowd-pool maintains a pool of block-sized buffers. Suppose that the goal is to write b blocks at a time. Whenever a crowd has more than b blocks buffered in the pool, it writes b blocks to disk. The write is performed as a single contiguous write in the next available space in the crowd. Whenever all crowds have less than b blocks buffered, then any crowd may be chosen arbitrarily, and its blocks are written to disk. By carefully choosing the size of crowds and the size of the buffer pool, you can ensure that crowd-pool can keep up the same pace as any application that was writing b blocks at a time. See http:// research.microsoft.com/scripts/pubDB/ pubsasp.asp?RecordID=244 for the gory details. Here, I will just mention that it turns out that the memory needed for crowd-pool is approximately the square root of the file size multiplied by b. Figure 7 shows the memory required for various file sizes with b=16.

For an idea of the kind of results you can get, here is what we measured on a Pentium 266 running Windows 98 with an IDE hard drive: Using group-area, the maximum rate was 1.2 Mbits/sec. Using crowd-pool with b set to 4 lets us increase the speed to 2.7 Mbits/sec. Crowd-pool with b set to 8 increases the speed to 3.8 Mbits/sec.

Fcast Implementation

Besides the file itself, other information must be exchanged between the Fcast sender and receivers. A session description must be shared that contains the multicast address and port number for the transmission. Session descriptions can be attached to an e-mail, embedded in a web page, or carried in protocols like the Session Announcement Protocol (SAP) (see M. Handley, "SAP: Session Announcement Protocol," Internet Draft, IETF MMUSIC Working Group, November 1996).

File metadata, such as the file name and its attributes, also need to be communicated. Fcast does this by appending the file metadata to the end of the file. Once the file is decoded, this metadata can be read to set the file name and attributes, and the file is truncated to be the correct length.

There is also some information that Fcast includes in each packet header. The group number and index value must be included in each header to identify the packet. Each file transmitted by the sender is assigned an ID number, which is included in the header. The file length is included in the header so that the receiver can allocate memory and disk space after receiving the first packet. The k value is also included, and the number of groups can then be calculated as the file length by k (the n value of the (n,k) codec is not needed by the receiver). Including the file length and k in every packet, along with appending the metainformation onto the file, makes it possible for a receiver to tune in to a well-known multicast address and port and begin receiving with no additional knowledge.

An Fcast sender multicasts a single file (which is usually a zip or cab file, containing multiple files), looping continuously through the encoded blocks as previously described. Usually it will do this throughout a time slot designed to allow receivers to tune in at convenient times. Receivers tune in to the multicast address and cache received packets in a temporary file name until they receive enough blocks to recreate the file. At this point, the receiver drops out of the multicast and does not receive any more packets (except in satellite transmission applications, which require the receiver to be continuously receiving the satellite channel). The file is then decoded, and the file name and attributes set.

The Fcast sender and receiver are implemented as ActiveX controls. These controls may be embedded in web pages, or embedded into applications, as in Figure 8. Each has an Init() call to initialize the Fcast with the multicast address and other transmission settings. A call to Start() begins the sending/receiving in another thread and returns control immediately. Progress of the transmission can be monitored by a call to GetStats(), and via connection points (ActiveX callback routines) that let you know when events occur, such as file completion or an error. The sender calls Stop() to end the transmission.

Conclusion

The Fcast control is available electronically from DDJ (see "Resource Center," page 5) and at http://research.microsoft.com/ barc/mbone/fcast.htm. We have experience using the Fcast control to download the updates of the Windows 2000 build and also as part of the Multicast PowerPoint Add-in, which does live telepresentations using multicast.

The power of Fcast is astonishing. I imagine that some day Matt Drudge will Fcast breaking news to the whole world from his laptop in his hotel room. Or, a small company will write the next killer app and be able to download it to millions of customers with only one server and a cable modem.

DDJ

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Mobile

Scalable Multicast File Distribution