TCP Programming Gotchas

By Noah Davids, October 01, 2002

Programs that don't properly parse TCP-based messages may exhibit random failures that can be hard to detect. Noah provides some insights to avoid random TCP parsing errors.

October 2002/TCP Programming Gotchas

TCP is based on a stream of bytes, not messages. There is no such thing as a TCP message. When you send data via TCP, the TCP layer reserves the right to package that data up into packets any way it wants. The TCP packet may contain just the first part of an application layer (AL) message, or just the last part. The packet may contain the last part of an AL message as well as some number of complete AL messages and the first part of another AL message. The only thing TCP guarantees is that the data will be delivered in sequence and error free or you will be notified of an error.

I see many TCP-based applications written with the same fatal design flaw. It's a flaw that may not be uncovered during testing and may not be apparent when the application is first moved to production. But eventually, during times of high network utilization, over networks with high latency, or when a system is particularly busy it can strike, and you'll be left wondering what happened.

This flaw stems from assuming that a TCP packet contains one, and only one, complete message. For example, when the sending side of an application sends a 100-byte AL message, you shouldn't assume that the TCP layer creates a packet with just those 100 bytes of data in it and sends that packet to the receiving host, which then places just those 100 bytes of data into a buffer and passes it to the receiving side of your application.

When an application contains this flaw, it exhibits some obvious symptoms. Most of the time the receiving side of the application will work fine for a while and then suddenly hang or abort with an indication of a corrupt AL message. Typically, after stopping and restarting the application it will again work fine — for a while. I've seen cases where the receiving side crashed only during a specific time of day, which turned out to correspond to a time when network usage was heaviest. There was a case where the receiving side crashed only when certain locations tried to run the application; other locations never had a problem. It all depends on what your application does when it gets a partial AL message or more than one complete AL message.

Don't Assume

The receiving side of your application must be prepared to read a buffer of data from the TCP layer and parse it to get the AL message. It may have to combine it with previously read data or hold it until more data is read.

Don't think you can get around this by making all your AL messages a fixed length and then having the receiving side of your application request just that number of bytes from its TCP stack. While the TCP run time will never give you more than the requested number of bytes, it might give you less. Even in blocking mode, a request to read data will return immediately if there is any data that can be returned to the application — it will not wait until the requested number of bytes is available.

At this point you should have two questions. First, how could an AL message be split into multiple TCP packets or multiple AL messages combined into one TCP packet? Second, why would testing not uncover these problems? There are at least five things that can cause TCP to split or combine AL messages. These things are all related to the network environment, so if your application testing environment does not duplicate your production network environment, no amount of application layer testing will trigger these problems.

Maximum Segment Size

First, the AL message may be larger than the remote host's maximum segment size (MSS). The MSS is the largest number of data bytes that can be sent to the remote host in one TCP packet over the connection. Each side of a connection announces its MSS when the connection is established. The MSS is not negotiated, that is, it's not necessary for the two sides of a connection to agree on an MSS, each side can use a different value and must honor the other side's announced value. For connections between hosts on the same subnet, the MSS is typically the maximum transmission unit (MTU)-40. The MTU is a function of the type of data link layer. For Ethernet, it is 1500. The 40 is the combined size of the IP and TCP headers. So for local connections, the MSS is typically 1460. Many TCP stacks arbitrarily reduce the MSS for connections between hosts on different subnets. It is typically reduced to 536 bytes, but the exact behavior depends on the OS. If your message is longer than the MSS, it will be split up. So, if the hosts in your test environment are on the same subnet while the production environment will have hosts on different subnets, your test environment may be using a different MSS then your production environment. Don't confuse this with IP fragmentation. If the networks connecting two hosts have different MTUs, the routers connecting those networks will fragment any IP datagram that is too large. The IP layer of the receiving host will combine all the fragments before passing anything to the TCP layer, so IP fragmentation is completely invisible to the TCP layer and your application.

The Nagle Algorithm

The second way that a message can be split up or multiple messages combined is based on a TCP congestion control mechanism called the "Nagle" algorithm. This algorithm prevents two packets with fewer than MSS bytes from being outstanding (that is, not acknowledged) at the same time. Lets say that the MSS is 1460 bytes and that your application is able to send multiple AL messages without waiting for a reply. Your first AL message is 1000 bytes and that is sent as soon as your application delivers it to the TCP layer. Now before an acknowledgment for those bytes is received, your application delivers another AL message of 1000 bytes and then another of 500 bytes to the TCP layer. TCP will send a packet with the entire second AL message and 460 bytes of the third AL message. The remaining 40 bytes of the third AL message are held at this point. If your test environment is on a LAN that typically has a small latency, it would not be unusual for an acknowledgment to be received before the application delivers the second message. But a production environment going over a WAN may have a much larger latency, and it would not be unusual for an application to deliver its next AL message before the acknowledgment for the first message is received. Of course, a lot depends on your application and your networks. You can turn the buffering created by the Nagle algorithm off by setting the TCP_NODELAY socket option. However, doing this can have a very negative effect on your network by dramatically increasing the number of packets and reducing the ratio of application layer data to overhead. Turning off Nagle should be done only after careful analysis of your entire networking environment, including all applications.

Retransmission

The third way that a message can be split up or multiple messages combined is from retransmission. All transmitted data remains in buffers in the TCP layer until the data is acknowledged. TCP will wait only so long for an acknowledgment. When the retransmission timer goes off, TCP may send not only the data that has not been acknowledged but also any other data that is in its buffers.

This can result in several AL messages being sent in one TCP packet; the exact behavior will depend on the TCP implementation. If you are testing in a LAN environment, the chances of a lost packet are typically much smaller than in a WAN environment, so you can expect to have fewer, maybe no, retransmissions in your test environment.

Receiving Side Gotchas

So far, all I've talked about is the ways that the messages can get combined or split by the sender, but these next two happen on the receiving side. What happens if the sending side of the application is able to send the AL messages as individual packets but it does so faster than the receiving side is reading them? Remember that the receiving side's TCP layer will get the data and acknowledge it, and sometime later, maybe a lot later, the receiving side of the application will make a call to read the data. Until the receiving side of the application makes that TCP call, all the received AL messages are placed in an input queue by the receiving TCP layer. When the receiving application asks for N bytes of data, it will get N bytes. Now if all the AL messages are N bytes you will be guaranteed not to get more than one AL message, but if the AL messages vary in length you're in trouble. One call to read N bytes will give you some number of complete AL messages and the first part of another AL message; the next call will give you the rest of the previous AL message and then some number of complete AL messages and the first part of another. Because the latency of a LAN is typically smaller than the latency of a WAN, the probability of this scenario is higher in a LAN environment than in a WAN environment. Therefore, this is one scenario that might be easier to spot in a LAN test environment than a WAN production environment.

This next scenario looks the same as the previous one from the perspective of the receiving side of the application, but it has a different cause so I'm counting it separately. Let's say that the sender has sent AL messages 3, 4, 5, and 6 in packets 3, 4, 5, and 6. However, packet 4 is dropped by a congested switch (or router or corrupted when someone started the elevator, or...). The receiving TCP layer will get packet 3 with AL message 3 and send that to the receiving application when data is requested, but packets 5 and 6 are not sent to the application. Remember that TCP guarantees the "in order" delivery of bytes. Eventually, the sender will resend the data in packet 4. It may also resend 5 and 6 but since the receiver already has that data, it doesn't matter. Once packet 4 arrives, all the data from 4, 5, and 6 are queued up and we are back to the previous scenario. The probability of dropping or corrupting a packet is greater on a WAN than a LAN, and on a busy LAN versus a quiet LAN, the probability is greater that you will see this in a production environment rather than in a test environment.

The Solution: read_msg

As stated at the beginning of this article, you can code the receiving side of the application so that it treats the data as a stream of bytes and not separate messages. Exactly how you do that will depend on the format of your AL messages. If the messages are from a fixed character space, then you can terminate messages with a character that is outside of that space. If the AL messages can contain any character, then you will need to either pick a termination character and a way to include that character in the AL message or use some mechanism to transmit the length of the AL message — perhaps an AL message made up of two parts, a fixed-size part containing the length and then a variable part of the specified length.

The read_msg routine in Listing 1 is a very simple example of a routine to parse AL messages from a TCP byte stream. In this case, AL messages are terminated by the '@' character. Read_msg assumes that a TCP connection has already been established. You call read_msg with a socket descriptor (fd) and a pointer to a buffer (bp), which is large enough to hold a maximum-length message plus the null terminator character. Once the read_msg routine is called, it will not return until it has a complete AL message to deliver to its caller, or an error has occurred. The first thing that the routine does is scan its buffer (static_buffer) looking for the terminator character to determine if there is already a complete AL message; if so, it copies the AL message into the buffer provided by the caller, adjusts the contents of the buffer and length counter (rcnt), and returns with the length of the message. If there is not a complete AL message already in the buffer, it checks to make sure that there is still space in the buffer. If not, it breaks out of its loop and indicates an error by returning -1 as the message length but with an errno value of 0. If there is space in the buffer, it calls recv to read data from the TCP stack and append it to the buffer.

Once data is read, it checks for TCP special conditions by comparing the return value from recv with -1 and 0. A value of -1 indicates an error at the TCP layer. The static buffer is purged by setting the length counter to 0 and -1 is returned to the caller. The real error is set by the TCP stack in the global variable errno, which the caller can access. A value of 0 means that the connection has been closed. This is indicated to the caller by returning a value of 0. If there were no special conditions, the loop is started again.

Testing Your Solution: send_msg

Luckily, you can test your application's ability to parse AL messages without building a duplicate of your production environment by creating a routine that buffers multiple AL messages and then sends randomly selected lengths of the buffer followed by random delays until the buffer is empty. The send_msg routine in Listing 2 is a simple example of this. Send_msg assumes that the socket connection has already been established and that the TCP_NODELAY socket option has been set. It also assumes that your C libraries have a function that can generate a random number uniformly distributed between 0 and 1; send_msg calls this function randomly. The #defined variables MAX_MSGS and MAX_MSG_LEN define the maximum number of messages to buffer and the maximum length of a message. If your application can only have one outstanding message at a time, then you can set MAX_MSGS to 1. You call send_msg with a socket descriptor (fd) and a pointer to a message (bp). It will add the message to its buffer (stream_buffer) and either return immediately or output all buffered messages. It returns without outputting the messages if the number of buffered messages (msg_count) is less than MAX_MSGS and a random value between 0 and MAX_MSGS is greater than the message count. As the number of buffered messages goes up, the probability of returning immediately goes down.

The message output section of the code has two loops. In the outer loop, a random number (count) between 1 and the length of all buffered messages (stream_len) is generated. This value determines how many characters to send in one TCP send call. The inner loop is then entered to send those characters. A loop is needed because it's possible that the send call will return with an indication that not all characters were sent, so we must loop to resend the leftover characters. A character pointer (cp) is used to point to the first character in the buffer to send. After each send, the cp pointer and count value are adjusted. When the inner loop exits, the cp value is adjusted to point to the next character to send based on the count value, but the stream_len, the number of characters still to send, must be adjusted based on the original value of count (saved_countx). The reason for this is that the value of cp and count have been adjusted in concert in each iteration of the inner loop while the value of stream_len remained unchanged. Using the adjusted value of count to update stream_len would not be correct. After adjusting the values of stream_len and count, another random value between 0 and 5 is selected for a sleep time. Then the outer loop repeats if there are still any characters to send.

Stream of Bytes

Remember that TCP is based on a stream of bytes — the only thing it guarantees is that the data will be delivered in sequence and error free or you will be notified of an error. Any program that is written with the expectation that TCP will keep application layer messages separate will fail; it's just a question of when.

Noah Davids ([email protected]) has worked as a LAN technical support specialist for a midsized computer company for the last 10 years. He has published numerous articles on LAN programming and troubleshooting. He has CNX, Network+, and MCSE certifications as well as a Master of Science in Computer Science from Arizona State University.

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

TCP Programming Gotchas

Don't Assume

Maximum Segment Size

The Nagle Algorithm

Retransmission

Receiving Side Gotchas

The Solution: read_msg

Testing Your Solution: send_msg

Stream of Bytes

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

TCP Programming Gotchas

Don't Assume

Maximum Segment Size

The Nagle Algorithm

Retransmission

Receiving Side Gotchas

The Solution: read_msg

Testing Your Solution: send_msg

Stream of Bytes

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content