Emergent Misbehavior

By Ed Nisley, October 01, 2004

Ed sees potential menace in the proliferation of remotely updatable embedded systems.

October, 2004: Emergent Misbehavior

Ed's an EE, PE, and author in Poughkeepsie, NY. Contact him at ed.nisleyieee.org with "Dr Dobbs" in the subject to avoid spam filters.

Three years ago, I mentioned the possibility of Internet-enabled refrigerators mounting a distributed denial-of-service attack on your local 911 center, in the context of security measures for deeply embedded systems. Earlier this year, I described how the control program in some cable modems can be replaced with code of your own choosing.

Perhaps the examples produced a few snickers, but the point remains valid: Allowing remote code updates in a device that's not under your physical control means the device can be hijacked and made to do nearly anything. As embedded devices continue to shrink, multiply, and network, their potential for misbehavior increases dramatically.

Worse, it seems we don't need malignant code to cause major problems, because the verification and testing techniques that help weed out errors may not be capable of ensuring that the overall design scales correctly. Algorithms that work well for one device in isolation or a few devices on a LAN may have completely unexpected consequences when tens or hundreds of thousands of devices go live in the field.

An example of this recently came to light, involving nothing more exotic than consumer-grade firewall routers that automatically set their internal clocks. Let's start by finding out what motivates designers to use automatic timekeeping, then see how it works and what went wrong. If your design calls for a whole bunch of networked gizmos, there are some serious lessons to be learned along the way.

Timekeeping

PCs, workstations, servers, and suchlike include a hardware clock that maintains the correct time and date, with a battery supplying power while the system is turned off. UNIX-flavored systems prefer to have that clock set to UTC (Coordinated Universal Time), while Windows PCs favor local time in the appropriate time zone, and it's typical for fresh Linux installations to report completely bogus time until that's sorted out. The BIOS or operating system copies the hardware clock's time into the kernel's software-clock timekeeping variables during the boot sequence and may copy the current system time back to the hardware during shutdown. In principle, therefore, both the hardware and the software clocks report the same time.

Hardware clocks use a quartz crystal oscillator, typically vibrating at 32,768 kHz, to produce a 1-Hz timekeeping interrupt. While quartz crystal oscillators were a major step forward from pendulum clock technology, relentless marketplace economics ensures that designers must use the cheapest possible crystal that satisfies the basic performance requirements.

But when was the last time you specified the long-term accuracy of the clock inside a computer system of any sort? Unless the computer's main function involves timekeeping, nobody really cares. Hence, the downward economic pressure and the reason why hardware clock oscillators tend to keep inaccurate time. The crystal has an initial error, plus variations with temperature and age. The typical accumulated error can reach a few minutes per month and may vary unpredictably.

An overall frequency error of only 0.0023 percent, 23 parts per million, amounts to an error rate of 1 minute per month. Most PC users set the system's hardware clock once, when they first fire up a new PC, and the accumulated error can reach half an hour after a year. I've seen Windows PCs with far more error, because their owners didn't notice, care, or know how to reset the clock.

The hardware clock is active only when the power is off, but the operating-system's software clock derives its accuracy from the CPU's crystal oscillator and is subject to similar errors. Both clocks will certainly drift at the whim of their crystals unless you periodically correct them.

The situation with embedded systems is even worse. A PC-class system has a display and keyboard, but an embedded system may have only a few buttons and LEDs. Given how few VCRs display the correct time, you can imagine how little interest most people would have in tweaking gadgets with no display at all.

The solution is to get humans out of the timekeeping business by automating the process. If we can find one system with a known-good clock and define a mechanism where other systems can set their clocks to that time, everything will work perfectly. That's the principle behind NTP, the Network Time Protocol, and it works wonderfully well.

NTP and SNTP

Even the earliest UNIX systems resided on networks, so they encountered the problems caused by clock (mis)synchronization long before DOS and Windows came along. Even a small network runs into troubles when each node carries a different time of day: Files can be created with dates in the future or past, daily archives can miss files created "tomorrow" or "yesterday," and other problems can be even harder to figure out. Manually keeping each node's time correct to within a few seconds simply isn't feasible.

The Network Time Protocol, which dates back to about 1981, defines how networked systems can synchronize their clocks with surprising accuracy: milliseconds on a LAN and tens of milliseconds across the Internet. That's entirely close enough for most practical purposes.

The protocol itself is fairly simple, but deriving correct, stable, accurate time from the raw packets is decidedly nontrivial. The NTP code, now up to Version 4, solves a wide range of reliability and stability problems that have nothing to do with time and everything to do with networks. In short, there's a lot of skull sweat behind this seemingly simple protocol.

Each machine can simultaneously be a client that receives correct time from another source and a server that passes its time along to other systems, with a "stratum" number indicating how far down the client-server chain it resides.

Stratum 1 NTP servers derive their time from primary standards, such as atomic clocks or the GPS satellite constellation, and serve as the ultimate authority for Internet timekeeping. There are about 130 Stratum 1 NTP servers available for public access, with a clock accuracy in the submicrosecond range. They're typically located at universities, research labs, and large ISPs.

Stratum 2 servers synchronize their clocks from the Stratum 1 servers with millisecond accuracy. The NTP "Rules of Engagement" strongly recommend that Internet users synchronize their systems from the 170-odd Stratum 2 servers to reduce the load on the Stratum 1 servers. Although NTP has relatively low bandwidth and CPU requirements, providing clock synchronization to the whole Internet from a few servers obviously won't work.

NTP adjusts the current system time continuously, by changing the frequency of the clock ticks, rather than discretely, by abruptly updating the clock variables. This avoids the bizarre problems that occur when time has gaps or runs backwards.

The NTP program is quite complex and well beyond the needs of most embedded systems that need mostly correct time. SNTP, the Simple Network Time Protocol, allows a client to set its clock from an NTP server without going through all the NTP clock-adjustment effort. In effect, SNTP asks the server for the current time, applies windage based on the estimated network delay, and stuffs that value directly into the client's clock. A periodic SNTP synchronization, perhaps every day or so, will compensate for even a terribly inaccurate crystal oscillator.

A product that sets its software clock from the network doesn't need a battery-powered hardware clock, which can dramatically reduce the system's hardware cost. Product managers really like trading off hardware for software!

An NTP client continuously calculates the reliability and stability of each of several servers, while an SNTP client simply fetches the time from a single server. Both use server IP addresses that are generally chosen when the program is configured and then left alone. An embedded system may have fairly limited configuration options and, at least for consumer-grade systems, might not have any at all.

Can you see it coming?

Bandwidth, Bit by Bit

The University of Wisconsin maintains a Stratum 1 NTP server on its Madison campus that should, in principle, be accessed only by NTP clients that provide time synchronization to large subnetworks. It certainly should not be accessed by folks such as you and me, who maintain only a handful of systems that don't require access to Stratum 1 time accuracy.

In early 2003 the UW-Madison network's inbound packet rate varied between 20,000-40,000 packets per second. At 8am on May 14, 2003, that rate began increasing steadily to nearly 100,000 packets per second, saturating their network monitors. They determined that the bulk of the additional traffic was addressed to their NTP server from many different IP addresses, so it looked like a classic distributed denial-of-service attack.

Curiously, the NTP packets all came from port 23457, so they could easily revise their firewall rules to block incoming traffic from that port to their server. This reduced, but did not completely eliminate, the inbound NTP traffic. They figured the DDoS attack would subside in a few days.

By mid-June, one month later, the data rate upstream of the campus firewall was rising beyond 250,000 packets per second, an average data rate of 150 MB/sec. That's slightly more than 100 percent utilization of an OC-3 fiber link and twice the typical throughput of a common 100 MB/sec. LAN.

Some packet analysis revealed the offending IP addresses were generally valid, so they could trace selected NTP requests to actual physical locations. Two such spots had Netgear consumer-grade firewall routers.

Examination of the firmware for several Netgear routers revealed the UW-Madison NTP server's IP address, hard-coded directly in the firmware. This didn't require any complex reverse engineering, just the application of the UNIX strings command to the binary firmware files downloadable from the Netgear website.

Worse, the SNTP implementation was rather impatient. It queried the UW-Madison NTP server, waited for one whole second before assuming the response was lost, and then sent another query. If the round-trip network delay between the router and the NTP server exceeded 1.00 second, the router would hose down the server with a 76-byte SNTP request every second until the network delay dropped under 1.00 second.

Now, for the punch line. Netgear manufactured over 700,000 routers with that SNTP implementation.

Pop Quiz: Do the math. Hint: Mind your decimal places!

Resolution

A few years ago, I set up a Linux box with a big hard drive to act as a file server to collect all the stuff from my various systems in one spot for easy backup. I'd been using NTP clients to synchronize my system clocks, so it was easy enough to aim them at my own NTP server, with that one box synchronized from a public-access Stratum 2 server.

Later, I bought a surplus HP Z3801 GPS Frequency Standard, which punched my lowly NTP server into the rarefied Stratum 1 level. Well, not really, because I haven't installed the Pulse-Per-Second kernel patch that would permit precise synchronization, but it's close enough.

None of the three or four firewall routers I've used (don't ask) can issue SNTP requests to an NTP server behind their own firewall. All of them expect to fetch the exact time from the Net, which is a reasonable assumption on the part of their designers.

In fact, my Netgear MR814V2 router does not even have a configurable NTP server address: It's hard-coded in the firmware. You simply connect the router to the Internet and, after a while, shazam, it has the correct time.

Neither of the two firmware updates since the original release mention anything about changing the NTP server, so it's not clear whether this one was affected. Perhaps that's what "Fixed various minor issues" means?

Netgear has, according to the report on the UW website, installed NTP servers on the Internet to support their routers. Newer routers, as well as older routers with updated firmware, will use the Netgear time servers. Older routers with old firmware will, alas, continue to hammer UW's servers.

There's no way to force a firmware update, because, as viewed from the user's LAN, the routers work perfectly. Recalling them for a mandatory update isn't feasible, as a negligible fraction of the owners bother to register their routers and, besides, who would voluntarily run their network without the firewall router for a few weeks? If it's the only router on the network, which is quite likely for the networks these things support, removing the router would destroy the network.

Lessons (to be) Learned

I'm certain that the router firmware passed all the usual reviews and verification tests. The code is correct, functional, and more-or-less standards compliant. It's just a little, well, impatient. Seen from the router's end of the connection, nothing is wrong. How will you catch a similar situation in your product's code?

Remember, everything works in the small and on your development department's LAN. The problem only emerges when huge numbers of the devices operate in parallel, in the real world, subject to random conditions. Do your specs encompass that situation?

Although the UW report specifically declines to comment on financial and legal matters, it seems reasonable to assume that Netgear is subsidizing the list of beefed-up network hardware recently installed at UW to handle the NTP flooding. If so, this would be a fine example of a corporation cleaning up a mess that it (inadvertently) caused. Not hardcoding UW's NTP server surely would have been less expensive, though.

Seen from a greater height, if your product requires a network service for proper operation, you should provide sufficient resources to support your product. For example, Windows XP's SNTP client checks the time at time.windows.com once a week, although it can be configured to use any other NTP server. Sponging off "free" services in a commercial product seems decidedly rude.

The inability to update old routers could, just possibly, be an argument for automated, remote hardware management: WIBNI (Wouldn't It Be Nice If) all those routers just magically started using the new code? Consumers would never know the update happened..., which is an argument for not using remote, automated management, at least on mass-market consumer products. If Netgear could do a remote update, so could script kiddies. Believe it.

I'm not not convinced that sending in those innumerable registration cards provides any real benefit to me. I always do, you probably do, but the responses taper off pretty quickly after us. If anyone knows where that information goes, drop me a note.

Reentry Checklist

Those earlier columns of mine were "Life Support" (DDJ, November 2001) with the nifty photo of the melted switch contacts similar to those inside Apollo 13's liquid oxygen tank and "DIY Hacking" (DDJ, June 2004). As Casey Stengel (or someone like him) once observed, it's difficult to make predictions, particularly about the future.

A note describing the Netgear router problem appeared in the Risks Forum (http:// catless.ncl.ac.uk/Risks/23.41.html). The complete writeup by Dave Plonka (http://www.cs.wisc.edu/~plonka/netgear-sntp/) includes a chart showing that being Slashdotted is trivial next to their SNTP traffic.

More than you need to know about NTP and timekeeping in general is at http:// www.ntp.org/ and http://www.eecis.udel .edu/~mills/ntp/index.html. The Windows 2000 and XP implementations are at http:// www.microsoft.com/technet/prodtechnol/ winxppro/maintain/xpmanaged/27_xpwts.mspx. Wake up early on the morning of 8 Feb 2036 to see what happens when that 32-bit value rolls over.

More on the HP Z3801 at http://www .realhamradio.com/GPS_Frequency_ Standard.htm. Some may still be available, but my source has sold out.

DDJ

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Emergent Misbehavior

Timekeeping

NTP and SNTP

Bandwidth, Bit by Bit

Resolution

Lessons (to be) Learned

Reentry Checklist

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Emergent Misbehavior

Timekeeping

NTP and SNTP

Bandwidth, Bit by Bit

Resolution

Lessons (to be) Learned

Reentry Checklist

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content