Dr. Dobb's | Fighting Back Against Spam-Zombie Hordes

Fighting Back Against Spam-Zombie Hordes

Cutting off spam at the source, instead of relying on filters

May 23, 2008
URL:http://www.drdobbs.com/windows/fighting-back-against-spam-zombie-hordes/208200258

Researchers at Microsoft Research Silicon Valley, have been pursuing a different approach to combating spam-spewing zombie botnets. Instead of focusing on identifying spam in a network and then trying to filter it out, they are out to identify the sources of the spam based on IP-address properties.

"Today, we're seeing a lot of network attacks from compromised zombie hosts, often referred to as botnet hosts," Yinglian Xie explains. "They're being used to launch and distribute denial-of-service attacks, and they're notorious for launching large-scale spam campaigns.

"We are trying to learn whether there's some fundamental property that we might be able to leverage across all these types of botnet attacks."

In seeking the answers to such investigation, Xie and Fang Yu have devised the approach of identifying and analyzing dynamic IP addresses to examine what role they might play in detecting zombie-based spamming servers.

Their findings are intriguing. In a paper entitled How Dynamic Are IP Addresses? -- co-written with Microsoft Research Silicon Valley colleagues Kannan Achan, Moises Goldszmidt, and Ted Wobber, along with Eliot Gillum, principal development manager for the Windows Live Hotmail team -- Xie and Yu report that in an analysis of a month of Hotmail user logins -- 155 million IP addresses were analyzed, without compromising user data -- almost 103 million are dynamic IP addresses, a much larger percentage than previously thought.

Further, almost 96 percent of the servers with dynamic IP addresses were used only to send spam e-mails. More than 42 percent of all spam sent to Hotmail during the analysis period came from dynamic IP addresses. Such numbers fly in the face of standard assumptions about host representation on the Internet. Many spam-fighting approaches have used host IP addresses to represent host identities, but that practice has been based on an assumption that the lion's share of IP addresses on the Internet are static.

The accuracy of this premise, though, had not been verified until Yu and Xie came along.

"We need to be very careful when we use an IP address to represent hosts," Xie adds. "It may not be true in many cases, and that would have significant implications."

Adds Yu: "We developed a scheme to automatically identify dynamic IP addresses and also look at the mail servers set up on these dynamic IP addresses to capture spam leveraged by those spammers."

"Each computer has an IP address from which it connects to the Internet," Yu says. "For dynamic IP addresses we identify, it's a subset of DHCP [dynamic host configuration protocol]-assigned address not statically bound to a particular IP, a particular computer."

Computers linked to the Internet via DSL or dial-up have their IP address changed frequently. Those -- quite possibly yours -- are dynamic IP addresses.

"All previous ways of knowing whether an IP address is dynamic," Xie says, "include manual collection in a database, which requires an enormous amount of effort. Our work established the possibility of an automatic way to do that."

That was one lesson they gleaned from their method, called UDMap, a simple yet powerful algorithm that uses only application-level server logs to identify and analyze dynamic IP addresses. Automatic identification is a key component of the researchers' approach.

In fact, the UDMap work represents the first successful attempt to identify IP addresses automatically and to understand the dynamics of IP addresses.

As the paper makes clear, the UDMap algorithm has three key advantages:

It is generally applicable. UDMap can be applied to various kinds of logs, not just Hotmail user logs.
It runs autonomously. There is no need to share information across domains, and changes to client software are not necessary.
It delivers fine-grained, current information. UDMap identifies blocks of dynamic IP addresses, often smaller and more precise than IP prefixes, and, because it is fully automated, it can be applied to recent logs to get current information.

The project derived additional value from an in-depth investigation into the correlations between dynamic IP addresses and spamming activities.

"Our conclusion," Xie says, "is that mail servers on dynamic IP addresses -- especially those highly dynamic, those changing within one, two, or three days, or even more frequently -- are highly suspicious. A normal mail server will want not only to send e-mails but also to receive them, so they want a relatively stable IP address. They normally won't use a highly dynamic IP address."

Yu, who obtained her Ph.D. at the University of California, Berkeley, and Xie, who did her doctoral studies at Carnegie Mellon University, both joined Microsoft Research Silicon Valley in August 2006, and they already are making a significant contribution. A number of entities currently are using knowledge learned via UDMap to filter spam e-mail for real users.

"Both of us had a background in network security," Xie says. "We thought a big problem nowadays with network security is that botnet hosts are prevalent and attackers use botnets to launch spamming attacks."

With Microsoft's Hotmail team also located on the same campus, Xie and Yu took advantage of the proximity to enter into discussions with the product group, getting their hands on some invaluable data in the process.

Xie and Yu note that they benefited greatly from their interaction with the Hotmail team.

"Because we are in Microsoft and Silicon Valley," Yu says, "we were actually quite fortunate. We talked to the product group and got a lot of useful data. Without that, this research probably would not have been possible."

Even so, the sheer volume of the data set the Hotmail group was able to supply posed its own challenges.

"We need to process a tremendous amount of data to extract IP-dynamics information out of it," Yu says. "We were lucky to get access to a large computing infrastructure, Dryad and DryadLINQ, for our implementation. Without that, our ideas might be very nice on paper, but a real-world implementation might take longer to process. With this available infrastructure, the whole processing time is very quick."

Finally, Xie and Yu got valuable feedback and help from their coauthors, Achan, Goldszmidt, and Wobber at the lab. Gillum from the Hotmail product group also is supporting their research.

Yu and Xie plan to continue their work, investigating, among other things, IP properties of zombie hosts, which could identify whether they are on a user's computer, a proxy, or a shared computer; if an address falls into a residential or an enterprise IP address range; and whether it corresponds to a DSL line or a wireless line.

Then there is the history of such dynamic IP addresses to consider.

"The second direction we're looking at," Xie says, "is the attack history information of IP addresses or IP address ranges. For example, if we know a specific IP address range and that there are a lot of botnet attacks associated with this range, that may indicate that it contains less-well-protected computers. It's also likely that when we talk to a host from that particular IP address range, we need to be aware that there might be botnet hosts behind that range."

"We think exposing IP dynamics to the world is the coolest part of this," Xie smiles. "I think our findings are surprising. Ninety-six percent of mail servers on dynamic IP addresses actually send nothing but spam -- this knowledge was not much exposed before."