The Evolution of Anti-Botnet Strategies
Given the proliferation and sophistication of malware, it is not hard to see why traditional anti-malware techniques don't work against botnets. Most IDS focus on detecting known threats, or on detecting the volume of traffic generated by a bot host, after it has been activated. However, most bots are polymorphic: they change with every instantiation so always appear as new. Furthermore, most botnets generate only low-volume periodic communication back to a bot master, and this volume is generally within the thresholds used by IDS.
In the remainder of this article, we describe the Canary detector that targets early botnet detection. The Canary detector encompasses three promising anti-botnet strategies. The first strategy employed is the analysis of real enterprise network traces that reveal how the network is actually used; this analysis, in turn, reveals how certain user-driven traffic properties differ from botnet traffic. Our second strategy is an end-host detection algorithm that is able to root out the botnet C&C channel. Our approach is based on the computation of a single persistence value, a measure of how regularly remote destinations are contacted. The strength of this method is that it requires no a priori knowledge of the botnets that are to be detected, nor does it require inspection of traffic payloads. Although the botnet detection capability may be carried out solely at an individual end-host, we show that detection is further improved by correlating across a population of systems, either at a network operation center (NOC) or in a completely de-centralized fashion, to identify the commonality in persistent destinations across multiple systems. This is our third strategy.
The Design of the Canary Detector
The Canary detector takes a novel approach to detecting stealthy, end-host malware, such as botnets. Here we use the term stealthy to mean not generating a noticeable level of traffic. The central idea in our detection scheme is to track the usage of destination atoms, the logical collections of destination addresses that describe services. Specifically, we measure the correlation of destination atoms -- temporally for individual users, and spatially across sets of users–and scrutinize those destination atoms that become significant. In the case of botnets, for example, the recruited end-hosts typically call home periodically. By tracking this destination atom over time at a coarse level, we can flag it when it becomes significantly persistent.
Destination Atoms in Intel Enterprise Traces
Interested in studying correlations between user activity and network traffic patterns, we launched an enterprise data collection effort from inside Intel's corporate network. We collected traces (over a 5-week period from approximately 400 end-hosts) that we and others subsequently data-mined for interesting phenomena, statistics, and contradictions of long-held assumptions .
Looking at real enterprise traces, we can see that there are substantial efficiencies to be gained when correlating destination usage. Thus, our Canary algorithms rely on a level of abstraction we call destination atoms, that is, logical representations of network services. This level of summarization leads to a significant reduction in the number of destination entities that are tracked, and thus, tracking atoms requires less overhead. The base definition for a destination corresponding to a connection is the tuple (destIP, destPort, proto), which is simply the end-point for the connection consisting of the destination address, the destination port, and the transport protocol that is used. Often, in the case of well-known services, multiple physical hosts provide the same, indistinguishable application service. Thus, we can group the set into a single atom (dstService, dstPort, proto). Here, the service is simply the domain name to which the underlying addresses resolve. Examples of atoms include (www.google.com, 80, tcp), (akamaitech.com, 80, tcp), and (mail.cisco.com, 135, tcp).
Further summarization is also possible by applying heuristics on how ports are used by applications. Consider an FTP server, connected in PASV mode. The initial connection is over port 21, but a separate server-negotiated ephemeral port is used for data transfer. Thus, a single FTP session has two atoms, (ftp.service.com, 21, tcp) and (ftp.service.com, k, tcp), where k is a port number beyond 1024, which can be viewed as offering the same service. By considering FTP semantics, we can add the entire range of ports larger than 1024 to the associated atom (ftp.service. com, 21:>1024, tcp). This means that, when we see a connection on port 21, we can expect an ephemeral port to be used in the near future.
In the real enterprise traces, we had many occasions to perform this level of summarization, most notably on the Microsoft RPC ports between 135 and 139. We then arrive at the full definition of destination atom, the triple (addr set, port set, proto). Here, addr set is a set of destination addresses: these addresses are identical with respect to the applications provided; port set is a set of individual ports or port ranges; and finally, proto is the transport protocol the service uses. Table 1 enumerates some atoms extracted from the enterprise traces.
Note that a single destination host can provide a number of distinct services, and in this case, the port is sufficient to disambiguate the services from each other, even though they may have similar service names, which are obtained by (reverse) DNS lookup. Finally, note that in cases where the addresses cannot be mapped to names, no summarization is possible, and the conventional destination address is the final descriptor.
The key anti-botnet technique we propose is to identify temporal heavy hitters without regard to their level of traffic; that is, identify services that get used with a degree of regularity. Again, this strategy was validated by the analysis of real enterprise traces from a diverse group of end users in varied geographic regions with disparate usage patterns. We believe that the set of significant atoms for an end-host is small and stable, and that when a host is infected with malware, it will connect periodically to a home server, and the latter will stand out. To perform this detection, we must first assign a numeric value to the somewhat nebulous concept of regularity, which we refer to as the persistence of an atom. We want to track the regularity of usage, rather than the connections themselves. Consider the act of using your newsreader to download the news headlines. Each time the newsreader application is launched, it makes a large number of connections. To track the long(er)-term communication with the end-host, we concentrate on tracking highlevel sessions, rather than individual connection frequencies.
To track high-level sessions, we bin connections to the atom by using a small tracking window, w, and we assign a 1 or a 0 to that window (the atom was seen 1 or more times, or not). Clearly, the tracking window length should cover sessions. When we plot the inter-arrival time for individual atoms across a large number of users, we see that 59 percent of the connections to atoms are made within a minute of each other, and 87 percent of connections to the same atom are separated by at least an hour. We therefore select an hour as the tracking window length to compute persistence.
The other step needed to assign a numeric value to persistence is the construction of an observation window, W; that is, we look at how long an atom should be regularly observed before it is classified as significant. Based on experience with the data, we defined the observation window, W = 10w, which roughly covers the average work day. Having defined w and W = (w1, w2 , . . . , w10), we quantify persistence for an atom a, as observed at host h, over the observation window W, p (a, h, W), as the number of individual windows w1, w2 , . . . , wn where the atom was observed.
If we denote p* as a threshold for an atom to be significantly regular, then if p (a, h, W) > p* , the destination a is considered persistent for host h. Note that the definition of persistence has an inherent timescale dictated by W. Suppose that w = 1hour and W = 1day. When computed at this scale, persistence captures the day-to-day behavior of the atom. However, it fails to capture longer-term trends that may exist. Consider two different atoms: a1, seen every hour, and a2, observed once a day. We have p (a1) = 24/24 and p (a2) = 1/24.
Intuitively, however, they are both quite regular and thus both should be termed persistent. In fact, because we are trying to detect stealthy malware about which we have no a priori timescale information, the one timescale we pick may be the exact one that misses the malware activity. Thus, instead of relying on a single timescale W, we consider five different timescales, W1 , W2 , . . . , W5 . Therefore, for every atom, we compute p (a, h, Wi) for i = 1, 2, . . . , 5 and say that it is persistent if maxi p (a, h, Wi) > p*
While persistence is defined as a property of the individual end-user, we use commonality to quantify how correlated a destination atom is across the users in a network. Thus, a destination atom is significant in this dimension if a large fraction of the users are communicating with it. Since these atoms are created because of many users in a network, we expect them to be quite stable among the population. The commonality metric is defined quite simply: let N (a) be the number of users in the population that see the atom a, at least in some observation window. Thus, the commonality of atom a, c (a) = N (a)/N, where N is the total number of hosts in the network. Additionally, we could require a minimum persistence for the atom across the set of hosts that report connections to it; doing so would counter the effect of temporary transients such as flash crowds.
Unlike persistence, this commonality metric cannot be computed in isolation at an individual end-host. Persistence requires a means for the system to collect and correlate information across end-hosts. One solution is to assume the existence of a central IT operations center (ITOC) that can collect periodic reports of atoms observed from all the end-hosts, and that can determine the significant common atoms in the set. Alternatively, peer systems can share persistence information periodically with like-minded subsets of the population (e.g., proximate peers, those running a similar OS or patch level, those deemed trusted via the social network of users at the application layer, and so on).
In contrast to the ITOC approach, significantly common atoms are determined and maintained at the end-hosts, as in . In either scheme an important point is that a sliding window is maintained over the entire observation window (the largest among the different timescales). While computing the commonality metric, only reports within this observation window are considered. Again, the test for significance is when the value of c (a) is greater than a specific threshold c . When c (a) > c , we say that a is common in the population.
We construct a whitelist for each user in two steps. First, the host observes its traffic for a training period, builds the set of atoms, and tracks their persistence; the length of this training period would vary with how stable the traffic patterns are, and we expect this to be defined by the network operator. We define p to be the persistence threshold; that is, if the persistence of a particular atom is larger than p , then the atom is added to the whitelist. In the detection phase, each end-host sends its set of observed atoms (all of them, not just the persistent ones) either to the central ITOC of the enterprise or to a subset of like-minded peers. At the ITOC, the commonality is calculated for each atom in the union. We define a threshold for commonality, c, and collect those atoms whose commonality exceeds c. These atoms are sent to every end-host, where they are incorporated into the whitelist. Thus, every host's whitelist has two components: an individual component capturing behaviors unique to that host, and a global component that corresponds to behavior that is common to the population. The global component can contain atoms that are not part of the individual host's regular behavior.
At a high level, our system generates alarms corresponding to two types of events. These are classified as (1) p-alarms, when a destination atom not contained in the host's whitelist becomes persistent and (2) c-alarms, when a destination atom is observed at a large number of end-hosts in the same window and is identified as common. Note that p-alarms are generated locally; the user is alerted and asked to acknowledge the alarm. In contrast, c-alarms are raised either at the central ITOC or locally, if full whitelists are distributed among peers. Note that when the alarm corresponds to an atom becoming significant, one of two things must happen: either the atom is classified as benign (by a user or operator) in which case it must be added to the appropriate whitelist, or else the alarm indicates malicious behavior, requiring remediation action. In this article, we do not address the remediation stage; we simply note that a number of possibilities have been suggested in the literature, such as throttling traffic, redirecting traffic through a scrubber, blocking traffic, and so on.
processPacket(pkt, t, wi) a <-- getDestAtom(pkt) if a in WHITELIST then return /* ignore atoms already in the whitelist */ end if if a is a new connection initiation then DCT[a][currIdx] = 1 /*update persistence */ sendReport(userID, a, t) /*report sent to central console*/ end if
In the rest of this section, we briefly review the specific actions required to process outgoing packets (summarized in Listing One). When the (outgoing) packet corresponds to an atom already in the individual host whitelist, nothing further is done. If the outgoing packet does not correspond to an atom already in the host whitelist, then the following steps are taken:
- If the atom was not previously seen, a new entry is created in the data structure used to track persistence (DCT); this is indexed by the atom and points to a bitmap. Each bit corresponds to a particular tracking window.
- The data structure that tracks the observations of atoms (labeled DCT) is updated for the current tracking window.
- The atom, if new, is sent to the ITOC (possibly after being filtered through a minimum persistence criterion).
Note that our system is not tied to any particular traffic feature or threshold definition; for convenience, we assume connections per minute as the feature under consideration. To generate p-alarms, we track persistence at all the timescales by employing a sliding window. The data structure to do this is depicted in Figure 2. A dictionary (or hash table) is maintained, in which an atom is indexed, and this dictionary entry reveals the particular bitmap associated with the atom. When the atom is observed in a tracking window wi, the ith bit is set to 1 as described in Figure 2. As the sliding window is advanced, at the end of the last window, the persistence is computed for each atom observed in the last tracking window. It would seem that doing this for multiple timescales would be expensive. However, an interesting observation is that we do not need to replicate the structure at different timescales. Instead, we can exploit the overlapping nature of the timescales (W3 < W4); we can get away with this by using a single long bitmap that has enough bits to cover the longest observation window.
If at any time, the persistence value of the atom exceeds the threshold p , an alarm is raised for the atom; at this time, the user is asked to attest whether the atom is valid and should be added to the whitelist. If the value is not significant even after sufficient tracking windows, the bitmap is cleared out and the atom is no longer tracked (a new bitmap is instantiated if it ever appears again).
To understand the overhead imposed by this procedure, we note that the length of the dictionary need not be large. If an outgoing packet is already in the whitelist (specifically, if its atom is in the whitelist), then no new dictionary entry is required. For everything else, we only need one entry per atom (even if the same atom has many connections or packets associated with it). With atoms that actually need to be tracked, the computation involved is simply the time it takes to index the dictionary and update the bitmap. However, we see in the traffic that most atoms that we track occur very infrequently (and that the most obviously persistent atoms are already in the whitelist and do not need to be tracked). Therefore, most entries in the bitmap are empty; an easy optimization would be to use sparse vectors in lieu of bitmaps. In our analysis, we found that the worst-case scenario over all users, and all observation windows Wmax had 1435 atoms requiring tracking. The average case was 485 atoms. This is almost negligible if one considers the computational power and memory associated with modern-day mobile systems.
We conclude this discussion by briefly discussing how the c-alarms are generated through tracking commonality -- a very straightforward operation. The central console at the ITOC keeps track of atoms seen by different users over the largest observation window. When a report arrives from a host, the corresponding atom is updated. At the same time, old information is expunged (that is, sightings of an atom older than the observation window are discarded). When an atom's entry is updated, and the number of associated users (who have seen this atom recently) crosses the threshold c , a c-alarm is generated. The frequency with which a host sends reports to the central console determines how soon an anomaly will be detected. Dispatching the report immediately (as soon as the atom is first seen) helps with catching the anomaly early, but at the cost of communication. Batching updates reduces the communication cost, but increases the time to detection. While this is an interesting tradeoff to study, we do not explore it in this article.