Dr. Dobb's | Finding Binary Clones with Opstrings & Function Digests: Part III

Finding Binary Clones with Opstrings & Function Digests: Part III

Andrew wraps up his examination of reverse engineering this month, further unraveling binary code.

September 01, 2005
URL:http://www.drdobbs.com/tools/finding-binary-clones-with-opstrings-fu/184406247

September, 2005: Finding Binary Clones With Opstrings & Function Digests: Part III

Andrew is a software litigation consultant who works on the technical aspects of cases involving copyright, patents, trade secrets, antitrust, and privacy. He can be contacted at [email protected] or http://www.undoc.com/.

Can a few extracts from a piece of binary code be sufficient to identify that piece of code in a large database? As discussed in the two previous parts of this article, a piece of binary code's noteworthy features can be extracted and placed into an opstring, an ordered set of instruction mnemonics or operations ("ops"). Noteworthy features include atypical instructions, system or API calls, branch-target locations (places the code may jump to), and the use of unlikely "magic" numbers.

In x86 Win32 code, one such opstring would be: "test, jz, call, test, jz, call, ret, loc, [HeapFree], loc, ret". Using an algorithm such as MD5, this can be turned into a function digest such as 65E482BF6A2F391D84D243D9E02244D7. If the function's name is known, either from debug symbols or from a Win32 DLL export table, the function digest and name can be added to a function database. The "test,jz,..." example and its MD5 signature happen to correspond to one implementation of the free C runtime library (RTL) function. If, in some other program, a function were encountered with a digest of 65E482BF6A2F391D84D243D9E02244D7, it could with a fair degree of certainty be identified as free. How much certainty is one of the subjects of this third and final article on binary code lookup and comparison.

Generating additional symbolic information for otherwise anonymous code—using symbols from one file to ID code in a different file—can tell you, when staring at code in a debugger, that some function is the free memory-deallocation function; this beats trying to read and understand the code.

However, that is a fairly limited use. Given a large database with MD5 signatures for every function in every Win32 file preinstalled on a typical PC or on the Windows XP CD, you can measure the amount of code duplication in the system, classify modules by sorting them into sets based on how much code they share in common, identify code that should perhaps be moved into a shared library, or find multiple instances of flawed code.

Excluding Boilerplate

I had a different purpose when I started this project: Not to use or study the contents of the function database, but instead to filter it out as noise. The function database was intended merely as a list of "boilerplate" code—stereotyped verbiage—that could be ignored when comparing binary modules in copyright infringement litigation.

Almost all litigation happens outside a courtroom. At least in the U.S., the majority of legal cases never reach trial (the system would collapse if more than a handful did). Cases are settled based on each side's estimation of the "value" of the case, determined in part by what each side learns about both sides to the case. Modern rules of "discovery" require each side to turn over to the other side ("produce") certain types of information, such as potentially relevant internal corporate e-mails and other documents (for typical examples, see http://www.usdoj.gov/atr/cases/ms_exhibits.htm).

In software copyright litigation, this includes each side's source code. Each side's expert compares the two sets of source code, looking for literal similarities that shouldn't be there, and for signs of nonliteral copying. (Each source tree may consist of millions of lines of code, so this process must be automated. It's a bit harder than it sounds. Hints: diff is not the right tool; associative arrays are nearly essential.)

Even open-source software is faced with issues like these: What percentage does GNU represent of a typical Linux distribution? Is a commercial vendor reusing open-source code without abiding by the GPL (see gpl-violations.org)? Has a contributor used copyrighted code?

Binary comparison is useful in part because source is sometimes unavailable. Not surprisingly, litigants sometimes resist turning over their source code ("the crown jewels") to the other side, even with court protective orders limiting who may see the source code. Also, a litigant only sees the other side's source code after it becomes an actual litigant. Thus, a software copyright complaint will generally be issued before the plaintiff has seen the defendant's source code. In the U.S., complaints can be filed "upon information and belief," and can later be amended. Still, a plaintiff should try confirming their suspicions. Meanwhile, a potential defendant might also have concerns ("I wonder if my programmers have been 'borrowing' code from their previous jobs"). If and when both sides' source code becomes available, the binary comparison is a double check, and may pinpoint specific source-code areas to investigate.

When detecting similarities in code—be it source or binary—there must be a baseline for comparison. Comparing any two otherwise-unrelated source-code trees, written in the same programming language and/or targeting the same operating system, yields some exact matches that nonetheless don't reflect copying, such as return 0; and for (int i=0; i<count; i++) in C code. Such lines need to be excluded from any source-code comparison. They are, in the context of such a comparison, noise, mere boilerplate: Everyone uses these phrases, so their presence is uninformative. (However, while boilerplate code is generally excluded from such percentages, boilerplate code that is merely a small part of a larger block of nonboilerplate code is generally included in the overall matching percentage.)

You generally "normalize" each line of source code by ignoring all white space (though in one case, a developer had used combinations of spaces and tabs as Morse-coded copyright notices). If you suspect juvenile forms of plagiarism, you might normalize all variable names; one of the benefits of comparing binary code is that the compiler eliminates names for you. You probably would not exclude comments because shared misspellings and errors may indicate copying (see Bob Zeidman's "Detecting Source-Code Plagiarism," DDJ, July 2004).

Lines can be normalized as each line of code is inspected, but filtering out boilerplate is different: You can't tell, just by looking at it, that something is standard verbiage. You need a boilerplate database, constructed at some earlier point by running through lots of other code—the baseline code you're not going to compare—and pulling out the most frequently occurring lines, like int i; and return 0.

Similarly, binary-code comparison needs a baseline—a database of boilerplate functions that, if found in the programs under comparison, should be ignored. The original purpose of Opstring was to build this database, by generating MD5 digests for every function included with Windows XP. This database contains MD5 signatures for C RTL functions, compiler startup code, standard sample code from Microsoft's software development kits, Visual Studio "wizard" code, and so on. Microsoft's debug symbol packages (see http://www .microsoft.com/whdc/devtools/debugging/ symbolpkg.mspx) provide names for the functions in nearly every XP DLL. Even without these names, while I don't necessarily know what each function is, my binary-code copying-detection program knows to ignore any copies of that function encountered in the two sets of binaries under comparison. (Additional code may need to be excluded from consideration, even if it's not found in the standard function database; see Computer Associates v. Altai.)

Does "Structurally Similar" Really Mean "Clone"?

As noted, the function database contains MD5 digests of opstrings, each of which is a string of the noteworthy features of a function. For example, the C source code in Figure 1(a), compiled to produce something like the code in Figure 1(b), yields the opstring in Figure 1(c) and the MD5 digest in Figure 1(d).

The instruction mnemonics have been extracted from the disassembly, the common mov, push, and pop instructions discarded, and the remaining mnemonics placed into the string, along with "loc" pseudo-ops, indicating the targets of the jz and jnz instruction. The operands have been ignored.

While Figure 1(c) plainly abbreviates the code in Figure 1(b), it is less clear that Figure 1(c) truly represents Figure 1(b), much less its source in Figure 1(a). After all, wouldn't a lot of other code, looking nothing like Figure 1(a) or Figure 1(b), also abbreviate to Figure 1(c)?

It turns out that the answer is no in this particular case. A database of about 775,000 function digests, representing most of the binary code in Windows XP SP2, did not contain a match for the function digest in Figure 1(d). Even having discarded so much of the function, what's left are reasonably reliable indicators of the function's identity.

Knowing that the opstring algorithm ignores mov, push, and pop, you could easily construct a function that differed from Figure 1(a) but that yielded the same opstring as in Figure 1(c). For example, Figure 1(e) appears to be a false positive, but only under a literal definition of "clone," in which even obvious cut-and-paste variations are treated as nonmatches. As noted in Part II of this series, Opstring was written to ignore minor differences. In other words, this is a feature, not a bug. It is important to have a uniform method for generating the function database, but a testing version of Opstring (available electronically; see "Resource Center," page 3) does provide command-line options to control what is included or excluded in an opstring, allowing for different definitions of "clone."

If two opstrings match, then—given a minimum opstring length—the binary code from which they were generated (and, to a lesser extent, the overlying source code) will be sufficiently similar that someone eyeballing them would regard them, perhaps not as identical, but as a good match, such that one could be a cut-and-paste variation of the other. In other words, we are backing into a working definition of "clone," based on what this tool can find. A binary-code "clone" can be defined as a match found by Opstring.

Obviously, not all tools are worth listening to. A "tool" that hashed any piece of code, of whatever length, to a single byte, would "tell" us which of the only 256 types of code in existence a given function fell into. Like classifying people by their astrological signs, it would provide only the illusion of usefulness. There's a similar danger here, because we're usually looking at compiler output (rather than hand-written assembler), and one can expect great regularity, even given infinite possible source-code input, because the compiler acts as a funnel, narrowing infinite varieties of source into a relatively small number of microprocessable structures.

Testing for False Positives

Where symbolic names are available, one test for false positives checks for ostensibly matching functions that have significantly different names. Multiple function names mapped to the same function digest (MD5 signature) may indicate that the opstrings from which the signatures are created are not good indicators of the function's identity. In my first test, nearly half of the named signatures appearing more than once had more than one name. However, browsing through the test program's output showed that nearly all the mismatches were C++ mangled/decorated names that differed from each other in only minor respects, such as ?Unlock@Perimeter@@QAEXXZ versus ?Unlock@RWLock@@QBEXXZ, or ?Remove@foo versus ?Update@foo versus ?Set@foo. These functions tended to be shorter than the system average. In many cases, the seemingly mismatched names were all located in a single file, indicating that indeed the functions were related, despite the different names. I also found many cases where, despite having used Microsoft's symchk utility, PDB symbols and code were misaligned.

When I reran the test, this time looking only for cases where code appears in more than one file and where a given symbol is used in more than one file (to eliminate PDB misalignments), ignoring all nonalpha characters, and looking only at the first five characters of the function name, the result was still that almost 3 percent of the signatures had one or more names that apparently mismatched the others in its first five characters.

Of the apparent mismatches, many were C RTL functions, such as isdigit, islower, and ispunct mapping to the same signature. That Opstring currently treats, say, islower and isupper as the same function may reduce its ability to supply exact symbolic names to disassembly listings, but does not undercut its ability to find code clones. Figure 2 shows what some of the remaining apparent mismatches look like (the first column is the opstring length).

I did find one genuine mismatch, so surely there must be others, especially in signatures based on shorter opstrings. Figure 3(a) shows the offending opstring, Figure 3(b) lists a few of the 30 functions whose code abbreviates to this opstring; Figures 3(c) and 3(d) are two examples of the actual code. Even here, given that the code in Figure 3(c) is called "Cleanup" and calls CoTaskMemFree twice, and that the code in Figure 3(d) calls C++ operator delete ("YAXPAX" in Microsoft C++) twice, the functions perhaps aren't totally unrelated. But clones, probably not.

Even though many of the mismatches are only apparent, some clearly are real; the 3 percent given earlier is probably a reasonable upper bound on the false-positive rate. This was based on a minimum opstring length of 12, and the false-positive rate can be reduced simply by requiring longer opstrings. A minimum length of 40 brings the false-positive rate down to about 1 percent.

This raises the question of the ideal minimum length for an opstring. The Opstring program discards any function whose opstring is shorter than the desired minimum. The average opstring length in XP is about 55. As seen in Figure 1, even a fairly small C function yielded an opstring whose length was 15. As a rough back-of-the-envelope figure, say there are about four "ops" per line of nonblank, noncomment C code. A minimum opstring length of 20 or 25 then represents a function definition, curly braces, some variable definitions, and five or six lines of actual code.

False Negatives

Different compilers can take the same source code and produce fairly different binary code. The same compiler, with different optimization settings, can do the same thing. Likewise, a single piece of source code that uses preprocessor macros or C++ templates can generate disparate binary code. Given a template<class T> class Stack, with Push and Pop methods, consider the difference between Stack<int>, Stack<float>, and Stack<class Foo>. The developer wrote Push and Pop once, but the compiler generates three different pieces of code for each. To still detect the underlying similarities between these three different binary versions of Stack::Push, based on the instruction mnemonics, would likely require boiling away so many differences among the three versions that unacceptably high false positives would result.

One approach to reduce false negatives is to match on something other than function boundaries. In addition to using branch targets for the "loc" pseudo-op, they could act as delimiters for the blocks of code from which to generate opstrings. With smaller blocks of code to consider, naturally more matches would be found. But considering each branch target as a separate block of code will result in many blocks of code that fall below the minimum length that avoids false positives. One solution is to consider both functions as a whole, and the branch-target blocks within them.

Another approach is to ignore boundaries entirely. The self-similarity within a file or group of files found using the DotPlot method (see "Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code," by Kenneth Church and Jonathan Helfman; http://imagebeat.com/dotplot/rp.jcgs.pdf). DotPlot is typically a visual technique in which duplication appears as diagonal lines, but the diagonals can be produced nonvisually (and thus selected by another program) by comparing every element of a string with every other element, using nested for loops. Church and Helfman describe several easy-to-implement optimizations that allow most comparisons to be skipped, by ignoring high-frequency tokens (much as Opstring ignores high-frequency instruction mnemonics). The frequency with which tokens appear can also be used to do weighted matching.

All of this can be applied to binary code, once this code has been turned into some form of text, like an opstring. One of the benefits of opstrings, apart from generating MD5 digests, is that they allow textual methods such as DotPlot to be applied, in effect, to binaries. Using the DotPlot method and a hacked version of Opstring, I located large tracts of duplicated code, crossing function boundaries, in Windows programs. However, even with the Church/Helfman optimizations, the program was significantly slower than comparisons with MD5 digests.

Apart from redefining what gets matched with what, we can also allow nonexact matching. By modifying the Levenshtein edit-distance algorithm (see "Finding String Distances," by Ray Valdes; DDJ, April 1992) to work with ops rather than characters, we could find identical functions by finding those whose opstrings have an edit distance of zero, then locate slight variations by selecting those pairs with small edit distances, for instance, only a small number of insertions, deletions, or substitutions needed to turn one opstring into the other. As with the DotPlot method, this too is significantly slower than comparing MD5 digests.

Implementation

The Opstring program (available electronically; see "Resource Center," page 3) is implemented in the AWK programming language, which simplifies text manipulation (if desired, the AWK code can be translated into Perl with a2p). Even though Opstring digests binary files, it's a text-manipulation tool that depends on an external disassembler, DumpPE. AWK splits each line of text input into space-delimited fields, designated as $1, $2, and so on. The entire input line is $0, and the number of fields is NF. In DumpPE disassembly output such as in Figure 1(b), opcode bytes appear in $2, the instruction mnemonic in $3, and operands in $4. A function label or branch target appears in $2. To aid reading the code listing, Figure 4 presents pseudocode.

When Opstring sees a function boundary, it outputs the previously collected filename, function name, and opstring, then sets up for the next opstring. When Opstring sees an instruction, it adds it to the opstring. All else is commentary.

Opstring could work directly on PE files, but it makes more sense, and not only for ease of implementation, to simply create a wrapper around an existing disassembler, DumpPE (http://www.tbcnet.com/~clive/dumppe.htm). While Opstring is currently overly tied to the DumpPE output format in particular, and Wintel code in general, there is little reason it couldn't be generalized for other disassemblers and other instruction sets. (Java and .NET IL come to mind, though there are decent decompilers for both of these, allowing somewhat higher level comparison methods.)

Microsoft's dumpbin is an obvious choice. Its -disasm option produces reasonable disassemblies of Win32 executable files. Newer versions use symbols in a PDB file. However, its output format is inconvenient to work with (each opcode byte is output followed by a space, making the location of the instruction mnemonic difficult to locate), and more importantly it does not identify branch targets. To use Opstring with dumpbin output, which is similar to that of the Linux objdump utility, I've written a converter (available electronically) that makes a first pass over a listing, looking for calls and jumps, to construct a pseudosymbol table that is then used to output function and branch-target labels on the second pass.

IDA Pro from DataRescue (http://www.datarescue.com/) is the gold standard in disassemblers. Its numerous features already include one of the side benefits that I've listed for opstrings and function digests—identification of runtime library functions. IDA Pro's FLIRT ("Fast Library Identification and Recognition Technology") identifies functions using the first 32 bytes of a function, marking all variant bytes, and a CRC of the remaining bytes. Provision is made for the fact that functions such as _strupr and _strlwr share identical code. An IDA Pro external plug-in could presumably employ a database of opstring-style function signatures.

However, IDA Pro is designed largely for interactive use (it even includes a runtime debugger to help with encrypted or obfuscated code that is difficult to untangle when looking at a static listing). To build the database itself, however, which is the purpose of Opstring, interactivity is less useful. Thousands of disassemblies are generated, briefly used by another program, and then discarded. The end result is the function signature.

Clive Turvey's DumpPE is a good choice as the supplier of function names, branch targets, and instruction mnemonics to opstring. It makes two passes over the code to find branch targets, uses symbols from a PDB file, identifies functions and branch targets that are indirectly accessed via pointer tables, and identifies the names of Windows API calls.

A key problem is finding the start and end of each block to be compared, which generally will be a function. Opstring relies on DumpPE to find function starts; DumpPE gets these from entry points (especially from the PE file's export table), from any debug symbols, and from its first pass through the code.

Finding function ends is more difficult; this is the halting problem. As seen in the pseudocode, the program doesn't find function ends at all. It simply relies on seeing the next function, or the end of the disassembly. Function returns or unconditional jumps do not necessarily signal the end of a function, because code often contains multiple returns or jumps per function.

Because Opstring continues to add to the current opstring as long as it thinks it is inside the same function, it is crucial to filter out as much junk as possible. As seen in the pseudocode, opstring ignores NOPs, debugger invocations (INT 3), data that looks like code, and anything that will take the opstring length beyond a reasonable length. Any opstring that doesn't contain at least one jmp or ret is discarded, and any opstring of the maximum length is chopped back to the location of its first jmp or ret.

When Opstring has generated the opstrings for a file, they can be turned into MD5 digests with the MkMD5Db program (available electronically). This is little more than a wrapper around the standard MD5Init, MD5Update, and MD5Final functions originally written by Colin Plumb in 1993. MkMD5Db expects the name for an item specified on a line beginning with '!', followed by one or more lines to be digested, and a blank line. Opstring, of course, outputs this format.

Preliminary Results

I noted in Part I that while this technique can be used for all sorts of things, these articles only construct the technique itself, deferring the actual use of it to some later time. While it is in the true spirit of software to construct tools and then move onto something else, without putting the tool to much use, it would be good to glimpse this hammer actually encountering some nails.

I took a Windows XP CD, expanded all files in the \I386 directory, and then discarded those that were not Win32 PE files. The result was 11,913 files (primarily DLLs), totaling 304 MB. I ran the following command on this directory:

for %f in (\pefiles\*.*) do dumppe -disasm
| awk -f opstring.awk
| mkmd5db >> xp_i386.db

This ran in about half an hour on a cheap laptop computer. The resulting file contained just over 500,000 function signatures, of which 44,000 different signatures appeared more than once (and, on average, was used 3.8 times; thus, about 123,000 signatures were duplicates). The small program funcdupe.awk (available electronically) pulls an entire function database into an associative array and then generates these numbers. It also factors in opstring lengths to calculate a duplication percentage. For Windows XP, the result was 17 percent.

Considering that this includes startup and runtime library code that is included in many programs, 17 percent doesn't sound very high. On the other hand, considering that Windows is built around shared libraries (DLLs), and is supposed to comprise the very operating system itself, and not merely a loose collection of applications, this duplication rate could also be viewed as high. One study of source-code duplication in large systems reports a range (excluding comments and blank lines) from 8.7 percent in GCC, to 29 percent in a web-based message board written in Python, to 36.4 percent for a Smalltalk database server, and 59 percent for a COBOL payroll system ("A Language Independent Approach for Detecting Duplicated Code," by Stephane Ducasse et al., http://www.bauhaus-stuttgart.de/clones/Duploc_ICSM99_IEEECopyright.pdf).

Previous Work

Given that cut and paste comprises a lot of what developers do, it is not surprising that clone detection is a rich field of research, and even has its own annual "Detection of Software Clones" workshop, held in conjunction with the IEEE's annual Working Conference on Reverse Engineering (WCRE). For an excellent overview, see "Detecting Duplicated Code" in Object-Oriented Reenginering Patterns, by Serge Demeyer et al. (Morgan Kaufmann, 2002).

Over 10 years ago, Brenda Baker described "parameterized string matching" for the purpose of finding sections of code that are identical except for a systematic change of parameters. Initially this involved source rather than binary code, but has been extended to Java bytecodes (see http://cm.bell-labs.com/who/bsb/). Other key contributions are discussed in Parts I and II of this article.

Inspecting binary executable files has been particularly important to the antivirus industry. Binary signatures, with provision made for wildcards, are used to scan for known viruses (see Art of Computer Virus Research and Defense, by Peter Szor; Addison-Wesley Professional, 2005). It seems odd that the software used by millions every day isn't analyzed with the same rigor as malware. In large part, Opstring takes malicious-code analysis, as used by antivirus researchers, and tries to appropriate it to analysis of commercial software.

DDJ

September, 2005: Finding Binary Clones With Opstrings & Function Digests: Part III

(a)

int foo(int x, int y) {
    while (b(x,y))
        if (c(x,y)) 
           return 1;
    return 0;
}

(b)

004011F0               	@foo@8:
004011F0 56         		push	esi
004011F1 57        		push 	edi
004011F2 8BF2        		mov 	esi,edx
004011F4 8BF9        		mov 	edi,ecx
004011F6 E869FEFFFF      		call 	fn_00401064 ; _b
004011FB 85C0           		test 	eax,eax
004011FD 741A          		jz   	loc_00401219
004011FF             	loc_004011FF:
004011FF 8BD6            		mov 	edx,esi
00401201 8BCF            		mov 	ecx,edi
00401203 E857FEFFFF        		call 	fn_0040105F ; _c
00401208 85C0            		test	eax,eax
0040120A 7512         		jnz 	loc_0040121E
0040120C 8BD6          		mov 	edx,esi
0040120E 8BCF           		mov 	ecx,edi
00401210 E84FFEFFFF     		call 	fn_00401064 ; _b
00401215 85C0           		test 	eax,eax
00401217 75E6           		jnz  	loc_004011FF
00401219               	loc_00401219:
00401219 5F             		pop 	edi
0040121A 33C0           		xor 	eax,eax
0040121C 5E             		pop 	esi
0040121D C3             		ret
0040121E              	loc_0040121E:
0040121E 5F               		pop  	edi
0040121F B801000000        		mov 	eax,1
00401224 5E                		pop  	esi
00401225 C3                		ret

(c)

call,test,jz,loc,call,test,jnz,call,test,jnz,loc,xor,ret,loc,ret

(d)

EEC1E2A53DA3D4FCF3E57850F302412D

(e)

int bar(int x, int y, int z) {
    while (g(x,y,666))
        if (h(1234,x,y)) 
           return 1;
    return 0;
}

Figure 1: (a) A small C function; (b) disassembly of a compiled version; (c) its opstring; (d) its function digest; (e) and a slightly different C function that yields the same opstring and function digest.

September, 2005: Finding Binary Clones With Opstrings & Function Digests: Part III

65 . A675B247E3DA5CC0CFC0418AE0A33607 taskkill.exe - _wmainCRTStartup
65 . A675B247E3DA5CC0CFC0418AE0A33607 systeminfo.exe - start
65 . A675B247E3DA5CC0CFC0418AE0A33607 syskey.exe - _wmainCRTStartup
 ...
63 . E65204032B23E3DE64472CE99203EF34 wmspdmoe.dll - _m6_ownBitRev1_Z
63 . E65204032B23E3DE64472CE99203EF34 wmspdmoe.dll - _a6_ownBitRev1_Z
63 . E65204032B23E3DE64472CE99203EF34 wmspdmod.dll - _m6_ownBitRev1_Z
63 . E65204032B23E3DE64472CE99203EF34 wmspdmod.dll - _a6_ownBitRev1_Z
 ...
27 . D40F1D68E0D3D7859498B13925213272 provthrd.dll - 
   ?HexCharToDecInteger@ProvAnalyser@@SGKD@Z
27 . D40F1D68E0D3D7859498B13925213272 msmsgs.exe - ?HexCharToDigitA@@YGHD@Z
27 . D40F1D68E0D3D7859498B13925213272 dpwsockx.dll - _GetDigit@4
27 . D40F1D68E0D3D7859498B13925213272 dplayx.dll - _GetDigit@4
27 . D40F1D68E0D3D7859498B13925213272 dplaysvr.exe - _GetDigit@4
 ...
25 . 90ACA191ED795D375823FF86C4CDBB97 sapi.dll - __lseek
25 . 90ACA191ED795D375823FF86C4CDBB97 netfxocm.dll - __lseek
25 . 90ACA191ED795D375823FF86C4CDBB97 msvcr71.dll - _write
25 . 90ACA191ED795D375823FF86C4CDBB97 mstscax.dll - __read
25 . 90ACA191ED795D375823FF86C4CDBB97 msihnd.dll - __lseek

Figure 2: Some function signatures with different names in different files.

September, 2005: Finding Binary Clones With Opstrings & Function Digests: Part III

(a)

test,jz,call,and,loc,test,jz,call,and,loc,test,jz,call,and,loc,ret

(b)

...
16 . BE52429A4DB568969503A30A27D4CB24 rtcshare.exe- ??1CShareErrorInfo@@UAE@XZ
16 . BE52429A4DB568969503A30A27D4CB24 rend.dll - ??1CObjectWithSite@@QAE@XZ
16 . BE52429A4DB568969503A30A27D4CB24 query.dll-?_Cleanup@CDbParameter@@AAEXXZ
16 . BE52429A4DB568969503A30A27D4CB24 qedit.dll - ?Release@CDXDataPtr@@QAEXXZ
 ...

(c)

7D9D7AFB             	?_Cleanup@CDbParameter@@AAEXXZ: ;;; query.dll
7D9D7AFB 8BFF         		mov 	edi,edi
7D9D7AFD 56              		push  	esi
7D9D7AFE 8BF1              		mov  	esi,ecx
7D9D7B00 8B06           		mov  	eax,[esi]
7D9D7B02 85C0            		test 	eax,eax
7D9D7B04 7409            		jz  	loc_7D9D7B0F
7D9D7B06 50              		push 	eax
7D9D7B07 E81EC40B00       		call 	_CoTaskMemFree@4
7D9D7B0C 832600          		and  	dword ptr [esi],0
7D9D7B0F             	loc_7D9D7B0F:
7D9D7B0F 8B4604          		mov  	eax,[esi+4]
7D9D7B12 85C0            		test 	eax,eax
7D9D7B14 740A             		jz  	loc_7D9D7B20
7D9D7B16 8B08            		mov  	ecx,[eax]
7D9D7B18 50             		push 	eax
7D9D7B19 FF5108         		call  	dword ptr [ecx+8]
7D9D7B1C 83660400       		and   	dword ptr [esi+4],0
7D9D7B20               	loc_7D9D7B20:
7D9D7B20 8B4608       		mov  	eax,[esi+8]
7D9D7B23 85C0          		test  	eax,eax
7D9D7B25 740A           		jz   	loc_7D9D7B31
7D9D7B27 50               		push 	eax
7D9D7B28 E8FDC30B00        		call 	_CoTaskMemFree@4
7D9D7B2D 83660800        		and  	dword ptr [esi+8],0
7D9D7B31             	loc_7D9D7B31:
7D9D7B31 5E            		pop  	esi
7D9D7B32 C3              		ret

(d)

5DA13F91            	??1CObjectWithSite@@QAE@XZ: ;;; rend.dll
5DA13F91 56              		push  	esi
5DA13F92 8BF1            		mov  	esi,ecx
5DA13F94 8B4604         		mov  	eax,[esi+4]
5DA13F97 85C0            		test 	eax,eax
5DA13F99 C706AC17A15D    		mov  	dword ptr [esi],offset 
   ??_7CRendezvous@@6BCObjectWithSite@@@
5DA13F9F 740B             		jz   	loc_5DA13FAC
5DA13FA1 50              		push 	eax
5DA13FA2 E819EF0000       		call	??3@YAXPAX@Z
5DA13FA7 83660400       		and	dword ptr [esi+4],0
5DA13FAB 59              		pop 	ecx
5DA13FAC             	loc_5DA13FAC:
5DA13FAC 8B460C         		mov 	eax,[esi+0Ch]
5DA13FAF 85C0            		test	eax,eax
5DA13FB1 740A             		jz  	loc_5DA13FBD
5DA13FB3 8B08             		mov 	ecx,[eax]
5DA13FB5 50              		push 	eax
5DA13FB6 FF5108          		call  	dword ptr [ecx+8]
5DA13FB9 83660C00        		and  	dword ptr [esi+0Ch],0
5DA13FBD               	loc_5DA13FBD:
5DA13FBD 8B4610           		mov 	eax,[esi+10h]
5DA13FC0 85C0             		test 	eax,eax
5DA13FC2 740B             		jz   	loc_5DA13FCF
5DA13FC4 50                		push 	eax
5DA13FC5 E8F6EE0000      		call  	??3@YAXPAX@Z
5DA13FCA 83661000         		and  	dword ptr [esi+10h],0
5DA13FCE 59              		pop 	ecx
5DA13FCF               	loc_5DA13FCF:
5DA13FCF 5E              		pop  	esi
5DA13FD0 C3              		ret

Figure 3: A false positive: (a) a function digest; (b) a few of the 30 functions in XP with this digest; and (c) and (d) two disassemblies showing that the code doesn't really match.

September, 2005: Finding Binary Clones With Opstrings & Function Digests: Part III

input file with "magic numbers" (DEADCAFEh, etc.) into array
given output from DumpPE -disasm:
    get Win32 filename from first line of dumppe output
    ignore everything until see "Disassembly"
    for each line in disassembly
        if it's a function label
            if opstring length > minlength and 
            if opstring contains at least one ret or jmp
                if opstring length is maximum, might be junk so:
                   chop it off at the first ret or jmp
                output filename, previous function name, and opstring
            set up for next opstring:
               set new function name; clear opstring; oplength = 0
        else if it's an instruction line, and opstring isn't at maxlength
            if line contains a large hex operand
                if the hex operand is found in the "magic" array
                   add mnemonic "_" magic number to opstring
            else if it's a Windows API call
                add API name to opstring without trailing 'W' or 'A'
            else if it's a branch target
                 add "loc" to opstring
            else if it's common (mov, push, pop, add esp)
                if it's an API mov
                   add "mov_" and API name to opstring
                else
                   do nothing
            else if it's junk code (nop, int 3, etc.)
                do nothing
            else if it's data ; relying on DumpPE to separate code/data
                do nothing
            else if we've seen this same thing many times in a row
                do nothing
            else if mnemonic has prefix (rep, lock, etc.)
                add prefix "_" mnemonic to opstring
            else
                add mnemonic to opstring
            if added to opstring
                oplength++
    stop processing when see hex dump
    output last one
send output to mkmd5db

Excluding Boilerplate

Does "Structurally Similar" Really Mean "Clone"?

Testing for False Positives

False Negatives

Implementation

Preliminary Results

Previous Work

Figure 1: (a) A small C function; (b) disassembly of a compiled version; (c) its opstring; (d) its function digest; (e) and a slightly different C function that yields the same opstring and function digest.

Figure 2: Some function signatures with different names in different files.

Figure 3: A false positive: (a) a function digest; (b) a few of the 30 functions in XP with this digest; and (c) and (d) two disassemblies showing that the code doesn't really match.

Figure 4: Pseudocode for the Opstring program.