Web Development

A Disassembler Written in Perl

By Tony Zhang, February 01, 1997

Tony presents the core subroutines of a disassembler written in Perl. Although designed for Intel's x86 instruction set, you can easily modify or customize the disassembler for your own applications.

Dr. Dobb's Journal February 1997: A Disassembler Written in Perl

Tony, a systems software developer at Texas Instruments, can be contacted at [email protected].

Reverse engineering has many applications, including disassembly and microprocessor debugging. Disassembly can be helpful in understanding code written by other programmers. In this article, I'll present the core subroutines of a disassembler written in Perl. This disassembler core is designed for the Intel x86 instruction set. You can modify or customize the disassembler for your own applications. To do so, simply design your application and write a low-level function to read assembled machine code (which may be from a static .EXE file or a bus trace fetched at run time). You then call the core disassembler routines to translate the assembled machine code (that is, the .EXE code) back into human-readable code, which should be close enough to the original source code for most purposes. For instance, I built a caller routine based on bus trace data collected by a logic analyzer connected to a PC system. The caller extracts the executive code from the bus traces and disassembles it by calling the core disassembler routines presented here.

There are several reasons for writing the disassembler in Perl:

Perl has rich built-in features to easily manipulate text, files, and processes.
Perl is ideal for providing quick solutions to programming problems because it needs neither a special compiler nor a linker to convert the source code to machine code.
Perl is rapidly becoming a popular scripting language across almost all platforms.

Instruction Format

Figure 1 shows the general format of the x86 instruction set. All instruction encodings are subsets of the general format. An instruction consists of four main parts. The first part is an optional prefix. The second part -- the opcode (operation code) -- is the most important and is one or two bytes long. The third part is an address specifier consisting of the ModR/M and scale index base (SIB) bytes, also optional. The last is for displacement or immediate data, if required.

Construct Instruction Look-Up Table

The file disasm.lut (available electronically; see "Availability," page 3) is an instruction look-up table based on the opcode map of the x86 instruction set given by Intel (see Pentium Processor Family Developer's Manual, Volume 3: Architecture and Programming Manual, Intel 1995). The ":" is used to separate the instructions and operand types. The number value of the first column indicates whether it is a 1-byte opcode map, a 2-byte opcode map, or another map determined by some attribute.

Each operand follows a "%" sign for parsing purposes. There are several windows in 1-byte and 2-byte opcode maps, such as %g0 and %g[a-q], that will link to 2-byte or other code maps.

A subroutine called "MakeLUT" is called to read the instruction look-up table into three arrays in memory when the disassembler is run for the first time. The three arrays are @One_Byte_LUT, @Two_Byte _LUT, and %Three_Bit_LUT. The last one, %Three_Bit_LUT, is an associative array, indexed by a string, $lsTemp_1, in the subroutine MakeLUT. The associative array is one of the most important features in Perl because it can be used for efficient database manipulation. It's often said that you are not thinking in Perl until you start thinking in terms of associative arrays.

The second column in the instruction look-up table gives the start index to the arrays of @One_Byte_LUT and @Two_Byte_ LUT. For the associative array %Three_Bit_ LUT, the combination of the second and third columns is used as the index to it. For instance, when the string 1:94:xchg %eax, %esp:xchg %eax, %ebp:xchg %eax, %esi:xchg %eax, %edi is read in, MakeLUT first determines that the string should go to the array @One_Byte_LUT, because the number in the first column(field) is 1. The number 94 will be used as the start index to the array. Therefore, it will look some-thing like Example 1(a) in the array @One_Byte_LUT after the reading.

Similarly, after reading the string 3:b:0:add %Ev, %Iv:or %Ev, %Iv:adc %Ev, %Iv:sbb %Ev, %Iv, the associative array %Three_Bit_LUT will have four more items, as in Example 1(b).

Parse Instruction

The main routine of the disassembler core, X86_Disasm, accepts a string containing the name of a low-level function that will be passed to subroutine ImageByteRead to read in image data (that is, assembled machine code). Also, the start address (either a linear or physical address) and the operand size (16 or 32 bit) will be passed to X86_Disasm. An instruction that accesses words (16 bit) or double words (32 bit) has an operand size attribute of either 16 or 32 bits.

When one byte of machine code is fetched by calling ImageByteRead, it is used as an index to find a string of symbolic instruction from array @One_Byte_ LUT. Then a while loop is applied in order to find prefixes in the symbolic string (see Figure 1). Perl provides two operators to check pattern matching. In the while loop, you check whether the $lsOpcode contains %P or not by using one of the patterns matching operator $lsOpcode =~ /\%P.*/.

The subroutine MethodP contains all four possible prefix groups:

Instruction Prefixes: rep, repe/repz, repne/repnz, and lock (where rep and repe/repz share the same cell in the opcode map).
Segment Override Prefixes: cs, ds, es, fs, gs, and ss.
Operand Size Override ($DisasmnOS).
Address Size Override ($DisasmnAS).

According to Intel, most instructions that can refer to an operand in memory have an addressing-form byte (the ModR/M byte) following the primary opcode byte(s). The ModR/M byte specifies what kind of address form is to be used. Also, in certain cases, the ModR/M byte is followed by a second addressing byte, called the SIB byte, that is required to fully specify the addressing form.

The disassembler I present here (available electronically; see "Availability," page 3) has three subroutines -- ModBits, RegBits, and RMBits. These check the three fields of mod, reg, and r/m in the ModR/M byte. SIBField returns the displacement and the register number of the index register, or the base register, according to the format of the SIB byte. Figure 2 illustrates the formats of the ModR/M and SIB bytes.

After the prefix-adding while loop, the symbolic string is parsed further to find if there are any links to arrays @Two_Byte_ LUT or %Three_Bit_LUT. The pattern %g0 implies that a new string of symbolic instruction will be needed from @Two_ Byte_LUT, while %g[a-q] leads to the associative array %Three_Bit_LUT, according to how you built your look-up table.

Later, each character in the symbolic string is extracted and checked in order to find the corresponding opcode, necessary information regarding the ModR/M byte, the SIB byte, displacement, or immediate data. The Perl substr function is applied to extract those characters from the symbolic string. The tested characters will help to determine which subroutines should be called to interpret the opcode following the addressing methods.

The subroutines in the sample disassembler complete the job of translating the symbolic string into text-like assembly code for X86 instructions.

MethodA represents the direct address in which the instruction has no ModR/M byte except the encoded address of the operand. Therefore, no base register, index register, or scaling factor can be applied in the addressing method.
While MethodC uses the reg field of the ModR/M byte to select a control register, MethodD uses the reg field value to determine a debug register.
MethodE fetches the ModR/M byte that follows the opcode and specifies the operand. The operand can be either a general register or a memory address. If the operand is a memory address, it is calculated from a segment register and any of the following values: a base register, an index register, a scaling factor, or a displacement. Following the calculated address, the subroutine can obtain all operand codes, convert them into proper formats, and save the results into a string variable that contains the final disassembled code.
MethodM is basically the same as MethodE, except that MethodM is used only for a memory address. Similarly, MethodR is the same as MethodE, except that MethodR is used only for a general register.
A general register can be selected by calling MethodG, which checks the value of the reg field in the ModR/M byte. For the immediate data encoded in subsequent bytes of the instruction, MethodI can be used. MethodJ computes the relative offset in an instruction and adds the offset to the instruction pointer.
For an instruction that has no ModR/M byte, the offset of the operand can be determined by calling MethodO. In this case, the offset is encoded as a word or double word in the instruction. The address-size attribute, $DisasmnAS, can tell whether it is a word or double word.
Finally, MethodS selects a segment register by checking the value of the reg field in the ModR/M byte.

There are three variable scopes in Perl -- global (the default), local, and my. The life span of the global is as long as that of the whole program module. A local variable in a calling routine can extend its life to a called subroutine. A my variable is limited to the scope of a routine itself. $DisasmnPrefix, $DisasmnOp, $DisasmnModRM, $DisasmnSIB, $DisasmnDispl, $DisasmnImmed, and $DisasmnMapIndex are defined as local variables so that they can be updated and passed easily among different subroutines.

Currently, ImageByteRead returns only one byte at a time. The string $mem_read in ImageByteRead contains the name of a low-level function that reads an image file (assembled machine code) and returns data in bytes. The integer $byte_num determines how many bytes of data should be read and returned.

Building a Sample Disassembler

Disasm.pl (available electronically) is a sample disassembler that demonstrates how you call the disassembler core function X86_Disasm when building disassemblers. Essentially, the disassembler takes your input (in hex format, separated by space), splits them into bytes, and save those bytes in memory. It then calls X86_Disasm to disassemble one byte at a time, and prints the disassembled result on the screen. Remember the first item you feed into the disassembler should always be the operand size (16 or 32). The disassembler keeps running until you type "q" to quit. You can run the program on under UNIX or Windows 95/NT.

Branch-Trace Message Analysis

One of the uses of the disassembler core in this project is to decode the branch-trace messages (BTM) collected from a logic analyzer. The disassembler is called by a program that can parse the bus cycles and extract BTM data by figuring out the target linear addresses and source linear addresses.

Once the code is retrieved, the disassembler can translate the code back into text-like assembly code by calling a low-level memory-read subroutine through ImageByteRead. I've tested the disassembler package on both UNIX and Windows platforms. The disassembled code matches the original source code (written in assembly) very well.

Conclusion

Clearly, there are many improvements you could make to this disassembler. You might want to rewrite the code for Example 2(a) to Example 2(b) if you can make ImageByteRead return more than one byte at a time, so that you can reduce the redundancy of the for loop. Also, you can add a subroutine to select a test register determined by the reg field of the ModR/M byte. Finally, you can add the part for the coprocessor (floating-point unit) instruction set, similar to the way the integer instructions were added in the article.

Searching, extracting, and pattern matching are three things Perl does very well. Perl lets you concentrate on logic flow and algorithm design while it handles the implementation details -- that's the beauty of programming in Perl.

Acknowledgment

I'd like to thank Dr. Ed Ferguson at Texas Instruments for his help in designing the x86 disassembler. Dr. Ferguson originally wrote a disassembler in awk.

DDJ

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development