Channels ▼

Bring Out Your Dead

Jan01: Bring Out Your Dead

Dan is a security researcher. He can be contacted at

Linux's Filesystem

Data recovery is, of course, of potential interest to anyone who has lost data to the ravages of time, malice, or carelessness. But in forensic computing or analysis, it takes on a new meaning — suddenly what other people have thrown away can become an important component in understanding what has happened in the past, as burglary tools, data files, correspondence, and other clues can be left behind by interlopers.

Taking the PC approach — essentially toggling a delete flag and hoping that no one has overwritten your data — will not work with most UNIX filesystems. Although, on UNIX, the data is not deleted (this would take far too much time), the location of the data blocks is lost when a file is removed. And when combined with the rather mysterious high-performance disk-block allocation methods, the recovery of deleted material becomes rather difficult and, to most, a seemingly hopeless task. The UNIX Internet FAQ (available at .edu/) has said since 1993:

For all intents and purposes, when you delete a file with "rm" it is gone...However, never say never. It is theoretically possible *if* you shut down the system immediately after the "rm" to recover portions of the data. However, you had better have a very wizardly type person at hand with hours or days to spare to get it all back.

Of course, it's actually quite simple to view the data on the disk, deleted or not, by simply looking at the raw disk. Since the data is still there, the easiest way to examine it is by utilizing the standard UNIX tools — strings, grep, text pagers, and the like. Unfortunately, they have no way of discerning what data is allocated and unallocated, but this doesn't mean that they cannot be useful, especially if you know what you're looking for. For example, if an intruder deleted all your system log files (which might start with the month, day, and time) from the first week of January, you could type this to see them:

strings /dev/raw/disk/device | egrep '^Jan 0[1-7] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]'| sort | uniq -c > date-file

(uniq -c compresses duplicate lines, prepending them with the number of occurrences. In all these examples, you must always verify that you are writing to a file that does not reside on the device that you are recovering data from, otherwise you run the risk of overwriting the data you want to save before recovering it.)

Since this searches through the entire disk, it can be quite slow. It also mixes in unremoved data with the removed, but it's certainly better than nothing, and it provides us with a starting point. It can actually be useful for recovering data with a regular or repeating form. Of course, more concise regular expressions or programs may be used to separate digital wheat from chaff.

Combing through the entire disk is certainly time consuming, however. In our search for deleted data, we would ideally like to be able to ignore the visible or allocated data and only search through the unallocated portion of the disk. To take this next step, I'll describe how data is stored on disks.

Brief UNIX Filesystem Overview

In UNIX, each disk drive is divided up into one or more disk partitions, each of which may contain a single filesystem. In turn, each filesystem has a list of inodes and a set of data blocks.

An inode holds almost all the information (other than the file name and its actual data) that describes an individual file — the size, the location of disk blocks, and so on. Inode numbers and their corresponding file names are stored in directory entries.

Data blocks are regularly sized chunks of data. Although users might think they are writing or reading individual bits and bytes from a disk, in reality, disks can only access the media in a physical (fixed-size) sector or group of sectors. The filesystem breaks up any data request to or from a file into logical blocks of data that correspond to (not necessarily contiguous) physical blocks on the disk. UNIX typically uses 8 KB for its logical block size.

When a file is deleted, the name remains in the directory but the inode number the name points to is removed. In addition, the inode itself is changed: The ctime is updated and the data block location is erased.

The Coroner's Toolkit ( or includes the unrm program, which can emit all the unallocated blocks on a filesystem. It does this by reading the list of free blocks in a filesystem, going to each logical block, and seeing if it contains any blocks or fragments of unallocated data. Fortunately, the free list covers all blocks in the partition, ignoring disk abstractions such as cylinder group maps, boot blocks, and the like, so you're pretty much guaranteed to get all the data blocks.

Unrm can be useful if you're looking for something that you know is deleted. For example, assuming you accidentally deleted your password file (a file composed of lines of seven fields separated by semicolons, the third and fourth fields being numeric), you could probably recover most of it by using unrm and a bit of editing:

unrm /dev/raw/disk/device | egrep '^.*:.*:[0-9]*:[0-9]*:.*:.*:' | sort -u > unrm-password-file

(Many UNIX systems distinguish between block devices and raw devices. On such systems, the block device may not give you data that is buffered for block device I/O. Always use the raw disk device on any system if available.)

Indeed, even using a pager (such as "less," rather than the symbol of male fertility that people hang off their belt) or editor that can display and search through binary data suddenly becomes quite useful when applied only to the unallocated data. For performance reasons, the filesystem attempts to allocate disk space in consecutive data blocks — the data for most files tends to stick together — and it becomes easy to examine the data directly.

This is still unsatisfactory for many tasks, but finding content on only raw data is often difficult — there is a reason for files and directory structure, after all. Actually, when armed with unrm, the hardest part about data recovery stems from the fact that all the time data that is tied to the content is lost (except with Linux) — if you only knew when the data was nuked, you could recover a great deal of it fairly easily.

To proceed onward, another method will be used.

Dawn of the Dead

Lazarus is another program included in The Coroner's Toolkit. Because it's a rather strange but simple program that produces unusual results, I'll describe it in some detail. Its goal is to give unstructured data some form that is both viewable and manipulatable by users. It achieves this goal via a few simple heuristics. The results are predicated on two lemmas:

  • The UNIX FFS never starts writing file data except on well-defined boundaries. If we choose an input block size that is consistent with this, we will never miss an opportunity for dividing up a file appropriately — 1024 bytes should succeed for this goal.

  • UNIX filesystems like to write files in contiguous blocks when possible for performance reasons. (The UNIX filesystem always keeps itself relatively defragmented, unlike many PC filesystems.)

With these ground rules, a sort of primitive digital X-ray device can be created. The map of the disk that is created essentially makes the drive transparent — you can peer into the disk and see the data by content type, but the highly useful filesystem abstraction is lost. Figure 1 is an example of the interface and a once-deleted JPEG file.

Unlike the usual small file-sized bites that most data recovery programs take, this is more like a giant vacuum cleaner that sucks up the entire deleted part of the disk and tries to make some sense of all of it.

How it Works

Lazarus begins by reading in a block of data from its input stream and roughly determining what sort of data — text or binary — the block is. This is done by examining the first 10 percent of the bytes in the block — if they are mostly unprintable characters, then it is flagged as a binary block; otherwise, it is flagged as text data.

If the block has been flagged as text, lazarus checks the data against a set of regular expressions to attempt to determine what it is with finer detail. For instance, if it sees "From:," it further marks the text as mail; "<A HREF=" marks the text as HTML code; and so on.

If the block was binary, the UNIX file(1) command (which attempts to classify files based on content) is run over the chunk of data. File(1) isn't used on all blocks — text and binary — because it gives fairly poor answers in some classes of problems, plus the regular expressions give a finer grained control on finding out what a text-orientated file is — at the cost of performance, of course.

While recovering data, lazarus saves its findings. If the data block is not specifically recognized after the initial text/binary recognition but instead follows a recognized chunk of text/binary data (respectively), lazarus assumes that it is a continuation of the previous data and will concatenate it to the previous data block. These discrete files — or pseudofiles, since they aren't the real files they once were, but instead are based on ephemeral — are then individually written to disk.

With the exception of images, lazarus saves data to neutral file names (having a name representative of the content, but ending in .txt) to avoid interpretation by the browser — the last thing you want is to have your browser execute code while examining potentially unexploded munitions (Java, JavaScript, ASP, and the like) left by an intruder.

Simple text characters are used to represent data chunks using a logarithmic (base 2) compressed scale of representation. This means a single character is one block of data, the second two, and so on. This allows large files to be visually significant but not overwhelming — a megabyte file would only take up 10 times the space of a single block file. In tests, I would typically see two orders of magnitude in visual compression from the one-block == one-character method.

A snippet of typical output might look something like this:


Llllllll....C.Mmmmmm...Mmmmm.... LlllllllllllMmmmm..


where capital letters are the start of a pseudofile, the Cs represent C source code, Ls are log files, Ms mail, and "."s unrecognized binary data.

Since writing a good UI is a very difficult task, I opted to generate HTML and use the familiar browser as an optional interface that could process a modest amount of data. Only modest, however. After watching browser after browser gasp, seize their hearts, and keel over after pumping a few megabytes of HTML with accompanying graphics, I was forced to use colored text instead. Fortunately, the mapping of the disk is significantly smaller than the disk itself.

I soon found that once the disk is mapped out, simply looking at the overall disk map can be a learning process. Whether examining an entire disk or simply the unallocated parts, you can clearly see not only the clustering of types of data, but what part of the disk you're looking at. Clusters of executable files point to system directories, bunches of log files to log repositories, HTML and graphical images point to browser caches or web directories, and so on. The clumping of file types based on directory structure and the relationship of locality on file types is striking.

As a caveat, the unrm/lazarus combination can chew through vast amounts of disk space to store the results — 100 percent of the unallocated space for unrm, and up to 150 percent for lazarus to store all its data files. And while unrm is reasonably efficient, lazarus runs something like a bloated pig dog with a leg or two missing — going out to coffee while it's running is not the answer, you might need to take a vacation before it finally finishes.


The unrm/lazarus combination is not, unfortunately, a competitor to PC programs such as Norton Utilities and their marvelous unerasing program, where you can almost instantly click on files that have been deleted and get them back with little danger of data loss (assuming you do it relatively quickly). It's not only consuming, but arduous and very hit and miss; large files, unless very regular in form (such as log files and other tightly constrained data files), can be very difficult to recover. Groups of files that have been destroyed together are also problematic, as they tend to blur together — there is a real reason for files and directory structure.

Obviously, there are no guarantees or statistical probabilities that I can give as to how successful you might be if attempting to recover your own data. But certainly with smaller files, even if they were deleted a very long time ago, you have a reasonable chance of successful regeneration. Anecdotal evidence suggests that large data recovery efforts seem to be able to recreate something like 2/3 of the data — the other third is probably there, but simply too difficult to piece together.

But to paraphrase an earlier article, if you're looking for anything at all, you'll find something, but if you're trying to find a single file it is much more difficult. Grep and other UNIX shell tools can be very useful when homing in on individual data within the mountain of processed files.

Almost perversely, they're fine tools for spying — anything that has hit the disk is fair game, even if immediately deleted. Mail, browser caches, and other ephemeral data are easily examined. Intruders will often download system cracking tools that are compiled, run, and then deleted. When used in combination with MACtimes and other forensic tools, lazarus can be a powerful mechanism for discovering types of activity and not simply recovering data.

Forensic Examinations

In what was perhaps the most striking example of combining what we've learned and written about so far, a break-in occurred in which the system administrators had immediately halted the system upon finding the intruder. Nothing was known about how the interloper gained access, and since no memory or network information was available (the system was simply turned off), The Coroner's Toolkit and unrm were run on the corpse of the system.

After mactime revealed that some unusual C header files were read recently — indicating that the intruder had probably brought along his security-exploit code in source form and then compiled it on the target system — grep was used to search for a reference to this header file in the recognized C code of the unrm/ lazarus output, which exposed a single deleted file. Unfortunately, lazarus had only recovered about 2/3 of the source code to this file, but putting a couple of lines from it into a WWW search engine provided the missing third to an exploit that revealed precisely the mechanism used to break into the system.


So, is what the UNIX FAQ says about removed files still valid? Not really. Some files (especially on Linux systems) can be recovered with very little effort or time. And while it can take a great deal of time to actually recover something, wizardly skill is not really required. Ultimately, however, your odds at getting anything useful from the grave is often a question of personal diligence — how much is it worth to you? If it's important enough, it's quite possibly there.

The persistence of data, however, is remarkable. Contrary to the popular belief that it's hard to recover information, it's actually starting to appear that it's very hard to remove something even if you want to. The unrm/lazarus combination is a fine, if a bit unsettling, trash can analyzer. And while the results can be spotty for simple single file "undeletion," robbing graves for fun and profit can be a lucrative venture for an aspiring forensic analyst. Indeed, when testing this software on a disk that had been used for some time on a Windows 95 machine, then reinstalled to be a firewall using Solaris, and finally converted to be a Linux system, files and data from the prior two installations were clearly visible. Now that's data persistence!

Forensic data is everywhere on computers. We urge others to continue the examination, for we have simply scratched the surface.


Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.