Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Open Source

UNIX Filesystems without I-Nodes


Dr. Dobb's Journal February 1997: UNIX Filesystems without I-Nodes

Volker, who implemented the ncpfs and smbfs filesystems for the Linux kernel, studies mathematics and computer science at the University of Göttingen, Germany. He can be contacted at [email protected].


Along with nfs, the Linux kernel smbfs and ncpfs filesystems make it possible to link Linux machines to virtually any file server -- from Pathworks to Windows NT 4.0, from NetWare to any NFS server -- across a LAN. When I was implementing smbfs and ncpfs, however, it became clear that Microsoft's Server Message Block (SMB) protocol is not designed to handle UNIX clients like Linux. SMB, the protocol that implements file services, is designed for DOS. Consequently, SMB has no notion of an i-node, the central structure in every UNIX file-system implementation. On the surface, this would appear to limit Linux's usefulness on heterogeneous networks. However, in this article, I present techniques I developed to work around this limitation.

UNIX Filesystems

The user's view of a UNIX filesystem is straightforward: Everything is arranged in a hierarchy of directories. But this hierarchical order does not reflect the filesystem's layout on disk. In reality, all things that can be stored in a filesystem -- normal files, directories, named pipes, device files, and so on -- are represented by inodes. An i-node stores everything the system needs to know about a file -- owner, size, permission bits, time stamps, and pointers to the file's data. Basically, the inode is the file. You might expect to find the file name in the i-node, but you won't. The name is stored elsewhere. Inodes are arranged in a large array stored at the beginning of a disk partition, or scattered around within the partition. The filesystem can only reference inodes via their numbers, the indexes into the inode table.

The illusion of a hierarchical filesystem is created by the directory, a special type of file. On disk, directories look much like normal files; they have an inode and data bytes. Their data simply consists of a list of filenames paired with inode numbers. These pairs of names and inodes are what users see as files. The difference between normal files and directories is that users are not allowed to access the directories. If a program manipulated directories, the UNIX system could not maintain its hierarchical structure. This loose coupling between a file's name and the file itself is the reason you can have two different names for a single file. You just create two directory entries pointing to the same inode; the names can even be in different directories. Moving files in the filesystem is also simple: Create a filename with the file's inode number in the target directory and delete the file's entry in the source directory. In short, a file gets an inode number when it is created. This inode number remains constant during the lifetime of the file, no matter where you move it in the directory tree.

This central role of inodes can also be seen in the implementation of the Linux virtual filesystem (vfs). In fact, the inode structure (see Figure 1) is the most important object in the Linux vfs. An inode is read from a table on disk when the corresponding file is opened the first time, or when the user wants information about the file. For example, the command ls -l reads every file's inode and shows the metadata of each file it lists. The inode is written back when the last process that accessed the file closes it. All other operations that can be performed on a file, including read/writes, refer to the inode.

Drawbacks for Linux

Both Microsoft's SMB protocol and early versions of Novell's NetWare Core Protocol (NCP) were designed to redirect all file-access functions of the DOS INT 0x21 to the server. The native DOS filesystem (that is, the FAT system) has no notion of inodes. FAT does not separate filenames from the files themselves. The meta data of the files is stored in the directory the file resides in together with its name. Because no inodes or inode numbers are needed to implement the INT 0x21 interface, neither SMB nor NCP provide inode numbers. All that is transferred over the network is the pathname of a file. In NCP, you can allocate 1-byte handles for parts of the path. But these handles cannot be used as inode numbers, as they are too short and do not uniquely identify a file.

To implement a filesystem for Linux, you have to implement read_inode(). The upper layers of the Linux filesystem (implemented in the linux/fs/*.c source files) hand an inode number to this function and expect the actual file-system implementation to fill a prepared inode structure with the values the inode has on disk. But what do you do when you can't ask the file server to give you "the size of the file with inode 1234" because the server doesn't know anything about inode 1234? That's the problem I faced when implementing a filesystem for Linux. Clearly, I had to find a way to fool the Linux kernel.

When writing smbfs and ncpfs, I had to identify all points in the Linux vfs layer that rely on the fact that a file is essentially an inode with a fixed inode number. I did not expect to find only two vfs routines to deal with inode numbers: lookup() and read_inode(). (create() is a third, but it works exactly as lookup() does.) The kernel asks lookup() for the inode of a file, giving it the filename and the directory to search. lookup() finds the inode number of the specified file, then calls iget() to read the inode. iget() is a higher-level function that maintains a cache of inodes. When it does not find the requested inode in its cache, it calls read_inode() with the inode number it just got from lookup().

So smbfs and ncpfs have to make sure that lookup() generates unique inode numbers for all the inodes iget() holds in memory. And lookup() needs a way to tell read_inode() about the inode numbers it has generated artificially. The Linux kernel doesn't care about the inode number stored on disk or anywhere else. The inodes in memory are those that are referenced as open files from one or more processes and are the current working directory of a process. The numbers of all other inodes can be randomly chosen. The only problem that remains is determining how the filesystem should choose the inode numbers.

One solution is to generate them randomly. However, this does not provide uniqueness. Another idea is to increment a counter whenever an inode number is requested. But as Linux systems are expected to run for a long time and must therefore cope with wraparound, a list of numbers in use must be maintained. This is inefficient. The fastest solution turns out to be the simplest: The Linux kernel already has an extremely reliable number generator perfectly suited for this purpose -- kmalloc().

The kmalloc() Approach

Normally, when a file-system type is added to Linux, the global inode structure is augmented by the fields that are needed by the new type. This can be found in the file linux/include/fs.h. When I was developing smbfs on the stabilized kernel 1.2.13, this was not an option, because I wanted to run these filesystems as modules without kernel modifications. Additionally, the structures needed by smbfs and ncpfs are quite large, and I did not want to penalize all other filesystems with such a large structure. So I chose to store the smbfs/ncpfs-specific inode data in a special kmalloc() structure. Now it's very fast to find unique inode numbers: I simply use the address of the allocated structure. We know this number is unique across all filesystems and it references the corresponding structure without overhead.

Inode numbers are generated in the vfs function lookup(). This function allocates the special filesystemspecific structure and gives its address as the inode number to the kernel routine iget(). iget() finds that the inode is not in memory yet and calls read_inode with the wanted number. read_inode is now able to find the filesystemspecific data by a type cast. The routine put_inode can reliably kfree the data, and free the inode number for later use.

This scheme works well from a kernel point of view. The first smbfs implementation used exactly this mechanism. Then the first users complained that the command pwd could not find the current working directory. As the kernel did not complain about any inconsistent structure, the inode numbering scheme seemed to be insufficient. The getwd routine, used by pwd, runs as follows: It works its way back to the root directory via the ".." entries of the directories. It stores the inodes of all the directories it finds on its way to the root. Then it reads all the directories as ls would, and compares the inode numbers it gets with the ones found on the way to the root. Directories are read until the current directory is found. Not really efficient, but under UNIX it's the only way to find the path to the current directory.

It should now be clear why the numbering scheme I implemented has to fail for getwd: Each time a directory is opened for reading, a new inode number is created for it. So getwd has no chance to reconstruct the path to the current directory. To help getwd, not only do the inodes used by the kernel need fixed and unique numbers, but so do all the inodes that are parents and grandparents of any inode used, up to the filesystem's mount point. Thus ncpfs and smbfs build up a tree of inodes that represents the directory structure currently in use. When an inode is released (because the corresponding file is closed or the directory has been left by cd, for example), the corresponding file-system structure is free()ed. Now the path to the root of the mount point is traversed and each directory's file-system structure is released when it is not used by other parts of the directory tree.

By this enhanced scheme, getwd can be satisfied, but some as-yet-unfixed problems remain. In most cases, when you do a ls -i in an ncpfs-mounted directory, each directory entry is assigned the same inode number. This is caused by the kernel malloc routine. When you kfree() a memory area and kmalloc() a memory block of the same size directly after, you get the same chunk of memory, resulting in identical inode numbers for all directory entries. This is especially annoying because the nfs daemon relies on unique inode numbers across the complete filesystem.

Another command that is broken by this scheme is an implementation of the diff command that tests whether the files have the same inode, then refuses to compare the files if the numbers are the same. cp is another candidate: Sometimes it refuses to copy a file over an existing file, because it believes both to be the same file.

Namespaces: NetWare has INodes

Again, Novell's NCP protocol was designed to implement DOS filesystem functionality. But when clients such as OS/2 and UNIX were used, NetWare had to cope with different notions of a filename. Although OS/2 and UNIX allow filenames that don't conform to the 8.3 convention, they do not agree about case sensitivity. To be able to cope with these needs, Novell introduced the concept of namespaces for NetWare 3. To handle namespaces, the NetWare core protocol was enhanced considerably.

From the protocol view, the NetWare 3 filesystem looks similar to UNIX filesystems. NetWare also seems to store inodes, which Novell calls "Directory Entries." The Directory Entry Table fulfills the same purpose as UNIX inode tables. This central table contains everything relevant about a file, such as its size, its owner, and a reference to the file's data. The NCP namespace services let you access entries in this table and, thus, individual files by their indexes into this table. These indexes are 32 bits long and could be called the "inode numbers" of the files.

There is one main directory entry per file, the original DOS filename entry. Each loaded namespace module adds another entry for each file. This is a complete directory entry of its own and, as such, also has a unique 32-bit number. It contains the filename according to the conventions the namespace expects, along with the number of the DOS filename entry. For example, the NFS namespace module stores up to 255 bytes of the filename, and knows that uppercase and lowercase filenames are different.

When a file is created, it is always created in a special namespace. For example, Windows 95 creates all files in the OS/2 namespace, giving the file a long filename. This file has to be visible from all other namespaces, so the other entries have to be created by NetWare itself. NetWare squeezes the name into the 8.3 scheme required by DOS and creates the corresponding DOS directory entry. Windows 95 has to do the same for the VFAT filesystem, where each file has an 8.3 filename in addition to its long filename. When a DOS workstation creates a file, NetWare does not have to truncate the filename, but it still has to create the additional entries in all the namespaces loaded. For volumes with many small files, the way namespaces are used can cause problems because lots of additional directory entries are allocated.

You can look at the namespace information with the NDIR program that Novell delivers with NetWare. When you call NDIR with the option /L, it will display the long filenames that are assigned to the files. Windows 95 only shows you the OS/2 and DOS namespace filenames, but NDIR will also show you the NFS and other namespace's filenames.

The namespace support of NetWare now makes it possible to solve the problem with the inode numbers, as the protocol gives you 32-bit directory-entry numbers that uniquely identify files and directories in the NetWare filesystem. As in UNIX filesystems, where inode numbers identify a file only within a filesystem, directory-entry numbers always refer to a NetWare filesystem, a volume. There might be two different files in different volumes with the same directory-entry number, as it is perfectly possible in UNIX. So to provide unique inode numbers for ncpfs-mounted directories, you have to mount a single volume under one mount point. When you do this, the directory-entry numbers that NetWare shows the client via the NCP protocol are handed to user processes as inode numbers, and user programs can rely on the inode numbers to be unique on the complete filesystem. This way, the filesystems mounted can be reexported by nfs.

I implemented this as an option only, because it is possible that some NetWare servers have a lot of volumes, which would require many mount points for the complete server to be mounted. This could be expensive, since currently each ncpfs mount point uses one NetWare user license. The other issue is that standard Linux kernels allow only 64 mount points, which could be quickly exhausted. It's much cheaper to raise the Linux limit than the NetWare limit, but it's inconvenient. Consequently, you can still mount all volumes under one mount point, but you lose the reexportability.

The Code

In general, the techniques I've described here are implemented in the ncpfs and smbfs kernel code, as well as parts of the general Linux vfs layer. More specifically, the files dir.c and inode.c (available electronically; see "Availability," page 3) are excerpts from the ncpfs code that illustrates the concepts I've presented. The latest versions of smbfs and ncpfs can always be found via http://www.kki.org/linux-lan/.

DDJ


Copyright © 1997, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.