Ron is a Member of Technical Staff at Los Alamos National Laboratory. He can be contacted at [email protected].
I magine waking up one day and discovering that programming has suddenly regressed you can only use global variables. Furthermore, imagine that all the names and values of all the variables of all the programs on your system are shared. Without question, this would be a nightmare. However, it's also how operating systems currently manage files. All of my files, for instance, can be made visible to you, and all of yours made visible to me and there is nothing we can do about it, at least with most operating systems. Fortunately, the concept of private namespaces provides an alternative. Private namespaces let groups of processes construct their own namespace, pass that namespace to other processes on other computers, and keep the namespace invisible to processes outside the group. In this article, I will discuss how I went about implementing private namespaces for Linux to solve problems in both distributed and cluster computing. (The complete source code that implements this technique is available electronically; see "Resource Center," page 5.)
Distributed computing is the use of many computers for one application. As processes run on remote nodes, they often access their home filesystem via the network filesystem (NFS). Figure 1, for instance, illustrates the result of a remote process from machine B running on machine A, where the remote process on machine A can access B's disks with the prefix /net/B.
Now consider a network of 400 machines where, in most cases, the set of distributed programs accesses every filesystem that may be accessed by any user at any time. Because the schedulers are trying to use as many machines as possible, these programs access all these filesystems from every desktop. In the worst case, the network of machines ends up with "spaghetti mounts," with many workstations mounting each other and even circular dependencies. The end result is that the entire collection of machines can become vulnerable to a single desktop machine going down.
You might expect that clusters would not have problems with NFS spaghetti mounts. In practice, however, the problem can sometimes be worse on a cluster. Because the cluster connectivity is very good and data sets are typically much larger than one disk, files are usually striped across the cluster nodes, and the nodes act as servers. If an NFS file server is a client of some other server, cascading outages of servers begin to occur, and whole parts of the cluster can become hung. The delays also exhibit self-synchronizing behavior, which can result in nodes all hanging synchronously, causing even more delay and failure. (The types of problems that can occur were briefly mentioned by the group that rendered the Titanic movie graphics.)
Remote file access is a common problem for both distributed and cluster computing. When a program accesses a remote filesystem, it makes changes to the global state of the operating system, and these changes can affect the operating system and all the processes running on it. Referring to Figure 1, you can see that once the process from machine B is running on machine A, the following happens:
- The process accesses B's filesystems, and they are mounted by the automounter. If B goes down, machine A can hang and may need to be rebooted. NFS only has a few ways of dealing with server outages: hanging the client until the server comes back (hard mounts); losing data on reads and writes while the server is down (soft mounts); or hanging on writes and losing data on reads (BSD spongy mounts).
- Any process on machine A can now peruse machine B's filesystem at will. B cannot distinguish malicious browsing from A from normal filesystem access from the remote process running on A.
- B's process can peruse A's filesystem at will. In fact, the problem is even worse if A and B are in different administrative domains because the process from B has now gained access to all of the servers that A can access. This single security problem alone makes it impossible to convince different organizations to share compute resources, because administrators are unwilling to leave their filesystem environment open to other organizations.
The type of namespace supported by UNIX is known as a "global namespace." In global namespaces, the filesystem namespace is common to all processes and traversable by any process on the machine. New filesystems become accessible to all processes as they are mounted, promoting "mutual insecurity."
The combination of global namespaces and distributed and cluster computing creates problems that cannot be solved by manipulations of directory permissions, automount tables, or judicious use of the chroot system call.
Again referring to Figure 1, the process running on B might construct a private namespace consisting solely of (for instance) /bdisk1 from B, as in Figure 2. When the process moves to A, it restarts with only /bdisk1 in its namespace and has no access to any of A's disks. No process on A can access /bdisk1 without communicating with B and mounting the disk; A's processes cannot access /bdisk1 just because a process from B happens to be running on A. If B goes down, then the only process that hangs or fails is the remote process started from B. The failure of B has no impact on any other process on A, or on A's operating system. In short, private namespaces remedy problems like this that I've experienced in large-scale distributed computing environments.
Private Namespaces for UNIX
Working from publicly available documents, I've built an implementation of the Plan 9 filesystem protocol and tested its user-mode components on FreeBSD, Solaris, SunOS, and Linux. I have also written a kernel-mode virtual filesystem (VFS) that runs on Linux. (For more information on Plan 9, see http://plan9.bell-labs.com/plan9dist/ and "Designing Plan 9," by Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey, DDJ, January 1991.)
As Figure 3 shows, the current components of this system provide several different layers of functionality. There is a user-mode library for the client side, a kernel VFS for the client side, and a set of user-mode servers.
The Plan 9 Protocol
The Plan 9 protocol uses stream-oriented communications services to connect clients to servers. Communications are via T messages (requests) and R messages (replies). The structure of T and R messages is very regular, with fixed-size data components and commonality between different message types.
Each T message has a unique 16-bit tag, a TID. Once a TID has been used for a T message, the tag may not be used again until an R message with the same tag has been sent, which retires that TID. The tags allow concurrency in the message transmission, processing, and reply. Requests may be aborted via a TFLUSH message. To gain access to remote filesystems, a process must:
- Create one or more sessions with one or more remote file servers. Part of this process involves authentication. Once a session is created, it is assigned a session ID.
- For each session, the process may mount one or more remote filesystems. A mount is known as an "attach" in Plan 9 terminology. Several attaches may be rooted at the same point in the processes' private namespace, in which case that part of the namespace represents a union of the attaches ("union mount"). Mount points and files (opened or not) are referred to by File IDs (FIDs), but unlike NFS or other remote filesystem protocols, FIDs are specified by the client, not the server. A FID for an attach may be cloned (similar to a UNIX dup), and from that point on the cloned FID may be used for further operations, such as walking the filesystem tree.
- Using the cloned FID, the process can then traverse the remote filesystem tree via messages that walk to a place in the remote filesystem or create a new file. Walking does not change the FID.
- Unlike NFS, Plan 9 has an open operator. Files may be opened after the appropriate set of walk operations. The argument to an open is a FID.
- Once a file has been opened or created, it can be read or written via TREAD or TWRITE messages. Since each TREAD or TWRITE uses a unique TID, many concurrent reads or writes can be posted, allowing for read-ahead or write-behind to be supported.
- When a client is done with a file, it sends a TCLUNK message for the FID to indicate that the server can forget about the file. Once a close succeeds, the FID may be reused. A clunk for an attach point is equivalent to an unmount.
Status for a file is accessed via a TSTAT message, and the status may be changed via a TWSTAT message. TWSTAT is used to implement chmod, rename, and other file state management operations.
This overview should provide a flavor of the nature of the Plan 9 protocol. Based on my experience with NFS (see my paper "Mether-NFS: A Modified NFS which supports Virtual Shared Memory," http://www.usenix.org/publications/library/proceedings/sedms4/full_papers/minnich.txt), I feel the protocol compares favorably to NFS. In contrast to the many variable-length fields in NFS, the fields in the Plan 9 messages are fixed size and the same fields are always at the same offset. The individual elements are defined in a simple but easily converted machine-independent format. Furthermore, the file status structure is similarly straightforward. The protocol is connection oriented, circumventing problems that have plagued UDP-based versions of the NFS protocol. Finally, the user-mode mount and file I/O protocol provide for increased security, since there are no privileged programs running to support those protocols.
Changes to the Plan 9 Protocol
To accommodate the differences between Plan 9 and UNIX, I made several changes and additions to the Plan 9 protocol. For one thing, I added support for symbolic links by adding two new messages to the protocol to support reading and creation of symbolic links.
The original Plan 9 protocol for reading directories returned not only directory entries, but all the information about each directory entry (for example, file sizes, access times, ownership information, and so on). This extra information is useful if used. However, it does slow directory reads by about a factor of 10, since most UNIX filesystems keep directory entries (the file names in the directory) and the information about those files in separate places. Since all UNIX systems extant do not use this information when a directory is read, I only return the directory entries, and not the information. David Butler has recently proposed adding this same type of limited directory reading operation to the Plan 9 protocol.
The client-side user-mode components let unmodified programs access private namespace semantics. The clib_libc.so is a stub library that overrides functions in the Standard C Library so that private namespace functions can be supported, as well as global namespace functions. Access to the global namespace may be disabled if desired. Programs that use this library do not need to use a kernel VFS. This library is used on systems that do not have a VFS or systems that cannot load the VFS.
User-mode components not shown in Figure 3 include the dump/restore functions that support private namespace inheritance transparently (in the C library they are integrated into fork()), as well as additional programs for testing the libraries and building the namespace that is inherited by unmodified UNIX programs.
The kernel VFS consists of several layers. The top layer consists of local directory structures. Every process has a private version of these structures, rooted at "private" in the VFS. For example, if the VFS is mounted at /v9fs, then a process referencing /v9fs/private sees a private copy of a directory tree, just as the name "current" refers to the current process in the /proc VFS. The next layer is a union mount layer, which sits on top of an actual mount layer. From the mount layer, VFS operations are performed over the network to the server. Processes using only the private name space perform a chroot to /v9fs/private, at which point the root of their namespace is the private namespace and the global name space is no longer accessible.
Servers consist of the communications layer that sends and receives packets; a packet decode layer that determines what functions to call, based on the packet type; and a filesystem-type dependent layer, which consists of the set of functions that implement a given filesystem. Most of the code is common, save for this top layer. Building a new filesystem involves writing a new set of functions and linking them with a server library.
I currently support two types of servers: an interface to the filesystem, to support remote file access; and a simple memory-based filesystem. The memory-based filesystem provides a network RAM disk. The memory-based filesystem can be used to support temporary files that might be shared between several processes on different nodes. Server processes can use the memory server to store information about servers, and thus the memory server can be used as a directory of servers.
Private Namespaces in Distributed Computing
The main application that I've found to date for distributed computing is the support of remote file access that does not require automounters or NFS access across organizational boundaries, or special privileges of any kind. I make heavy use of the user-mode client-side support described here, since not all the systems I run on are Linux, and not all Linux systems have VFS installed. UNIX programs (including Emacs, gcc, and all the shells) run without problems.
Another interesting use of the private namespace is a process-family-private /proc filesystem. Users may have the ability to distribute programs to remote nodes but may not be given permission to run (for example) a process status program such as ps, or to peruse the /proc filesystem on the remote nodes. The processes running on the remote nodes can open a session to a memory-based server and write status into a file at intervals. The processes that use this instance of the server are part of a distributed process family. Users can start a session to the server and examine the process status with ordinary ls and cat commands, as in Figure 4. The process that actually writes the file can be the remote computation or a control program that manages the remote computation and reports on its status.
Private Namespaces in Cluster Computing
For clustering, I have found two main uses for the private name spaces to date replacing NFS and building clustered /proc namespaces. In each case I use the filesystem servers, not the memory servers.
On clusters, I use the VFS on Linux nodes. The VFS is SMP-safe and 64-bit clean. Because the mounts are process-to-process, special root access is not required to mount a remote filesystem. Users can mount a remote filesystem without requiring system administrator support or the root password. At the same time, the remote mounts do not compromise security since they occur in the context of the user's private namespace. We also use the private namespace to build cluster /proc filesystems for families of processes. As a single parent process starts up processes on other cluster nodes, the parent process can mount /proc from those other nodes so as to monitor process status. Once those other /proc directories are mounted and accessible, the cluster /proc lets users easily monitor a family of processes. In Figure 5, for instance, users have mounted /proc from localhost and four nodes into /proc in the private namespace. All of the process status is easily viewed via this cluster /proc. A significant advantage of this type of /proc is that the user sees /proc only on nodes of interest. Instead of having to deal with process status from 160 nodes, users need only see status for the nodes in use by the process family. A remaining step is to modify ps to use paths with /proc/<host-name> in them instead of just /proc.
Common Uses of Private Namespaces
A final application of the private namespace common to both cluster and distributed computing is a directory of servers. Directories of servers let remote clients locate servers by name, without knowing a server's port number. The client contacts a directory of servers and looks up the desired server by name, opens the file for that name, and reads the information needed to contact the server, such as port number and the type of authentication required by that server.
To support the directory of servers application, I have applied for and been granted a registered service name (fsportmap) and port number, 4349/tcp, from the Internet Assigned Numbers Authority. This port number is used as follows: A process contacts a host on this port, establishes a session with the server and attaches the root of the directory of servers server. The process can then look up an entry for a server of interest.
Using a directory of servers (Figure 6) in this manner is much less complex than NFS. To find and connect to a filesystem server, NFS requires three separate types of daemons and RPC protocols. In contrast, using my directory of servers, I use one protocol and two different types of servers for that same protocol. The server that is used for the directory of servers is not special purpose in any sense; rather, it is a general-purpose server used for a specific application. You can eliminate two special RPC protocols and two special daemons using the directory of servers approach.
Private namespaces are an essential component of any large-scale distributed or cluster computing environment. I have developed a first implementation of private namespaces for UNIX, including a kernel VFS implementation for Linux. The performance of this system is comparable to NFS. Nonprivileged processes can construct their own filesystem namespaces, and these namespaces are not visible to external processes, greatly enhancing security in distributed systems. The private namespace also makes the construction of cluster /proc filesystems quite simple. Also, the memory server can be used to provide a directory of servers, eliminating the need for distinct RPC protocols for locating and mounting remote filesystems.
This research was supported by DARPA Contract #F30602-96-C-0297.