Friday, February 29, 2008

S.R. Kleiman, Vnodes: An Architecture for Multiple File System Types in Sun UNIX, USENIX Summer, 1986

Abstract

An architecture for accommodating multiple file system implementations within the Sun UNIX kernel. The file system implementations can encompass local, remote, or even non-UNIX file systems. These file systems can be "plugged" into the kernel through a well-defined interface.

Motivation

  • Split the file system implementation independent and the file system implementation dependent functionality of the kernel and provide a well-defined interface between the two parts.
  • The interface must support (but not require) UNIX file system access semantics. In particular it must support local disk file systems (like FFS), stateless remote file systems (like NFS), statefull remote file systems (like RFS), or non-UNIX file systems (like MSDOS FAT).
  • The interface must be usable by the server side of a remote file system to satisfy client requests.
  • All file system operations should be atomic. In other words, the set of interface operations should be at a high enough level so that there is no need for locking (hard locking, not user advisory locking) across several operations. Locking, if required, should be left up to the file system implementation dependent layer. For e.g., if a relatively slow computer running a remote file system requires a supercomputer server to lock a file while it does several operations, the users of the supercomputer would be noticably affected. It is much better to give the file system dependent code full information about what operation is being done and let it decide what locking is necessary and practical.

Goals

  • There should be little or no performance degradation.
  • The file system independent layer should not force static table sizes. Most of the new file system types use a dynamic storage allocator to create and destroy objects.
  • Different file system implementations should not be forced to use centralized resources (e.g. inode table, mount table or buffer cache). However, sharing should be allowed.
  • The interface should be reentrant. In other words, there should be no implicit references to global data (e.g. u.u_base) or any global side effect information passed between operations (e.g. u.u_dent). This has the added benefit of cutting down the size of the per user global data area (u area). In addition, all the interface operations return error codes as the return value. Overloaded return codes and u.u_error should not be used.
  • The changes to the kernel should be implemented by an "object oriented" programming approach. Data structures representing objects contain a pointer to a vector of generic operations on the object. Implementations of the object fill in the vector as appropriate. The complete interface to the object is specified by its data structure and its generic operations. The object data structures also contain a pointer to implementation specific data. This allows implementation specific information to be hidden from the interface.
  • Each interface operation is done on behalf of the current process. It is permissible for any interface operation to block.
Vnode Architecture
+---------------------+
|    System Calls     |
+----------+----------+
|
v
+---------------------+
|    Vnode Layer      |
+----------+----------+
| <-----------------------+                          |                         |                          v                         |         +----------------+-----------------+-------+         |                |                 |       |         v                v                 v       |     +-------+        +--------+         +------+------------+     |PC FAT |        |BSD FFS |         | NFS  | NFS Server |     +---+---+        +---+----+         +--+---+------------+         |                |                 |         A         v                v                 v         |     +-------+        +--------+         +------------+------+     |Floppy |        | Disk   |         |      Network      |     +-------+        +--------+         +-------------------+ 

Implementation

The file system dependent/independent split was done just above the UNIX kernel inode layer. This was an obvious choice, as the inode was the main object for file manipulation in the kernel. The file system independent inode was renamed vnode (virtual node). All file manipulation is done with a vnode object. Similarly, file systems are manipulated through an object called a vfs (virtual file system). The vfs is the analog to the old mount table entry. The file system independent layer is generally referred to as the vnode layer. The file system implementation dependent layer is called by the file system type it implements.
VFS's
Each mounted VFS is linked into a list of mounted file systems. The first file system on the list is always the root. The private data pointer (vfs_data) points to the file system dependent data. The public data in the vfs structure contains data used by the vnode layer or data about the mounted file system that does not change.
Since different file system implementations require different mount data, the mount(2) system call was changed. The arguments to mount(2) now specify the file system type, the directory which is the mount point, generic flags (e.g. read only), and a pointer to file system type specific data. When a mount system call is performed, the vnode for the mount point is looked up and the vfs_mount operation for the file system type is called. If this succeeds, the file system is linked into the list of mounted file systems, and the vfs_vnodecovered field is set to point to the vnode for the mount point. This field is null in the root vfs. The root vfs is always first in the list of mounted file systems.
Once mounted, file systems are named by the path name of their mount points. Special device name are no longer used because remote file systems do not necessarily have a unique local device associated with them. Umount(2) was changed to unmount(2) which takes a path name for a file system mount point instead of a device.
The root vnode for a mounted file system is obtained by the vfs_root operation, as opposed to always referencing the root vnode in the vfs structure. This allows the root vnode to be deallocated if the file system is not being referenced. For example, remote mount points can exist in "embryonic" form, which contains just enough info to actually contact the server and complete the remote mount when the file system is referenced. These mount points can exist with minimal allocated resources when they are not being used.
Vnodes
The public data fields in each vnode either contain data that is manipulated only by the vfs layer or data about the file that does not change over the life of the file, such as the file type (v_type). Each vnode contains a reference count (v_count) which is maintained by the generic vnode macros VN_HOLD and VN_RELE. The vnode layer and file systems call these macros when vnode pointers are copied or destroyed. When the last reference to a vnode is destroyed, the vn_inactive operation is called to tell the vnode's file system that there are no more references. The file system may then destroy the vnode or cache it for later use. The v_vfsp field in the vnode points to the vfs for the file system to which the vnode belongs. If a vnode is a mount point, the v_vfsmountedhere points to the vfs for another file system. The private data pointer (v_data) in the vnode points to data that is dependent on the file system.
Vnodes are not locked by the vnode layer. All hard locking (i.e. not user advisory locks) is done within the file system dependent layer. Locking could have been done in the vnode layer for synchronization purposes without violating the design goal; however, it was found to be not necessary.
Path name traversal
Path name traversal is done by the lookuppn routine, which takes a path name in a path name buffer and returns a pointer to the vnode which the path represents. This takes the place of the old namei routine.
If the path name begins with a "/", Path name traversal starts at the vnode pointed to by either u.u_rdir or the root. Otherwise it starts at the vnode pointed to by u.u_cdir or the current directory. Lookuppn traverses the path one component at a time using the vn_lookup vnode operaion. Vn_lookup takes a directory vnode and a component as arguments and returns a vnode representing that component. If a directory vnode has v_vfsmountedhere, then it is a mount point. When a mount point is encountered going down the file system tree, lookuppn follows the vnode's v_vfsmountedhere pointer to the mounted file system and calls the vfs_root operation to obtain the root vnode for the file system. Path name traversal then continues from this point. If a root vnode is encountered (VROOT flag in v_flag set) when following "..", lookuppn follows the vfs_vnodecovered pointer in the vnode's associated vfs to obtain the covered vnode. If a symbolic link is encountered lookuppn calls the vn_readlink vnode operation to obtain the symbolic link. If the symbolic link begins with a "/", the path name traversal is restarted from the root (or u.u_rdir); otherwise the traversal continues from the last directory. The caller of lookuppn specifies whether the last component of the path name is to be followed if it is a symbolic link. The process continues until the path name is exhausted or an error occurs. When lookuppn completes, a vnode representing the desired file is returned.
Remote file systems
The path name traversal scheme implies that files on remote file systems appear as files within the normal UNIX file name space. Remote files are not named by any special constructs that current programs don't understand. The path name traversal process handles all indirection through mount points. This means that in a remote file system implementation, the client maintains its own mount points. If the client mounts another file system on a remote directory, the remote file system will not see any ".." references at the remote of the remote file system.
New system calls
Three new system calls were added in order to make the normal application interface file system implementation independent. The getdirentries(2) system call was added to read directories in a manner which is independent of the on disk directory format. Getdirentries reads directory entries from an open directory file descriptor into a user buffer, in file system independent format. As many directory entries as can fit in the buffer are read. The file pointer is changed so that it points at directory entry boundaries after each call to getdirentries. The statfs(2) and fstatfs(2) system calls were added to get general file system statistics (e.g. space left). Statfs and fstatfs take a path name or a file descriptor, respectively, for a file within a particular file system, and return a statfs structure.
Devices
The device interfaces, bdevsw and cdevsw, are hidden from the vnode layer, so that devices are only manipulated through the vnode interface. A special device file system implementation, which is never mounted, is provided to facilitate this. Thus, file systems which have a notion of associating a name within the file system with a local device may redirect vnodes to the special device file system.
The buffer cache
The buffer cache routines have neem modified to act either as a physical buffer cache or a logical buffer cache. A local file system typically uses the buffer cache as a cache of physical disk blocks. Other file system types may use the buffer cache as a cache of logical file blocks. Unique blocks are identified by the pair (vnode-pointer, block-number). The vnode pointer points to a device vnode when a cached block is a copy of a physical device block, or it points to a file vnode when the block is a copy of a logical file block.

VFS operations

vfsp - Argument is a pointer to the vfs that the operation is being applied to.
  • vfs_mount(vfsp, pathp, datap) - Mount vfsp (i.e. read the superblock etc.). Pathp points to the path name to be mounted (for recording purposes), and datap points to file system dependent data.
  • vfs_unmount(vfsp) - Unmount vfsp (i.e. sync the superblock).
  • vfs_root(vfsp, vpp) - Return the root vnode for this file system. Vpp points to a pointer to a vnode for the results.
  • vfs_statfs(vfsp, sbp) - Return file system information. Sbp points to a statfs structure for the results.
struct statfs {
long f_type;     /* type of info */
long f_bsize;    /* block size */
long f_blocks;   /* total blocks */
long f_bfree;    /* free blocks */
long f_bavail;   /* non-su blocks */
long f_files;    /* total # of nodes */
long f_ffree;    /* free nodes in fs */
fsid_t f_fsid;   /* file system id */
long f_spare[7]; /* spare for later */
};
  • vfs_sync(vfsp) - Write out all cached information for vfsp. Note that this is not necessarily done synchronously. When the operation returns all data has not necessarily been written out, however it has been scheduled.
  • vfs_fid(vfsp, vp, fidpp) - Get a unique file identifier for vp which represents a file within this file system. Fidpp points to a pointer to a fid structure for the results.
struct fid {
u_short fid_len;     /* length of data */
char    fid_data[1]; /* variable size */
};
  • vfs_get(vfsp, vpp,fdp) - Turn unique file identifier fidp into a vnode representing the file associated with the file identifier. vpp points to a pointer to a vnode for the result.

Vnode operations

vp - Pointer to the vnode to which the operation is being applied;
c - Pointer to a credentials structure which contains the user credentials (e.g. uid) to use for the operation;
nm - Pointer to a character string containing a name;
  • vn_open(vpp, f, c) - Perform any open protocol on a vnode pointed to by vpp (e.g. devices). If the open is a "clone" open the operation may return a new vnode. f is the open flags.
  • vn_close(vp, f, c) - Perform a close protocol on a vnode (e.g. devices). Called on the closing of the last reference of the vnode from the file table, if vnode is a device. Called on the last user close of a file descriptor, otherwise. f is the open flags.
  • vn_rdwr(vp, uiop, rw, f, c) - Read or write vnode. Reads or writes a number of bytes at a specified offset in the file. Uiop points to a uio structure which supplies the I/O arguments. rw specifies the I/O direction. f is the I/O flags, which may specify that the I/O is to be done synchronously (i.e. don't return until all the volatile data is on disk) and/or in a unit (i.e. lock the file to write a large unit).
  • vn_ioctl(vp, com, d, f, c) - Perform an ioctl on vnode vp. Comm is the command, d is the pointer to the data, and f is the open flags.
  • vn_select(vp, w, c) - Perform a select on vp. w specifies the I/O direction.
  • vn_getattr(vp, va, c) - Get attributes for vp. va points to a vattr structure.
struct vattr {
enum vtype     va_type;      /* vnode type */
u_short        va_mode;      /* acc mode */
short          va_uid;       /* owner uid */
short          va_gid;       /* owner gid */
long           va_fsid;      /* fs id */
long           va_nodeid;    /* node # */
short          va_nlink;     /* #links */
u_long         va_size;      /* file size */
long           va_blocksize; /* block size */
struct timeval va_atime;     /* last acc */
struct timeval va_mtime;     /* last mod */
struct timeval va_ctime;     /* last chg */
dev_t          va_rdev;      /* dev */
long           va_blocks;    /* space used */
};
This must map file system dependent attributes to UNIX file attributes.
  • vn_setattr(vp, va, c) - Set attributes for vp. va points to a vattr structure, but only mode, uid, gid, file size, and times may be set. This must map UNIX file attributes to file system dependent attributes.
  • vn_access(vp, m, c) - Check access permissions for vp. Returns error if access is denied. M is the mode to check for access (e.g. read, write, execute). This must map UNIX file protection information to file system dependent protection information.
  • vn_lookup(vp, nm, vpp, c) - Lookup a component name nm in directory vp. Vpp points to a pointer to a vnode for the results.
  • vn_create(vp, nm, va, e, m, vpp, c) - Create a new file nm in directory vp. va points to an vattr structure containing the attributes of the new file. e is the exclusive/non-exclusive create flag. m is the open mode. vpp points to a pointer to a vnode for the results.
  • vn_remove(vp, nm, c) - Remove a file nm in directory vp.
  • vn_link(vp, tdvp, tnm, c) - Link the vnode vp to the target name tnm in the target directory tdvp.
  • vn_rename(vp, nm, tdvp, tnm, c) - Rename the file nm in directory vp to tnm in target directory tdvp. The node can't ne lost if the system crashes in the middle of the operation.
  • vn_mkdir(vp, nm, va, vpp, c) - Create directory nm in directory vp. va points to an vattr structure containing the attributes of the new directory and vpp points to a pointer to a vnode for the results.
  • vn_rmdir(vp, nm, c) - Remove the directory nm from directory vp.
  • vn_readdir(vp, uiop, c) - Read entries from directory vp. uiop points to a uio structure which supplies the I/O arguments. The uio offset is set to a file system dependent number which represents the logical offset in the directory when the reading is done. This is necessary because the number of bytes returned by vn_readdir is not necessarily the number of bytes in the equivalent part of the on disk directory.
  • vn_symlink(vn, lnm, va, tnm, c) - Symbolically link the path pointed to by tnm to the name lnm in directory vp.
  • vn_readlink(vp, uiop, c) - Read symbolic link vp. uiop points to a uio structure which suppies the I/O arguments.
  • vn_fsync(vp, c) - Write out all cached information for file vp. The operation is synchronous and does not return until the I/O is complete.
  • vn_inactive(vp, c) - The vp is no longer referenced by the vnode layer. It may now be deallocated.
  • vn_bmap(vp, bn, vpp, bnp) - Map logical block number bn in file vp to physical block number and physical device. bnp is a pointer to a block number for the physical block and vpp is a pointer to a vnode pointer for the physical device. Note that the returned vnode is not necessarily a physical device. This is used by the paging system to premap file before they are paged. In NFS this is a null mapping.
  • vn_strategy(bp) - Block oriented interface to read or write a logical block from a file into or out of a buffer. bp is a pointer to a buffer header which contains a pointer to the vnode to be operated on. Does not copy through the buffer cache if the file system uses it. This is used by the buffer cache routines and paging system to read blocks into memory.
  • vn_bread(vp, bn, bpp) - Read a logical block bn from a file vp and return a pointer to a buffer header in bpp which contains a pointer to the data. This does not necessarily imply the use of the buffer cache. This operation is useful to avoid extra data copying on the server side of a remote file system.
  • vn_vrelse(vp, bp) - The buffer returned by vn_bread can be released.
Kernel interfaces
A later over the generic vnode interface to allow kernel subsystems to easily manipulate files:
  • vn_open - Perform permission checks and then open a vnode given by a path name.
  • vn_close - Close a vnode.
  • vn_rdwr - Build a uio structure and read or write a vnode.
  • vn_create - Perform permission checks and then create a vnode given by a path name.
  • vn_remove - Remove a node given by a path name.
  • vn_link - Link a node given by a source path name to a target given by a target path name.
  • vn_rename - Rename a node given by a source path name to a target given by a target path name.
  • VN_HOLD - Increment the vnode reference count.
  • VN_RELE - Decrement the vnode reference count and call vn_inactive if this is the last reference.
Many system calls which take names do a lookuppn to resolve the name of a vnode then call the appropriate routine to do the operation. System calls which work off file descriptors pull the vnode pointer out of the file table and call the appropriate routine.

Contributions

Vnodes have been proven to be a clean, well-defined interface to different file system interfaces.

Future Work

A standard UNIX file system interface.
Some of the current issues are:
  • Allow multiple component lookup in vn_lookup. This would require file systems that implemented this to know about mount points.
  • Cleaner replacements for vn_bmap, vn_strategy, vn_bread and vn_brelse.
  • Symlink handling in the file system independent later.
  • Eliminate redundant lookups.