Articles/Essays/Papers Log: Marshall K. Mckusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry, A Fast File System for UNIX, UC Berkeley, 1984

Motivation

Performance (throughput) problems with older UNIX file systems.
Flexible allocation policies that can be adapted to a wide range of peripheral and processor characteristics

Old ("traditional") UNIX File System

Each disk drive is divided into one or more partitions. Each disk partition may contain one file system. A file system never spans multiple partitions.

Super-block

A file system is described by it's super-block, which contains basic parameters of the file system. These include number of data blocks in the file system, a count of the maximum number of files, a pointer to the free-list, a linked-list of all the free blocks in the file system.

inode

Within the file system are files. Certain files are distinguished as directories and contain pointers to files that may themselves be directories. Each file has a descriptor associated with it called an index-node (inode). An inode contains information describing ownership of the file, timestamps marking last modification and access times for the file, and an array of indices that point to the data blocks for the file. The file data blocks are directly referenced by values stored in an inode itself. An inode may contain references to indirect blocks containing further data block indices. In a file system with 512 byte block size, a single indirect block contains 128 further block addresses, a doubly indirect block contains 128 addresses of further singly indirect blocks, and a triply indirect block contains 128 addresses of further doubly indirect blocks.

Long seek time

A 150 MB traditional UNIX file system consists of 4 MB of inodes followed by 146 MB of data blocks. This organisation segregates the inode information from the data; thus accessing a file normally incurs a long seek from the file's inode to its data. Files in a single directory are not typically allocated consecutive slots in the 4MB of inodes, causing many non-consecutive blocks of inodes to be accessed when executing operations on the inodes of several files in a directory.

Small blocks!

The disk block size of 512 bytes causes the max transfer per disk transaction to be limited to 512 bytes and we often find that the next sequential data block is not on the same cylinder, forcing seeks between 512 byte transfers. The combination of small block size, limited read-ahead in the system, and many seeks severely limit file system bandwidth. Only provided 2% (~20KB/s per arm) of the maximum disk bandwidth.

Some preliminary work to improve reliability and performance:

Reliability was improved by staging modifications to critical file system information so that they could either be completed or repaired cleanly by a program after a crash.
Performance was improved by a factor of more than two by changing the basic block size from 512 bytes to 1024 bytes. The improvement was because of two factors: each disk transfer accessed twice as much data, and most files could be described without need to access indirect blocks since the direct blocks contained twice as much data.

The unordered free-list

Although the throughput had doubled, the file system was still delivering only 4% of the disk bandwidth. The main problem was that even though the free-list was initially ordered for optimal access, it quickly became scrambled as files were created and removed.Eventually the free-list became entirely random, causing file to have their data blocks allocated randomly over the disk. This forced a seek before every block access. This randomization of data block placement causes performance to degrade over a period of use. The only mechanism to restore performance was to dump, rebuild and restore the file system. Another possibility would be to have a process that periodically reorganized the data on the disk to restore locality.

Advantages

Simple and elegant file system interface

New Berkeley Fast File System

Each disk drive contains one or more file systems. A file system is described by its super-block, located at the beginning of the file system's disk partition. The super-block contains critical data and is replicated to protect against catastrophic loss. This is done when the file system is created; since the super-block data does not change, the copies need not be referenced unless a head crash or other hard disk error causes the default super-block to be unusable.

To insure that it is possible to create files as large as

232

bytes with only two levels of indirection, the minimum size of a file system block is 4096 bytes. The size of the file systems blocks can be any power of two greater than or equal to 4096. The block size of recorded in the file system's super-block so it is possible for file systems with different block sizes to be simultaneously accessible on the same system. The block size must be decided at the time the file system is created; it cannot be subsequently changed without rebuilding the file system.

A disk partition is divided into one or more areas called cylinder groups. A cylinder group is comprised of one or more consecutive cylinders on a disk. Associated with each cylinder group is some bookkeeping information that includes a redundant copy of the super-block, space for inodes, a bit map describing available blocks in the cylinder group, and summary information describing the usage of data blocks within the cylinder group. The bit map of available blocks in the cylinder group replaces the traditional file system's free-list. For each cylinder block a static number of inodes is allocated at file system creation time. The default policy is to create one inode for each 2048 bytes of space in the cylinder group, expecting this to be far more than will ever be needed.

The cylinder group bookkeeping information begins at a varying offset from the beginning of the cylinder group. The offset for each successive cylinder group is calculated to be about one track further from the beginning of the cylinder group than the preceding cylinder group. In this way the redundant information spirals down into the pack so that any single track, cylinder, or platter can be lost without losing all copies of the super-block. Except for the first cylinder group, the space between the beginning of the cylinder group and the beginning of the cylinder group information is used for data blocks.

Optimizing Storage Utilization

Data is laid out so that larger blocks can be transferred in a single disk transaction, greatly increasing file system throughput. In large files, several 4096 byte blocks may be allocated from the same cylinder so that large data transfers are possible before requiring a seek.

The main problem with larger blocks is that most UNIX file systems are composed of many small files. A uniformly large block size wastes space. To be able to use large blocks without undue waste, small files must be stored in a more efficient way. The new file system accomplishes this goal by allowing the division of a single file system block into one or more fragments. The file system fragment size is specified at the time the file system is created; each file system block can optionally be broken into 2, 4, or 8 fragments, each of which is addressable. The lower bound on the size of these fragments is constraint by the disk sector size, typically 512 bytes. The block map associated with each cylinder group records the space available in a cylinder group at the fragment level; to determine if a block is available, aligned fragments are examined.

Space is allocated to a file when a program does a write system call. Each time data is written to a file, the file system checks to to see if the size of the file has increased. If the file needs to be expanded to hold the new data, one of three consitions exists:

There is enough space left in an already allocated block or fragment to hold the new data. The new data is written into the available space.
The file contains no fragmented blocks and the last block in the file contains insufficient space to hold the new data. If space exists in a block already allocated, the space is filled with new data. If the reminder of the new data contains more than a full block of data, a full block is allocated and the first full block of new data is written there. The process is repeated until less than a full block of new data remains. If the remaining new data will fit in less than a full block, a block with the necessary fragments is located, otherwise a full block is located. The remaining new data is written into the located space.
the file contains one or more fragments and the fragments contain insufficient space to hold the new data. If the size of the new data plus the size of the data already in the fragments exceeds the size of a full block, a new block is allocated. The contents of the fragments are copied to the beginning of the block and the reminder of the block is filled with new data. The process then continues as in 2. above. Otherwise, if the new data to be written will fit in less than a full block, a block with the necessary fragments is located, otherwise a full block is located. The contents of the existing fragments appended with the new data is written into the allocated space.

The problem with expanding a file one fragment at a time is that data may be copied many times as a fragmented block expands to a full block. Fragment reallocation can be minimized if the user program writes a full block at a time, except for a partial block at the end of the file. The file system interface has been extended to provide applications the optimal size for a read or write. For files the optimal size if the block size of the file system on which the file is being accessed. For other objects, such as pipes and sockets, the optimal size is the underlying buffer size.

In order for the layout policies to be effective, a file system cannot be kept completely full. For each file system there is a parameter, termed the free space reserve, that gives the minimum acceptable percentage of file system blocks that should be free.

File System Parametrization

A goal of the new file system is to parameterize the processor capabilities and mass storage characteristics so that blocks can be allocated in an optimum configuration-dependent way. Parameters used include the speed of the processor, the hardware support for mass storage transfers, and the characteristics of the mass storage devices. Disk technology is constantly improving and a given installation can have several different disk technologies running on a single disk processor. Each file system is parameterized so that it can be adapted to the characteristics of the disk on which it is placed.

For mass storage devices such as disks, the new file system tries to allocate new blocks on the same cylinder as the previous block in the same file. Optimally, these new blocks will also be rotationally well positioned. The distance between "rotationally optimal" blocks varies greatly; it can be a consecutive block or a rotationally delayed block depending on system characteristics. On a processor with an I/O channel that does not require any processor intervention between mass storage transfer requests, two consecutive disk blocks can often be accessed without suffering lost time because of an intervening disk revolution. For processors without I/O channels, the main processor must field an interrupt and prepare for a new disk transfer. The expected time to service this interrupt and schedule a new disk transfer depends on the speed of the processor.

The physical characteristics of each disk include the number of blocks per track and the rate at which the disk spins. The allocation routines use this information to calculate the number of msecs required to skip over a block. The characteristics of the processor include the time to service an interrupt and schedule a new disk transfer request. Given a block allocated to a file, the allocation routines calculate the number of blocks to skip over so that the next block in the file will come into position under the disk head in the expected amount of time that it takes to start a new disk transfer operation. For programs that sequentially access large amounts of data, this strategy minimizes the amount of time spent waiting for the disk to position itself.

To ease the calculation of finding rotationally optimal blocks, the cylinder group summary information includes a count of the available blocks in a cylinder group at different rotational positions.

The parameter that defines the minimum number of msecs between the completion of a data transfer and the initiation of another data transfer on the same cylinder (rotational layout delay) can be changed at any time, even if the file system is mounted and active.

Layout Policies

The file system layout policies are divided into two distinct parts. At the top level are global policies that use file system wide summary information to make decisions regarding the placement of new inodes and data blocks. These routines are responsible for deciding the placement of new directories and files. They also calculate rotationally optimal block layouts, and decide when to force a long seek to a new cylinder group because there are insufficient blocks left in the current cylinder group to do reasonable layouts. Below the global policy routines are local allocation routines that use a locally optimal scheme to lay out data blocks.

Two methods to improve file system performance are

to increase the locality of reference to minimize seek latency, and,
to improve the layout of data to make larger transfers possible.

The global policies try to balance the two conflicting goals of localizing data that is concurrently accessed while spreading out unrelated data.

Cluster sequentially accessed data

One allocatable resource is inodes. Inodes are used to describe both files and directories. Inodes of files in the same directory are frequently accessed together. The layout policy tries to place all the inodes of files in a directory in the same cylinder group. To ensure that files are distributed throughout the disk, a different policy is used for directory allocation. A new directory is placed in a cylinder group that has a greater than average number of free inodes, and the smallest number of directories already in it. The intent of this policy is to allow the inode clustering policy to succeed most of the time. The allocation of inodes within a cylinder group is done using a next free strategy. Although this allocates the inodes randomly within a cylinder group, all the inodes for a particular cylinder group can be read with 8 to 16 disk transfers. (At most 16 disk transfers are required because a cylinder group may have no more than 2048 inodes.) This puts a small and constant upper bound on the number of disk transfers required to access the inodes for all the files in a directory. In contrast, the old file system typically requires one disk transfer to fetch the inode for each file in a directory.

The other major resource is data blocks. Since data blocks for a file are typically access together, the policy routines try to place all data blocks for a file in the same cylinder group, preferably at rotationally optimal positions in the same cylinder. The problem with allocating all the data blocks in the same cylinder group is that large files will quickly use up available space in the cylinder group, forcing a spill over to other areas. Further, using all the space in a cylinder group causes future allocations for any file in the cylinder group to also spill to other areas. Ideally none of the cylinder groups should ever become completely full. The heuristic solution chosen is to redirect block allocation to a different cylinder group when a file exceeds 48 KB, and at every MB thereafater. The newly chosen cylinder group is selected from those cylinder groups that have greater than average number of free blocks left. Although large files tend to be spread out over the disk, a MB of data is typically accessible before a long seek must be performed, and the cost of one long seek per MB is small.

The global policy routines call local allocation routines with requests for specific blocks. The local allocation routines will always allocate the requested block if it is free, otherwise it allocates a free block of the request size that is rotationally closest to the requested block. If the global layout policies had complete information, they could always request unused blocks and the allocation policies would be reduced to simple bookkeeping. However, maintaining complete information is costly; this the implementation of the global layout uses heuristics that empoly only partial information.

If the requested block is not available, the local allocator uses a 4-level allocation strategy:

Use the next available block rotationally closest to the requested block on the same cylinder. It is assumed here that head switching time is zero. On disk controllers where this is not the case, it may be possible to incorporate the time required to switch between disk platters when constructing the rotational layout tables.
If there are no blocks available on the same cylinder, use a block within the same cylinder group.
If that cylinder group is entirely full, quadratically hash the cylinder group number to choose another cylinder group to look for a free block.
Finally if the hash fails, apply an exhaustive search to all cylinder groups.

Quadratic hash is used because of its speed in finding unused slots in nearly full hash tables. File systems that are parameterized to maintain at least 10% free space rarely use this strategy.

An upper bound on the transfer rate from a disk is calculated by multiplying the number of bytes on a track by the number of revolutions of the disk per second.

Performance

File access rates up to 10x faster than the traditional UNIX file system.

Both reads and writes are faster in the new file system. The biggest factor in this speedup is because of the larger block size used by the new file system. The overhead of allocating blocks in the new system is greater, however fewer blocks need to be allocated because they are bigger. The net effect is that the cost per byte allocated is about the same for both.

In the new file system, the reading rate is always at least as fast as the writing rate. This is to be expected since the kernel must do more work (twice as many disk allocations per second, making the processor unable to keep up with the disk transfer rate) when allocating blocks than when simply reading them.

The old file system is about 50% faster at writing files than reading them. This is because the write system call is asynchronous and the kernel can generate disk transfer requests much faster than can be serviced, hence disk transfers queue up in the disk buffer cache. Because the disk buffer cache is sorted by minimum seek distance, the average seek between the scheduled disk writes is much less than it would be if the data blocks were written out in the random disk order in which they are generated. However, when the file is read, the read system call is processed synchronously so the disk blocks must be retrieved from the disk in the non-optimal seek order in which they are requested. This forces the disk scheduler to do long seeks resulting ina lower thoroughput rate.

In the new file system the blocks of a file are more optimally ordered on the disk. Even though reads are still synchronous, the requests are presented to the disk in a much better order. Even though the writes are still asynchronous, they are already presented to the disk in a minimum seek order so there is no gain to be had by reordering them. Hence disk seek latencies that limited the old file system have little effect in the new file system. The cost of allocation is the factor in the new system that causes writes to be slower than reads.

Copy bottleneck

The performance of the new file system is limited by memory-to-memory copy operations required to move data buffers in the system's address space to data buffers in the user's address space. These copy operations account for about 40% of the time spent performing an I/O op.

Chain buffers

Greater disk thoughput could be achieved by rewriting the disk drivers to chain together kernel buffers. This would allow contiguous disk blocks to be read in a single disk transaction. The inability to use contiguous disk blocks effectively limits the performance on these disks to less than 50% of the available bandwidth.

Preallocate blocks

Currently only one block is allocated to a file at a time. One technique to improve performance would be to preallocate several blocks at once if we find that a file is growing rapidly, releasing them when the file is closed if they remain unused. By batching up allocations, the system can reduce the overhead of allocating at each write, and it can cut down on the number of disk writes needed to keep the block pointers on the disk synchronized with the block allocation.

File System Functional Enhancements

Several generally desired changes were introduced along with the performance improvements.

Long file names

File name can now be of nearly arbitrary length. Only programs that read directories are affected by this change. To ensure portability across systems, a set of directory access routines have been introduced to provide a consistent interface to directories on both old and new systems.

File locking

The old file system had no provision for locking files. Processes that needed to synchronize the updates of a file had to use a separate "lock" file. A process would try to create a "lock" file. If the creation succeeded, then the process could proceed with its update; if the creation failed, then the process would wait and try again. This mechanism had three drawbacks:

Processes consumed CPU time by looping over attempts to create locks,
Locks left lying around because of system crashes had to be manually removed (normally in a system startup command script),
Processes running as system administrator are always permitted to create files, so were forced to use a different mechanism.

A mechanism to serialize access to a file with locks has been added.

Locking schemes fall into two classes, those using hard locks and those using advisory locks. The primary difference between the two is the extent of enforcement. A hard lock is always enforced when a program tries to access a file; an advisory lock is only applied when it is requested by a program. The advisory locks are only effective when all programs accessing a file use the locking scheme. With hard locks there must be some override policy implemented in the kernel. With advisory locks the policy is left to the user programs.

The file locking facilities allow cooperating programs to apply advisory shared or exclusive locks on files.

Symbolic links

The traditional UNIX file system allows multiple directory entries in the same file system to reference a single file. Each directory entry "links" a file's name to an inode and its contents. The link concept is fundamental; inodes do not reside in directories, but exist separately and are referenced by links. When all the links to an inode are removed, the inode is deallocated. This style of referencing an inode does not allow references across physical file systems, nor does it support inter-machine linkage. To avoid these limitations symbolic links have been added.

A symbolic link is implemented as a file that contains a pathname. When the system encounters a symbolic link while intepreting a component of a pathname, the contents of the symbolic link is prepended to the rest of the pathname, and this name of interpreted to yield the resulting pathname.

Certain system utilities must be able to detect and manipulate symbolic links. Three new system calls provide the ability to detect, read, and write symbolic links; In future it will be possible to create symbolic links that span machines by referencing file systems located on remote machines using pathnames.

Rename

Programs that create a new version of an existing file typically create the new version as a temporary file and then rename the temporary file with the name of the target file. In the old UNIX file system renaming required three calls to the system. If the program were interrupted or the system crashed between these calls, the target file could be left with only its temporary name. To eliminate this possibility the rename system call has been added. The rename call does the rename operation in a fashion that guarantees the existence of the target name.

Rename works on data files and directoies. When renaming directories, the system must do special validation checks to insure that the directory tree structure is not corrupted by the creation of loops or inaccessible directories. Such corruption would occur if a parent directory were moved into one of its descendants. The validation check requires tracing the descendents of the target directory to insure that it does not include the directory being moved.

Quotas

The UNIX file system has traditionally attempted to share all available resources to the greatest extent possible. This any single user can allocate all the available space in the file system. In certain environments this is unacceptable. Consequently, a quota mechanism has been added for restricting the amount of file resources that a user may allocate.

Separate quota for each user on each file system
Resources are given both hard and soft limit.

Monday, February 11, 2008

Marshall K. Mckusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry, A Fast File System for UNIX, UC Berkeley, 1984