Sunday, March 16, 2008

L. W. McVoy, S. R. Kleiman, Extent-like Performance from a UNIX File System, USENIX Winter, 1991

Abstract

Increasing throughput demands on the SunOS file system made both by applications and higher performance hardware. The principal constraints were that the on-disk file system format remain the same and that whatever changes were necessary not be user-visible. The solution arrived at was to approximate the behavior of extent-based file systems by grouping I/O operations into clusters instead of dealing in individual blocks. A single clustered I/O may take the place of 15-30 block I/Os, resulting in a factor of two increased sequential performance increase. The changes described were restricted to a small portion of the file system code; no user-visible changes were necessary and the on-disk format was not altered. Extent size is variable and maintained by the file system. Enhanced UFS will provide better average throughput than extent-based file systems.

Motivation

The SunOS UNIX File System (UFS) was derived from the Berkeley Fast File System (FFS). FFS solves many performance problems found in the original UNIX file system. UFS was modified to fit into the VFS architecture but other than that, it has been tracking the FFS closely. UFS has served well for several years.
Both applications and disk subsystems are demanding higher and higher transfer rates through the file system. A measurement of the performance of the existing UFS showed that about half of a 12MIPS CPU was used to get half of the disk bandwidth of a 1.5MB/sec disk. The existing implementation of UFS does not scale to the desired I/O rates.
UFS performance problems
  • Computational overhead: There is little that can be done to reduce the computational overhead. The computational cost can be amortized by moving more data for each traversal of the file system code. This idea was a basic motivation for the FFS changes to the orginal UNIX file system.
  • Placement policy: While UFS has many tuning parameters,including ones that affect the placement policy, it is almost always tuned in the same way. The file system is reponsible for placing the logical blocks on the disk in a pattern that is optimal for sequential access. Each block is separated by a gap called the rotational delay or rotdelay by the file system code. rotdelay is specified in milliseconds and the minimum non-zero value is the rotational delay of one block time. The number of blocks placed contiguously between each rotational delay is known as maxcontig, is typically set to 1.
  • Rotational delay: The rotational delays allow the file system enough time to deliver the current block to the requesting process, for the process to compute using the new data then generate a request for the next block, and for the file system to check that the requested block is in memory (due to read ahead) and generate the disk I/O for the next read ahead block. If the file system is properly tuned, the I/O request will get to the disk as the appropriate block is moving under the head. If there were no rotational delay, the next block would already have started under the disk head by the time the disk saw the request. The disk would have to wait almost a full rotation before starting that request.
This explains why the rotational delay is necessary but we can see that it comes at a cost: having those holes reduce the maximum transfer rate to half that of the disk rate. To solve this performance problem, the rotational delays must be eliminated and the computational overhead of the system must be reduced.

Goals

  • UFS should be able to run the disk at their full bandwidth and use less CPU.
  • All users of the file system should benefit from the enhancements.
  • The on-disk format of the file system could not change, so that no application would need to be aware of the enhancements.

Background

Virtual file system interfaces
Each file system type implements two object classes: vfs and vnode. A VFS object represents a particular instance of a file system. A vnode object represents a particular file within a VFS. These objects export interface routines that the main body kernel uses to manipulate a file system without knowing the details of how it is implemented.
The interfaces used by the read, write and mmap system calls are:
  1. rdwr - read/write. This copies the appropriate file data to or from a buffer supplied by the caller. Many file systems implement this by mapping a portion of the file into the kernel address space and then copying to or from the user's buffer.
  2. getpage - read a page. Returns a page filled with data from the vnode at the file offset specified by the caller. The file system may use a page cache supplied by the VM system to store active page data. The entries in this cache are named by the vnode and file offset of the data in the page.
  3. putpage - write a page. Returns a page to scondary storage.
SunOS virtual memory subsystem
The VM subsystem works in concert with the file systems to manage a cache of vnode pages. There is no distinction between process pages and I/O pages. Pages are brought into the system for different reasons but they are all labeled in the same way. This unified naming scheme allows all of memory to be used for any purpose, based on demand. All of memory may be an I/O cache if the system is acting primarily as an I/O server, or all of memory may be used up for a single large active process. Older UNIX variants confined I/O pages to a small "buffer cache".
UFS Details
  • inode - UFS represents each active file with an inode. An inode is an in-memory version of the control information associated with a file; the inode is initialized when the file is first read from disk from an on-disk structure called the dinode. The inode contains information such as file size, the location of the first few data blocks on disk, date created, etc. Each inode is directly associated with a vnode. Inodes also contain meta information that the file system uses to help tune performance.
  • logical blocks - UFS breaks up each file into logical blocks. A logical block is the main unit of alloction in UFS. Logical block numbers, or lbns, are numbered from zero and denote a particular block of a particular file. Logical blocks are used for two reasons: to decouple the file system block size from the disk block (or sector), and to decouple the location of a block in a file from the location of the block on the disk.
  • ufs_rdwr - Performs a read by breaking the request into block sized pieces, mapping each file block in turn to an unused portion of the kernel's address space, copying the data to the requesting process, and unmapping the block.
If the page representing the block is not already in memory with an active MMU translation, the copy will fault. The kernel handles the fault by calling ufs_getpage to find the page. After the page is retrieved, the MMU translation to the page is set up, the fault returns, and ufs_rdwr finished the copy unaware that the fault ever occured.
  • ufs_getpage - First checks to see whether the page is already in the page cache and returns the page if it is. Otherwise, it convers the vnode and offset into the equivalent inode and logical block number and calls bmap, which is reponsible for mapping logical blocks of an inode to physical blocks on the disk as well as the allocation of physical blocks on disk. It uses the block pointers in the inode to perform the translation, unless the file is large, in which case the inode contains a pointer to a disk block of pointers or indirect block. For large files, bmap needs to fetch the indirect block to perform the translation. The physical block number returned by bmap is used to start up the I/O.
Heuristics for optimizing read performance: In the absence of other info, ufs_getpage uses the pattern of logical block requests it sees to predict the file access pattern in the near future. If the pattern if requests is such that the current request is one page greater than the last request, it is assumed that the file is being accessed sequentially. If sequential access is detected the next access is predicted to be the page following the requested page and ufs_getpage will read ahead, i.e., will start the I/O for the page following the one requested. The file system uses an inode field, nextr, to predict the location of the next read. When the inode is initialized, nextr is set to zero, predicting that the first read will be the first block of the file. Starting read ahead at the beginning of the file turns out to be a beneficial heuristic.
bmap() to find disk location
if (requested page not in cache) {
start I/O for requested
}
if (sequential I/O) {
do another bmap() if necessary
start I/O for next page
}
if (first page was not in cache) {
wait for I/O to finish
}
predict next I/O location
  • ufs_putpage - When the kernel wishes to free some pages that contain modified data, it calls the appropriate file system's putpage routine which simply writes out the page data to the correct location on secondary storage.

Possible Improvements to UFS

Raw disk
Get rid of the file system altogether by using the raw disk. Some users, mostly those running database applications, actually do this.
+ Zero file system overhead: the raw disk is a direct interface plus a few permission checks.
- There is no file system, no file abstraction, no read ahead, no caching, none of the features that are expected of a file system.
File system tuning
Tune the file system to take advantage of track buffers. A track buffer is a memory cache the size of one track commonly found on newer disks, such as SCSI disks, that have on board controllers. When a read request for a block is sent to the disk, the entire track is read into the buffer. If successive blocks are on the same track, they are serviced immediately from the track buffer. Therefore, there is no need for rotational delay between successive file blocks. UFS can be tuned to attempt to place successive blocks contiguously on the disk by setting rotdelay to zero. This increases read performance substantially, since an entire track's worth of file data can be read in one rotation.
+ With no rotational delays a track would contain twice as much relevant data and the effective disk bandwidth would be twice as great.
- Not all drives have track buffers. Drives without track buffers would suffer substantial performance penalties on both reads and writes.
- Write performance suffers horribly when the file system has no rotational delay. The track buffer acts as write through cache, each write goes through the track buffer to the disk. Since the writes go directly to the disk, we need the rotational delay between each block or each write will wait a full rotation before beginning.
Driver clustering
First tune UFS to allocate sequential logical file blocks contiguously by setting rotdelay to zero. Then have the disk driver combine (cluster) any contiguous requests in its queue into one large request.
- File system code must be traversed for each block. This is excessively expensive in CPU cycles.
- This helps only writes. There can be many related writes in the disk queue at once, since writes are asynchronous in nature. Reads, on the other hand, are synchronous, so there can be at most two, the primary block and the read ahead block, in the queue at once.
- There are drivers that depend on intelligent controllers to do the ordering of requests.
Extent-based file system
Replace UFS with a new file system type, an extent-based file system. This is a popular answer to file system performance issues. The basic idea is to allocate file data in large, physically contiguous chunks, called extents. Most I/O is done in units of an extent. This improves performance in both I/O rate and CPU utilization, since the I/O is done contiguously, and file system CPU overhead is amortized over larger I/Os. Typicaly, the user can control the size of these extents on a per-file basis. In most cases the on-disk file system represents the mapping of logical file blocks to physical blocks as a tuple of . In addition, the on-disk inode is usually expanded to maintain the user's requested extent size(s).
- It is unlikely that the user will be able to choose the "right" extent size. The size will also vary between machine configurations, between file systems on the same machine, or even between different locations on the same file system (variable geometry drive).
- Writing portable code that knows about extents is close to impossible.
- Users rarely want to manage extents. If the file system performed satisactorily, the user would never consider telling the file system what to do.
- Change in on-disk file system format would require changes to many system utilities, such as dump, restore, and fsck.
File system clustering
Modify UFS to combine blocks adjacent to the requested blocks into a larger I/O request. This produces most, if not all, of the advantages of an extent-based file system without requiring changes to the on-disk format of UFS.

Clustering Implementation in UFS

Two basic changes:
  1. Tuned the file system to allocate file contiguously, and,
  2. Changed the file system to transfer sequential I/O in units of clusters.
A cluster is simply a number of blocks.
This approach solves both the problems:
  1. The rotational delays are removed, which potentially allows a single file to be read or written at the disk speed,
  2. Clusters are used in place of blocks which causes the file system and driver code to be traversed far less frequently.
Allocator details
This work depends heavily on contiguous allocation and hence it is important to have confidence in the allocator's ability to allocate contiguously. The UFS allocator has always been able to allocate files contiguously but it may not be able to do so if the disk is fragmented.
Most extent-based file systems use preallocation to insure maximum transfer rates. Experience showed that preallocation was largely unnecessary because the allocator keeps a percentage of the disk (usually 10%) free at all times.
Sizing clusters
maxcontig indicates the desired cluster size. The actual cluster might be less than this. The bmap routine was modified to tell the file system about this information. The length returned by bmap is at most maxcontig blocks long and is used as the effective cluster size by the caller (ufs_getpage or ufs_putpage).
Read clustering implementation
The read ahead implementation does a read ahead on each cluster (instead of each page). Synchronous read of the first cluster, and asynchronous read of the of the second cluster. The nextrio inode field, which is used to remember where to start the next read ahead, is updated to the current location plus the size of the current cluster. The cluster sizes sent back from bmap may vary at any point and this value is used by the read ahead code (an old file system with rotational delay between each block will have a value of 1 block).
Write clustering implementation
Writes are handled by assuming sequential I/O and pretending that the I/O completed immediately (in other words, do nothing). If the sequentiality assumption is found to be wrong at the next call, we write the previous page out and then start over with the current page. If the assumption is correct, we keep stalling until a cluster is built up and then write out the whole cluster. The implementation relies on the page cache to hold dirty pages that ufs_putpage pretented to flush.
If we detect random writes, we write out the old pages between delayoff and delayoff + delaylen before restarting the algorithm with the current page. We do not know if the file is allocated contiguously until we try to write out the cluster.

Contributions

Page thrashing
Need to allow large I/O to go through the system with little impact but still leave in place the caching effects for smaller files.
Considering large sequential I/O, we can see that the pages just brought in are recently touched and as such will not be candidates for page replacement. This has the side effect of using all of memory as a buffer cache for I/O pages. for limited I/O, this is generally a good policy, but for large (greater than memory size) I/O this is a poor policy since it will replace all, potentially useful, pages with I/O pages that are unlikely to be reused. The VM system implements a LRU page replacement algorithm but for large I/O it should implement MRU.
Somethimes the best thing to do is to use and reuse a small number of pages, say the current cluster's worth. However, if we used MRU for every file, we would effectively turn off caching, which is as bad as teh original problem of destroying the cache.
We need a mechanism that would allow large I/O to go through the system with little impact but still leave in place the caching effects for smaller files.
Write limits or fairness
A single process can lock down all of memory by writing a large file (remember that write I/O is asynchronous; the kernel copies it and allows the user process to continue). In old UNIX systems, the buffer cache imposed a natural limit on the amount of memory that could be consumed for I/O. In the SunOS VM, where all of memory is used as a cache, there is nothing to prevent a single process from dirtying every page.
Basic fairness problem: the asynchronous nature of writes may be used to the advantage of one process, but it may be at the expense of other processes in the system.
The solution is to limit the amount of data that can be in the write queue on a per file basis. We do this by adding what is essentially a counting semaphore in the inode. Each process decrements the semaphore when writing and increments it when the write is complete. If the semaphore falls below zero, the writing process is put to sleep until one of the other writes completes.
The initial value of the semaphore has to be chosen carefully. If it is too large we return to the old problem; if it is too small, we will degrade both sequential and random performance.
The sequential problem is exposed when we consider the I/O path as a pipeline. We need to feed the pipe at a fast enough rate that we never have any bubbles. If there is a bubble the drive will have rotated past the desired block by the time the request makes it out to the drive. The pipeline problem can be solved by allowing two or three outstanding writes, but this is still not good enough.
The random problem: Consider a process that seeks to the beginning of the disk, writes a block, seeks to the end, writes a block, back to the beginning, writes a block, and so on until N blocks have been written. If we allow the disk queue to be infinitely large, then disksort will get a chance to sort the requests such that the system will seek to the beginning, write N/2 blocks, seek to the end, and write N/2 blocks. The effective I/O rate will be much higher in the case without a write limit than the case with a write limit of one. For this reason, we allow a fairly large amount of I/O per file in the disk queue.

Future Work

Random clustering
Clustering is currently enabled only when sequential access is detected in the ufs_getpage routine. Certain access patterns, such as random reads, will not receive the full benefits of clustering. If the request is a read of a large amount of data, it is possible that the request size could be passed down to the ufs_getpage routine, which could use the request size as a hint to turn on clustering for what is apparently random access.
Bmap cache
The translation from logical location to physical location is done frequently and gets more expensive for large files because of indirect blocks. A small cache in the inode could reduce the cost of bmap substantially.
UFS_HOLE
Since UFS allows files to have holes, it is possible for bmap to return a hole. If we look back at the ufs_getpage algorithm, we see that bmap is called even whenthe requested page is in memory. The reason for this call is that ufs_getpage needs to know if the requested page has backing store (i.e., is not a page of zeros from a hole in a UFS file). If the page has no backing store, then ufs_getpage just changes the page protection bits to be read only. A read only page will fault when written, allowing UFS the chance to allocate the block to back the page. If the system did not enforce these rules, a write may appear to succeed but later will find that there is no more space in the file system.
If UFS didn't allow holes in files, we could bypass the bmap in all the cases that the page was in memory. One possible solution is to remember whether the file has holes and do the bmap only if the page is not in memory or if the file has holes.
Data in the inode
Many files are small, less than 2KB. Caching small files in the system causes fragmentation since the cache is made up of pages which are typically larger than the average file. We would like the caching effect without the fragmentation effect. This could be achieved by increasing the size of the inode in memory and caching small files in the extra space. This is already done for symbolic links of the link is small enough. Inodes are already cached in the system separately from pages which means that the system could satisfy many requests directly from the inode instead of the page cache. This would not work for mmap() since the data would not be page aligned.
Extents vs blocks
UFS maintains a physical block number for each logical block number. Given that UFS now allocates mostly contiguous files, there is a potential for substantial space savings by storing extent tuples of instead of a long list of physical blocks. Unfortunately, this would mean an on-disk format change which is not acceptable for UFS. However, if this idea were coupled with the inode cache, large files could use the extra space as a bmap cache. To maximize the benefit of the space, the cache could be a cache of extent tuples.
B_ORDER
We would like to improve performance of UFS for the average user, not just the users who want high sequential I/O rates. One approach is to discard UFS in favor of a log-based file system. However, there are improvements that can be made to UFS today, and the installed base of UFS disks makes them worth considering.
A long standing problem with UFS is that it does many operations, such as directory updates, synchronously to maintain file system consistency on the disk. The file system uses synchronous writes to insure an absolute ordring when necessary. If there was a way to insure the order of critical writes, the file system would be able to do many operations asychronously. The performance of commands like rm * would improve substantially.
The B_ORDER flag would be passed down to the various disk drivers. Requests in the disk queue with the B_ORDER flag may not be reordered by the driver, by disksort or by the controller.