Articles/Essays/Papers Log: Mendel Rosenblum and John K. Ousterhout, The Design and Implementation of a Log-Structured Filesystem, ACM Transactions on Computer Systems, 1991

Assumptions

CPU speeds have increased dramatically in recent times while disk access times have only improved slowly. This trend is likely to continue in the future and will cause more and more applications to be disk I/O bound.

Modern filesystems cache recently-used file data in main memory and ever-increasing memory sizes on modern computers will make caches more and more effective at satisfying read requests. Most write requests must eventually be reflected on disk for safety. As a result disk I/O (and disk performance) will become more and more dominated by writes.

The second impact of large file caches is that they can serve as write buffers where large numbers of modified blocks can be collected before writing any of them to disk. Buffering may make it possible to write the blocks more efficiently, for example by writing them all in a single sequential transfer with one seek. Of course write-buffering has the disadvantage of increasing the amount of data lost during a crash; for applications that require better crash recovery, NVRAM may be used for the write buffer.

Focus on efficiency of small file accesses. Office and engineering applications: Measure mean file sizes of only a few KBs. Resulting in smal random disk I/Os. Creation and deletion times for such files are often dominated by updates to the filesystem "metadata". Supercomputing environments: Workloads dominated by sequential accessses to large files. I/O performance tends to be limited by the bandwidth of the I/O and memory subsystems rather than file allocation policies.

Background

Convetional filesystems tend to spread information around the disk in a way that causes too many small accesses. Physically separate different files. The attributes ("inode") of a file are separate from the file's contents, as is the directory entry containing the file's name. It takes atleast 5 separate disk I/Os, each preceeded by a seek, to create a new file: two different accesses to the file's attributes plus one access each for the file's data, the directory's data and the directoy's attributes. When writing small files in such systems, less than 5% of the disk's potential bandwidth is used for new data; the rest of the time is spent seeking.

Conventional filesystems tend to lay out files on disk with great care for spatial locality and make in-place changes to data structures on disk in order to perform well on magnetic disks, which tend to seek relatively slowly.

Conventional filesystems tend to write syncronously: the application must wait for the write to complete, rather than continuing while the write is handled in the background. For eg: Event though FFS writes file data blocks asynchronously, filesystem metadata structures such as directories and inodes are written synchronously. For workloads with many small files, the disk traffic is dominated by synchronous metadata writes. Synchronous writes couple the application's performance to that of the disk and make it hard for the application to benefit from faster CPUs. They also defeat the potential use of the file cache as a write buffer. Unfortunately, network file systems (like NFS) have introduced additional synchronous behavior where it didn't used to exist. This has simplified crash recovery but reduced write performance.

Goals

Developed a new disk storage management technique called the log-structured filesystem which uses disks an order of magnitude more efficiently than conventional filesystems.

Design

The fundamental idea of a log-structured filesystem is to improve write performance by buffering a sequence of file system changes in the file cache and then writing all the changes to disk sequentially in a single disk write operation. The information written to disk in a write operation includes file data blocks, attributes, index blocks, directories, and almost all other information used to manage the filesystem. For workloads that contain many small files, a log-structured filesystem converts the many small synchronous random writes of conventional filesystem into large asynchronous transfers sequential tranfers that can utilize nearly 100% of the raw disk bandwidth.

A log-structured filesystem treats the disk as a circular log and writes all new information in a sequential structure to the head of the log. This approach increases write performance dramatically by eliminating almost all seeks.

The sequential nature of the log also permits much faster crash recovery: Conventional filesystems typically must scan the entire disk to restore consistency after a crash, but a log-structured filesytem need only examine the most recent portion of the log.

A log-structured filesystem has the side effect of creating multiple, chronologically-advancing versions of both file data and meta-data. Thus, a log-structured file system is a journaling filesystem in that the entire file system is a journal.

Two key issues for an efficient log-structured filesystem:

1. How to retrieve information from the log ?
2. How to manage free space on the disk so that there are always large extents of free space available for writing new data. This is the most difficult challenge in the design of log-structured filesystem. A solution based on large extents called "segments" is presented, where a "segment cleaner" process continually regenerates empty segments by compressing the live data from heavily fragmented segments.

Explored various cleaning policies and discovered a simple and effective algorithm based on cost and benefit: it segregates older, more slow changing data from young rapidly-changing data and treats them differently during cleaning.

Advantages

1. May allow access to old versions of files, a feature sometimes called time-travel or snapshotting.
2. Recover quickly after crashes because no consistency checks are needed. Instead, the file system simply rolls forward from the last consistent point in the log. The locations of the last disk operations are easy to determine: they are at the end of the log. Two-pronged approach to recovery: "checkpoints", which defines positions in the log at which all the filesystem structures are complete and consistent, and "roll-forward", which is used to recover as much information as possible, which was written since the last checkpoint.
3. Tend to have good write performance and much more compact arrangement of files on disk.
4. Sprite LFS was not much more complicated to implement than a conventional filesystem. Additional complexity for the segment cleaner is compensated by the elimination of the bitmap and layout policies required by conventional filesystems. Checkpointing and rollback code is not more complicated than the fsck code that scans the disk to restore consistency.

Disadvantages

1. Cleaning overhead

Metrics

1. Transfer bandwidth: %age of disk's raw bandwidth utilized for reading/writing as opposed to cleaning and time spent seeking.
2. Access time:
3. Small and large files performance

Notes

A log-structured filesystem produces a different form of locality on disk than conventional filesystems. A conventional filesystem achieves "logical locality" by assuming certain access patterns (sequential reading of files, a tendency to use multiple files within a directory, etc.); it then pays extra on writes, if necessary, to organize information optimally on disk for the assumed read patterns. In contrast, a log-structured filesystem achieves "temporal locality": information that is created or modified at the same time will be grouped closely on disk. If temporal locality matches logical locality, as it does for a file that is written sequentially and then read sequentially, then a log-structured filesystem should have the same performance on large files as conventional filesystem. If temporal locality differs from the logical locality then the systems will perform differently.

Monday, January 28, 2008

Mendel Rosenblum and John K. Ousterhout, The Design and Implementation of a Log-Structured Filesystem, ACM Transactions on Computer Systems, 1991