CS262a: FFS and LFS

Advanced Topics in Computer Systems	10/10/01
Anthony Joseph & Joe Hellerstein

UNIX Fast File System

Log-Structured File System

A Fast File System for UNIX

· Original UNIX file system was simple and elegant, but slow.

· Could only achieve about 20 KB/sec/arm; ~2% of 1982 disk bandwidth

· Problems:

o Blocks too small. Why?

§ VAX page size 512 bytes

§ 512 increased to 1024 (but, achieved only 4% of disk BW)

§ Small size limited read ahead è many seeks

o Consecutive blocks of files not close together. Why?

§ Free list became randomized (175 KB/sec è 30 KB/sec)

o i-nodes far from data

o i-nodes of directory not close together

· Aspects of new file system:

o 4096 or 8192 byte block size (why not larger?)

§ Waste space because most UNIX files are small

o Large blocks and small fragments

o Disk divided into cylinder groups

o Each contains superblock, i-nodes, bitmap of free blocks, usage summary info

o Keeps i-node near file, i-nodes of a directory together

o Cylinder groups ~ 16 cylinders, or 7.5 MB

o Cylinder headers spread around so not all on one platter, track, cylinder

o Account for rotational delay in numbering sectors

o But, preserve existing filesystem abstraction

· Old FS view of disk: just blocks

· New FS view of disk: detailed hardware abstraction

· Today: Disk appears as just blocks!

· Two techniques for locality:

o Don’t let disk fill up in any one area (10% reserve)

o Paradox: to achieve file block locality, must spread unrelated things far apart

o Note: new file system got 175KB/sec because free list contained sequential blocks (it did generate locality), but an old system has randomly ordered blocks and only got 30 KB/sec

· Specific application of these techniques:

o Goal: keep directory within a cylinder group, spread out different directories è fewer seeks

o Goal: allocate runs of blocks within a cylinder group, every once in a while switch to a new cylinder group è higher throughput

§ Jump at 48KB (4K singly indirect block) then 1MB (25% of blocks in cylinder group).

o Layout policy: global and local

o Global policy allocates files & directories to cylinder groups. Picks “optimal” next block for block allocation.

o Local allocation routines handle specific block requests. Select from a sequence of alternative if need to.

· Results:

o 20-50% of disk bandwidth for large reads/writes.

o 10-15x original UNIX speeds.

o Size: 3800 lines of code vs. 2700 in old system.

o 10% of total disk space unusable (except at 50% perf. price)

· Could have done more; later versions do.

o Example: pre-allocate blocks like DEMOS

· Enhancements made to system interface: (really a second mini-paper)

o Long file names (14 -> 255)

o Advisory file locks

o Symbolic links (contrast to hard links)

o Atomic rename capability

o Disk quotas

· 3 key features of paper:

· Parameterize FS implementation for the hardware it’s running on.

· Measurement-driven design decisions

· Locality “wins”

· Major flaws:

o Measurements derived from a single installation.

o Ignored technology trends: more sophisticated drive electronics

A lesson for the future: don’t ignore underlying hardware characteristics.

Contrasting research approaches: improve what you’ve got vs. design something new.

Log-Structured File System

· Radically different file system design.

· Technology motivations:

o CPUs outpacing disks: I/O becoming more-and-more of a bottleneck.

o Big memories: file caches work well, making most disk traffic writes.

· Problems with current file systems:

o Lots of little writes.

o Synchronous: wait for disk in too many places. (This makes it hard to win much from RAIDs, too little concurrency.)

o Logical locality – certain access patterns are assumed. Pay on writes to organize data.

· Basic idea of LFS:

o Log all data and meta-data with efficient, large, sequential writes.

o Treat the log as the truth (but keep an index on its contents).

o Rely on a large memory to provide fast access through caching of log.

o Temporal locality – information created/modified at the same time is automatically clustered.

· Two potential problems:

o Log retrieval on cache misses.

o Wrap-around: what happens when end of disk is reached?

§ No longer any big, empty runs available.

§ How to prevent fragmentation?

· Log retrieval:

o Keep same basic file structure as UNIX (inode, indirect blocks, data).

o Retrieval is just a question of finding a file’s inode.

o UNIX inodes kept in one or a few big arrays, LFS inodes must float to avoid update-in place.

o Solution: an inode map that tells where each inode is. (Also keeps other stuff: version number, last access time, free/allocated.)

o Inode map gets written to log like everything else.

o Map of inode map gets written in special checkpoint location on disk; used in crash recovery.

· Disk wrap-around:

o Compact live information to open up large runs of free space. Problem: long-lived information gets copied over-and-over.

o Thread log through free spaces. Problem: disk will get fragmented, so that I/O becomes inefficient again.

o Solution: segmented log.

§ Divide disk into large, fixed-size segments.

§ Do compaction within a segment; thread between segments.

§ When writing, use only clean segments (i.e. no live data).

§ Occasionally clean segments: read in several, write out live data in compacted form, leaving some fragments free.

§ Try to collect long-lived information into segments that never need to be cleaned.

· Which segments to clean?

o Keep estimate of free space in each segment to help find segments with lowest utilization.

o If utilization of segments being cleaned is U:

§ Write cost = (total bytes read & written)/(new data written) = 2/(1-U). (Unless U is 0 è assume full segment read).

§ Write cost increases as U increases: U = .9 => cost = 20!

§ Need a cost of less than 4 to 10; => U of less than .75 to .45.

· Big assumption that segments can have 25-55% free space!

o Disks are always full

o Does buying new, larger disks help?

§ No! Need AutoRAID-like solution to extend “physical” drive space

· Can we archive segments that are long-lived?

o No, might be heavily read, not written.

· Simulation of LFS cleaning:

o Initial model: uniform random distribution of references; greedy algorithm for segment-to-clean selection.

o Why does the simulation do better than the formula?

§ Because of variance in segment utilizations.

o Then they added locality (i.e. 90% of references go to 10% of data) and things got worse!

§ Greedy cleaning policy è clean least utilized of all segments

o First solution: write out cleaned data ordered by age to obtain hot and cold segments (i.e., time that the space will likely stay free).

§ What programming language feature does this remind you of? Generational GC.

§ Only helped a little.

o Claimed problem: even cold segments eventually have to reach the cleaning point, but they drift down slowly, while tying up lots of free space. Do you believe that’s true?

o Solution: it’s worth paying more to clean cold segments because you get to keep the free space longer.

o New selection function: MAX of (AGE*(1-U)/(1+U)).

§ Resulted in the desired bi-modal utilization function.

§ LFS stays below write cost of 4 up to a disk utilization of 80%.

· Crash recovery:

o Unix must read entire disk to reconstruct metadata.

o LFS reads checkpoint and rolls forward through log from checkpoint state.

o Result: recovery time measured in seconds instead of minutes to hours.

§ But, checkpoints every 30 seconds!

· An interesting point: LFS’ efficiency isn’t derived from knowing the details of disk geometry; implies it can survive changing disk technologies (such variable number of sectors/track) better.

· Where is LFS today?

o NTFS and Linux Journaling FS, but for MD only (also, not primary storage mechanism)

o Used in DBMS log FS?

· Key features of paper:

o CPUs outpacing disk speeds; implies that I/O is becoming more-and-more of a bottleneck.

o Write FS information to a log and treat the log as the truth; rely on in-memory caching to obtain speed.

o Hard problem: finding/creating long runs of disk space to (sequentially) write log records to. Solution: clean live data from segments, picking segments to clean based on a cost/benefit function.

· Some flaws:

o Assumes that files get written in their entirety; else would get intra-file fragmentation in LFS.

o If small files “get bigger” then how would LFS compare to UNIX?

o Disks are always full

A Lesson: Rethink your basic assumptions about what’s primary and what’s secondary in a design. In this case, they made the log become the truth instead of just a recovery aid.