Advanced Topics in Computer Systems

10/10/01

Anthony Joseph & Joe Hellerstein

 

UNIX Fast File System

Log-Structured File System

A Fast File System for UNIX

·        Original UNIX file system was simple and elegant, but slow.

·        Could only achieve about 20 KB/sec/arm; ~2% of 1982 disk bandwidth

 

·        Problems:

o       Blocks too small. Why?

§         VAX page size 512 bytes

§         512 increased to 1024 (but, achieved only 4% of disk BW)

§         Small size limited read ahead è many seeks

o       Consecutive blocks of files not close together. Why?

§         Free list became randomized (175 KB/sec è 30 KB/sec)

o       i-nodes far from data

o       i-nodes of directory not close together

 

·        Aspects of new file system:

o       4096 or 8192 byte block size (why not larger?)

§         Waste space because most UNIX files are small

o       Large blocks and small fragments

o       Disk divided into cylinder groups

o       Each contains superblock, i-nodes, bitmap of free blocks, usage summary info

o       Keeps i-node near file, i-nodes of a directory together

o       Cylinder groups ~ 16 cylinders, or 7.5 MB

o       Cylinder headers spread around so not all on one platter, track, cylinder

o       Account for rotational delay in numbering sectors

o       But, preserve existing filesystem abstraction

 

·        Old FS view of disk: just blocks

·        New FS view of disk: detailed hardware abstraction

·        Today: Disk appears as just blocks!


 

·        Two techniques for locality:

o       Don’t let disk fill up in any one area (10% reserve)

o       Paradox: to achieve file block locality, must spread unrelated things far apart

o       Note: new file system got 175KB/sec because free list contained sequential blocks (it did generate locality), but an old system has randomly ordered blocks and only got 30 KB/sec

 

·        Specific application of these techniques:

o       Goal: keep directory within a cylinder group, spread out different directories è fewer seeks

o       Goal: allocate runs of blocks within a cylinder group, every once in a while switch to a new cylinder group è higher throughput

§         Jump at 48KB (4K singly indirect block) then 1MB (25% of blocks in cylinder group).

o       Layout policy: global and local

o       Global policy allocates files & directories to cylinder groups. Picks “optimal” next block for block allocation.

o       Local allocation routines handle specific block requests. Select from a sequence of alternative if need to.

 

·        Results:

o       20-50% of disk bandwidth for large reads/writes.

o       10-15x original UNIX speeds.

o       Size: 3800 lines of code vs. 2700 in old system.

o       10% of total disk space unusable (except at 50% perf. price)

 

·        Could have done more; later versions do.

o       Example: pre-allocate blocks like DEMOS

 

·        Enhancements made to system interface: (really a second mini-paper)

o       Long file names (14 -> 255)

o       Advisory file locks

o       Symbolic links (contrast to hard links)

o       Atomic rename capability

o       Disk quotas

 

·        3 key features of paper:

·        Parameterize FS implementation for the hardware it’s running on.

·        Measurement-driven design decisions

·        Locality “wins”


 

·        Major flaws:

o       Measurements derived from a single installation.

o       Ignored technology trends: more sophisticated drive electronics

 

A lesson for the future: don’t ignore underlying hardware characteristics.

 

Contrasting research approaches: improve what you’ve got vs. design something new.

 

Log-Structured File System

·        Radically different file system design.

 

·        Technology motivations:

o       CPUs outpacing disks: I/O becoming more-and-more of a bottleneck.

o       Big memories: file caches work well, making most disk traffic writes.

 

·        Problems with current file systems:

o       Lots of little writes.

o       Synchronous: wait for disk in too many places. (This makes it hard to win much from RAIDs, too little concurrency.)

o       Logical locality – certain access patterns are assumed. Pay on writes to organize data.

 

·        Basic idea of LFS:

o       Log all data and meta-data with efficient, large, sequential writes.

o       Treat the log as the truth (but keep an index on its contents).

o       Rely on a large memory to provide fast access through caching of log.

o       Temporal locality – information created/modified at the same time is automatically clustered.

 

·        Two potential problems:

o       Log retrieval on cache misses.

o       Wrap-around: what happens when end of disk is reached?

§         No longer any big, empty runs available.

§         How to prevent fragmentation?


 

·        Log retrieval:

o       Keep same basic file structure as UNIX (inode, indirect blocks, data).

o       Retrieval is just a question of finding a file’s inode.

o       UNIX inodes kept in one or a few big arrays, LFS inodes must float to avoid update-in place.

o       Solution: an inode map that tells where each inode is. (Also keeps other stuff: version number, last access time, free/allocated.)

o       Inode map gets written to log like everything else.

o       Map of inode map gets written in special checkpoint location on disk; used in crash recovery.

 

·        Disk wrap-around:

o       Compact live information to open up large runs of free space. Problem: long-lived information gets copied over-and-over.

o       Thread log through free spaces. Problem: disk will get fragmented, so that I/O becomes inefficient again.

o       Solution: segmented log.

§         Divide disk into large, fixed-size segments.

§         Do compaction within a segment; thread between segments.

§         When writing, use only clean segments (i.e. no live data).

§         Occasionally clean segments: read in several, write out live data in compacted form, leaving some fragments free.

§         Try to collect long-lived information into segments that never need to be cleaned.

 

·        Which segments to clean?

o       Keep estimate of free space in each segment to help find segments with lowest utilization.

o       If utilization of segments being cleaned is U:

§         Write cost = (total bytes read & written)/(new data written) = 2/(1-U). (Unless U is 0 è assume full segment read).

§         Write cost increases as U increases: U = .9 => cost = 20!

§         Need a cost of less than 4 to 10; => U of less than .75 to .45.

 

·        Big assumption that segments can have 25-55% free space!

o       Disks are always full

o       Does buying new, larger disks help?

§         No! Need AutoRAID-like solution to extend “physical” drive space

 

·        Can we archive segments that are long-lived?

o       No, might be heavily read, not written.

 

 

 

·        Simulation of LFS cleaning:

o       Initial model: uniform random distribution of references; greedy algorithm for segment-to-clean selection.

o       Why does the simulation do better than the formula?

§         Because of variance in segment utilizations.

o       Then they added locality (i.e. 90% of references go to 10% of data) and things got worse!

§         Greedy cleaning policy è clean least utilized of all segments

o       First solution: write out cleaned data ordered by age to obtain hot and cold segments (i.e., time that the space will likely stay free).

§         What programming language feature does this remind you of? Generational GC.

§         Only helped a little.

o       Claimed problem: even cold segments eventually have to reach the cleaning point, but they drift down slowly, while tying up lots of free space. Do you believe that’s true?

o       Solution: it’s worth paying more to clean cold segments because you get to keep the free space longer.

o       New selection function: MAX of (AGE*(1-U)/(1+U)).

§         Resulted in the desired bi-modal utilization function.

§         LFS stays below write cost of 4 up to a disk utilization of 80%.

 

·        Crash recovery:

o       Unix must read entire disk to reconstruct metadata.

o       LFS reads checkpoint and rolls forward through log from checkpoint state.

o       Result: recovery time measured in seconds instead of minutes to hours.

§         But, checkpoints every 30 seconds!

 

·        An interesting point: LFS’ efficiency isn’t derived from knowing the details of disk geometry; implies it can survive changing disk technologies (such variable number of sectors/track) better.

 

·        Where is LFS today?

o       NTFS and Linux Journaling FS, but for MD only (also, not primary storage mechanism)

o       Used in DBMS log FS?

 

·        Key features of paper:

o       CPUs outpacing disk speeds; implies that I/O is becoming more-and-more of a bottleneck.

o       Write FS information to a log and treat the log as the truth; rely on in-memory caching to obtain speed.

o       Hard problem: finding/creating long runs of disk space to (sequentially) write log records to. Solution: clean live data from segments, picking segments to clean based on a cost/benefit function.

 

·        Some flaws:

o       Assumes that files get written in their entirety; else would get intra-file fragmentation in LFS.

o       If small files “get bigger” then how would LFS compare to UNIX?

o       Disks are always full

 

A Lesson: Rethink your basic assumptions about what’s primary and what’s secondary in a design. In this case, they made the log become the truth instead of just a recovery aid.