Advanced Topics in Computer Systems |
10/10/01 |
Anthony Joseph & Joe Hellerstein |
|
·
Original UNIX
file system was simple and elegant, but slow.
·
Could only
achieve about 20 KB/sec/arm; ~2% of 1982 disk bandwidth
· Problems:
o Blocks too small. Why?
§
VAX page size
512 bytes
§
512 increased
to 1024 (but, achieved only 4% of disk BW)
§
Small size
limited read ahead è many seeks
o Consecutive blocks of files not close
together. Why?
§
Free list became
randomized (175 KB/sec è 30 KB/sec)
o i-nodes far from data
o i-nodes of directory not close together
·
Aspects of new
file system:
o 4096 or 8192 byte block size (why not
larger?)
§
Waste space
because most UNIX files are small
o Large blocks and small fragments
o Disk divided into cylinder groups
o Each contains superblock, i-nodes, bitmap of
free blocks, usage summary info
o Keeps i-node near file, i-nodes of a
directory together
o Cylinder groups ~ 16 cylinders, or 7.5 MB
o Cylinder headers spread around so not all on
one platter, track, cylinder
o Account for rotational delay in numbering
sectors
o But, preserve existing filesystem abstraction
·
Old FS view of
disk: just blocks
·
New FS view of
disk: detailed hardware abstraction
·
Today: Disk
appears as just blocks!
·
Two techniques
for locality:
o Don’t let disk fill up in any one area (10%
reserve)
o Paradox: to achieve file block locality, must
spread unrelated things far apart
o Note: new file system got 175KB/sec because
free list contained sequential blocks (it did generate locality), but an old
system has randomly ordered blocks and only got 30 KB/sec
·
Specific
application of these techniques:
o Goal: keep directory within a cylinder group,
spread out different directories è fewer seeks
o Goal: allocate runs of blocks within a cylinder
group, every once in a while switch to a new cylinder group è higher
throughput
§
Jump at 48KB
(4K singly indirect block) then 1MB (25% of blocks in cylinder group).
o Layout policy: global and local
o Global policy allocates files &
directories to cylinder groups. Picks “optimal” next block for block
allocation.
o Local allocation routines handle specific
block requests. Select from a sequence of alternative if need to.
·
Results:
o 20-50% of disk bandwidth for large
reads/writes.
o 10-15x original UNIX speeds.
o Size: 3800 lines of code vs. 2700 in old
system.
o 10% of total disk space unusable (except at
50% perf. price)
·
Could have done
more; later versions do.
o Example: pre-allocate blocks like DEMOS
·
Enhancements
made to system interface: (really a second mini-paper)
o Long file names (14 -> 255)
o Advisory file locks
o Symbolic links (contrast to hard links)
o Atomic rename capability
o Disk quotas
·
3 key features
of paper:
·
Parameterize FS
implementation for the hardware it’s running on.
·
Measurement-driven
design decisions
·
Locality “wins”
·
Major flaws:
o Measurements derived from a single
installation.
o Ignored technology trends: more sophisticated
drive electronics
A lesson for the future: don’t ignore underlying hardware
characteristics.
Contrasting research approaches: improve what you’ve got vs. design
something new.
·
Radically
different file system design.
·
Technology
motivations:
o CPUs outpacing disks: I/O becoming
more-and-more of a bottleneck.
o Big memories: file caches work well, making
most disk traffic writes.
·
Problems with
current file systems:
o Lots of little writes.
o Synchronous: wait for disk in too many
places. (This makes it hard to win much from RAIDs, too little concurrency.)
o Logical locality – certain access patterns
are assumed. Pay on writes to organize data.
·
Basic idea of
LFS:
o Log all data and meta-data with efficient,
large, sequential writes.
o Treat the log as the truth (but keep an index
on its contents).
o Rely on a large memory to provide fast access
through caching of log.
o Temporal locality – information
created/modified at the same time is automatically clustered.
·
Two potential
problems:
o Log retrieval on cache misses.
o Wrap-around: what happens when end of disk is
reached?
§
No longer any
big, empty runs available.
§
How to prevent
fragmentation?
·
Log retrieval:
o Keep same basic file structure as UNIX
(inode, indirect blocks, data).
o Retrieval is just a question of finding a
file’s inode.
o UNIX inodes kept in one or a few big arrays,
LFS inodes must float to avoid update-in place.
o Solution: an inode map that tells
where each inode is. (Also keeps other stuff: version number, last access time,
free/allocated.)
o Inode map gets written to log like everything
else.
o Map of inode map gets written in special
checkpoint location on disk; used in crash recovery.
·
Disk
wrap-around:
o Compact live information to open up large
runs of free space. Problem: long-lived information gets copied over-and-over.
o Thread log through free spaces. Problem: disk
will get fragmented, so that I/O becomes inefficient again.
o Solution: segmented log.
§
Divide disk into large, fixed-size segments.
§
Do compaction
within a segment; thread between segments.
§
When writing,
use only clean segments (i.e. no live data).
§
Occasionally clean
segments: read in several, write out live data in compacted form, leaving
some fragments free.
§
Try to collect
long-lived information into segments that never need to be cleaned.
·
Which segments
to clean?
o Keep estimate of free space in each segment
to help find segments with lowest utilization.
o If utilization of segments being cleaned is
U:
§
Write cost =
(total bytes read & written)/(new data written) = 2/(1-U). (Unless U is 0 è assume full
segment read).
§
Write cost
increases as U increases: U = .9 => cost = 20!
§
Need a cost of
less than 4 to 10; => U of less than .75 to .45.
·
Big assumption
that segments can have 25-55% free space!
o Disks are always full
o Does buying new, larger disks help?
§
No! Need
AutoRAID-like solution to extend “physical” drive space
·
Can we archive
segments that are long-lived?
o No, might be heavily read, not written.
·
Simulation of
LFS cleaning:
o Initial model: uniform random distribution of
references; greedy algorithm for segment-to-clean selection.
o Why does the simulation do better than the
formula?
§
Because of
variance in segment utilizations.
o Then they added locality (i.e. 90% of
references go to 10% of data) and things got worse!
§
Greedy cleaning
policy è clean least utilized of all segments
o First solution: write out cleaned data
ordered by age to obtain hot and cold segments (i.e., time that the space will
likely stay free).
§
What
programming language feature does this remind you of? Generational GC.
§
Only helped a
little.
o Claimed problem: even cold segments
eventually have to reach the cleaning point, but they drift down slowly, while
tying up lots of free space. Do you believe that’s true?
o Solution: it’s worth paying more to clean
cold segments because you get to keep the free space longer.
o New selection function: MAX of
(AGE*(1-U)/(1+U)).
§
Resulted in the
desired bi-modal utilization function.
§
LFS stays below
write cost of 4 up to a disk utilization of 80%.
·
Crash recovery:
o Unix must read entire disk to reconstruct
metadata.
o LFS reads checkpoint and rolls forward
through log from checkpoint state.
o Result: recovery time measured in seconds
instead of minutes to hours.
§
But,
checkpoints every 30 seconds!
·
An interesting
point: LFS’ efficiency isn’t derived from knowing the details of disk geometry;
implies it can survive changing disk technologies (such variable number of
sectors/track) better.
·
Where is LFS
today?
o NTFS and Linux Journaling FS, but for MD only
(also, not primary storage mechanism)
o Used in DBMS log FS?
·
Key features of
paper:
o CPUs outpacing disk speeds; implies that I/O
is becoming more-and-more of a bottleneck.
o Write FS information to a log and treat the
log as the truth; rely on in-memory caching to obtain speed.
o Hard problem: finding/creating long runs of
disk space to (sequentially) write log records to. Solution: clean live data
from segments, picking segments to clean based on a cost/benefit function.
·
Some flaws:
o Assumes that files get written in their
entirety; else would get intra-file fragmentation in LFS.
o If small files “get bigger” then how would
LFS compare to UNIX?
o Disks are always full
A Lesson: Rethink your basic assumptions about what’s primary and
what’s secondary in a design. In this case, they made the log become the truth
instead of just a recovery aid.