Advanced Topics in Computer Systems

10/3/01

Anthony Joseph & Joe Hellerstein

 

HP Auto RAID

RAID

·        UC Berkeley project: Redundant Array of Inexpensive Disks

·        Now multi-billion dollar business

·        Why use main small disks instead of a few very large disks (or one)?

o       Striping across drives gives higher data transfer rates on large accesses

o       Higher I/O rates on small data accesses

o       More uniform load balancing across disks (eliminate hot spots – arm contention)

·        Caveats?

o       Vulnerable to failures: 500K hour MTTF for one disk implies 500K/100 or 5K hours for 100 disks!

o       Solution: Redundancy

o       But, redundancy means worse performance for writes (have to write more than once)

o       Also, consistency in the presence of concurrent I/O and crashes is complex

 

Goals: automate the efficient replication of data in a RAID

·        RAIDs are hard to setup and optimize

o       Three simultaneous goals:

§         Maximize the number of disks being accessed in parallel

§         Minimize the amount of disk space used for redundant data

§         Minimize the overhead required to achieve the above goals

o       Parameters:

§         Granularity of data interleaving

§         The way the redundant data is computed and stored across the array

-         Can be one small number of drives or interleaved across all drives

·        Mix fast mirroring (2 copies) with slower, more space-efficient parity disks

·        Automate the migration between these two levels

 

Types of RAID:

·        RAID 0: Non-redundant, Just a Bunch of Disks

o       Make multiple disks appear as a one large disk

o       Very high read throughput (can read in parallel)

·        RAID 1: Mirrored

o       Use twice as many disks as non-redundant array

o       Read from either, but write both (spindle synch is an issue)

·        RAID 2: Memory-style

o       Bit interleave across data disks (e.g., 32) add ECC on parity drives (e.g., 7 for Hamming-code)

o       Very complex, also modern drives already contain ECC hardware

o       No commercial products released

·        RAID 3: Bit-interleaved parity

o       Modern drives can use ECC to detect which drive failed

o       Only use one parity disk to reconstruct (XOR)

o       High bandwidth, not high I/O rates (only one request at a time, due to need to access parity drive)

·        RAID 4: Block-interleaved parity

o       Strip in blocks, not bytes

o       Small writes need four I/Os: two to read old data and parity, one to write new data, and one to write recomputed parity

o       Use inexpensive XOR operation for parity

·        RAID 5: Block-interleaved Distributed Parity

o       Strip parity across all drives, instead of one

o       Small writes still inefficient due to read-modify-write requirement

·        Others:

o       RAID 10 (RAID 0 striping with RAID 1 mirroring)

o       Lots of proprietary schemes

·        What about tape backup?

o       Do we still need it?

§         Multiple drive failures

§         Human errors

§         Disaster recovery

 

3 levels of RAID:

·        Mirroring (simple, fast, but requires 2x storage)

·        Parity disk (RAID level 3)

·        Rotating parity disk (RAID level 5)

 

Each kind of replication has a narrow range of workloads for which it is best...

·        Mistake Þ 1) poor performance, 2) changing layout is expensive and error prone

·        Also difficult to add storage: new disk Þ change layout and rearrange data...

·        (another problem: spare disks are wasted)

Key idea: mirror active data (hot), RAID 5 for cold data

·        Assumes only part of data in active use at one time

·        Working set changes slowly (to allow migration)

 

Where to deploy:

·        sys-admin: make a human move around the files.... BAD. painful and error prone

·        File system: best choice, but hard to implement/deploy; can’t work with existing systems

·        Smart array controller: (magic disk) block-level device interface. Easy to deploy because there is a well-defined abstraction

 

Features:

·        Block Map: level of indirection so that blocks can be moved around among the disks

·        Mirroring of active blocks

·        RAID 5 for inactive blocks or large sequential writes (why?).

·        Start out fully mirrored, then move to 10% mirrored as disks fill

·        Promote/demote in 64K chunks (8-16 blocks)

·        Hot swap disks, etc. (A hot swap is just a controlled failure.)

·        Add storage easily (goes into the mirror pool)

·        No need for an active hot spare (per se); just keep enough working space around

·        Log-structured RAID 5 writes. (Why is this the right thing? Nice big streams, no need to read old parity for partial writes)

 

Issues:

·        When to demote? When there is too much mirrored storage (>10%)

·        Demotion leaves a hole (64KB). What happens to it? Moved to free list and reused

·        Demoted RBs are written to the RAID5 log, one write for data, a second for parity

·        Why is log-based RAID5 better than update in place? Update of data requires reading all the old data to recalculate parity. Log throws ignores old data (which becomes garbage) and writes only new data/parity stripes.

·        When to promote? When a RAID5 block is written... Just write it to mirrored and the old version becomes garbage.

o       What is thrashing? Why does it occur? How to prevent it?

·        How big should an RB be? Bigger Þ finer-grain migration, smaller Þ less mapping information, bigger Þ fewer seeks

·        How do you find where an RB is? Convert addresses to (LUN, offset) and then lookup RB in a table from this pair. Map size = Number of RBs and must be proportional to size of total storage.

o       Big issue: protecting this table from errors

·        Controller uses cache for reads

·        Controller uses NVRAM for fast commit, then moves data to disks. What if NVRAM is full? Block until NVRAM blocks flushed to disk, then write to NVRAM.

·        Disks writes normally go to two disks (since newly written data is “hot”). Must wait for both to complete (why?). Does the host have to wait for both? No, just for NVRAM.

·        What happens in the background? 1) compaction, 2) migration, 3) balancing.

·        Compaction: clean the RAID5 and plug holes in the mirrored disks. Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a disks with lots of holes and move its used RBs to other disks. Resulting empty PEG is now usable by RAID5.

·        What if there aren’t enough holes? Write the excess RBs to RAID5, then reclaim the PEG.

·        Migration: which RBs to demote? Least-recently-written

·        Balancing: make sure data evenly spread across the disks.

o       Most important when you add a new disk

§         Why? Newer drives usually perform much better (lower seek, faster rotation, …)

 

Bad cases? One is thrashing when the working set is bigger than the mirrored storage

 

Performance:

·        Consistently better than regular RAID, comparable to plain disks (but worse)

·        They couldn’t get RAIDs to work well....

 

Other things:

·        “shortest seek” -- pick the disk (of 2) whose head is closet to the block

·        When idle, plugs holes in RAID5 rather than append to log (easier because all RBs are the same size!) Why not all the time? Requires reading the rest of the stripe and recalculating parity

·        Very important that the behavior is dynamic: makes it robust across workloads, across technology changes, and across the addition of new disks. Greatly simplifies management of the disk system

 

Key features of paper:

·        RAIDs difficult to use well

·        Mix mirroring and RAID5 automatically

·        Hide magic behind simple SCSI interface