













































#### Reducing Misses by <u>Software</u> Prefetching Data

- Data Prefetch
  - Load data into register (HP PA-RISC loads)
  - Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
  - Special prefetching instructions cannot cause faults; a form of speculative execution
- Prefetching comes in two flavors:
  - Binding prefetch: Requests load directly into register. » Must be correct address and register!
  - Non-Binding prefetch: Load into cache.
  - » Can be incorrect. Faults?
- Issuing Prefetch Instructions takes time
  - Is cost of prefetch issues < savings in reduced misses?
  - Higher superscalar reduces difficulty of issue bandwidth

CS252/0

# Reducing Misses by Compiler Optimizations • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts(using tools they developed) • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in mory

- Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
- Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

/\* Before: ? sequential arrays \*/
int val(SIZE);
int key(SIZE);
/\* After: 1 array of stuctures \*/
struct merge {
 int val;
 int key;
 ;;
 struct merge merged\_array[SIZE];

 Reducing conflicts between val & key;
 improve spatial locality





















### Write Policy 2: Write Allocate vs Non-Allocate (What happens on write-miss)

- Write allocate: allocate new cache line in cache

   Usually means that you have to do a "read miss" to
   fill in rest of the cache-line!
  - Alternative: per/word valid bits
- Write non-allocate (or "write-around"):
   Simply send write data through to underlying memory/cache - don't allocate new cache line!



#### 1. Reducing Miss Penalty: Read Priority over Write on Miss

- Write-through w/ write buffers => RAW conflicts with main memory reads on cache misses
  - If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% )
  - Check write buffer contents before read; if no conflicts, let the memory access continue
- Write- back want buffer to hold displaced blocks
   Read miss replacing dirty block
  - Normal: Write dirty block to memory, and then do the read
  - Instead copy the dirty block to a write buffer, then do the read, and then do the write
  - CPU stall less since restarts as soon as do read

CS252/Culler Lec 4.39



block





- requires multi-bank memories
- "<u>hit under miss</u>" reduces the effective miss penalty by working during miss vs. ignoring CPU requests
- "<u>hit under multiple miss</u>" or "<u>miss under miss</u>" may further lower the effective miss penalty by overlapping multiple misses
  - Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
  - Requires muliple memory banks (otherwise cannot support)
  - Penium Pro allows 4 outstanding memory misses



CS252/



• L2 Equations

AMAT = Hit Time<sub>L1</sub> + Miss  $Rate_{L1} \times Miss Penalty_{L1}$ 

Miss Penalty<sub>L1</sub> = Hit Time<sub>L2</sub> + Miss Rate<sub>L2</sub> x Miss Penalty<sub>L2</sub>

AMAT = Hit Time<sub>L1</sub> +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

#### Definitions:

- Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate<sub>L2</sub>)
- Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU
- Global Miss Rate is what matters











## Cache Optimization Summary

| miss rate    | Technique<br>Larger Block Size<br>Higher Associativity<br>Victim Caches<br>Pseudo-Associative Caches<br>HW Prefetching of Instr/Data<br>Compiler Controlled Prefetching<br>Compiler Reduce Misses | MR<br>+<br>+<br>+<br>+<br>+<br>+<br>+ | MP HT<br>-<br>- | Complexity<br>0<br>1<br>2<br>2<br>2<br>3<br>0 |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|-----------------|-----------------------------------------------|
| miss penalty | Priority to Read Misses<br>Early Restart & Critical Word 1st<br>Non-Blocking Caches<br>Second Level Caches                                                                                        |                                       | +<br>+<br>+     | 1<br>2<br>3<br>2                              |
| 1/31/02      |                                                                                                                                                                                                   |                                       |                 | CS252/Culler<br>Lec 4.49                      |