

## How to Improve Cache Performance?

### AMAT = HitTime+ MissRate ' MissPenalty

- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.

### Where to misses come from?

- · Classifying Misses: 3 Cs
  - Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache)
  - Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.
  - Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)
- 4th "C":
  - Coherence Misses caused by cache coherence.

252/Culler Lec 4.3



# Reducing Misses by Hardware prefetching of Instructions & Data E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in "stream buffer" Om sis check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 stream got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty Stream Hight



### Reducing Misses by Compiler Optimizations

- McFarling [1989] reduced caches misses by 75%
- on 8KB direct mapped cache, 4 byte blocks in <u>software</u> • Instructions
  - Reorder procedures in memory so as to reduce conflict misses
  - Profiling to look at conflicts(using tools they developed)
- Data
  - Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
  - Loop Interchange: change nesting of loops to access data in order stored in memory
  - $\textit{Loop Fusion:}\xspace$  Combine 2 independent loops that have same looping and some variables overlap
  - Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

CS252/Cul

# Merging Arrays Example

/\* Before: 2 sequential arrays \*/
int val[SIZE];
int key[SIZE];

/\* After: 1 array of stuctures \*/
struct merge {
 int val;
 int key;
 };
struct merge merged\_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

















# Write Policy: Write-Through vs Write-Back

Write-through: all writes update cache and underlying memory/cache

- Can always discard cached data most up-to-date data is in memory
- Cache control bit: only a *valid* bit Write-back: all writes simply update cache
- Can't just discard cached data may have to write it back to memory
- Cache control bits: both valid and dirty bits
- Other Advantages: - Write-through:
  - » memory (or other processors) always have latest data
- » Simpler management of cache
   Write-back:
  - » much lower bandwidth, since data often overwritten multiple times
     » Better tolerance to long-latency memory?

Write Policy 2: Write Allocate vs Non-Allocate (What happens on write-miss) · Write allocate: allocate new cache line in cache Usually means that you have to do a "read miss" to fill in rest of the cache-line! - Alternative: per/word valid bits • Write non-allocate (or "write-around"): - Simply send write data through to underlying memory/cache - don't allocate new cache line!





### 2. Reduce Miss Penalty: Early Restart and Critical Word First

- Don't wait for full block to be loaded before restarting CPU
  - <u>Early restart</u>—As soon as the requested word of the block ar rives, send it to the CPU and let the CPU continue executio . Critical Word Eircs—Request the missed word first from memory
  - <u>Critical Word First</u>—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
- · Generally useful only in large blocks,
- Spatial locality => tend to want next sequential word, so not clear if benefit by early restart

































#### - I/O must interact with cache, so need virtual address

### 3: Fast Hits by pipelining Cache Case Study: MIPS R4000

#### · 8 Stage Pipeline:

- IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
   IS-second half of access to instruction cache.
- RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection.
- EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
- DF-data fetch, first half of access to data cache.
- DS-second half of access to data cache.
- TC-tag check, determine whether the data cache access hit.
- WB-write back for loads and register-register operations.

#### What is impact on Load delay?

- Need 2 instructions between a load and its use

Case Study: MIPS R4000 DS ١F IS RF ΕX DF ΤС WB TWO Cycle Load Latency 1 F ١S RF FΧ DF DS TC İF 1S RF EX DS **PF** 1 F ١S RF I S DF ΕX 1 F RF 1 F IS RF 1 F ١S 1 F THREE Cycle 1 F 15 RF DS TC W/B DF RF DF DS Branch Latency 1 F IS ΕX TC İF IS RF EX DF DS (conditions evaluated ΕX 1 F RF DF during EX phase) ١S RF ΕX Delay slot plus two stalls ١S RF 1 F Branch likely cancels delay slot if not taken 1 F 15 LE













| Cache Optimization Summary |    |    |    |            |  |  |  |
|----------------------------|----|----|----|------------|--|--|--|
| Technique                  | MR | MP | нт | Complexity |  |  |  |
| Larger Block Size          | +  | -  |    | 0          |  |  |  |
| Higher Associativity       | +  |    | -  | 1          |  |  |  |
| Victim Cachoc              |    |    |    | 2          |  |  |  |

Cacho Ontimization Summary

|                 | recnnique                         | IVIR | IVIP H | i complexity |          |
|-----------------|-----------------------------------|------|--------|--------------|----------|
| rate            | Larger Block Size                 | +    | -      | 0            |          |
|                 | Higher Associativity              | +    |        | - 1          |          |
| miss            | Victim Caches                     | +    |        | 2            |          |
| 5               | Pseudo-Associative Caches         | +    |        | 2            |          |
|                 | HW Prefetching of Instr/Data      | +    |        | 2            |          |
|                 | Compiler Controlled Prefetching   | +    |        | 3            |          |
|                 | Compiler Reduce Misses            | +    |        | 0            |          |
| >               | Priority to Read Misses           |      | +      | 1            |          |
| s ≟             | Early Restart & Critical Word 1st |      | +      | 2            |          |
| ine<br>ne       | Non-Blocking Caches               |      | +      | 3            |          |
| miss<br>penalty | Second Level Caches               |      | +      | 2            |          |
|                 | Better memory system              |      | ÷      | 3            |          |
|                 | Small & Simple Caches             | -    | -      | + 0          |          |
| e               | Avoiding Address Translation      |      | -      | + 2          |          |
| ÷               | Pipelining Caches                 |      |        | + 2          |          |
| hit time        |                                   |      |        |              |          |
| 1/31/02         |                                   |      |        | CS25         | 2/Culler |
| 1/31/02         |                                   |      |        | Lec          | 4.46     |