

## Introduction to HPC

Lecture 2

Jakub Gałecki





- CPU architecture crashcourse
- Assembly language just a taste
- The problem with branching
- · Memory cache
- $\cdot$  The TLB
- Memory alignment
- Case study: the Goto algorithm

# CPU architecture – a primer



# CPU = Central Processing Unit

# It is the brain of the computer

But that's a bit vague...

The CPU









# But that was too specific...





Let's start by placing things in context. The (modern) computer consists of:

- $\cdot$  The CPU
- · RAM
- GPU (last 2 lectures)
- $\cdot$  The hard drive (whether HDD or SSD is irrelevant to us)
- I/O devices
- ...

#### The CPU



#### Fundamentally, the CPU performs the following tasks:

- Fetches instructions
- Decodes instructions
- Executes instructions

How does it know where to fetch the instructions from: the program counter.

Examples of instructions:

- arithmetic operation
- read/write from/to memory
- $\cdot\,$  conditional jump

Instructions usually operate on operands (arguments)



Small (e.g. 64 bit) volatile memory units

Most instructions involve data stored in registers

Registers have zero latency to access

Examples:

- General purpose
- RFLAGS
- Control
- Debug
- Vector (spoiler alert)



| ZMM0              | YMM0     | XMM0    | ZMM   | 1 🗅    | /MM1   | XMM1    | ST(0) M   | IM0  | ST(1) MM      | 1   | 4        | -AXE   | X RAX   | AND ROW ROOM   | R8       | #120 R12           | MSWC     | R0 CR | 4  |
|-------------------|----------|---------|-------|--------|--------|---------|-----------|------|---------------|-----|----------|--------|---------|----------------|----------|--------------------|----------|-------|----|
| ZMM2              | YMM2     | XMM2    | ZMM3  | 3 [1   | /ММЗ   | ХММЗ    | ST(2) M   | IM2  | ST(3) MM      | 3   |          | BXE    | X RBX   | AND ROW ROOM   | R9       | 8130 R13           | CRI      | . CF  | (5 |
| ZMM4              | YMM4     | XMM4    | ZMM5  | 5 [1   | /MM5   | XMM5    | ST(4) M   | IM4  | ST(5) MM      | 5   | 9        | HCXE   | X RCX   | RINK RIOW      | R10      | R140 R14           | CR2      | CF    | 6  |
| ZMM6              | YMM6     | XMM6    | ZMM7  | 7 []   | /MM7   | XMM7    | ST(6) M   | IM6  | ST(7) MM      | 7   | <u>e</u> | DXE    | RDX     | **** R11W R110 | R11      | R150 R15           | CR3      | CF    | 7  |
| ZMM8              | YMM8     | XMM8    | ZMMS  | )<br>) | /MM9   | XMM9    |           |      |               |     | 870      | BPEBR  | RBP     | DL DI EDI      |          | EIP RIP            | MXCS     | SR CR | (8 |
| ZMM10             | YMM10    | XMM10   | ZMM   | 11 🗅   | /MM11  | XMM11   | CW F      | P_IP | FP_DP FP_     | cs  | 51       | SI ES  | RSI     | SPLSPESP       | SP       |                    |          | CF    | 9  |
| ZMM12             | YMM12    | XMM12   | ZMM   | 13 🚺   | /MM13  | XMM13   | SW        |      |               |     |          |        |         |                |          |                    |          | CR    | 10 |
| ZMM14 YMM14 XMM14 |          |         | ZMM   | 15 🗋   | /MM15  | XMM15   | TW        |      | 8-bit registe |     | _        |        | egister | 80-bit         | register | 256-bit<br>512-bit |          | CR    | 11 |
| ZMM16 ZMM         | 417 ZMM1 | 8 ZMM19 | ZMM20 | ZMM2   | 1 ZMM2 | 2 ZMM23 | FP_DS     |      | 16-bit regis  | cer |          | 64-000 | egister | 128-0          | register | 512-DR             | register | CR    | 12 |
| ZMM24 ZMM         | 425 ZMM2 | 6 ZMM27 | ZMM28 | ZMM2   | 9 ZMM3 | 0 ZMM31 | FP_OPC FI | P_DP | FP_IP         | CS  | s        | SS     | DS      | GDTR           | IDTR     | DR0                | DR6      | CR    | 13 |
|                   |          |         |       |        |        |         |           |      |               | ES  | 5        | FS     | GS      | TR             | LDTR     | DR1                | DR7      | CR    | 14 |
|                   |          |         |       |        |        |         |           |      |               |     |          |        |         | nues enue      | RFLAGS   | DR2                | DR8      | CR    | 15 |
|                   |          |         |       |        |        |         |           |      |               |     |          |        |         |                | 1.00     | DR3                | DR9      |       |    |
|                   |          |         |       |        |        |         |           |      |               |     |          |        |         |                |          | DR4                | DR10     | DR12  | DI |
|                   |          |         |       |        |        |         |           |      |               |     |          |        |         |                |          | DB5                | DB11     | DB13  | D  |



.LC0: .string "Hello, World!\n" main: push rbp rbp, rsp mov edi, OFFSET FLAT:.LC0 mov call puts eax, 0 mov rbp pop ret

https://godbolt.org/z/or5Tz3cEb



#### Transistor reaction speed is not instantaneous.

- Gate delay: d
- Desired clock rate: *f*
- Theoretical max gate chain length: 1/df

If we want fast clock frequencies, we have a hard, physical limit on the complexity of our circuit.

The solution: pipelining

We can break the instructions down into stages and execute 1 stage per cycle. Different stages of subsequent instructions are executed concurrently!



#### Simplified example, 5 stage pipeline:



Problem: what happens when instruction n + 1 depends on the result of instruction n?



Pipelining introduces potential delays when subsequent operations depend on one another

- Structural hazard resource conflict
- Data hazard logical dependency between instructions
  - Read-after-write
  - Write-after-read
  - Write-after-write
- Control hazard control flow depends on result of previous instruction

We should keep these in mind when programming, although the hardware and compiler do most of the heavy lifting.



#### What kind of hazard is this?

|         |                | 1          | 2  | 3    | 4   | 5  | 6  | 7   | 8    | 9  | 10 | 11  | 12 | 13 | 14 |
|---------|----------------|------------|----|------|-----|----|----|-----|------|----|----|-----|----|----|----|
| S       | ADD R8, R5, R5 | IF         | ID | EX   | MEM | WR |    |     |      |    |    |     |    |    |    |
| uctions | ADD R2, R5, R8 | R2, R5, R8 |    | Idle |     | ID | EX | MEM | WR   |    |    |     |    |    |    |
| nstru   | SUB R3, R8, R4 |            |    | IF   | Id  | le | ID | EX  | MEM  | WR |    |     |    |    |    |
| l       | ADD R2, R2, R3 |            |    |      |     |    |    | IF  | Idle | ID | EX | MEM | WR |    |    |



- Structural hazards: get better CPU (sorry)
- Data hazards: compiler optimization, out-of-order execution, register renaming, inline assembly if we're feeling dangerous
- Control hazards: branch prediction, write better code (stay tuned)

## A branching example



```
.LC0:
        .string "Hello, World!"
.LC1:
        .string "So many arguments :o"
main:
        sub
                 rsp, 8
        cmp
                 edi, 1
        jle
                 .L6
                 edi, OFFSET FLAT:.LC1
        mov
        call
                 puts
.L3:
        xor
                 eax, eax
        add
                 rsp, 8
        ret
.L6:
                 edi, OFFSET FLAT:.LC0
        mov
        call
                 puts
        jmp
                 .L3
```

https://godbolt.org/z/5sWsYcvTd



- The problem with branching: the CPU doesn't even know which instruction to fetch until some previous instruction executes
- Control hazard == "Data hazard on steroids"
- The solution: take a guess and see what happens
  - Correct guess: no stall, no performance penalty
  - · Incorrect guess: pipeline flush, undo changes expensive

We need to try to be predictable.

Complete talk on the subject: https://youtu.be/g-WPhYREFjk

### Live demo





Accessing memory



DRAM == Dynamic Random Access Memory

Very large - up to hundreds of GB

Very slow to access - hundreds of cycles





#### Cache == fast, on-die memory

#### Usually several levels, nowadays: L1I, L1D, L2, L3

 $\mathsf{Smaller} \to \mathsf{faster}$ 





This is, of course, dependent on the specific CPU model, but the order of magnitude for modern CPUs is as follows:

| Memory hierarchy component | Latency [cycles] |  |  |  |  |  |
|----------------------------|------------------|--|--|--|--|--|
| Register                   | 0                |  |  |  |  |  |
| L1 Cache                   | 4                |  |  |  |  |  |
| L2 Cache                   | 10-25            |  |  |  |  |  |
| L3 Cache                   | ~40              |  |  |  |  |  |
| Main memory                | 200+             |  |  |  |  |  |

Source: Bakhvalov, D. (2020). Performance Analysis and Tuning on Modern CPUs.



The cache does not operate on individual bytes, but rather on sets of bytes, called **cachelines**.

The size of a cacheline on modern CPUs is 64B.

This has consequences:

- Aligning data to cache can increase performance
- Accessing neighboring data is faster
- Potential pitfall for concurrent programs (false sharing)



There is no instruction for "write N bytes from memory to LX cache"\*

We have to structure our data access so that it is naturally cache-friendly

Spatial locality:

- Subsequent addresses are likely to be on the same cacheline
- The CPU can detect access patterns and prefetch our data

Temporal locality:

- Least recently used cacheline gets evicted first
- Data which was recently accessed is likely still in cache

### Live demo







- Data is ultimately represented by electrons residing in the DRAM die *physical* address
- Our program references memory via virtual addresses
- To de-conflict different processes, the OS *translates* virtual addresses to physical addresses
- The CPU has special hardware which helps with translation
- For improved efficiency, memory is divided into 4kB\* pages



TLB = Translation Lookaside Buffer Cache for the page translation process TLB size: 1536 pages TLB hit time: <1 cycle TLB miss penalty: 10-100 cycles *Memory thrashing* for large working sets with random memory access

Usually not an issue



## We say address *i* is aligned to *a* (or has alignment *a*) iff

 $i \mod a = 0$ 

where *a* must be a power of 2. For example:

- **0xa0** is aligned to 16
- **0x0777b2** is aligned to 2

CPUs are much better at accessing data which is aligned to its natural alignment, i.e., a multiple of its size.

For usual cases, this is handled by the compiler with padding: https://godbolt.org/z/39aWbGoKW.

We can use **alignas** or aligned allocation to override the defaults. We will soon see why this may be desired.



- Author: Kazushige Goto (early 2000's)
- Matrix-matrix multiply algorithm explicitly catering to the 3 level cache memory hierarchy
- Slice & dice approach
- General structure: simple, no CS PhD required
- Micro-kernel: detailed knowledge of the CPU architecture is required
- Fantastic explanation: https://youtu.be/07SMaudtH6k

#### To the whiteboard!



 $\frac{\partial e}{\partial t} + \frac{\partial}{\partial z}(e_{n}) = 0$  $\frac{\partial u}{\partial t} + u \frac{\partial u}{\partial t}$  $\frac{\partial u}{\partial t} + u \frac{\partial u}{\partial t}$ 1 28



- CPU architecture 101
- Assembly 101
- CPUs are pipelined
- Avoid unpredictable branches
- Cache is king



→ Want performance? Know your hardware!

- → The speed of feeding the data to the CPU is equaly as important as the speed of processing the data
- → Break down the problem, optimize the kernel



