## Script generated by TTT Title: Petter: Programmiersprachen (23.10.2019) Date: Wed Oct 23 12:30:52 CEST 2019 Duration: 76:13 min Pages: 27 ## **MESI Example** Consider how the following code might execute: ``` Thread A a = 1; // A.1 b = 1; // A.2 ``` ``` Thread B while (b == 0) {}; // B.1 assert(a == 1); // B.2 ``` - in all examples, the initial values of variables are assumed to be 0 - suppose that a and b reside in different cache lines - assume that a cache line is larger than the variable itself - we write the content of a cache line as - Mx: modified, with value x - ► Ex: exclusive, with value x - Sx: shared, with value x - I: invalid ## **MESI Example** Consider how the following code might execute: ``` Thread A a = 1; // A.1 b = 1; // A.2 ``` | Thread B | | | | | |------------------------|--|-----|-------|------------| | <b>while</b><br>assert | | - " | . , , | B.1<br>B.2 | - in all examples, the initial values of variables are assumed to be 0 - suppose that a and b reside in different cache lines - assume that a cache line is larger than the variable itself - we write the content of a cache line as - Mx: modified, with value x - Ex: exclusive, with value x - Sx: shared, with value x - I: invalid ## The MESI Protocol: Messages Moving data between caches is coordinated by sending messages [McK10]: - Read: sent if CPU needs to read from an address - Read Response: when in state E or S, response to a Read message, carries the data for the requested address - Invalidate: asks others to evict a cache line - Invalidate Acknowledge: reply indicating that a cache line has been evicted - Read Invalidate: like Read + Invalidate (also called "read with intend to modify") - Writeback: Read Response when in state M, as a side effect noticing main memory about modifications to the cacheline, changing sender's state to S We mostly consider messages between processors. Upon *Read Invalidate*, a processor replies with *Read Response/Writeback* before the *Invalidate Acknowledge* is sent. ## **MESI Example** Consider how the following code might execute: # Thread A a = 1; // A.1 b = 1; // A.2 ``` Thread B while (b == 0) {}; // B.1 assert(a == 1); // B.2 ``` - in all examples, the initial values of variables are assumed to be 0 - suppose that a and b reside in different cache lines - assume that a cache line is larger than the variable itself - we write the content of a cache line as - Mx modified, with value x - Ex: exclusive, with value x - $\triangleright$ Sx: shared, with value x - l: invalid # **MESI Example** Consider how the following code might execute: # Thread A a = 1; // A.1 b = 1; // A.2 ``` Thread B while (b == 0) {}; // B.1 assert(a == 1); // B.2 ``` - in all examples, the initial values of variables are assumed to be 0 - suppose that a and b reside in different cache lines - assume that a cache line is larger than the variable itself - we write the content of a cache line as - Mx: modified, with value x - ► Ex: exclusive, with value x - ► Sx: shared, with value x - I: invalid ## **MESI Example (I)** | statement | CP | U A | CF | PU B | Ш | R/ | ١M | message | |-----------|----|-----|----|------|---|----|----|-----------------------------------| | | а | b | a | b | | a | b | | | A.1 | I | Ι | I | П | | 0 | 0 | read invalidate of a from CPU A | | | 1 | 1 | 1 | 1 | | 0 | 0 | invalidate ack. of a from CPU B | | | 1 | 1 | 1 | 1 | | 0 | 0 | read response of a=0 from RAM | | B.1 | M1 | 1 | 1 | 1 | | 0 | 0 | read of b from CPU B | | | M1 | 1 | 1 | 1 | | 0 | 0 | read response with b=0 from RAM | | B.1 | M1 | 1 | 1 | E0 | | 0 | 0 | V. | | A.2 | M1 | 1 | 1 | E0 | | 0 | 0 | ) read invalidate of b from CPU A | | | M1 | 1 | 1 | E0 | | 0 | 0 | read response of b=0 from CPU B | | | M1 | S 0 | 1 | S0 | | 0 | 0 | invalidate ack. of b from CPU B | | | M1 | M1 | 1 | 1 | | 0 | 0 | ν. | \_\_\_\_\_ # **MESI Example (I)** Thread A Thread B | while | | | | // | В. | |--------|----|----|-----|----|----| | assert | (a | == | 1); | // | В. | | statement | CP | U A | CF | PUB | RA | MA | message | |-----------|----|-----|----|-----|----------------|-----|-----------------------------------| | | a | b | а | b | а | b | | | A.1 | 1 | I | | _ | 0 | 0 | ) read invalidate of a from CPU A | | | | | | 1 | 0 | 0 | ) invalidate ack. of a from CPU B | | | 1 | | 1 | 1 | 0 | 0 | read response of a=0 from RAM | | B.1 | M1 | | 1 | 1 | 0 | 0 | read of b from CPU B | | | M1 | 1 | | 1 | 0 | - 0 | read response with b=0 from RAM | | B.1 | M1 | | 1 | E0 | <del>1</del> 0 | 0 | <u>v</u> . | | A.2 | M1 | | 1 | E0 | 0 | 0 | ) read invalidate of b from CPU A | | | M1 | | 1 | E0 | 0 | 0 | read response of b=0 from CPU B | | | M1 | S 0 | 1 | S0 | 0 | 0 | ) invalidate ack. of b from CPU B | | | M1 | M1 | 1 | 1 | 0 | 0 | <u>v</u> | # **MESI Example (I)** ## Thread A #### // A.1 a = 1;// A.2 b = 1; # Thread B while (b == 0) {}; // B.1 assert(a == 1); // B.2 | statement | CP | U A | CF | PU B | R/ | M | message | |-----------|-----|-----|----|------|----|---|-----------------------------------| | | а | b | а | b | a | b | | | A.1 | I | I | I | I | 0 | 0 | ) read invalidate of a from CPU A | | | 1 | | 1 | 1 | 0 | 0 | ) invalidate ack. of a from CPU B | | | 1 | | 1 | 1 | 0 | 0 | read response of a=0 from RAM | | B.1 | M1 | 1 | 1 | 1 | 0 | 0 | read of b from CPU B | | | M1 | 1 | 1 | 1 | 0 | 0 | read response with b=0 from RAM | | B.1 | M 1 | 1 | 1 | E0 | 0 | 0 | V. | | A.2 | M1 | | 1 | E0 | 0 | 0 | ) read invalidate of b from CPU A | | | M1 | | 1 | E0 | 0 | 0 | read response of b=0 from CPU B | | | M1 | S 0 | 1 | S0 | 0 | 0 | invalidate ack. of b from CPU B | | | M 1 | M1 | 1 | 1 | 0 | 0 | <b>↓</b> | # **MESI Example (II)** ### Thread B | statement | CP | U A | CP | U B | R/ | λM | message | |-----------|-----|-----|----|-----|----|----|---------------------------------| | | а | b | а | b | a | b | | | B.1 | M 1 | M 1 | 1 | I | 0 | 0 | ) read of b from CPU B | | | M 1 | M 1 | 1 | 1 | 0 | 0 | write back of b=1 from CPU A | | B.2 | M 1 | S 1 | 1 | S1 | 0 | 1 | read of a from CPU B | | | M 1 | S 1 | 1 | S1 | 0 | 1 | write back of a=1 from CPU A | | | S 1 | S 1 | S1 | S1 | 1 | 1 | Α. | | : | : | : | : | : | : | : | : | | A.1 | S 1 | S 1 | S1 | S1 | 1 | 1 | invalidate of a from CPU A | | | S 1 | S 1 | 1 | S1 | 1 | 1 | invalidate ack. of a from CPU B | | | M 1 | S 1 | 1 | S1 | 1 | 1 | ¥ | # **MESI Example (I)** # Thread A a = 1; ### Thread B | statement | CP | U A | CF | PUB | R/ | ٩M | message | |-----------|-----|-----|----|-----|----|----|-----------------------------------| | | a | b | a | b | a | b | | | A.1 | ı | I | ı | ı | 0 | 0 | ) read invalidate of a from CPU A | | | 1 | | 1 | 1 | 0 | 0 | ) invalidate ack. of a from CPU B | | | 1 | 1 | 1 | 1 | 0 | 0 | read response of a=0 from RAM | | B.1 | M1 | | 1 | 1 | 0 | 0 | read of b from CPU B | | | M1 | 1 | 1 | 1 | 0 | 0 | read response with b=0 from RAM | | B.1 | M1 | 1 1 | 1 | E0 | 0 | 0 | V. | | A.2 | M1 | | 1 | E0 | 0 | 0 | ) read invalidate of b from CPU A | | | M1 | 1 | 1 | E0 | 0 | 0 | read response of b=0 from CPU B | | | M1 | S 0 | 1 | S0 | 0 | 0 | ) invalidate ack. of b from CPU B | | | M 1 | M1 | 1 | 1 | 0 | 0 | ¥ | # **MESI Example (II)** ## Thread B | statement | CP | U A | CP | U B | RA | AΜ | message | |-----------|-----|-----|-----|-----|-----|----|---------------------------------| | | a | b | a | b | a | b | | | B.1 | M 1 | M 1 | I | _ | 0 | 0 | ) read of b from CPU B | | | M 1 | M 1 | 1 | 1 | 0 | 0 | write back of b=1 from CPU A | | B.2 | M 1 | S 1 | 1 | S1 | 0 | 1 | read of a from CPU B | | | M 1 | S 1 | 1 | S1 | 0 | 1 | write back of a=1 from CPU A | | | S 1 | S 1 | S1 | S1 | 1 | 1 | <b>₹</b> | | 1 : | : | : | 1 : | : | 1 : | : | <u>:</u> | | A.1 | S 1 | S 1 | S1 | S1 | 1 | 1 | ) invalidate of a from CPU A | | | S 1 | S 1 | 1 | S1 | 1 | 1 | invalidate ack. of a from CPU B | | | M 1 | S 1 | 1 | S1 | 1 | 1 | <u>*</u> | ## **MESI Example (II)** ``` Thread A // A.1 a = 1; // A.2 ``` ``` Thread B while (b == 0) {}; // B.1 assert(a == 1); ``` | statement | CP | U A | CP | U B | R/ | λM | message | |-----------|-----|-----|-----|-----|----|----|---------------------------------| | | а | b | а | b | a | b | | | B.1 | M 1 | M 1 | I | | 0 | 0 | ) read of b from CPU B | | | M 1 | M 1 | 1 | 1 | 0 | 0 | write back of b=1 from CPU A | | B.2 | M 1 | S 1 | 1 | S1 | 0 | 1 | read of a from CPU B | | | M 1 | S 1 | 1 | S1 | 0 | 1 | write back of a=1 from CPU A | | | S 1 | S 1 | S1 | S1 | 1 | 1 | <b>₹</b> | | : | : | : | 1 : | : | : | : | <u>:</u> | | A.1 | S 1 | S 1 | S1 | S1 | 1 | 1 | ) invalidate of a from CPU A | | | S 1 | S 1 | 1 | S1 | 1 | 1 | invalidate ack. of a from CPU B | | | M 1 | S 1 | 1 | S1 | 1 | 1 | <u>*</u> | Introducing Store Buffers: Out-Of-Order Stores # **MESI Example: Happened Before Model** Idea: each cache line one process, A caches b=0 as E, B caches a=0 as E #### Observations: • each memory access must complete before executing next instruction $\leadsto$ add edge ## **Out-of-Order Execution** ♠ performance problem: writes always stall ## **Store Buffers** ⚠ Abstract Machine Model: defines semantics of memory accesses - put each store into a store buffer and continue execution - Store buffers apply stores in various orders: - ► FIFO (Sparc/x86 TSO) - ▶ unordered (Sparc PSO) - program order still needs to be observed locally - store buffer snoops read channel and - on matching address, returns the youngest value in buffer ## **Store Buffers** Abstract Machine Model: defines semantics of memory accesses - put each store into a store buffer and continue execution - Store buffers apply stores in various orders: - ► FIFO (Sparc/x86-*TSO*) - ► unordered (Sparc PSO) - program order still needs to be observed locally - store buffer snoops read channel and - ▶ on matching address, returns the youngest value in buffer ## **Store Buffers** Abstract Machine Model: defines semantics of memory accesses - put each store into a store buffer and continue execution - Store buffers apply stores in various orders: - ► FIFO (Sparc/x86-*TSO*) - ► unordered (Sparc *PSÓ*) - $\triangle$ program order still needs to be observed locally - ► store buffer snoops read channel and - on matching address, returns the youngest value in buffer # **TSO Model: Formal Spec [SI92]** #### **Definition (Total Store Order)** The store order wrt. memory ( □ ) is total $\forall_{a,b} \in \mathit{addr}\ i,j \in \mathit{CPU} \quad \big(\mathsf{St}_{\pmb{i}}[a] \sqsubseteq \mathsf{St}_{\pmb{i}}[b]\big) \lor \big(\mathsf{St}_{\pmb{j}}[b] \sqsubseteq \mathsf{St}_{\pmb{i}}[a]\big)$ ② Stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ $\mathrm{Ld}_{i}[a] \leq \mathrm{Op}_{i}[b] \Rightarrow \mathrm{Ld}_{i}[a] \sqsubseteq \mathrm{Op}_{i}[b]$ A load's value is determined by the latest write as observed by the local CPU Particularly, one ordering property is not guaranteed: $$\operatorname{St}_{i}[a] \leq \operatorname{Ld}_{i}[b] \not\Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{Ld}_{i}[b]$$ △ Local stores may be observed earlier by local loads then from somewhere else! ## **Store Buffers** Abstract Machine Model: defines semantics of memory accesses - put each store into a store buffer and continue execution - Store buffers apply stores in various orders: - ► FIFO (Sparc/x86-*TSO*) - ▶ unordered (Sparc *PSO*) - program order still needs to be observed locally - ► store buffer snoops read channel and - on matching address, returns the youngest value in buffer ## **Happened-Before Model for TSO** Assume cache A contains: a: S0, b: S0, cache B contains: a: S0, b: S0 ## **TSO Model: Formal Spec [SI92]** #### **Definition (Total Store Order)** - The store order wrt. memory ( □ ) is total - $\forall_{a,b} \in \mathit{addr}\ i,j \in \mathit{CPU} \quad \left(\mathsf{St}_i[a] \sqsubseteq \mathsf{St}_j[b]\right) \lor \left(\mathsf{St}_j[b] \sqsubseteq \mathsf{St}_i[a]\right)$ - Stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ - ② Loads preceding an other operation (wrt. program order ≤ ) are embedded into the memory order ( □ ) - $\mathrm{Ld}_{i}[a] \leq \mathrm{Op}_{i}[b] \Rightarrow \mathrm{Ld}_{i}[a] \sqsubseteq \mathrm{Op}_{i}[b]$ - A load's value is determined by the latest write as observed by the local CPU $$val(\operatorname{Ld}_i[a]) = val(\operatorname{St}_j[a] \mid \operatorname{St}_j[a] = \max_{\subseteq} \left( \left\{ \operatorname{St}_k[a] \mid \operatorname{St}_k[a] \sqsubseteq \operatorname{Ld}_i[a] \right\} \cup \left\{ \operatorname{St}_i[a] \mid \operatorname{St}_i[a] \le \operatorname{Ld}_i[a] \right\} \right)$$ Particularly, one ordering property is not guaranteed: $$\operatorname{Str}[a] \subseteq \operatorname{Ld}[b] \Rightarrow \operatorname{St}_i[a] \subseteq \operatorname{Ld}_i[b]$$ △ Local stores may be observed earlier by local loads then from somewhere else! ## TSO in the Wild: x86 The x86 CPU, powering desktops and servers around the world is a common representative of a TSO Memory Model based CPU. - FIFO store buffers keep quite strong consistency properties - The major obstacle to Sequential Consistency is $$\operatorname{\mathsf{St}}_i[a] \leq \operatorname{\mathsf{Ld}}_i[b] ot \Rightarrow ot \operatorname{\mathsf{St}}_i[a] \sqsubseteq \operatorname{\mathsf{Ld}}_i[b]$$ - ▶ modern x86 CPUs provide the mfence instruction - mfence orders all memory instructions: $$\mathsf{Op}_i \leq \mathit{mfence}() \leq \mathsf{Op}_i'$$ $\Rightarrow$ $\mathsf{Op}_i \sqsubseteq \mathsf{Op}_i'$ - a fence between write and loads gives sequentially consistent CPU behavior (and is as slow as a CPU without store buffer) - → use fences only when necessary ## **Happened-Before Model for TSO** #### Thread A #### Thread B ``` b = 1; printf("%d",a); ``` Assume cache A contains: a: S0, b: S0, cache B contains: a: S0, b: S0 # **PSO Model: Formal Spec [SI92]** #### **Definition (Partial Store Order)** - The store order wrt. memory ( □ ) is total - $\forall_{a,b \in addr \ i,j \in CPU} \quad (St_i[a] \sqsubseteq St_j[b]) \lor (St_j[b] \sqsubseteq St_i[a])$ - 2 Fenced stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{sfence}() \leq \operatorname{St}_{i}[b] \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ - 3 Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[a]' \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[a]'$ - - $\mathrm{Ld}_{i}[a] \leq \mathrm{Op}_{i}[b] \Rightarrow \mathrm{Ld}_{i}[a] \sqsubseteq \mathrm{Op}_{i}[b]$ - A load's value is determined by the latest write as observed by the local CPU - $val(\mathrm{Ld}_i[a]) = val(\mathrm{St}_j[a] \mid \mathrm{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \mathrm{St}_k[a] \mid \mathrm{St}_k[a] \sqsubseteq \mathrm{Ld}_i[a] \right\} \cup \left\{ \mathrm{St}_i[a] \mid \mathrm{St}_i[a] \le \mathrm{Ld}_i[a] \right\} \right)$ - Now also stores are not guaranteed to be in order any more: $$\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \not\Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$$ What about sequential consistency for the whole system? #### TSO in the Wild: x86 The x86 CPU, powering desktops and servers around the world is a common representative of a TSO Memory Model based CPU. - FIFO store buffers keep quite strong consistency properties - The major obstacle to Sequential Consistency is $$\operatorname{St}_{i}[a] \leq \operatorname{Ld}_{i}[b] \implies \operatorname{St}_{i}[a] \sqsubseteq \operatorname{Ld}_{i}[b]$$ - ▶ modern x86 CPUs provide the mfence instruction - ▶ mfence orders all memory instructions: $$Op_i \leq mfence() \leq Op_i' \Rightarrow Op_i \sqsubseteq Op_i'$$ - a fence between write and loads gives sequentially consistent CPU behavior (and is as slow as a CPU without store buffer) - → use fences only when necessary ## **Explicit Synchronization: Write Barrier** Overtaking of messages *may be desirable* and does not need to be prohibited in general. - generalized store buffers render programs incorrect that assume sequential consistency between different CPUs - whenever a store in front of another operation in one CPU must be observable in this order by a different CPU, an explicit write barrier has to be inserted - ▶ a write barrier marks all current store operations in the store buffer - ▶ the next store operation is only executed when all marked stores in the buffer have completed