# Lecture Note 7. IA: History and Features

November 11, 2023
Jongmoo Choi
Dept. of Software
Dankook University

http://embedded.dankook.ac.kr/~choijm

#### **Objectives**

- Discuss Issues on ISA (Instruction Set Architecture)
  - ✓ Opcode and operand addressing modes
- Apprehend how ISA affects system program
  - Context switch, Memory alignment, Memory overflow
- Describe the history of IA (both IA-32 and Intel 64)
- Grasp the key technologies in recent IA
  - ✓ Pipeline and Moore's law

Refer to Chapter 3, 4 in the CSAPP and Intel SW Developer

Manual





#### Issues on ISA (1/2)

Consideration on ISA (Instruction Set Architecture)

```
asm_sum: addl $1, %ecx
movl -4(%ebx, %ebp, 4), %eax
call func1
leave
```

- ✓ opcode issues
  - how many? (add vs. inc → RISC vs. CISC)
  - multi functions? (SISD vs. SIMD vs. MIMD ...)
- ✓ operand issues
  - fixed vs. variable operands
  - fixed: how many?
    - · Consider with C = A + B;
  - operand addressing modes
- ✓ performance issues
  - pipeline
  - superscalar
  - multicore





#### Issues on ISA (2/2)

- Features of IA (Intel Architecture)
  - ✓ Basically CISC (Complex Instruction Set Computing)
    - Variable length instruction
    - Variable number of operands (0~3)
    - Diverse operand addressing modes
    - Stack based function call
    - Supporting SIMD (Single Instruction Multiple Data)
  - ✓ Try to take advantage of RISC (Reduced Instruction Set Computing)
    - Micro-operations (for instance, an instruction of "add %eax, a" is divided into three u-ops, and each u-op is executed in a pipeline manner)
    - Load-store architecture
    - Independent multi-units
    - Out-of-order execution
    - Register based function call on x86-64
    - Register renaming

**-** ...

(Source: CSAPP Chapter 4)

More recent CISC machines also take advantage of high-performance pipeline structures. As we will discuss in Section 5.7, they fetch the CISC instructions and dynamically translate them into a sequence of simpler, RISC-like operations. For example, an instruction that adds a register to memory is translated into three operations: one to read the original memory value, one to perform the addition, and a third to write the sum to memory. Since the dynamic translation can generally be performed well in advance of the actual instruction execution, the processor can sustain a very high execution rate.

## Operand addressing modes (1/5)

#### Addressing modes

- Immediate addressing
- Register addressing
- Register Indirect addressing
- Direct (Absolute) addressing
- ✓ Indirect addressing
- Base plus Offset addressing
- Base plus Index addressing
- Base plus Scaled Index addressing
- Base plus Scaled Index plus Offset addressing
- Stack addressing



## Operand addressing modes (2/5)

#### Subtle differences in operand



# Operand addressing modes (3/5)

#### Operand Addressing in IA

√ immediate operand

addl \$0x12, %eax

✓ register operand

addl %esp, %ebp

- Memory operand
  - direct addressing

addl 0x8049384, %eax

register indirect addressing

addl (%ebp), %eax

Base plus offset addressing

addl 4(%ebp), %eax

Base plus Scaled index plus offset addressing

addl 4(%ebp, %eax, 4), %ebx

displacement(base, index, scale)

# Operand addressing modes (4/5)

#### Example

✓ Base plus Scaled index plus offset

| Base                                                         |                | Index                                                 | Sca | ale Fact         | or Dis | placeme                           | nt |
|--------------------------------------------------------------|----------------|-------------------------------------------------------|-----|------------------|--------|-----------------------------------|----|
| EAX<br>EBX<br>ECX<br>EDX<br>ESI<br>EDI<br>EBP<br>ESP<br>None | : <del> </del> | EAX<br>EBX<br>ECX<br>EDX<br>ESI<br>EDI<br>EBP<br>None | ×   | 1<br>2<br>3<br>4 | +      | None<br>8-bit<br>16-bit<br>32-bit |    |



if 4(%ebx, %ecx, 4)?

## Operand addressing modes (5/5)

#### Summary

| Туре      | Form                 | Operand value                      | Name                |
|-----------|----------------------|------------------------------------|---------------------|
| Immediate | \$Imm                | Imm                                | Immediate           |
| Register  | $E_{\alpha}$         | $R[E_{\alpha}]$                    | Register            |
| Memory    | Imm                  | M[Imm]                             | Absolute            |
| Memory    | (E <sub>a</sub> )    | $M[R[E_{\alpha}]]$                 | Indirect            |
| Memory    | $Imm(E_b)$           | $M[Imm + R[E_h]]$                  | Base + displacement |
| Memory    | $(E_b, E_i)$         | $M[R[E_b] + R[E_i]]$               | Indexed             |
| Memory    | $Imm(E_b, E_i)$      | $M[Imm + R[E_b] + R[E_i]]$         | Indexed             |
| Memory    | $(,E_i,s)$           | $M[R[E_i] \cdot s]$                | Scaled indexed      |
| Memory    | $Imm(,E_i,s)$        | $M[Imm + R[E_i] - s]$              | Scaled indexed      |
| Memory    | $(E_b, E_i, s)$      | $M[R[E_b] + R[E_i] \cdot s]$       | Scaled indexed      |
| Memory    | $Imm(E_{b},E_{i},x)$ | $M[Imm + R[E_b] + R[E_i] \cdot s]$ | Scaled indexed      |

Figure 3.3 Operand forms. Operands can denote immediate (constant) values, register values, or values from memory. The scaling factor x must be either 1, 2, 4, or 8.

(Source: CSAPP Chapter 3)



### Impact of ISA on system program: Multitasking (1/5)

#### Time sharing system

- ✓ Tasks run interchangeable
- ✓ Need to remember where to start → Context
  - Context: registers, address space, opened files, IPCs, ...
- Context switch
  - When: timeout(time quantum expired), sleep, blocking I/O, ...
  - How
    - Context save: CPU registers → task structure (memory)
    - Context restore: task structure (memory) → CPU registers





### Impact of ISA on system program: Multitasking (2/5)

#### Virtual CPU: running A



Time quantum is expired, system program (scheduler) selects a Task B to run next.



## Impact of ISA on system program: Multitasking (3/5)

#### Virtual CPU: switch to B



- Time quantum is expired, system program (scheduler) selects a Task B to run next.
- Time quantum is expired, again. Task A is scheduled. Then where to start?



## Impact of ISA on system program: Multitasking (4/5)

Virtual CPU: how to switch back to A



### Impact of ISA on system program: Multitasking (5/5)

#### Time sharing system

- ✓ Tasks run interchangeable
- ✓ Need to remember where to start → Context
  - Context: registers, address space, opened files, IPCs, ...
- Context switch
  - When: timeout(time quantum expired), sleep, blocking I/O, ...
  - How
    - Context save: CPU registers → task structure (memory)
    - Context restore: task structure (memory) → CPU registers



#### Impact of ISA on system program: Memory Usage (1/6)

Little Endian vs. Big Endian

```
🍱 chojjm's X desktop (embedded.wowdns.com:2)
                cholim@embedded: /public_html/sys-ro/exam_byteorder
              파일(F)
                               편집(E)
                                                보기(V)
                                                                  터미널(T)
                                                                                      7171(G)
                                                                                                        도움말(H)
            #include <stdio.h>
            int main(void)
                     int a = 0 \times 12345678;
                    unsigned char *p a:
                    p a = (unsigned char *)&a;
                    printf("p a[0] = %x\mm\n", p a[0]):
                    printf("p a[3] = % \ h ", p a[3]):
                                                                                                                                  _ 🗆 🗆 🗴
♣ choiim@localhost:~
                                                             _ 🗆 X
                                                                    dembedded:~
[choiim@localhost choiim]$
                                                                    cholim@embedded ~ $ more byte order.c
[choijm@localhost choijm]$
                                                                    #include <stdio.h>
[choijm@localhost choijm]$
[choijm@localhost choijm]$ uname -a
                                                                    int main()
Linux localhost.localdomain 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i
386 GNU/Linux
                                                                           int a = 0x12345678;
[choijm@localhost choijm]$
                                                                          unsigned char *p_a;
[choijm@localhost choijm]$ Is -I byte_order.c
-rw-rw-r-- 1 choijm choijm
                               175 11월 19 20:18 byte_order.c
                                                                           p a = (unsigned char *)&a;
[choijm@localhost choijm]$
                                                                          printf("p_a[0] = %x\u00fcn", p_a[0]);
printf("p_a[3] = %x\u00fcn", p_a[3]);
[choijm@localhost choijm]$
[choijm@localhost choijm]$ gcc byte_order.c
[choiim@localhost choiim]$
[choijm@localhost choijm]$
                                                                    choijm@embedded ~ $
[choiim@localhost choiim]$ ./a.out
                                                                    lchoilm@embedded ~ $ uname -a
p_a[0] = 78
                                                                    SunOS embedded 5.10 Generic 127127-11 sun4u sparc SUNW.Sun-Fire-880 Solaris
p_a[3] = 12
[choiim@localhost choiim]$
                                                                    cholim@embedded ~ $ gcc byte_order.c
[choiim@localhost choiim]$
                                                                    cholim@embedded ~ $
[choijm@localhost choijm]$
                                                                    chollm@embedded ~ $ ./a.out
[choiim@localhost choiim]$
                                                                    p_a[0] = 12
[choiim@localhost choiim]$
                                                                    p_a[3] = 78
[choijm@localhost choijm]$
                                                                    choiim@embedded ~ $
[choiim@localhost choiim]$
```

## Impact of ISA on system program: Memory Usage (2/6)

#### Little Endian vs. Big Endian

Continuing our earlier example, suppose the variable x of type int and at address 0x100 has a hexadecimal value of 0x01234567. The ordering of the bytes within the address range 0x100 through 0x103 depends on the type of machine:

| Big endian    | 0x100 | 0x101 | 0x102 | 0x103 |       |
|---------------|-------|-------|-------|-------|-------|
| 3.4.6.65      | 01    | 23    | 45    | 67    | + * * |
|               |       |       |       |       |       |
|               |       |       |       |       |       |
| ittle endian  |       |       |       |       |       |
| Little endian | 0x100 | 0x101 | 0x102 | 0x103 |       |

(Source: CSAPP)







## Impact of ISA on system program: Memory Usage (3/6)

- Where can we see the little endian?
  - √ readelf command

```
Choijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4
                                                                        Choijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4
                                                                       choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                       choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$ readelf -a a.out
choilm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ more test.c
#include <stdio.h>
                                                                      ELF Header:
                                                                         Magic:
                                                                                   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
lint a = 10:
                                                                         Class:
                                                                                                                    ELF64
                                                                         Data:
                                                                                                                    2's complement, little endian
int b = 20:
                                                                         Version:
                                                                                                                    1 (current)
int c:
                                                                                                                   UNIX - System V
                                                                         OS/ABI:
                                                                         ABI Version:
int main()
                                                                                                                    DYN (Shared object file)
                                                                         Type:
                                                                                                                    Advanced Micro Devices X86-64
     c = a + b:
                                                                         Machine:
      printf("C = %dWn", c);
                                                                         Version:
                                                                         Entry point address:
                                                                                                                   0x1060
                                                                         Start of program headers:
                                                                                                                   64 (bytes into file)
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                                                                   14784 (bytes into file)
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$ acc -c test.c
                                                                         Start of section headers:
choi im@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                         Flags:
                                                                                                                   64 (bytes)
choilm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ size test.o
                                                                         Size of this header:
                          hex filename
                                                                         Size of program headers:
                                                                                                                    56 (bytes)
                                                                         Number of program headers:
                                                                                                                    13
                           a4 test.o
choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                         Size of section headers:
                                                                                                                    64 (bytes)
choi m@LAPTOP-LR5HOQBH:~/Syspro/LN4$ gcc test.c
                                                                         Number of section headers:
choi im@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                         Section header string table index: 30
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$ size a.out
                                                                      Section Headers:
              bss
                                                                         [Nr] Name
                          8a3 a.out
                                                                                                     Type
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
                                                                               Size
                                                                                                     EntSize
                                                                                                                          Flags Link Info Align
                                                                         [0]
                                                                                                                          00000000000000 00000000
choilm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ objdump -h a.out
                                                                               0000000000000000
                                                                                                     00000000000000000
                                                                         [ 1] .interp
                                                                                                     PROGRITS
                                                                                                                          0000000000000318 00000318
a.out:
       file format elf64-x86-64
                                                                               00000000000001c
                                                                                                    00000000000000000
                                                                         [ 2] .note.gnu.propert NOTE
                                                                                                                          000000000000338 00000338
Sections:
                                                                               ldx Name
                                                                                                    0000000000000000
                                                                                                                                      0
                                                                                                                          000000000000358 00000358
             0000001c 000000000000318 00000000000318 00000318 2**0
                                                                               .note.gnu.build-i NOTE
 0 .interp
                                                                               0000000000000024
                                                                                                    0000000000000000
             CONTENTS, ALLOC, LOAD, READONLY, DATA
                                                                         [ 4] .note.ABI-tag
                                                                                                                          000000000000037c 0000037c
 1 .note.gnu.property 00000020 000000000000338 00000000000338 00000338 2**3
                                                                               0000000000000000
             CONTENTS, ALLOC, LOAD, READONLY, DATA
                                                                         [5] .gnu.hash
                                                                                                     GNU HASH
                                                                                                                          00000000000003a0 000003a0
 2 .note.gnu.build-id 00000024 0000000000000358 00000000000358 00000358 2**2
                                                                               0000000000000024
                                                                                                     00000000000000000
             CONTENTS, ALLOC, LOAD, READONLY, DATA
                                                                                                                          00000000000003c8 000003c8
 3 .note.ABI-tag 00000020 000000000000037c 0000000000037c 0000037c 2**2
                                                                               .dvnsvm
                                                                               000000000000000a8
                                                                                                    000000000000018
             CONTENTS, ALLOC, LOAD, READONLY, DATA
                                                                                                                          0000000000000470 00000470
                                                                          [7].dvnstr
                                                                                                     STRTAB
             00000024 0000000000003a0 000000000003a0 000003a0 2**3
```

#### Impact of ISA on system program: Memory Usage (4/6)

- Memory Alignment in data structure
  - ✓ To reduce memory fetch numbers (and atomicity)
  - ✓ To consider cache line boundary (and false sharing)

```
choijm@sungmin-Samsung-DeskTop-System: ~/syspro/chap7
    /* Byte alignment test bu choijm */
    #include <stdio.h>
    // #define TEST PACKED
    #ifdef TEST PACKED
    typedef struct {
        double d1;
        char ch:
        double d2:
        attribute
                      (packed)) Test;
    typedef struct {
                                        Depend on compiler and CPU
        char ch:
        double d2;
                                        " attribute ((packed))"
    int main()
        Test test;
        printf("Size of Test is %d\n", sizeof(test));
                                                         27,1
byte alignment.c
"byte alignment.c" 27L,
```

### Impact of ISA on system program: Memory Usage (5/6)

#### Memory Alignment in stack

- ✓ Caller: need 16 bytes (for 2 local variables + 2 arguments in Fig. 3.24)
  - note that sum and diff are not allocated in stack in this example
- ✓ But, allocate 24 bytes for alignment in a frame (subl \$24, %esp)
  - 4B for saved ebp, allocated 24B, 4B for return addr. → 16B alignment

```
int swap_add(int *xp, int *yp)
3
         int x = *xp:
         int y = *yp;
         *xp = y:
         *yp = x;
         return x + y;
0
10
     int caller()
11
12
13
         int arg1 = 534;
         int arg2 = 1057;
14
15
         int sum = swap_add(&arg1, &arg2);
         int diff = arg1 - arg2;
17
         return sum * diff:
19
```

Figure 3.23 Example of procedure definition and call.

(Source: CSAPP)



Figure 3.24 Stack frames for caller and swap\_add. Procedure swap\_add retrieves its arguments from the stack frame for caller.

#### Impact of ISA on system program: Memory Usage (6/6)

- Revisit the stack in LN6
  - ✓ IA recommends 16 bytes alignment: andl \$-16, %esp



# Impact of ISA on system program: Memory overflow (1/2)

#### Memory overflow

- ✓ Due to no boundary check: buffer overflow, stack overflow
- ✓ How to thwart buffer overflow on stack (or Heap)
  - Stack randomization
    - One step further: ASLR (Address Space Layout Randomization) →
       even code, data and heap (see Appendix)
  - Stack protector (a.k.a. stack guard): e.g. Canary

```
/* Sample implementation of library function gets() */
                                                                             /* Read input line and write it back */
                                                                                                                                                  echo:
                                                                            void echo()
char *gets(char *s)
                                                                            1
                                                                     19
                                                                                                                                                    pushl
                                                                                                                                                             %ebp
                                                                                                                                                                              Save %ebp on stack
                                                                     20
                                                                                  char buf[8]: /* Way too small! */
                                                                                                                                                             %esp, %ebp
    int c:
                                                                                  gets(buf):
                                                                     21
                                                                                                                                                     pushl
                                                                                                                                                             %ebx
                                                                                                                                                                              Save %ebx
                                                                     22
                                                                                  puts(buf):
    char *dest = s:
                                                                     23
                                                                                                                                                     subl
                                                                                                                                                             $20, %esp
                                                                                                                                                                              Allocate 20 bytes on stack
   int gotchar = 0; /* Has at least one character been read? */
                                                                                                                                                            -12(%ebp), %ebx
                                                                                                                                                                             Compute buf as %ebp-12
   while ((c = getchar()) != '\n' && c != EOF) {
                                                                                                                                                             %ebx, (%esp)
                                                                                                                                                                              Store buf at top of stack
        *dest++ = c; /* No bounds checking! */
                                                                     Figure 3.31
       gotchar = 1:
                                                                     Stack organization for
                                                                                                                                                             gets
                                                                                                                                                                              Call gets
                                                                                                   Stack frame
                                                                     echo function. Character
                                                                                                                                                             %ebx, (%esp)
                                                                                                                                                                              Store buf at top of stack
                                                                                                      for caller
                                                                     array buf is just below part
    *dest++ = '\0'; /* Terminate string */
                                                                                                                                                    call
                                                                     of the saved state. An out-
                                                                                                                                                             puts
                                                                                                                                                                              Call puts
                                                                                                                Return address
                                                                     of-bounds write to buf can
   if (c == EOF && !gotchar)
                                                                                                                  Saved %ebp
                                                                                                                                                    addl
                                                                                                                                                             $20, %esp
                                                                                                                                             11
                                                                                                                                                                              Deallocate stack space
                                                                     corrupt the program state.
                                                                                                                  Saved %ebx
       return NULL: /* End of file or error */
                                                                                                                                                    popl
                                                                                                                                                             %ebx
                                                                                                                                                                              Restore %ebx
                                                                                                               [7] [6] [5] [4]
                                                                                                   Stack frame
                                                                                                                                                             %ebp
                                                                                                                                                     popl
                                                                                                      for echo
                                                                                                                                                                              Restore %ebp
                                                                                                               [3] [2] [1] [0] buf
                                                                                                                                                    ret
                                                                                                                                                                              Return
```

# Impact of ISA on system program: Memory overflow (2/2)

#### Stack protector

- ✓ Typical example: canary
- ✓ Included as default in modern gcc



Figure 3.33

Stack organization for echo function with stack protector enabled. A special "canary" value is positioned between array buf and the saved state. The code checks the canary value to determine whether or not the stack state has been corrupted.







#### Intel CPU History (1/9)

- **8080 (1974)** 
  - ✓ 8bit register, 8bit bus, 64KB memory support
- **8086** (1978)
  - √ 16bit register, 16bit data bus, 20bit address bus (8088: 8bit data bus for backward compatibility, others are same as 8086), 1st generation of x86 ISA
  - Segmentation (real addressing mode, 1MB memory support)
- **80286 (1982)** 
  - √ 16bit, 24bit address bus
  - Segmentation (use segment descriptors, 16MB memory support)
  - ✓ 4 privilege level
- 80386 (1985)
  - ✓ 32bit register and bus (80386 SX: 16bit bus for backward compatibility)
  - √ First 32bit addressing (4GB memory support)
  - ✓ Paging with a fixed 4-KBytes page size



## Intel CPU History (2/9)

- **80486 (1989)** 
  - ✓ Pipelining support (3 stages of execution, introduce u-op)
  - ✓ Use L1 cache (keep recently used instruction, 8KB)
  - ✓ An integrated x87 FPU (no FPU in 486SX)
  - ✓ power saving support, system management mode for notebook (486SL)
- Pentium (1993, 5<sup>th</sup> generation)
  - ✓ 5-stage pipeline, Superscalar support (two pipelines (u and v), which allows
    to execute at most two u-ops at a cycle in parallel)
  - ✓ L1 cache is divided into D-Cache, I-Cache, Use L2 cache, write back protocol (MESI protocol)
  - ✓ Introduce Branch Prediction
  - ✓ APIC for multiple processor

Why not the 80586?

- Pentium with MMX Technology
  - Equip Multimedia Accelerator.
  - SIMD(Single Instruction Multiple Data): High performance for Matrix processing (one of the big changes in x86 ISA, CISC flavor)



## Intel CPU History (3/9)

- P6 family (1995~1999, 6<sup>th</sup> generation )
  - ✓ P6 Microarchitecture: Dynamic execution
    - Out-of-order execution
    - Branch prediction
    - Speculative execution: decouple execution and commitment
    - Data flow analysis: detect independent instructions on real time
    - Register renaming
  - ✓ Pentium Pro
    - Three instructions per clock cycle (3-way superscalar), 256KB L2 cache
    - Even though its name is similar to Pentium, its internal is quite novel (eg. employ diverse RICS features such as first out-of-order execution)
  - ✓ Pentium II
    - MMX enhancement, 16KB L1 cache, 1MB L2 cache
    - Multiple low power state (Autohalt, Stop-grant, sleep, deep sleep)
    - Pentium II Xeon: Premium Pentium II (for server, large cache and scalability)
    - Pentium II Cerelon: For lower system cost (for cost-optimization, no L2 or small)
  - ✓ Pentium III
    - SSE (Streaming SIMD Extension): 128bit register(XMM), FPU support,
       Multimedia specialized instruction (around 70), Coopermine, Tualatin, ...
    - Pentium III Xeon: Premium Pentium III



## Intel CPU History (4/9)

- Pentium 4 Processor Family (2000~2006, also release Itanium)
  - ✓ NetBurst microarchitecture
    - Deep pipelining (Hyper Pipelining: 20~31 stages u-op, expected up to 10GHz)
    - Wider design: Rapid Execution (ALU 2X), System Bus (4X)
    - Advanced Dynamic Execution
      - Deep, out-of-order execution engine, Enhanced branch prediction
    - New cache system (Advanced Trace Cache for decoded instructions)
  - ✓ Hyper-Threading: support Multithread at the CPU level (AS)
  - ✓ Pentium 4 with SSE2, SSE3
  - ✓ Pentium D (Smithfield, beginning of the dual core era)
  - √ 64-bit CPU (IA-64, x86-64)
  - ✓ Virtualization technology
  - ✓ Market Name
    - Pentium 4
      - Northwood, Prescott, Cedermill, Smithfield, Willamette, ...
    - Pentium M: low power, high performance mobile CPU
    - Intel Xeon Processor: Premium Pentium 4
      - 64-bit Xeon MP: 3.3GHz, 16KB L1, 1MB L2, 8MB L3
    - Intel Pentium Processor Extreme Edition (Gallatin)
      - For High performance PC

Pentium 4
Central processing unit





## Intel CPU History (5/9)

- Intel Core Processor Family (2006 ~)
  - ✓ Intel Core microarchitecture
    - NetBurst problem: high power consumption, pipeline inefficiency
    - Reengineering based on P6 Microarchitecture (14 stage of pipeline)
    - Increased L2 cache (6MB), 4 way superscalar, combine u-ops
    - Native Dualcore: not just packaging two cores, but integrating as the design stage (eg. Advanced Smart Cache (L2 sharing), Enhanced prefetcher)
  - ✓ Marketing name: use Core, not Pentium
    - Core Solo/Duo (32 bit)
      - Yonah (laptop), actually based on P6 microarchitecture
    - Core 2 Solo/Duo/Quad (64 bit)
      - Merom, Penryn (laptop), Conroe, Kentsfield, Yorkfield (desktop), Woodcrest, Clovertown(Server)
      - · Develop rapidly to multiple cores



(source: http://motoc.tistory.com/)



### Intel CPU History (6/9)

- Intel Core i3/i5/i7 Family (2009 ~)
  - Nehalem microarchitecture (and it's tick version Westmere)
    - Quickpath interconnect(for competing AMD's hyper-transport, supporting NUMA), IMC (Integrated Memory Controller), SMT, 45nm
    - Turbo mode, 256KB L2 cache/core, 12MB L3 cache, Intel Core 1st generation
  - ✓ Sandy Bridge, Haswell, Skylake, Sunny Cove, Raptor Cove u-architecture
    - Successor of Nehalem, <= 32 nm, Tick-Tock strategy</li>
    - AVX (Advanced Vector extension, 256 bit SSE), Integrated GPU, DDR4, 10 nm
  - ✓ Marketing name: Core i3, i5, i7 (From mid-range (i3) to high-end (i7))
    - Lynnfield, Sandy bridge(Laptop), Gulftown, Sandy bridge-E(P) (Server),
       Arrandale, Sandy bridge-M (Mobile)









### Intel CPU History (7/9)

#### Intel tick-tock model

- ✓ Tick: innovations in manufacturing process technology
- ✓ Tock: innovations in processor microarchitecture

#### The Tick-Tock model through the years



(Source: http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html)



(Intel Logo for Sandy Bridge, Haswell, Sky lake, Sunny. Source: http://namu.wiki)

## Intel CPU History (8/9)

#### Summary of Intel CPU microarchitecture

| Year • | Micro-architecture \$                                    | Pipeline stages •                       | Max<br>Clock •<br>[MHz] | Tech process [nm] |
|--------|----------------------------------------------------------|-----------------------------------------|-------------------------|-------------------|
| 1978   | 8086 (8086, 8088)                                        | 2                                       | 5                       | 3000              |
| 1982   | 186 (80186, 80188)                                       | 2                                       | 25                      | 3000              |
| 1982   | 286 (80286)                                              | 3                                       | 25                      | 1500              |
| 1985   | 386 (80386)                                              | 3                                       | 33                      | 1500              |
| 1989   | 486 (80486)                                              | 5                                       | 100                     | 1000              |
| 1993   | P5 (Pentium)                                             | 5                                       | 200                     | 800, 600, 350     |
| 1995   | P6 (Pentium Pro, Pentium II)                             | 14 (17 with load & store/retire)        | 450                     | 500, 350, 250     |
| 1997   | P5 (Pentium MMX)                                         | 6                                       | 233                     | 350               |
| 1999   | P6 (Pentium III)                                         | 12 (15 with load & store/retire)        | 1400                    | 250, 180, 130     |
| 2000   | NetBurst (Pentium 4)<br>(Willamette)                     | 20 unified with branch                  | 2000                    | 180               |
| 2002   | NetBurst (Pentium 4)<br>(Northwood, Gallatin)            | prediction                              | 3466                    | 130               |
| 2003   | Pentium M (Banias, Dothan)<br>Enhanced Pentium M (Yonah) | 10 (12 with fetch/<br>retire)           | 2333                    | 130, 90, 65       |
| 2004   | NetBurst (Pentium 4)<br>(Prescott)                       | 31 unified with branch prediction       | 3800                    | 90                |
| 2006   | Intel Core                                               | 12 (14 with                             | 3000                    | 65                |
| 2007   | Penryn (die shrink)                                      | fetch/retire)                           | 3333                    |                   |
|        | Nehalem                                                  | 20 unified (14 without miss prediction) | 3600                    | 45                |
| 2008   | Bonnell                                                  | 16 (20 with prediction miss)            | 2100                    |                   |
| 2010   | Westmere (die shrink)                                    | 20 unified (14 without miss prediction) | 3730                    |                   |
| 2011   | Saltwell (die shrink)                                    | 16 (20 with prediction miss)            | 2130                    | 32                |
|        | Sandy Bridge                                             | 14 (16 with                             | 4000                    | 3                 |

#### (source: en.wikipedia.org/wiki/ List\_of\_Intel\_CPU\_microarchitectures)

| 2012 | Ivy Bridge (die shrink) | fetch/retire)                         | 4100 |         |
|------|-------------------------|---------------------------------------|------|---------|
| 2013 | Silvermont              | 14–17 (16–19 with fetch/retire)       | 2670 | 22 nm   |
|      | Haswell                 | 14 (16 with                           | 4400 |         |
| 2014 | Broadwell (die shrink)  | fetch/retire)                         | 3700 |         |
| 2015 | Airmont (die shrink)    | 14-17 (16-19 with fetch/retire)       | 2640 |         |
| 2015 | Skylake                 | 14 (16 with fetch/retire)             | 5200 | 14 nm   |
| 2016 | Goldmont                | 20 unified with branch prediction     | 2600 |         |
| 2017 | Goldmont Plus           | 20 unified with branch prediction (?) | 2800 |         |
| 2018 | Palm Cove               | 14 (16 with fetch/retire)             | 3200 |         |
| 2019 | Sunny Cove              | 14–20 (misprediction)                 | 4100 | 10 nm   |
| 2020 | Tremont                 | 20 unified                            | 3300 |         |
| 2020 | Willow Cove             | 14 unified                            | 5300 |         |
|      | Cypress Cove            | 14 unified                            | 5300 | 14 nm   |
| 2021 | Golden Cove             | 12 unified                            | 5500 |         |
|      | Gracemont               | 20 unified with misprediction penalty | 4300 | Intel 7 |
| 2022 | Raptor Cove             | 12 unified                            | 6000 |         |
| 2022 | Redwood Cove            |                                       |      | feed 4  |
| 2023 | Crestmont               |                                       |      | Intel 4 |

#### Intel CPU History (9/9)

#### Summary of Intel CPU microarchitecture

- ✓ From <a href="https://en.wikipedia.org/wiki/List\_of\_Intel\_CPU\_microarchitectures">https://en.wikipedia.org/wiki/List\_of\_Intel\_CPU\_microarchitectures</a>
- Pre-P5: 1) 8086: first x86 processor, 2) 286: protected mode, 3) 386: 32-bit CPU, paging, 4) 486: FPU, pipeline, L1 cache
- ✓ P5: Advanced pipeline, Superscalar, MMX
- ✓ P6 (Pentium Pro, II, III): O3, SSE (Quite novel)
- ✓ Netburst (Pentium 4, Xeon): Deep pipeline
- ✓ Core (Core, Xeon): Mar. 2006, reengineered P6-based microarchitecture, 65nm, Multicore, (Tock → Penryn: 45nm)
- ✓ Nehalem (i3, i5, i7): 2008, 45nm, Integrated Memory Controller, QPI, (Tick → Westmere: 32nm)
- ✓ Sandy Bridge: 2011, 32nm, AVX, HW-support for video encoding and decoding, Encryption instruction set.(Tick → Ivy Bridge: 22nm)
- ✓ Haswell: 2013, 22nm, Integrated GPU, advanced power-saving (Tick → Broadwell: 14nm)
- ✓ Skylake: 2015, 14nm, DDR4 (64GB), PCI-e 3.0 (20 lane) (Optimization→ kaby lake, Tick → Cannon lake, 2018)
- ✓ Sunny Cove (Ice lake): 2019, 10nm (Optimization → Raptor Cove (2022), Redwood Cove (2023), HW-accelerator for security and AI features
  - → Rival: AMD (x86), ARM (Atom), Nvidia (Haswell), Samsung (IMC),

### Technologies of Intel CPU (1/12)

#### What processor do?

| Instruction type      | Dynamic usage |
|-----------------------|---------------|
| Data movement         | 43%           |
| Control flow          | 23%           |
| Arithmetic operations | 15%           |
| Comparisons           | 13%           |
| Logic operations      | 5%            |
| Other                 | 1%            |

- Data movement needs to be optimized
  - → CPU cache, write buffer
- ✓ Some components are idle while executing instruction
  - → Pipelining
  - → Superscalar



### Technologies of Intel CPU (2/12)

#### Pipeline

- Execution of an instruction is divided into multiple stages
- Overlapping execution of multiple instructions



## Technologies of Intel CPU (3/12)

- For the efficiency of Pipelining (no free lunch)
  - ✓ All instructions should have similar execution time (simple format)
    - RISC (addl a, b vs. movl a, %eax; addl b, %eax; movl %eax, b)
  - ✓ CPU components are independent each other → I/D cache
  - ✓ No resource conflict (sharing at the same time) → dual component
  - ✓ Overcome pipeline hazard (data, control)







## Technologies of Intel CPU (4/12)

- Techniques for overcome pipeline hazard
  - Compiler optimization
    - Instruction reordering
    - Loop unrolling
  - ✓ Branch prediction
    - Static prediction
    - Dynamic prediction
  - ✓ Out of order execution
    - Dynamic reordering with data flow analysis
  - Speculative execution and retirement
  - Register renaming



## Technologies of Intel CPU (5/12)

#### P6 microarchitecture revisit

- ✓ Dynamic execution
  - Out-of-order execution
  - Branch prediction
  - Speculative execution: decouple execution and commitment (retirement unit)
  - Data flow analysis: detect independent instructions on real time
  - Register renaming
- ✓ Pipelined (12 stage) architecture, 3-way superscalar
- ✓ L1 cache and L2 cache



Figure 2-1. The P6 Processor Microarchitecture with Advanced Transfer Cache Enhancement



# Technologies of Intel CPU (6/12)

#### Moore's law

#### Moore's Law - The number of transistors on integrated circuit chips (1971-2018)



Moore's law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important as other aspects of technological progress – such as processing speed or the price of electronic products – are linked to Moore's law.



Data source: Wikipedia (https://en.wikipedia.org/wiki/Transistor\_count)
The data visualization is available at OurWorldinData.org. There you find more visualizations and research on this topic.

Licensed under CC-BY-SA by the author Max Roser.

(Source: https://en.wikipedia.org/wiki/Moore%27s\_law)



# Technologies of Intel CPU (7/12)

#### Trend

- ✓ Increasing available transistors: multi components, multi channels
- ✓ Superscalar
- ✓ Multimedia support: SIMD
  - MMX technology
  - SSE
  - SSE2/3, AVX
- ✓ Hyper threading
- √ 64-bit Supporting
  - IA64 (EPIC)
  - Intel 64
- ✓ Multicore
- ✓ Virtualization



Intel Core 2 Architecture

(From http://en.wikipedia.org/wiki/File:Intel\_Core2\_arch.svg)

# Technologies of Intel CPU (8/12)

#### SIMD instructions

- A group of instructions can be performed in parallel
- ✓ Using MMX (64), XMM(128), YMM(256) registers
- ✓ MMX
  - integer
- ✓ SSE (Pentium 3)
  - Streaming SIMD Extension
  - Single precision floating point
- ✓ SSE2 (Pentium 4)
  - Double precision floating point
- ✓ SSE3 (Pentium 4)
  - HT support
  - 13 new SIMD instructions
- ✓ AVX (Sandy Bridge)
  - Advanced Vector Extension
  - From Sandy Bridge, 256 bit (YMM)



Figure 2-4. SIMD Extensions, Register Layouts, and Data Types



## Technologies of Intel CPU (9/12)

- Hyper threading Technology
  - ✓ Support multi-threading at CPU level
  - ✓ 2 or more separated code streams using shared execution resources



Figure 2-5. Comparison of an IA-32 Processor Supporting Hyper-Threading Technology and a Traditional Dual Processor System



# Technologies of Intel CPU (10/12)

### Multi core Technology

- ✓ Intel Pentium D: dual core based on two Pentium 4 (without HT)
- ✓ Intel Core Duo, Core 2 Duo: dual core with shared bus interface (dual core performance with low cost)
- ✓ Intel Core 2 Quad Processor: Duplicated Core Duo, Core 2 Duo
  - Extreme edition: multi-core with multi architectural states (with HT)
- ✓ Intel Core i7: Quick Path Interconnect, L3, IMC,



Figure 2-6. Intel 64 and IA-32 Processors that Support Dual-Core

Figure 2-7. Intel 64 Processors that Support Quad-Core

Figure 2-8. Intel Core i7 Processor

# Technologies of Intel CPU (11/12)

### Intel 64

- ✓ Support 64bit address extension: EM64T (Extended Memory 64 Technology), x86-64, IA-32e
- ✓ new operation modes
- ✓ new/enhanced register sets
- new/enhanced instruction sets
- √ 64bit address translation



Figure 4-8. Linear-Address Translation to a 4-KByte Page using IA-32e Paging



| Software Visible<br>Register | 64-Bit Mode                                         |        |             | Legacy and Compatibility Modes               |        |             |
|------------------------------|-----------------------------------------------------|--------|-------------|----------------------------------------------|--------|-------------|
|                              | Name                                                | Number | Size (bits) | Name                                         | Number | Size (bits) |
| General Purpose<br>Registers | RAX, RBX, RCX,<br>RDX, RBP, RSI,<br>RDI, RSP, R8-15 | 16     | 64          | EAX, EBX, ECX,<br>EDX, EBP, ESI,<br>EDI, ESP | 8      | 32          |
| Instruction Pointer          | RIP                                                 | 1      | 64          | EIP                                          | 1      | 32          |
| Flags                        | EFLAGS                                              | 1      | 32          | EFLAGS                                       | 1      | 32          |
| FP Registers                 | ST0-7                                               | 8      | 80          | ST0-7                                        | 8      | 80          |
| Multi-Media<br>Registers     | MM0-7                                               | 8      | 64          | MM0-7                                        | 8      | 64          |
| Streaming SIMD<br>Registers  | XMM0-15                                             | 16     | 128         | XMM0-7                                       | 8      | 128         |
| Stack Width                  |                                                     |        | 64          | -                                            |        | 16 or 32    |



# Technologies of Intel CPU (12/12)

- VT (Virtualization Technology)
  - √ VMX (Virtual Machine Extension)
    - Direct execution
    - New privilege level







### **CPU** information in Linux

### Iscpu

```
choijm@embedded: ~
Run 'do-release-upgrade' to upgrade to it.
Last login: Wed Nov 21 12:44:22 2018 from 172,25,235,170
choijm@embedded:~$
choiim@embedded:~$ lscpu
Architecture:
                       x86 64
                      32-bit, 64-bit
CPU op-mode(s):
Byte Order:
                      Little Endian
CPU(s):
On-line CPU(s) list: 0,1
Thread(s) per core:
Core(s) per socket:
Socket(s):
NUMA node (s):
Vendor ID:
                       GenuineIntel
CPU family:
Model:
Model name:
                       Intel(R) Core(TM)2 Duo CPU
                                                     E7500 @ 2.93GHz
Stepping:
CPU MHz:
                      2933,000
CPII max MHz:
                      2933.0000
CPU min MHz:
                      1600.0000
BogoMIPS:
                      5852.10
Virtualization:
                      VT-x
Ild cache:
                       32K
Lli cache:
                       32K
L2 cache:
                      3072K
NUMA node0 CPU(s):
                       0.1
                      fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm con
stant tsc arch perfmon pebs bts rep good nopl cpuid aperfmperf pni dtes64 monito
r ds cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4 l xsave lahf lm pti retpoline tpr
shadow vnmi flexpriority dtherm
choijm@embedded:~$
```

```
[root@prism81 ~]# lscpu
Architecture:
                      x86 64
                     32-bit, 64-bit
CPU op-mode(s):
Byte Order:
                     Little Endian
CPU(s):
                      32
On-line CPU(s) list: 0-31
Thread(s) per core:
Core(s) per socket:
Socket(s):
NUMA node(s):
Vendor ID:
                      GenuineIntel
CPU family:
Model:
Stepping:
CPU MHz:
                     2400.043
BogoMIPS:
                      4799.30
Virtualization:
                      VT-x
Lid cache:
                      32K
L1i cache:
                      32K
                      256K
L2 cache:
L3 cache:
                      20480K
NUMA node0 CPU(s):
                     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):
                      1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
```

## x86-64: extending IA-32 to 64-bit CPU (1/4)

- Brief history from IA-32 (x86) to Intel 64 (x86-64)
  - ✓ Intel traditional ISA: called as IA-32
    - Start at 1985 (80386)
    - Evolution: add new instructions (e.g. conditional move), also keep backward compatibility
  - ✓ New Intel ISA for 64-bit CPU: called as IA-64
    - Totally new ISAs called EPIC (Explicitly Parallel Instruction Computing)
       → MIMD
    - Market name: Itanium (2001)
  - ✓ AMD ISA for 64-bit CPU
    - Compatible with IA-32 → win at the market
    - Intel follows: Intel 64 (This is why SW developer manual is named as Intel 64 and IA-32 ...)
    - AMD renames AMD 64 (but x86-64 "persists as a favored name")



## x86-64: extending IA-32 to 64-bit CPU (2/4)

#### Features of x86-64

- ✓ New size for some data types
  - E.g. Pointer becomes 8 bytes
- Make use of RISC techniques
  - 8 GPR → 16 GPR
  - Register based arguments passing
- $\checkmark$  2<sup>64</sup> address space (2<sup>48</sup> in practical)
- ✓ Backward compatible
  - Can run existing SW in compatible mode

| C declaration | Intel data type    | Assembly code suffix | x86-64<br>size (bytes) | IA32 Size |
|---------------|--------------------|----------------------|------------------------|-----------|
| char          | Byte               | b                    | 1                      | 1         |
| short         | Word               | W                    | 2                      | 2         |
| int           | Double word        | 1                    | 4                      | 4         |
| long int      | Quad word          | q                    | 8                      | 4         |
| long long int | Quad word          | q                    | 8                      | 8         |
| char *        | Quad word          | q                    | 8                      | 4         |
| float         | Single precision   | s                    | 4                      | 4         |
| double        | Double precision   | d                    | 8                      | 8         |
| long double   | Extended precision | t                    | 10/16                  | 10/12     |

Figure 3.34 Sizes of standard data types with x86-64. These are compared to the sizes for IA32. Both long integers and pointers require 8 bytes, as compared to 4 for IA32.



Figure 3.35 Integer registers. The existing eight registers are extended to 64-bit versions, and eight new registers are added. Each register can be accessed as either 8 bits (byte), 16 bits (word), 32 bits (double word), or 64 bits (quad word).

## x86-64: extending IA-32 to 64-bit CPU (3/4)

### Assembly code example1

- ✓ Syntax: 1) rax instead of eax, 2) movq instead of movl, 3) argument passing using registers, 4) No stack frame if possible, 5) use PIC (Position Independent Code) based on rip (rip-relative address), ...
  - Register passing → 7 memory references vs. 3 memory references

```
long int simple_l(long int *xp, long int y)
                1
                      long int t = *xp + y;
                      *xp = t;
                      return t;
                7
IA32 implementation of function simple_1.
                                                                x86-64 version of function simple_1.
xp at %ebp+8, y at %ebp+12
                                                                xp in %rdi, y in %rsi
simple_1:
                                                                 simple_1:
                                                                           %rsi, %rax
  pushl
           %ebp
                            Save frame pointer
                                                                   movq
                                                                                            Copy y
                                                                           (%rdi), %rax
                                                                   addq
                                                                                            Add *xp to get t
           %esp, %ebp
  movl
                            Create new frame pointer
                                                                           %rax, (%rdi)
                                                                                            Store t at xp
                                                                   movq
           8(%ebp), %edx
  movl
                            Retrieve xp
                                                   (R)
                                                                   ret
                                                                                            Return
                                                                                                            (R)
           12(%ebp), %eax
  movl
                            Retrieve yp
                                                   (R)
           (%edx), %eax
   addl
                            Add *xp to get t
                                                   (R)
           %eax, (%edx)
  movl
                            Store t at xp
                                                   (W)
           %ebp
  popl
                            Restore frame pointer
                                                   (R)
                                                   (R)
   ret
                            Return
```

## x86-64: extending IA-32 to 64-bit CPU (4/4)

- Assembly code example2
  - ✓ Recent gcc using PIC
  - √ 1) using GOT (Global Offset Table), 2) using rip-relative addressing

```
Ochoijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$ more test.c
#include <stdio.h>
int a = 10;
int b = 20
int c;
int main()
       printf("C = %d\n", c);
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ gcc -S -o test64.s test.c -m64
choilm@LAPTOP-LR5HOQBH:~/Syspro/LN4$
choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ gcc -S -o test32.s test.c -m32
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD TARGET NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=
Target: x86 64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-
Oubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enab
le-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/us
r --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x
86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdin
=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/
lusr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --e
nable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu
-unique-object --disable-vtable-verify --enable-plugin --enable-defaul
t-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-
gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with
-abi=m64 --with-multilib-list=m32.m64.mx32 --enable-multilib --with-tu
ne=generic --enable-offload-targets=nvptx-none.hsa --without-cuda-driv
er --enable-checking=release --build=x86 64-linux-anu --host=x86 64-li
nux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)
choiim@LAPTOP-LR5HOQBH:~/Syspro/LN4$
```

```
Choijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4
        .comm
               c.4.4
        .section
                        .rodata
LCO:
        .string "C = %d\n
        .text
        .glob1
               main
                main, @function
        .type
I FRO:
       .cfi_startproc
       endbr32
                4(%esp), %ecx
        .cfi def cfa 1. 0
                $-16. %esp
               -4(\%ecx)
        .cfi escape 0x10.0x5.0x2.0x75.0
                %esp, %ebp
       .cfi_escape 0xf,0x3,0x75,0x78,0x6
        ofi escape 0x10 0x3 0x2 0x75 0x76
                  _x86.get_pc_thunk.ax
                $ GLOBAL OFFSET TABLE . %eax
       addl
                a@GOTOFF(%eax), %ecx
       mov
                b@GOTOFF(%eax). %edx
       mov I
       add
                c@GOT(%eax),
                %ecx. (%edx)
                c@GOT(%eax).
                (%edx), %edx
       mov
                $8. %esp
       subl
       pushl
                .LCO@GOTOFF(%eax), %edx
       leal
       pushl
                %eax. %ebx
       mov I
                printf@PLT
       call
                $16, %esp
       addl
                $0. %eax
       movI
                -8(%ebp), %esp
       .cfi_restore 1
       .cfi def cfa 1, 0
                %ebx
       popl
test32.s" line 59
```

```
Ochoijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4
       align 4
       .type a, @object
       .size
               a. 4
               10
       . long
       .glob1
              b
        .align 4
              b. @object
       .type
       .size
               b. 4
       . long
               20
              c.4.4
        .comm
        .section
                        .rodata
LCO:
       .string "C = %d\n
       .text
        .globl main
               main. @function
.LFB0:
       .cfi startproc
       endbr64
       pusha %rbp
       .cfi_def_cfa_offset 16
       .cfi_offset 6, -16
              %rsp. %rbp
        .cfi_def_cfa_register 6
               a(%rip), %edx
               b(%rip), %eax
       mov
               %edx %eax
               %eax. c(%rip)
               c(%rip), %eax
               %eax. %esi
                .LCO(%rip), %rdi
               $0. %eax
               printf@PLT
               $0. %eax
               %rbp
       .cfi_def_cfa 7, 8
       .cfi_endproc
LFE0:
        .size main, .-main
test64.s" line 47
```

## **Summary**

- Discuss the issues of ISA
- Grasp several operand addressing modes
- Understand how context switch works, memory alignment, ...
- Apprehend the technologies of IA
  - ✓ Pipelining
  - ✓ Dynamic execution
  - √ Cache (L1, L2, L3)
  - ✓ Superscalar
  - ✓ MMX
  - ✓ Hyper-threading
  - ✓ Multi core
  - ✓ Intel 64
  - ✓ Virtualization Technology





### Quiz for this Lecture

#### Quiz

- ✓ 1. Explain the differences between "movl \$array, %ebx" and "movl array, %ebx" in operand addressing modes.
- ✓ 2. Assume that a student reads three books (called A, B, C) in a library. Also assume that he/she reads a book for 10 minutes and turns to a next book. Explain the context save and context restore in this scenario.
- ✓ 3. Explain the key techniques of the dynamic execution in the Intel P6 microarchitecture (5 techniques)
- ✓ 4. Discuss what pipeline hazard can be occurred in the center below figure (from LN6) and how to overcome that hazard.
- ✓ 5. What are the Spectre vulnerabilities (or Meltdown) ? Explain it using the Intel technologies learned in this LN.
- ✓ 6. Discuss the differences between x86 (32-bit) and x86-64 (64-bit) in an assembly code (at least 3).







## **Appendix**

### Stack randomization

- ✓ To disable ASLR: echo 0 > /proc/sys/kernel/randomize\_va\_space
- ✓ To disable stack protector: -fno-stack-protector

```
↑ choijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4

↑ choijm@LAPTOP-LR5HOQBH: ~/Syspro/LN4

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        choilm@LAPTOP-LR5HOOBH:~/Syspro/LN4$ vi stack struct.c
                                                                                                                                                                                                                                  cholim@LAPTOP-LR5HOOBH:~/Syspro/LN4$ cat /proc/sys/kernel/randomize_va_space
choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ cat stack_struct.c
/* stack_struct.c: stack structure analysis, by choijm. choijm@dku.edu */
                                                                                                                                                                                                                                  cholimalAPTOP-LR5H00BH:~/Syspro/LN4$ echol0 | sudo tee/proc/sys/kernel/randomize valspace
#include <stdio.h>
int func2(int x, int y) {
                                                                                                                                                                                                                                  cholim@LAPTOP-LR5HOOBH:~/Syspro/LN4$
                   int f2_local1 = 21, f2_local2 = 22;
                                                                                                                                                                                                                                  choijm@LAPTOP-LR5HOOBH:~/Syspro/LN4$ ./a.out
                                                                                                                                                                                                                                                                        0xffffd3f0.
                                                                                                                                                                                                                                                                                                                                                            0xffffd3f8
                                                                                                                                                                                                                                 func2 local:
                                                                                                                                                                                                                                                                                                                  0xffffd3f4.
                                                                                                                                                                                                                                                    0xffffd3f0
                   printf("func2 local: \text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\tin}}\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\tinc{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\ti}\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\texi}\tint{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text{\text
                                                                                                                                                                                                                                                    0xffffd3ec
                                                                                                                                                                                                                                                                                               -11272
                   pointer = &f2_local1;
                                                                                                                                                                                                                                                     0xffffd3fc
                                                                                                                                                                                                                                                                                               472302336
                   printf("\txp \txd\tm", (pointer), *(pointer));
                                                                                                                                                                                                                                                     0xffffd400
                                                                                                                                                                                                                                                                                               -134520832
                   printf("\top \totatkg \totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg\totatkg
                                                                              (pointer-1), *(pointer-1));
                                                                                                                                                                                                                                                     0xffffd404
                                                                              (pointer+3), *(pointer+3));
                                                                                                                                                                                                                                                                                              -11208
                                                                                                                                                                                                                                                     0xffffd408
                                                                             (pointer+4), *(pointer+4));
                                                                                                                                                                                                                                                     y = 112
                   printf("\txp \txd\txd\txn", (pointer+5), *(pointer+5));
                                                                                                                                                                  // new
                                                                                                                                                                                                                                  cholim@LAPTOP-LR5HOOBH:~/Syspro/LN4$
                   printf("\txp \txd\n", (pointer+6), *(pointer+6));
                                                                                                                                                                  // new
                                                                                                                                                                                                                                  choilm@LAPTOP-LR5HOOBH:~/Syspro/LN4$ ./a.out
                   *(pointer+4) = 333;
                                                                                                                                                                                                                                 func2 local: 0xffffd3f0,
                                                                                                                                                                                                                                                                                                                  0xffffd3f4.
                                                                                                                                                                                                                                                                                                                                                            0xffffd3f8
                   printf("\ty = %d\n", y);
                                                                                                                                                                                                                                                    0xffffd3f0
                   return 222;
                                                                                                                                                                                                                                                    0xffffd3ec
                                                                                                                                                                                                                                                                                               -11272
                                                                                                                                                                                                                                                                                               967315200
                                                                                                                                                                                                                                                     0xffffd3fc
                                                                                                                                                                                                                                                     0xffffd400
                                                                                                                                                                                                                                                                                               -134520832
void func1() {
                   int ret_val, f1_local1 = 11, f1_local2 = 12;
                                                                                                                                                                                                                                                     0xffffd404
                                                                                                                                                                                                                                                     0xffffd408
                                                                                                                                                                                                                                                                                              -11208
                   ret_val = func2(111, 112);
                                                                                                                                                                                                                                  choljm@LAPTOP-LR5HOOBH:~/Syspro/LN4$ gcc -fno-stack-protector stack_struct.c -m32
                                                                                                                                                                                                                                  chalim@LAPTOP-LR5HOOBH:~/Syspro/LN4$
 int main() {
                   func1();
                                                                                                                                                                                                                                 choijm@LAPTOP-LR5HOOBH:~/Syspro/LN4$ ./a.out
                                                                                                                                                                                                                                                                                                                                                            0xffffd3f4
                                                                                                                                                                                                                                func2 local:
                                                                                                                                                                                                                                                                       0xffffd3fc.
                                                                                                                                                                                                                                                                                                                  0xffffd3f8.
choijm@LAPTOP-LR5H00BH:~/Syspro/LN4$ gcc stack_struct.c -m32
choijm@LAPTOP-LR5H00BH:~/Syspro/LN4$ ./a.out
                                                                                                                                                                                                                                                     0xffffd3fc
                                                                                                                                                                                                                                                     0xffffd3f8
func2 local:
                                       0xff9df5f0,
                                                                                0xff9df5f4.
                                                                                                                         0xff9df5f8
                                                                                                                                                                                                                                                                                              -11208
                                                                                                                                                                                                                                                     0xffffd408
                   0xff9df5f0
                                                                                                                                                                                                                                                     0xffffd40c
                                                                                                                                                                                                                                                                                              1448436529
                   0xff9df5ec
                   0xff9df5fc
                                                             1935993600
                                                                                                                                                                                                                                                     0xffffd410
                                                                                                                                                                                                                                                                                              111
                   0xff9df600
                                                             -135065600
                                                                                                                                                                                                                                                    0xffffd414
                                                                                                                                                                                                                                                                                              112
                   0xff9df604
                                                                                                                                                                                                                                                    y = 112
                   0xff9df608
                                                             -6425032
                                                                                                                                                                                                                                 Segmentation fault
                   y = 112
                                                                                                                                                                                                                                  cholim@LAPTOP-LR5HOOBH:~/Syspro/LN4$
 choijm@LAPTOP-LR5HOQBH:~/Syspro/LN4$ ./a.out
                                                                                                                                                                                                                                  cholim@LAPTOP-LR5HOOBH:~/Syspro/LN4$ gcc --version
func2 local:
                                                                                                                         0xff932978
                                       0xff932970,
                                                                                0xff932974.
                   0×ff932970
                                                                                                                                                                                                                                 gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
                                                            -7132808
                   0xff93296c
                                                                                                                                                                                                                                  Copyright (C) 2019 Free Software Foundation, Inc.
                                                            -1763943680
                   0xff93297c
                                                                                                                                                                                                                                 This is free software; see the source for copying conditions. There is NO
                   0×ff932980
                                                             -135049216
                                                                                                                                                                                                                                 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                   0xff932988
                                                             -7132744
                                                                                                                                                                                                                                  51 m@LAPTOP-LR5HOOBH:~/Syspro/LN4$
                   v = 112
 choiim@LAPTOP-LR5HOOBH:~/Svspro/LN4$
```