计算机体系结构-02

MIPS, Pipeline and Instruction Scheduling

Posted by Yvan on November 21, 2020

目录

  Folien Chapter Video
MIPS64 02 Appendix A and K 1.2.1 - 1.2.4
Pipelining & Harzard 03 Appendix C.1 1.3.1 - 1.3.10
Multicycle Operations & Harzard 03 Appendix C.5 1.4.1 - 1.4.2
MIPS R4000 03 Appendix C.6 1.4.3
Scheduling 04 Chapter 3.2 1.4.4 - 1.4.7

MIPS64

Folien 02 / Appendix A and K / Video 1.2.1 - 1.2.4

  • simple load-store instruction set
    • Only load/store instructions access memory
    • To process data, need to load it into register first
  • designed for pipelining efficiency
    • Fixed instruction set encoding
  • efficient instruction set for compiler to target

Registers

  • has 32 64-bit general-purpose registers (GPRs) R0 - R31
    • R0 is a hardwired 0
  • has 32 64-bit floating-point registers (FPRs) F0 - F31
    • to hold a 32-bit single precision floating point or 64-bit double precision floating point
    • when used as single precision float, half of FPR is unused
  • few special registers
    • hi and lo:
      • eg. for multiply: n bit * n bit = 2n bit which can not be placed in a n bit register
    • floating-point status register
    • exception program counter

Data Types

  • Integer:

    • 8-bit: bytes (char)

    • 16-bit: half words (short, Unicode)

    • 32-bit: words (only in mips)

    • 64-bit: double words

      • 32-bit to 64-bit:

        to increase the amount of directly addressable memory

  • Operations process 64-bit integers

    • 如果使用bytes等小于64位的integer,需要用zero extended或sign extended转换成64位integer

      • zero extended:

        free positions are filled with zeros

      • sign extended:

        a sign bit will be copied into the free position

  • Floating-point

    • 32-bit single-precision (float)
    • 64-bit double-precision (double)

Addressing Modes

image-20201121220036543

  • Mode bit specifies:

    • Big Endian

    • Little Endian

    • 图中为32bits整数

      Big-Endian.svg

Instruction Formats

  • 32 bits long

  • R type: <Register direct>

  • I type: <Immediate>

    • sign extended
  • J type:

    • 26 bits instruction address,地址是64位
    • 怎么从26变成64?
    • When a J-type instruction is executed, a full target address is formed by concatenating [1] the high order bits of the PC (the address of the instruction following the jump), the [2] 26 bits of the target field, and [3] 2 zero bits.
    • [2] 26 bits of the target field
    • [3] 2 zero bits:因为地址/指令地址是对齐的(4Bytes 32bits: 比如地址结尾是 …000 的后一条指令地址是 …100 , 000同时占用了001,010,011 四格) 最后2bits可以省略 (相当于除以4/右移2位)
    • [1] high order bits of the PC: 指令只接受64位输入,也就是说28-63位的值由当前PC提供 (即 $(PC+4)_{63..28}$)

    image-20201121221733330

MIPS64 长度
整数寄存器 64 bits
浮点数寄存器 64 bits
指令 32 bits
指令opcode 6 bits
寄存器数量 32个 = 5 bits = 2的5次方:
1. 两种寄存器各32个,2.R/I指令的rs/rt/rd部分都是5位长的
地址 64 bits (一个寄存器能存下的长度)
* J指令地址部分 26 bits (32bits的指令-6bits的opcode) 但地址是64bits的,
1. 26位的地址补两个0变成28位(0-27),因为地址是4byte对齐的所以补两个0没关系
2. 28位的地址加上当前PC提供第28-63位,补充成64位(0-64)
各种数据 8 bits - 64 bits
bytes / Char 8 bits
half words / Short / Unicode 16 bits
words 32 bits
double words 64 bits
single-precision floating-point / Float 32 bits
double-precision floating-point / Double 64 bits
指令操作时处理的数据 64 bits (不足的需要补成64位长度 比如8bit的char在操作中会补充成64bit)

Basic Code Optimization

  • Eliminating redundant branches
  • Eliminating induction variables
  • Loop unrolling

Pipelining

Classical 5-Stage Pipeline

Folien 03 / Appendix C.1 / Video 1.3.1 - 1.3.10

  • IF: Instruction fetch
    • Fetch instruction using PC
    • PC=PC+4 *(4×8=32 每条指令的长度)
  • ID: Instruction decode
    • decode
    • read rs and rt fields in each instruction
    • read IMM as sign extended 64bit Integer
    • pass destination register rd and the next sequential PC to ID/EX pipeline register
  • EX: Execute
    • ALU
  • MEM: Memory access
    • Load + Store
  • WB: Write-back cycle
    • Write to register

Advantage

  • all MIPS instructions same length
  • few instruction formats, source registers always in same place
  • only loads and stores access memory (MIPS is load/store architecture)

Hazard

  1. structural hazards: HW cannot support parallel execution of instructions
    • 1 Mem w/ 1 read/write port
      • IF,MEM can not be in a same Clock Cycle
  2. data hazards: instr. depends on result of prior instr. still in the pipeline
    • Reg-Reg的矛盾 第一条指令的写Reg在CC5,第二条指令的读Reg在CC3
      • Forwarding / Bypassing
        • Using Pipeline Register (不需要写不需要读 第一条ALU计算好的直接进第二条ALU)
        • write in the first half Clock Cycle, read in the second half Clock Cycle
    • **load use data hazard ** 在forwarding基础上 第一条指令load在MEM结束后读完CC4后半期,第二条指令需要ALU在CC4前半期
      • Pipeline Interlock
        • Bubble:空几个CC
      • Scheduling
  3. control(/branch) hazards: delay between fetching control flow instruction (branch or jump) and actual jump
    • 分支指令在ALU比较完 决定是否跳转 找到跳转地址前有3个CC的空隙(penalty)
    • Solution
      • Determine Branch taken / not taken sooner
      • Compute Branch Target address earlier
    • MIPS solution
      • Move branch condition evaluation to ID stage
      • Calculate branch target address in ID stage
      • Always execute instruction after the branch (branch delay slot)
    • 这样做以后只有1个CC penalty 这个penalty 用一个指令来填满 这个指令一定会被运行(不能对之后有影响)
      • From before branch instruction
      • From target address: only valuable when branch taken
      • From fall through: only valuable when branch not taken
      • 都没有则用nop:dadd r0,r0,r0

image-20201130122320691


Multicycle Operations

Folien 03 / Appendix C.5 / Video 1.4.1 - 1.4.2

Certain instructions (e.g., mul, div, floating-point operations) do not execute in one cycle

for example:

image-20201130121957140

image-20201130115251476image-20201130115351335

Hazard

  1. Structural Hazard
    • Divider not pipelined (Division只能全部做完再开始下一个)
      • stall and scheduling
    • Varying latencies: write at the same cycle
      • stall
      • increase write ports
    • image-20201130122948379
  2. WAW(write-after-write) Hazard
    • do not always write back in order
      • busy bit
      • scheduling
    • image-20201130122923210
  3. instruction complete out of order
  4. RAW(read-after-write) Hazard
    • this instruction depends on the instructions before
      • stall
    • image-20201130122902595
  5. Complexity of forwarding logic increases

MIPS R4000

Folien 03 / Appendix C.6 / Video 1.4.3

image-20201130122030080


Scheduling

Folien 04 / Chapter 3.2 / Video 1.4.4 - 1.4.7

to solve the stall cycles in Multicycle Operations RAW Hazard

  1. Swap the position of the instructions

    to use unrelated instruction

    • to fill the stall
    • to fill the nop after branch instruction
  2. Loop Unrolling

    image-20201130124732515

  3. register renaming

  • moving all loads before stores
  • moving two SDs after DADDI and BNE
  • using different registers

Compiler Steps

  • Eliminate extra test and branch instructions and adjust loop termination and iteration code
  • Determine that it is possible to move S.D after DADDI and BNE, and find amount to adjust S.D offset
  • Determine that loop unrolling would be useful by finding that loop iterations were independent
  • Rename registers to avoid name dependencies
  • Determine loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent
    • requires analyzing memory addresses and finding that they do not refer to the same address.
  • Schedule the code, preserving any dependences needed to yield same result as original code

Advantages and Disadvantages

+improves potential for instruction scheduling

+reduces loop overhead

-increases code size

-shortfall in registers (register pressure)