## Instruction Scheduling - Part 3

#### Y.N. Srikant

Department of Computer Science Indian Institute of Science Bangalore 560 012

#### NPTEL Course on Compiler Design

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

- Reordering of instructions so as to keep the pipelines of functional units full with no stalls
- NP-Complete and needs heuristcs
- Applied on basic blocks (local)
- Global scheduling requires elongation of basic blocks (super-blocks)

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

- This is useful for the evaluation of instruction scheduling heuristics that do not generate optimal schedules
- Careful implementation may enable these methods to be deployed even in production quality compilers
- Assume a simple resource model in which all the functional units are fully pipelined
- Assume an architecture with integer ALU, FP add unit, FP mult/div unit, and load/store unit with possibly differing execution latencies
- Assume that there are R<sub>r</sub> instances of the functional unit r

・ロット (雪) (日) (日) (日)

- Let  $\sigma_i$  be the time at which instruction i is scheduled
- Let  $d_{(i,j)}$  be the weight of the edge (i, j) of the DAG
- To satisfy dependence constraints, for each arc (*i*, *j*) of the DAG

$$\sigma_j \geq \sigma_i + d_{(i,j)} \tag{1}$$

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

- A matrix K<sub>n×T</sub>, where n is the number of instructions in the DAG and T is an estimate of the worst case execution time of the schedule, is used
  - *T* can be estimated by summing up the execution times of all the instructions in the DAG
- *K*[*i*, *t*] is 1, if instruction *i* is scheduled at time step *t* and 0 otherwise

• The schedule time  $\sigma_i$  of instruction *i* can be expressed as

$$\sigma_i = k_{i,0} \cdot \mathbf{0} + k_{i,1} \cdot \mathbf{1} + \cdots + k_{i,T-1} \cdot (T-1)$$

where exactly one of the  $k_{i,j}$  is 1

• This can be written in matrix form for all  $\sigma_i$ 's as:

$$\begin{bmatrix} \sigma_{0} \\ \sigma_{1} \\ \vdots \\ \sigma_{n-1} \end{bmatrix} = \begin{bmatrix} k_{0,0} & k_{0,1}, & \cdots & k_{0,T-1} \\ k_{1,0} & k_{1,1} & \cdots & k_{1,T-1} \\ \vdots & \vdots & \vdots & \vdots \\ k_{n-1,0} & k_{n-1,1} & \cdots & k_{n-1,T-1} \end{bmatrix} * \begin{bmatrix} 0 \\ 1 \\ \vdots \\ T-1 \end{bmatrix}$$
(2)

• To express that each instruction is scheduled exactly once, we include the constraint

$$\sum_{t} k_{i,t} = 1, \quad \forall i \tag{3}$$

• The resource constraint that no more than *R<sub>r</sub>* instructions are scheduled in any time step can be expressed as

$$\sum_{i \in F(r)} k_{i,t} \le R_r, \quad \forall \ t \ \text{ and } \forall \ r$$
(4)

・ロット (雪) (日) (日) (日)

where F(r) represents the set of instructions that can be executed in functional unit type *r*.

 The objective function is to minimize the execution time or schedule length, subject to the constraints in equations 1-4 above. This can be represented as:

minimize(
$$\max_{i}(\sigma_{i} + d_{(i,j)})$$
)

## Delayed Load Scheduling Algorithm for Trees

- RISC load/store architecture with delayed loads
- Single cycle issue/execution, with only loads pipelined (load delay = 1 cycle)
- Generates optimal code without any interlocks for expression trees
- Three phases
  - Computation of *minReg* as in Sethi-Ullman code generation algorithm
  - Ordering of loads and operations as in the SU algorithm
  - Emitting code in canonical DLS order
- Uses 1 + minReg number of registers and can handle only one cycle load delay

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

```
if (isLeaf(node)) then {node.minReg = 1}
else
```

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

### Sethi-Ullman minReg Computation Example

| i1:  | t1 | $\leftarrow$ | load a  |
|------|----|--------------|---------|
| i2:  | t2 | $\leftarrow$ | load b  |
| i3:  | t3 | $\leftarrow$ | t1 + t2 |
| i4:  | t4 | $\leftarrow$ | load c  |
| i5:  | t5 | $\leftarrow$ | load a  |
| i6:  | t6 | $\leftarrow$ | load b  |
| i7:  | t7 | $\leftarrow$ | t5 + t6 |
| i8:  | t8 | $\leftarrow$ | t7 - t4 |
| i9:  | t9 | $\leftarrow$ | t3 * t8 |
| i10: | d  | $\leftarrow$ | st t9   |

(a) 3-Address Code



(b) Expression Tree

・ ロ ト ・ 雪 ト ・ 目 ト ・ 日 ト

ъ

### Sethi-Ullman Algorithm Code Gen Example

| i1:  | r1 | $\leftarrow$ | load a  |
|------|----|--------------|---------|
| i2:  | r2 | $\leftarrow$ | load b  |
| i3:  | r1 | $\leftarrow$ | r1 + r2 |
| i4:  | r2 | $\leftarrow$ | load c  |
| i5:  | r3 | $\leftarrow$ | load a  |
| i6:  | r4 | $\leftarrow$ | load b  |
| i7:  | r3 | $\leftarrow$ | r3 + r4 |
| i8:  | r2 | $\leftarrow$ | r3 - r2 |
| i9:  | r1 | $\leftarrow$ | r1 * r2 |
| i10: | d  | $\leftarrow$ | st r1   |
|      |    |              |         |

(a) Code Sequence using 4 Registers

| i5:  | r1 | $\leftarrow$ | load a  |
|------|----|--------------|---------|
| i6:  | r2 | $\leftarrow$ | load b  |
| i7:  | r1 | $\leftarrow$ | r1 + r2 |
| i4:  | r2 | $\leftarrow$ | load c  |
| i8:  | r1 | $\leftarrow$ | r1 - r2 |
| i1:  | r2 | $\leftarrow$ | load a  |
| i2:  | r3 | $\leftarrow$ | load b  |
| i3:  | r2 | $\leftarrow$ | r2 + r3 |
| i9:  | r1 | $\leftarrow$ | r1 * r2 |
| i10: | d  | $\leftarrow$ | st r1   |
|      |    |              |         |

(b) Optimal Code Sequence with 3 Registers

(日)

### **DLS** Computation Example

| i5:  | r1 | $\leftarrow$ | load a  |           |
|------|----|--------------|---------|-----------|
| i6:  | r2 | $\leftarrow$ | load b  |           |
| i7:  | r1 | $\leftarrow$ | r1 + r2 | % 1 stall |
| i4:  | r2 | $\leftarrow$ | load c  |           |
| i8:  | r1 | $\leftarrow$ | r1 - r2 | % 1 stall |
| i1:  | r2 | $\leftarrow$ | load a  |           |
| i2:  | r3 | $\leftarrow$ | load b  |           |
| i3:  | r2 | $\leftarrow$ | r2 + r3 | % 1 stall |
| i9:  | r1 | $\leftarrow$ | r1 * r2 |           |
| i10: | d  | $\leftarrow$ | st r1   |           |

(a) Stalls in Sethi-Ullman Sequence

| i5:  | r1 | $\leftarrow$ | load a  |
|------|----|--------------|---------|
| i6:  | r2 | $\leftarrow$ | load b  |
| i4:  | r3 | $\leftarrow$ | load c  |
| i1:  | r4 | $\leftarrow$ | load a  |
| i7:  | r1 | $\leftarrow$ | r1 + r2 |
| i2:  | r2 | $\leftarrow$ | load b  |
| i8:  | r1 | $\leftarrow$ | r1 - r3 |
| i3:  | r4 | $\leftarrow$ | r4 + r2 |
| i9:  | r1 | $\leftarrow$ | r1 * r4 |
| i10: | d  | $\leftarrow$ | st r1   |

(b) DLS Sequence with No Stalls

・ ロ ト ・ 雪 ト ・ 目 ト

Procedure Generate(root: ExprNode)
{ label(root); //Calculate minReg values
 opSched = loadSched = emptyList(); //Initialize
 order(root, opSched, loadSched);
 //Find load and operation order
 schedule(opSched, loadSched, root.minReg+1);
 //Emit canonical order
}

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

```
Procedure Order(root: ExprNode;
                var opSched, loadSched: NodeList)
{ if (not(isLeaf(root))
    { if (root.left.minReg < root.right.minReg)</pre>
        { order(root.right, opSched, loadSched);
          order(root.left, opSched, loadSched);
        } else
           {order(root.left, opSched, loadSched);
            order(root.right, opSched, loadSched);
        append(root, opSched);
  else { append(root, loadSched);
```

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

## **DLS Algorithm - Scheduling**

Procedure schedule(opSched, loadSched: NodeList; Reqs: integer) { for i = 1 to MIN(Regs, length(loadSched)) do // loads first { ld = popHead(loadSched); ld.reg = getReg(); gen(Load, ld.name, ld.Reg)} while (not Empty(loadSched)) // (Operation,Load) pairs next { op = popHead(opSched); op.reg = op.left.reg; gen(op.op, op.left.reg, op.right.reg, op.reg); ld = popHead(loadSched); ld.reg = op.right.reg; gen(Load, ld.name, ld.reg) } while (not Empty(opSched)) //Remaining Operations { op = popHead(opSched); op.reg = op.left.reg; gen(op.op, op.left.reg, op.right.reg, op.reg); freeReg(op.right.reg) } ◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

- Average size of a basic block is quite small (5 to 20 instructions)
  - Effectiveness of instruction scheduling is limited
  - This is a serious concern in architectures supporting greater ILP
    - VLIW architectures with several function units
    - superscalar architectures (multiple instruction issue)
- Global scheduling is for a set of basic blocks
  - Overlaps execution of successive basic blocks
  - Trace scheduling, Superblock scheduling, Hyperblock scheduling, Software pipelining, etc.

・ロット (雪) (日) (日) (日)

### Trace Scheduling

- A Trace is a frequently executed acyclic sequence of basic blocks in a CFG (part of a path)
- Identifying a trace
  - Identify the most frequently executed basic block
  - Extend the trace starting from this block, forward and backward, along most frequently executed edges
- Apply list scheduling on the trace (including the branch instructions)
- Execution time for the trace may reduce, but execution time for the other paths may increase
- However, overall performance will improve

・ロット (雪) (日) (日) (日)

#### **Trace Example**



・ロ ・ ・ 一 ・ ・ 日 ・ ・ 日 ・

### Trace - Basic Block Schedule

- 2-way issue architecture with 2 integer units
- add, sub, store: 1 cycle, load: 2 cycles, goto: no stall
- 9 cycles for the main trace and 6 cycles for the off-trace

| Time          |      | Int. Unit 1                |      | Int. Unit 2            |
|---------------|------|----------------------------|------|------------------------|
| 0             | i1:  | $r2 \leftarrow load a(r1)$ |      |                        |
| $\frac{1}{2}$ |      |                            |      |                        |
|               | i2:  | if (r2 != 0) goto i7       |      |                        |
| 3             | i3:  | $r3 \leftarrow load b(r1)$ |      |                        |
| 4 5           |      |                            |      |                        |
|               | i4:  | $r4 \leftarrow r3 + r7$    |      |                        |
| 6             | i5:  | $b(r1) \leftarrow r4$      | i6:  | goto i9                |
| 3             | i7:  | $r4 \leftarrow r2$         | i8:  | $b(r1) \leftarrow r2$  |
|               |      |                            |      |                        |
| 7(4)          | i9:  | $r5 \leftarrow r5 + r4$    | i10: | $r1 \leftarrow r1 + 4$ |
| 8(5)          | i11: | if (r1 $<$ r6) goto i1     |      |                        |

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

#### **Trace Schedule**



・ロ ・ ・ 一 ・ ・ 日 ・ ・ 日 ・

### **Trace Schedule**

• 6 cycles for the main trace and 7 cycles for the off-trace

| Time |      | Int. Unit 1                | Int. Unit 2 |       |              | t 2        |
|------|------|----------------------------|-------------|-------|--------------|------------|
| 0    | i1:  | $r2 \leftarrow load a(r1)$ | i3:         | r3    | $\leftarrow$ | load b(r1) |
| 1    |      |                            |             |       |              |            |
| 2    |      | if (r2 != 0) goto i7       | i4:         | r4    | $\leftarrow$ | r3 + r7    |
| 3    | i5:  | $b(r1) \leftarrow r4$      |             |       |              |            |
| 4(5) | i9:  | $r5 \leftarrow r5 + r4$    | i10:        | r1    | $\leftarrow$ | r1 + 4     |
| 5(6) | i11: | if (r1 $<$ r6) goto i1     |             |       |              |            |
|      |      |                            |             |       |              |            |
| 3    |      | $r4 \leftarrow r2$         | i8:         | b(r1) | $\leftarrow$ | r2         |
| 4    | i12: | goto i9                    |             |       |              |            |

◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで

- Side exits and side entrances are ignored during scheduling of a trace
- Required compensation code is inserted during book-keeping (after scheduling the trace)
- Speculative code motion *load* instruction moved ahead of conditional branch
  - Example: Register r3 should not be live in block B3 (off-trace path)
  - May cause unwanted exceptions
    - Requires additional hardware support!

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のので

## **Compensation Code**



What compensation code is required when Instr 1 is moved below the side exit in the trace?

(日)

### Compensation Code (contd.)



・ロ ・ ・ 一 ・ ・ 日 ・ ・ 日 ・

э

### Compensation Code (contd.)



What compensation code is required when Instr 5 moves above the side entrance in the trace?

(日)

### Compensation Code (contd.)



・ ロ ト ・ 雪 ト ・ 目 ト ・ 日 ト

э

#### A Superblock is a trace without side entrances

- Control can enter only from the top
- Many exits are possible
- Eliminates several book-keeping overheads
- Superblock formation
  - Trace formation as before
  - Tail duplication to avoid side entrances into a superblock
  - Code size increases

・ロット (雪) (日) (日) (日)

### Superblock Example

• 5 cycles for the main trace and 6 cycles for the off-trace



| Time |       | Int. Unit 1 |                                                                                                                  |       | Int. Unit 2                             |  |  |
|------|-------|-------------|------------------------------------------------------------------------------------------------------------------|-------|-----------------------------------------|--|--|
| 0    | i1:   | r2          | $\leftarrow \texttt{load a(r1)}$                                                                                 | i3:   | $r3 \leftarrow load b(r1)$              |  |  |
| 1    |       |             |                                                                                                                  |       |                                         |  |  |
| 2    | i2:   | if (r2      |                                                                                                                  |       | $r4 \leftarrow r3 + r7$                 |  |  |
| 3    | i5:   | b(r1)       | $\leftarrow$ r4                                                                                                  | i10:  | $r1 \leftarrow r1 + 4$                  |  |  |
| 4    | i9:   | r5          | $\leftarrow$ r5 + r4                                                                                             | i11:  | if (r1 <r6) goto="" i1<="" td=""></r6)> |  |  |
|      |       |             |                                                                                                                  |       |                                         |  |  |
|      |       |             |                                                                                                                  | i8:   | $b(r1) \leftarrow r2$                   |  |  |
| 4    | i9':  | <b>r</b> 5  | $\leftarrow$ r5 + r4<br><r6) goto="" i1<="" th=""><th>i10':</th><th><math>r1 \leftarrow r1 + 4</math></th></r6)> | i10': | $r1 \leftarrow r1 + 4$                  |  |  |
| 5    | i11': | if (r1      | . <r6) goto="" i1<="" th=""><th></th><th></th></r6)>                                                             |       |                                         |  |  |

(a) Control Flow Graph

(b) Superblock Schedule

(日)

- Superblock scheduling does not work well with control-intensive programs which have many control flow paths
- Hyperblock scheduling was proposed to handle such programs
- Here, the control flow graph is IF-converted to eliminate conditional branches
- IF-conversion replaces conditional branches with appropriate predicated instructions
- Now, control dependence is changed to a data dependence

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

### **IF-Conversion Example**

for I = 1 to 100 do { if (A(I) <= 0) then contnue A(I) = B(I) + 3 } for I = 1 to N do { S1: A(I) = D(I) + 1 S2: if (B(I) > 0) then S3: C(I) = C(I) + A(I) S4: else D(I+1) = D(I+1) + 1 end if }

for I = 1 to 100 do { p = (A(I) <= 0) (p) A(I) = B(I) + 3 }

for I = 1 to N do { S1: A(I) = D(I) + 1 S2: p = (B(I) > 0) S3: (p) C(I) = C(I) + A(I) S4: (!p) D(I+1) = D(I+1) + 1 }

くロン くぼと くさと くちとう

=

### Hyperblock Example Code



・ロット (雪) (日) (日) (日)

### Hyperblock Example

- 6 cycles for the entire set of predicated instructions
- Instructions i3 and i4 can be executed speculatively and can be moved up, instead of being scheduled after cycle 2



| $\operatorname{Time}$ | Int. Unit 1 |    |              | it 1       |      | Int. Unit 2                             |
|-----------------------|-------------|----|--------------|------------|------|-----------------------------------------|
| 0                     | i1:         | r2 | $\leftarrow$ | load a(r1) | i3:  | $r3 \leftarrow load b(r1)$              |
| 1                     |             |    |              |            |      |                                         |
| 2                     | i2':        | p1 | $\leftarrow$ | (r2 == 0)  | i4:  | $r4 \leftarrow r3 + r7$                 |
|                       |             |    |              |            |      | b(r1) $\leftarrow$ r2, if !p1           |
| 4                     | i10:        | r1 | $\leftarrow$ | r1 + 4     | i7:  | $r4 \leftarrow r2, if !p1$              |
| 5                     | i9:         | r5 | $\leftarrow$ | r5 + r4    | i11: | if (r1 <r6) goto="" i1<="" td=""></r6)> |

(b) Hyperblock Schedule

伺 とくき とくきと

(a) Control Flow Graph

## **Delayed Branch Scheduling**

- Delayed branching
  - One instruction immediately following the delayed branch instruction will be executed before the branch is taken
  - The instruction occupying the delay slot should be *independent* of the branch instruction
- It is best to fill the branch delay slot with an instruction from the basic block that the branch terminates
- Otherwise, an instruction from either the target block or the fall-through block, whichever is most likely to be executed, is selected
  - The selected instruction should either be a *root node* of the DAG of the basic block (target of fall-through)
  - and has a destination register that is not live-in in the other block
  - or has a destination register that can be renamed

・ロット (雪) (日) (日) (日)

### Delay Branch Scheduling Conditions - 1



### Delay Branch Scheduling Conditions - 2



くつわえ くさん くちん

=