# Instruction Scheduling - Part 1

#### Y.N. Srikant

Department of Computer Science Indian Institute of Science Bangalore 560 012

#### NPTEL Course on Compiler Design

- Instruction Scheduling
  - Simple Basic Block Scheduling
  - Automaton-based Scheduling
  - Integer programming based scheduling
  - Optimal Delayed-load Scheduling (DLS) for trees
  - Trace, Superblock and Hyperblock scheduling

(日)

э

- Reordering of instructions so as to keep the pipelines of functional units full with no stalls
- NP-Complete and needs heuristcs
- Applied on basic blocks (local)
- Global scheduling requires elongation of basic blocks (super-blocks)

## Instruction Scheduling - Motivating Example

- time: load 2 cycles, op 1 cycle
- This code has 2 stalls, at i3 and at i5, due to the loads



(b) DAG

・ロン ・聞 と ・ ヨ と ・ ヨ と

3

| i1: | r1 | $\leftarrow$ | load a  |
|-----|----|--------------|---------|
| i2: | r2 | $\leftarrow$ | load b  |
| i3: | r3 | $\leftarrow$ | r1 + r2 |
| i4: | r4 | $\leftarrow$ | load c  |
| i5: | r5 | $\leftarrow$ | r3 - r4 |
| i6: | r6 | $\leftarrow$ | r3 * r5 |
| i7: | d  | $\leftarrow$ | st r6   |

(a) Sample Code Sequence

#### Scheduled Code - no stalls

• There are no stalls, but dependences are indeed satisfied

| i1: | r1 | $\leftarrow$ | load a  |  |  |  |  |
|-----|----|--------------|---------|--|--|--|--|
| i2: | r2 | $\leftarrow$ | load b  |  |  |  |  |
| i4: | r4 | $\leftarrow$ | load c  |  |  |  |  |
| i3: | r3 | $\leftarrow$ | r1 + r2 |  |  |  |  |
| i5: | r5 | $\leftarrow$ | r3 - r4 |  |  |  |  |
| i6: | r6 | $\leftarrow$ | r3 * r5 |  |  |  |  |
| i7: | d  | $\leftarrow$ | st r6   |  |  |  |  |

- Consider the following code:
  - $\textit{i}_1: \textit{ r1} \leftarrow \textit{load}(\textit{r2})$
  - $\textit{i}_2: \textit{ r3} \leftarrow \textit{r1} + 4$
  - $i_3: r1 \leftarrow r4 + r5$
- The dependences are
  - $i_1 \delta i_2$  (flow dependence)  $i_2 \overline{\delta} i_3$  (anti-dependence)
  - $i_1 \delta^o i_3$  (output dependence)
- anti- and ouput dependences can be eliminated by register renaming

・ロット (雪) (日) (日) (日)

# Dependence DAG

- full line: *flow* dependence, dash line: *anti*-dependence dash-dot line: *output* dependence
- some anti- and output dependences are because memory disambiguation could not be done

| i1: | t1 | $\leftarrow$ | load a  |
|-----|----|--------------|---------|
| i2: | t2 | $\leftarrow$ | load b  |
| i3: | t3 | $\leftarrow$ | t1 + 4  |
| i4: | t4 | $\leftarrow$ | t1 - 2  |
| i5: | t5 | $\leftarrow$ | t2 + 3  |
| i6: | t6 | $\leftarrow$ | t4 * t2 |
| i7: | t7 | $\leftarrow$ | t3 + t6 |
| i8: | с  | $\leftarrow$ | st t7   |
| i9: | ъ  | $\leftarrow$ | st t5   |

(a) Instruction Sequence



(b) DAG

- Basic block consists of micro-operation sequences (MOS), which are indivisible
- Each MOS has several steps, each requiring resources
- Each step of an MOS requires one cycle for execution
- Precedence constraints and resource constraints must be satisfied by the scheduled program
  - PC's relate to data dependences and execution delays
  - RC's relate to limited availability of shared resources

・ ロ ト ・ 雪 ト ・ 目 ト ・ 日 ト

-

## The Basic Block Scheduling Problem

- Basic block is modelled as a digraph, G = (V, E)
  - R: number of resources
  - Nodes (V): MOS; Edges (E): Precedence
  - Label on node v
    - resource usage functions, ρ<sub>ν</sub>(i) for each step of the MOS associated with ν
    - length I(v) of node v
  - Label on edge e: Execution delay of the MOS, d(e)
- Problem: Find the shortest schedule  $\sigma : V \to N$  such that  $\forall e = (u, v) \in E, \ \sigma(v) - \sigma(u) \ge d(e)$  and  $\forall i, \sum_{v \in V} \rho_v(i - \sigma(v)) \le R$ , where length of the schedule is  $\max_{v \in V} \{\sigma(v) + I(v)\}$

# Instruction Scheduling - Precedence and Resource Constraints



MOS substeps (time)

Consider R = 5. Each MOS substep takes 1 time unit.

At i=4,  $\varsigma_{v4}(1)+\varsigma_{v3}(2)+\varsigma_{v2}(3)+\varsigma_{v1}(4) = 2+2+1+0 = 5 \le R$ , satisfied

At i=2,  $\zeta_{v3}(0)+\zeta_{v2}(1)+\zeta_{v1}(2) = 3+3+2=8 > R$ , NOT satisfied

# A Simple List Scheduling Algorithm

Find the shortest schedule  $\sigma: V \to N$ , such that precedence and resource constraints are satisfied. Holes are filled with NOPs.

```
FUNCTION ListSchedule (V,E)
BEGIN
  Ready = root nodes of V; Schedule = \phi;
  WHILE Ready \neq \phi DO
  BEGIN
   v = highest priority node in Ready;
    Lb = SatisfyPrecedenceConstraints (v, Schedule, \sigma);
   \sigma(v) = SatisfyResourceConstraints (v, Schedule, \sigma, Lb);
    Schedule = Schedule + \{v\};
    Ready = Ready - \{v\} + \{u \mid NOT (u \in Schedule)
              AND \forall (w, u) \in E, w \in Schedule};
  END
  RETURN \sigma;
FND
```

#### List Scheduling - Ready Queue Update



```
FUNCTION SatisfyPrecedenceConstraint(v, Sched, \sigma)
BEGIN
RETURN (\max_{u \in Sched} \sigma(u) + d(u, v))
END
```

```
FUNCTION SatisfyResourceConstraint(v, Sched, \sigma, Lb)
BEGIN
FOR i := Lb TO \infty DO
IF \forall 0 \le j < l(v), \ \rho_v(j) + \sum_{\substack{u \in Sched \\ \rho_u(i+j-\sigma(u)) \le R \ THEN \\ RETURN (i);}END
```

### Precedence Constraint Satisfaction



.⊟...>

# **Resource Constraint Satisfaction**

ş

| Resource constraint satisfaction<br>Consider R = 5. Each MOS<br>substep takes 1 time unit. |                      | MOS substeps (time) |   |   |   |   |
|--------------------------------------------------------------------------------------------|----------------------|---------------------|---|---|---|---|
|                                                                                            |                      | 0                   | 1 | 2 | 3 | 4 |
| Schedule<br>Time $\sigma(u)$                                                               | σ(v <sub>1</sub> )=0 | 1                   | 1 | 2 | 2 |   |
|                                                                                            | σ(v <sub>2</sub> )=1 | 2                   | 3 | 1 | 1 | 2 |
|                                                                                            | 2                    |                     |   |   |   |   |
|                                                                                            | 3                    |                     |   |   |   |   |
|                                                                                            | σ(v <sub>3</sub> )=4 | 3                   | 1 | 2 |   |   |
|                                                                                            | σ(v <sub>4</sub> )=5 | 1                   | 2 | 3 | 2 |   |

Time slots 2 and 3 are vacant because scheduling node  $v_3$  in either of them violates resource constraints

=

# List Scheduling - Priority Ordering for Nodes

- Height of the node in the DAG (*i.e.*, longest path from the node to a terminal node
- 2 Estart, and Lstart, the earliest and latest start times
  - Violating Estart and Lstart may result in pipeline stalls
  - Estart(v) = max<sub>i=1,...,k</sub> (Estart(u<sub>i</sub>) + d(u<sub>i</sub>, v))
     where u<sub>1</sub>, u<sub>2</sub>, ..., u<sub>k</sub> are predecessors of v. Estart value of the source node is 0.
  - Lstart(u) =  $\min_{i=1,\dots,k} (Lstart(v_i) d(u, v_i))$ where  $v_1, v_2, \dots, v_k$  are successors of *u*. Lstart value of the

sink node is set as its Estart value.

• Estart and Lstart values can be computed using a top-down and a bottom-up pass, respectively, either statically (before scheduling begins), or dynamically during scheduling

### Estart and Lstart Computation



Estart (v) = max (Esart (u<sub>i</sub>) + d<sub>i</sub>) i = 1,...,3 = max(25+4, 45+7, 16+2) = max(29, 52, 18) = 52

- A node with a lower Estart (or Lstart) value has a higher priority
- Slack = Lstart Estart
  - Nodes with lower slack are given higher priority
  - Instructions on the critical path may have a slack value of zero and hence get priority

## Simple List Scheduling - Example - 1



# Simple List Scheduling - Example - 2

- Iatencies
  - add,sub,store: 1 cycle; load: 2 cycles; mult: 3 cycles
- path length and slack are shown on the left side and right side of the pair of numbers in parentheses



(c) DAG with *(Estart, Lstart)* Values

7(0, 1)1

・ロト ・ 一日 ・ ・ 日 ・ ・ 日 ・