1 of 44

x86 Architecture & �Its Assembly language programming

Dr A Sahu

Dept of Computer Science & Engineering

IIT Guwahati

2 of 44

Outline

  • Review of 8086 Architecture
    • Block diagram (Data Path)
  • Similarity with x86 (i386, Pentium, etc)
    • Very IMP for interview/knowledge
    • Not part of Examination
  • x86 Assembly language program
    • Memory model
    • Example programs
    • Data Segment
    • Loop and Nested Loop
  • Next Class: Detail of assembly language
    • Summary of 8085/8086/i386 Arch & programming

3 of 44

Introduction to �8086 & i386 processor

  • 16 bit Microprocessor
  • All internal registers as well as internal and external data buses were 16 bits wide
  • 4 Main Register, 4 Index Register, 4 Segment Register, Status Reg, Instr Ptr.
  • Not compatible with 8085, but with successors
  • Two Unit works in parallel:
    • Bus Interface Unit (BIU)
    • Execution Unit (EI)

4 of 44

8086 Architecture

AH

AL

BH

BL

CH

CL

DH

DL

SI (Source Idx )

DI (Dest. Idx)

BP (Base Ptr )

SP (Stack Ptr)

Z (Flag Reg)

CS (Code Seg Reg)

DS (Data Seg Reg )

ES (Extra Seg Reg )

SS (Stack Seg Reg)

IP (Intr Ptr)

Operand

InDirect

Temp A

Temp B

Temp C

Q6

Q5

Q4

Q3

Q2

Q1

Sequencer

Bus Interface Unit

Execution

Unit

SUM

C BUS

A BUS

ALU

5 of 44

8086 Architecture

  • Execution Unit :
    • ALU may be loaded from three temp registers (TMPA, TMPB, TMPC)
    • Execute operations on bytes or 16-bit words.
    • The result stored into temp reg or registers connected to the internal data bus.
  • Bus Interface Unit
    • BIU is intended to compute the addresses.
    • Two temporary registers
    • indirect addressing
    • four segment registers (DS, CS, SS and ES),
    • Program counter (IP - Instruction Pointer),
    • A 6-byte Queue Buffer to store the pre-fetched opcodes and data.
    • This Prefetch Queue optimize the bus usage.
    • To execute a jump instruction the queue has to be flushed since the pre-fetched instructions do not have to be executed.

6 of 44

History of Intel Architectures

  • 1978: 8086 (16 bit architecture)
  • 1980: 8087
    • Floating point coprocessor is added
  • 1982: 80286
    • Increases address space to 24 bits
  • 1985: 80386:
    • 32 bits Add,
    • Virtual Mem & new add modes
    • Protected mode (OS support)
  • 1989-95: 80486/Pentium/Pro
    • Added a few instructions of base MMX

7 of 44

History of Intel Architectures

  • 1997: Pentium II
      • 57 new “MMX” instructions are added,
  • 1999: Pentium III:
      • Out of Order, added another 70 Streaming SIMD Ext (SSE)
  • 2001: Pentium 4
      • Net burst, another 144 instructions (SSE2)
  • 2003: PI4 HT, Trace Cache
  • 2005: Centrino, low power
  • 2007: Core architecture, Duo
  • 2008: Atom, Quad core with HT….
  • 2009---:Multi core (Large chip multiprocessor)

8 of 44

Superscalar Pipeline

IF

Time in base Cycle

0 1 2 3 4 5 6 7 8 9

D

IS

WB

EX

How Complex the HW will be

9 of 44

ILP in Superscalar processors

Fetch

Unit

Cache/

memory

Multiple instruction

Sequential stream of instructions

FU

FU

FU

Register file

Decode

& issue

Instruction/control

Data

FU

Funtional Unit

10 of 44

Intel P5 Architecture (Generation 5)

  • Used in initial Pentium processor
  • Could execute up to 2 instructions simultaneously
  • Instructions sent through the pipeline in order - if the next two instructions had a dependency issue, only one instruction (pipe) would be executed and the second execution unit (pipe) went unused for that clock cycle.

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Decoder

EX Unit

WB Unit

Pipeline

Pipe 1

Pipe 2

11 of 44

Intel P6 Architecture (Generation 6)

  • Used in the Pentium II, III and Pro processors
  • 3 instruction decoders, which break each CISC instruction (macro-op) into equivalent micro-operations (µops) for the Out-of-Order Execution unit
  • 10 stage instruction pipeline utilized in this architecture

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Pipeline

Decoder

Decoder

Decoder

WB Unit

Scheduler

EX Unit

Re Order

Buffer

Out of Order EU

12 of 44

Intel NetBurst MicroArchitecture

  • New architecture used for the Intel Pentium IV and Pentium Xeon processors

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Decoder

Decoder

Decoder

Scheduler

EX Unit

Re Order

Buffer

WB Unit

Decoded Instructions

(Execution Trace Cache)

Pipeline

Out of Order EU

13 of 44

Tasks of superscalar processing

Parallel Parallel Preserving the

decoding instruction sequential

and issue execution consistency of

instruction execution

and

exception processing

14 of 44

Superscalar decode and issue

I - cache

Instruction

buffer

Decode & Issue

IF

D/I

I - cache

Instruction

buffer

Decode & Issue

IF

D

I

Scalar

Issue

Superscalar

Issue

IF

D

I

IF

D

I

15 of 44

Parallel Decoding

  • Fetch multiple instructions in instruction buffer
  • Decode multiple instructions in parallel – instruction window
  • Possibly check dependencies among these as well as with the instructions already under execution

16 of 44

Dependent/Independent Instructions

  • ADD T A B T= A+B
  • ADD W C D W= C+D
  • LD A, 0(W) A=M[W]
  • ST C, 0(B) M[B]=C

Read After Write (RAW), W after W, W after R

RAW (Ins2-Ins3): True dependency

WAW, WAR (Ins1 ot Ins3) : false dependency

17 of 44

Issue vs Dispatch

Blocking Issue

  • Decode and issue to EU

Instructions may be blocked due to data dependency

Non-blocking Issue

  • Decode and issue to buffer
  • From buffer dispatch to EU

Instructions are not blocked due to data dependency

18 of 44

Blocking Issue

EU

EU

EU

Decode

Check & Issue

Instruction

Buffer

issue window

19 of 44

Non-blocking (shelved) Issue

Reservation

station

Dep. Checking/

dispatch

EU

Reservation

station

Dep. Checking/

dispatch

EU

Reservation

station

Dep. Checking/

dispatch

EU

Decode & Issue

Instruction

buffer

20 of 44

Handling of Issue Blockages

Preserving issue order Alignment of instruction issue

aligned unaligned

in-order out of order

21 of 44

Dependent/Independent Instructions

  • ADD T A B T= A+B
  • ADD W C D W= C+D
  • LD A, 0(W) A=M[W]
  • ST C, 0(B) M[B]=C

22 of 44

Issue Order

c

d

a

b

e

a

Issue window

Instructions

to be issued

Instructions

issued

c

d

a

b

e

a

Issue window

Instructions

to be issued

Instructions

issued

Issue in strict program order

Out of order Issue

c

Example: MC 88110, PowerPC 601

Independent instruction

Dependent instruction

Issued instruction

23 of 44

Alignment

c

d

a

b

e

a

fixed window

checked

in cycle 1

Aligned Issue

Unaligned Issue

issued

in cycle 1

f

g

h

next window

c

d

b

e

b

checked

in cycle 2

issued

in cycle 2

f

g

h

d

e

d

checked

in cycle 3

issued

in cycle 3

f

g

h

c

c

d

a

b

e

a

gliding window

f

g

h

c

d

b

e

b

f

g

h

d

e

f

g

h

c

d

e

f

24 of 44

Design choices in instruction issue

Coping with Coping with Use of Handling of Issue

false data unresolved shelving issue blockages rate

dependencies control (2-6)

dependencies

no Register

renaming

wait speculative

blocking shelved

25 of 44

Layout of Shelving Buffers

Type of the Number of Number of read

shelving buffers shelving buffer entries and write ports

Stand combined with

alone renaming and

(RS) reordering

individual 2-4

group 6-16

central 20

total 15-40

depends on

no. of EUs

connected

26 of 44

Reservation Stations (RS)

EU

EU

EU

EU

EU

EU

EU

EU

RS

RS

RS

RS

RS

Individual RSs

Group RSs

Central RS

27 of 44

Issue bound operand fetch�(with single register file)

EU

EU

RS

RS

EU

EU

RS

RS

Decode/issue

RF

instruction

data

28 of 44

Dispatch bound operand fetch (with single register file)

EU

EU

RS

RS

EU

EU

RS

RS

Decode/issue

instruction

data

RF

29 of 44

Why Renaming and Reordering?

  • Register Renaming
    • Removes false dependencies (WAR and WAW)
  • Reordering Buffer (ROB) : Pentitum Out of order instruction processing
    • Ensures sequential consistency of interrupts (precise vs imprecise interrupts)
    • Facilitates speculative execution
      • Branch execution
      • Execute both path and discard after getting CC Value

30 of 44

RAW, WAR and WAW�(in Superscalar)

IF

IS

DP

EX

WB

IF

IS

DP

EX

WB

IF

IS

DP

EX

WB

write

read

write

RAW

WAR

WAW

31 of 44

Register renaming

write R5

RAW

read R5

WAR

write R5

RAW

read R5

write R5

RAW

read R5

write R8

RAW

read R8

32 of 44

Who does renaming?

  • Compiler
    • Done statically
    • Limited by registers visible to compiler
  • Hardware
    • Done dynamically
    • Limited by registers available to hardware

33 of 44

X86 Assembly language program

  • Kernel code writing
    • Process switching code
    • Thread synchronization code
    • Lock, barrier (test & set, fetch & increment, xcng)
    • pthread spin_lock (in ASM) is 28% faster than intel tbb::spin_mutex (in C)
  • Time critical coding (Coding for DSP)
    • /usr/include/asm-i386
  • Use of MASM/TASM/GCC/NASM compiler
    • gcc –S test.c –o test.s
  • C/C++ code with asm block
  • 8086 compatible with ii386/pentium

34 of 44

8086 Registers

  • AX - the accumulator register (divided into AH / AL)
  • BX - the base address register (divided into BH / BL)
  • CX - the count register (divided into CH / CL)
  • DX - the data register (divided into DH / DL)

  • SI - source index register.
  • DI - destination index register.
  • BP - base pointer.
  • SP - stack pointer.

AH

AL

BH

BL

CH

CL

DH

DL

SI (Source Idx )

DI (Dest. Idx)

BP (Base Ptr )

SP (Stack Ptr)

CS (Code Seg Reg)

DS (Data Seg Reg )

ES (Extra Seg Reg )

SS (Stack Seg Reg)

IP (Intr Ptr)

Z (Flag Reg)

35 of 44

i386/i486/i686 Registers

AH

AL

BH

BL

CH

CL

DH

DL

SI (Source Idx )

DI (Dest. Idx)

BP (Base Ptr )

SP (Stack Ptr)

CS (Code Seg Reg)

DS (Data Seg Reg )

ES (Extra Seg Reg )

SS (Stack Seg Reg)

IP (Intr Ptr)

Z (Flag Reg)

EAX

EBX

ECX

EDX

ESI

EDI

EBP

ESP

ECS

EDS

EES

ESS

EIP

EZ

31 15 7 0

Extended

36 of 44

Memory layout of C program

  • Stack
    • automatic (default), local
    • Initialized/uninitialized
  • Data
    • Global, static, extern
    • BSS: Block Started by Symbol
    • BBS: Uninitialized Data Seg.
  • Code
    • program instructions
  • Heap
    • malloc, calloc

Data

Code

Heap

Stack

BSS

37 of 44

Memory layout of C program

int A;

int B=10;

main(){

int Alocal;

int *p;

p=(int*)malloc(10);

}

Data

Code

Heap

Stack

BSS

38 of 44

MASM : Hello world

.model small

.stack 100h ; reserve 256 bytes of stack space

.data

message db "Hello world, I'm learning Assembly$”

.code

main proc

mov ax, seg message

mov ds, ax

mov ah, 09 // 9 in the AH reg indicates Procedure

//should write a bit-string to the screen.

lea dx, message // Load Eff Address

int 21h

mov ax,4c00h // Halt for DOS routine (Exit Program)

int 21h

main endp

end main

39 of 44

Memory Model: Segment Definition

  • .model small
    • Most widely used memory model.
    • The code must fit in 64k.
    • The data must fit in 64k.
  • .model medium
    • The code can exceed 64k.
    • The data must fit in 64k.
  • .model compact
    • The code must fit in 64k.
    • The data can exceed 64k.
  • .medium and .compact are opposites.

40 of 44

Data Allocation Directives

  • db : define byte dw: def. word (2 bytes)
  • dd: def double word (4) dq : def quad word (8)
  • equ : equate assign numeric expr to a name

.data

db A 100 dup (?) ; define 100 bytes, with no initial values for bytes

db “Hello” ; define 5 bytes, ASCII equivalent of “Hello”.

dd PtrArray 4 dup (?) ;array[0..3] of dword

maxint equ 32767 ; define maxint=32767

count equ 10 * 20 ; calculate a value (200)

41 of 44

MASM: Loop

  • Assemby code: Loop
    • Loop simply decreases CX and checks if CX != 0, if so, a Jump to the specified memory location

    • LOOPNZ : LOOPs when the zero flag is not set

MOV CX,100

_LABEL: INC AX

LOOP _LABEL

MOV CX,10

_CMPLOOP: DEC AX

CMP AX,3

LOOPNE CMPLOOP

42 of 44

MASM: Nested Loop

  • Assemby code: Nested Loop: One CX register

mov cx, 8

Loop1: push cx

mov cx, 4

Loop2: stmts

loop Loop2

pop cx

stmts

loop Loop1

43 of 44

Next Class Agenda

  • Detail of Assembly language program
  • Addressing mode
  • Example program
  • Assignment problem
    • Summary of 8085/8085/i386 Arch & programming (1st Unit of Syllabus)
  • Introduction to device interfacing
    • Device type
    • Interfacing

44 of 44

Thanks