My attempt to teach others about microprocessors and programming in IA-32 and IA-64 assembly, and to spread the word of how awesome it is.
- Pre-requisites
- x86_64 Assembly
The core elements of today's modern computing devices are consistent with those designed in the dawning phase of technology. So it's always good to study them beforehand, before moving onto its complex counterparts.
Architecture model | Description |
---|---|
Von Neumann | According to this architecture model, data and memory addresses in the same memory (you'll come to understand more about this later, the distinction is important in the case of shellcoding). |
Harvard Architecture | According to this model, the data and the address are stored in different places |
A CPU has many internal components which we will discuss about, one by one; namely: Control Unit, Arithmetic Logic Unit (ALU), Registers, Cache, and Buses.
Name | Description |
---|---|
Control Unit | |
Arithmetic Logic Unit (ALU) | |
Registers | amd64 , and they're also limited in number |
CPU Clock | Speaking from the low level perspective, the CPU is just another creation of sequential and combinational logic. We need a clock to synchronize the internal circuitry. The clock does the job, by sending electric pulses at regular intervals, which is able to dictate how fast the CPU is able to execute its internal logic. |
Cache |
The number of possible registers depends from architecture to architecture, but they can be categorized into:
Name | Description |
---|---|
Accumulator | The most frequently used register, sometimes built into the ALU, used to store intermediary data when logical/ arithmetic calculations are being done |
Instruction register | Holds the instruction which is just about to be executed by the processor. |
Program Counter (Instruction Pointer) | Used to keep track of the execution, and points to the next instruction which needs to be executed after the current one |
Counters | Used in loops |
Stack/ Base Pointer | Used to point to the top and base of the stack respectively, extremely important to understand the concept of Stack Frames. |
FLAGS | A register in which each bit is independent of one another, and stores information about the current status at any given stage of program execution. |
Additional registers | Depends on the architecture, and they're extensions to the basic set of registers, such as x87, MMX, SSE etc. |
In the case of x86_64, general purpose registers are 64-bits in size.
- The lower 32-bits of
RAX
,RBX
,RCX
,RDX
can be accessed viaEAX
,EBX
,ECX
, andEDX
, their lower 16-bits byAX
,BX
,CX
andDX
. The lower half of the said 16-bits byAL
,BL
,CL
,DL
and the upper half byAH
,BH
,CH
andDH
. - GPR
RSI
,RDI
,RBP
andRSP
are 64-bits in size, and their lower 32-bits can be accessed byESI
,EDI
,EBP
,ESP
, and their lower 16-bits can be accessed bySI
,DI
,BP
,SP
and their lower 8- bits can be accessed bySIL
,DIL
,BPL
andSPL
. - There are other 8 GPR, named from
R8
-R15
. The whole 64-bits can be accessed viaR8
, the lower 32-bits viaR8D
(double word), the lower 16-bits viaR8W
(R8-word), and further lower 8-bits viaR8B
(R8-byte) - Due to various design decisions made during the design of x86_64, accessing EAX would wipe out the upper 32-bits of the RAX register (and all other GPRs)
General purpose registers: 64 bit RSI, RDI, RSP, RBP
RAX, RBX, RCX, RDX R8, R9, R10, R11, R12, R13, R14, R15
┌──────────────────RAX──────────────────┐ ┌─────────────RSI/R8────────────────────┐
┌───────────────────┬─────────┬────┬────┐ ┌───────────────────┬─────────┬────┬────┐
│ │ │ │ │ │ │ │ │SIL │
│ │ │AH │AL │ │ │ │ │R8B │
└───────────────────┴─────────┴────┴────┘ └───────────────────┴─────────┴────┴────┘
└───AX────┘ └─SI/R8W──┘
└───────EAX─────────┘ └────ESI/R8D────────┘
- A bus is a group of wires having common functionality, and they're used to interconnect stuff internally within the CPU.
- Some higher end systems use switch instead of the bus-based architecture but that's outside the scope of this post.
Name | Description |
---|---|
Control Bus | Bi-directional in nature (CPU <---> other parts), and are used to control the data flow. Control signals are transferred through this bus, and they synchronize everything connected to the data bus. |
Address Bus |
|
Data Bus |
|
The CPU works on the basis Fetch-Decode-Execute cycle, the clock rate of a CPU, is the number of times this cycle occurs per second. It’s often used as an indication of processor's speed.
- Most of the modern day CPUs support stored program execution, which means the instructions to be executed will firstly exist in the memory, which will later be fetched into the registers, decoded and executed. This process is known as Fetch Decode Execute.
- The Control Unit drives the fetch, decode, execute and store functions of the processor
initialise the program counter
repeat forever
fetch instruction
increment the program counter
decode the instruction
execute the instruction
end repeat
┌────────────┐
┌──────►Control Unit├────┐
│ └────────────┘ │Execute
Decode │ │
┌──┴──────┐ ┌─▼─┐
│Registers◄───────────────┤ALU│
└─────────┘ Fetch └───┘
Step | Description |
---|---|
Fetch | |
Decode | |
Execute |
The number and order of operands depends on the instruction addressing mode as follows:
Addressing Modes
Register Direct: Both the operands are registers
ADD EAX, EAX
Register Indirect:Both the operands are registers, but contains the address where the operands are stored in memory
MOV ECX, [EBX]
Immediate: The operand is included immediately after the instruction in memory
ADD EAX, 10
Indexed: The address is calculated using a base address plus an index, which can be another register
MOV A, [ESI+0x4010000]
MOV EAX, [EBX+EDI]
Name | Description |
---|---|
Mnemonics | |
Machine code | They can be understood by the micro-processor directly w/o any need of middle man. |
- Instructions are defined as per a specification, which is known as the Instruction Set Architecture (ISA). It's specifies things such as type and size of operands, register states, memory model, how interrupts and exceptions are handled etc viz. it's the syntax and semantics.
- Some examples are: x86, x86_64, ARM, MIPS, Power PC, RISC-V etc
Name | Description |
---|---|
Complex Instruction Sets | |
Reduced Instruction Sets |
Some other approaches are: Minimal Instruction Set Computer (MISC), One Instruction Set Computer (OISC) and Very Long Instruction Word (VLIW), LIW (Long Instruction Word) but these are not so common these days.
Micro-architecture is how the instruction set is implemented. There are multiple micro-architecture that support the same ISA, such as such as both Intel and AMD support the x86 ISA, but they have different implementation (micro-architecture)
- Used to define the native word-size of the ISA, and that is what the CPU processes at once viz. if the word size is 1 byte, 1 byte of data can be processed in a single fetch-decode-execute cycle
- If there are 8-data lines as per the ISA, it means 8-bits can be transferred simultaneously at once, viz. the each distinct register can store 8 bits each, thus the CPU is 8-bit in nature. The address bus is irrelevant with classification of CPUs.
- The native word size also defines the addressable memory, because special purpose registers (program counter, instruction register) are used as pointers to memory location, and the native word size defines the sizes of these registers.
- A 32/64 bit program has different meaning from a 32/64 bit CPU. A 32-bit program means the CPU will operate in 32-bit mode, and only
$2^{32}$ addresses will be accessible.
Name | Description |
---|---|
Micro-processor | An electronic chip functioning as the CPU of computer |
Micro-controller | It’s the combination of micro-processor, I/O ports, and memory altogether. |
Micro-computer | A computer having a microprocessor and limited resources is known as a micro-computer, and is the combination of a micro-controller, I/O devices and memory. |
CPU = the hardware that executes instructions, can have multiple cores in it
Processor = A physical chip containing one or more CPUs
Core = The basic computational unit of CPU
Multicore = Having multiple cores on the same CPU
Multiprocessor = Having multiple processors
- Installing the required tools
sudo apt install build-essential clang nasm gdb gdbserver
- A text editor, I personally use neovim
- A guest OS (x86_64)
Read more about this here: link
┌───────┐
│ Stack │ Grows downwards
│ │ │ Contains things that are local
│ │ │ to a function (local variables,
│ ▼ │ return addresses, parameters etc)
├───────┤
│ Heap │ Dynamic memory allocation takes place
├───────┤
│ Data │ Initialized global/static variables
├───────┤
│ BSS │ Contains uninitialized data
├───────┤
│ Text │ Contains our program code
└───────┘
# Using gdb
$ gdb -q ./binary
$ break <breakPoint>
$ run
$ info proc mappings
# Using pmap
pmap <processID>
;;The start symbol, during the start of the execution, the execution flow will jump to the address pointed to, by the label _start
global _start
section .text
;;The executable code goes here
section .data
;;Initialized data goes here
section .bss
;;Uninitialized data goes here
- Read more about assembly, linking and such stuff here
- Read more about position independent code here
# Assembly the code
$ nasm ./code.asm -f elf64 -o output.o
# Linking
$ ld output.o -o finalExecutable #Use the -pie flag to get position independent code
$ ./finalExecutable
Name | Size | Instruction |
---|---|---|
Byte | 8 bits | db |
Word | 16 bits | dw |
Double Word | 16 * 2 bits | dd |
Quad Word | 16 * 4 bits | dq |
Double Quad Word | 16 * 8 bits | ddq |
;;Defining the byte 0x23
db 0x23
;;Defining two bytes successive in memory 0x12, 0x34, 0x56
db 0x12, 0x34, 0x56
;;Defining a character constant and a byte
db 'x', 0x00
;;Defining a string constant and a byte in succession
db 'hi', 0x10
;;Defining a word (2 bytes, 16 bits)
dw 0x1234 ; 0x34 0x12 (little-endian)
dw 'a' ; 0x61 0x00
dw 'ab' ; 0x61 0x62
;;Defining a double word (32 bits, 4 bytes)
dd 0x12345678 ; 0x12 0x34 0x56 0x78
;;Defining a Quad Word (64 bits, 8 bytes)
dq 0x123456789abcdef0
Uninitialized data is stored in the .BSS
section, and since they're un-initialized in nature, no memory needs to be allocated for their storage, and they can just exist inside the object file.
;;Reserve a byte
section .bss
label: resb <numberOfBytes> ;;the label will point to the first byte
;;Reserve a word
section .bss
label: resw <numberOfWords> ;;the laebl will point to the first byte
If we’re moving 64-bit data into a 64-bit register, the data will occupy the whole register. But when the data is of 32-bits, the lower 32-bits will be occupied by the data and the rest will be zeroed out. When dealing with 8 or 16-bit operands, the other bits will not be modified.
;;B/w registers
mov registerA, registerB
;;Memory to registers
mov RAX, qword [memoryAddress]
mov EAX, dword [memoryAddress]
mov AX, word [memoryAddress]
mov AL, byte [memoryAddress]
;;Register to Memory
mov byte [memoryAddress], AL
mov dword [memoryAddress], EAX
;;Immediate data to register
mov AX, 0x1234
;;Immediate data to Memory
mov byte [label], 0x99
Used to load pointer values
lea RAX, [sample] ;;RAX will point to the memory region of sample
lea RBX, [RAX] ;;moving the contents of the location RAX is pointing to, into RBX
The values present in the two mentioned entities get exchanged.
XCHG registerA, registerB
XCHG memory, register
XCHG register, memory
ADD registerA, registerB
ADD register, memory
ADD register, immediateData
;;Add with carry
ADC registerA, registerB
;;registerA += registerB + 1 (If carry bit is set)
;;kregisterA += registerB + 0 (If carry bit is not set)
ADC register, immediateData
ADC register, [memoryAddress]
SUB registerA, registerB
SUB register, memory
SUB register, immediateData
;;Subtract with carry
SBB registerA, registerB
;;registerA += registerB + 1 (If carry bit is set)
;;kregisterA += registerB + 0 (If carry bit is not set)
SBB register, immediateData
SBB register, [memoryAddress]
inc <register>
inc [memoryAddress]
dec <register>
dec [memoryAddress]
- Implied + indirect addressing is followed viz. a/b, the a will always be the RAX register, and b can be any register
- The quotient will be stored in RAX, and the remainder will be stored in RDX
div <register>
- The first operand must always be present in RAX
- The second operand can be put into any register
mul <register> ;;RAX = RAX * register
not <register>
not <memoryAddress>
and <registerA>, <registerB>
and <register>, <memoryLocation>
or <registerA>, <registerB>
or <register>, <memoryLocation>
xor <registerA>, <registerB>
xor <register>, <memoryLocation>
xor <memoryLocation>, <register>
- ECX register is used as the counter register, and it gets decremented each time the loop executes, as soon as it reaches 0, the iteration stops.
- Looping is not as simple as how its done in HLLS, there’s an inherent logic involved, one really needs to go through each instruction step by step and track all the registers to understand the instruction (in gdb)
;;1. Indentation doesn't matter in ASM, it's only for readability sake
;;2. Our process has something called fetch-decode-execute cycle, and it keeps executing instruction in sequence (if no branching)
global _start
section .text
_start:
mov RAX, 0x1 ;;Some data
mov RCX, 0x3 ;;How many times to iterate
someLabel:
ADD RAX, 0x1
loop someLabel
mov RAX, 0x10
- 1 gets moved into the RAX
- 0x3 gets moved into RCX
ADD RAX, 0x1
is executed for the first time- loop someLabel is executed, and the value of RCX is decremented by 1
- Since the counter register is not equal to 0, the execution flow will jump to where the label
someLabel
is pointing to - ADD RAX, 0x1 is executed accordingly, until the value of RCX becomes 0
- After the value of RCX reaches 0, the next instruction is executed
These instructions can be categorised into two types: Conditional jumps and unconditional jumps
- No conditions are checked, and the execution flow is shifted to the location specified
- Memory address can be specified via either some register, or some other means
jmp <memoryLocation>
- There are a lot of different conditional jumps statements
- The first letter is a J, followed by two other letters based on some condition, viz. Jxx
- The conditions are decided based on the flag registers
conditionLoopInstruction <label>
There are a lot of instructions for conditional jumps, but what is common in all of them is they start with a J
and rest letters are based on some condition.
Here is a reference (taken from the Intel's manual)
- Similar to functions in C or other HLLs, and in nasm, procedures are defined using labels, and called using the call instruction.
- When the program is fresh in memory, the stack is mostly empty, it has stuff like
argc
, the environment variables table (pointer variables and the location they point to viz. the actual environment variables), and the command line arguments table (the pointer variables and the location they point to, viz. the actual command line arguments stored onto the stack). - Command line arguments can be passed to a procedure with the help of registers, stack, or passed the address of data structure present in the memory
procedureLabel:
;;intstructions
ret
call procedureLabel
When a sub-procedure is called using call
, the value of RIP
is changed to the where the procedureLabel is pointing to and the address of the next instruction (beneath the call
instruction) is pushed onto the stack
When ret
is executed, the address of the next instruction which was present on the stack gets popped and is pointed to, by RIP
viz. the execution flow redirects back to the next instruction which was beneath the call instruction
Address Instruction
┌─────┬──────────────────────┐
│ │ procedureLabel: │
│ 1 │ mov RAX, RBX │
│ 2 │ ret │
│ │ │
│ │ │
│ 3 │ call procedureLabel│
│ 4 │ xor RAX, RAX │
Stack └─────┴──────────────────────┘ Stack
┌────────────┐ ┌─────────────┐
│ │ ◄──────RSP │ 4 │
├────────────┤ ├─────────────┤
│ │ During the execution of │ │◄──────RSP
├────────────┤ ───────────────────────► ├─────────────┤
│ │ call procedureLabel │ │ RIP
├────────────┤ ├─────────────┤ ┌─────┐
│ │ EIP │ │ │ 4 │
├────────────┤ ┌─────┐ ├─────────────┤ └─────┘
│ │ │ 3 │ │ │
├────────────┤ └─────┘ ├─────────────┤
│ │ │ │
├────────────┤ Execution of ret ├─────────────┤
│ │ ┌─────────────── │ │
└────────────┘ │ pop RIP (kind of) └─────────────┘
│
│
│
▼
Stack
┌────────────┐
│ │
├────────────┤
│ │
├────────────┤
│ │
├────────────┤ RIP
│ │ ┌─────┐
├────────────┤ │ 5 │
│ │ └─────┘
├────────────┤
│ │
├────────────┤
│ │
└────────────┘
- Whenever a procedure is called, a stack frame is created on the stack which is like a theoretical wall, to isolate all data created by previous procedures, when the procedure ends, the theoretical wall is destroyed.
- Two registers are used to maintain the theoretical wall viz.
RSP
(top of the stack) andRBP
(base of the stack) - When a sub-procedure is called, the current
RBP
is pushed onto the stack, andRBP
gets the same value as that of the RSP (the base address of the wall will start building from here) - At the very end of a sub-procedure,
leave
andret
instructions are there,leave
does the opposite of the thing mentioned above, andret
is used to change theRIP
to the next instruction of the caller. - After using Stack frames, we can do whatever we please with the stack and all the previous data will still get preserved
procedureLabel:
;;Function prologue
push RBP
mov RBP, RSP
;;Instructions
;;Function epilogie
mov RSP, RBP ;;ignore everything that was above the current RBP (in the container) which can be re-written
pop RBP
call procedureLabel