<asm> Assembly Language;
The closest you can get to the metal — a complete guide from registers to real programs
01. What is Assembly Language?
Assembly language is a low-level programming language with a very strong correspondence between its instructions and the machine code instructions of a specific computer architecture. Unlike high-level languages (Python, Java, C++), there is almost a 1-to-1 mapping between an assembly instruction and a machine code opcode.
Every CPU architecture has its own assembly language — x86, ARM, RISC-V, MIPS — all speak different dialects. An assembler translates human-readable assembly mnemonics into binary machine code the CPU can execute directly.
02. Why Learn Assembly?
| Use Case | Why Assembly Matters |
|---|---|
| OS Kernel Development | Boot loaders, interrupt handlers, context switching require direct hardware control |
| Reverse Engineering | Disassembling malware or closed-source binaries produces assembly output |
| Embedded Systems | Microcontrollers with KB of RAM can't afford compiler overhead |
| Cryptography & SIMD | Hand-optimised vector instructions outperform compiler output |
| Security Research | Shellcode, buffer overflow exploits, ROP chains are written in assembly |
| Understanding Compilers | Reading compiler output helps write faster high-level code |
03. Architecture Overview
x86 (32-bit)
Intel's classic architecture, dominant in PCs from the 1980s through the 2000s. Uses registers EAX, EBX, ECX, EDX. Instructions are variable-length (CISC), making encoding complex but code compact.
x86-64 (64-bit)
AMD's 64-bit extension of x86, now universal on desktops and servers. Registers are prefixed with R (RAX, RBX…). Adds 8 new general-purpose registers (R8–R15). The System V AMD64 ABI governs Linux; Microsoft has its own calling convention.
ARM (AArch64)
RISC architecture dominant in smartphones, tablets, Apple Silicon, and embedded systems. Fixed-length 32-bit instructions, 31 general-purpose registers (X0–X30). Designed for power efficiency.
04. CPU Registers
Registers are tiny, lightning-fast storage locations inside the CPU. In x86-64, each 64-bit register has 32-bit, 16-bit, and 8-bit sub-names:
; 64-bit 32-bit 16-bit 8-bit RAX → EAX → AX → AL / AH
05. Memory Model & Addressing Modes
x86-64 uses a flat memory model — all memory is one large address space. How you reference memory is called an addressing mode:
; Immediate – literal value mov rax, 42 ; Register – value in a register mov rbx, rax ; Direct memory – fixed address mov rax, [0x601020] ; Register indirect – address in register mov rax, [rbx] ; Base + offset – struct field access mov rax, [rbx + 8] ; Base + index * scale + disp – array access mov rax, [rbx + rcx*8 + 16]
Memory Segments
| Segment | Contents | Permissions |
|---|---|---|
| .text | Executable instructions | Read + Execute |
| .data | Initialised global variables | Read + Write |
| .bss | Uninitialised globals (zeroed) | Read + Write |
| Stack | Local vars, return addresses | Read + Write |
| Heap | Dynamic allocations (malloc) | Read + Write |
06. Core Instruction Set
Data Movement
MOV dst, srcCopy value from src to dstPUSH srcPush onto stack; RSP -= 8POP dstPop from stack; RSP += 8LEA dst, [mem]Load effective address (pointer arithmetic)XCHG a, bAtomically swap two operandsArithmetic
ADD dst, srcdst = dst + src; sets flagsSUB dst, srcdst = dst - src; sets flagsMUL srcUnsigned: RDX:RAX = RAX × srcIMUL srcSigned multiplyDIV srcUnsigned: quotient→RAX, rem→RDXINC / DECIncrement or decrement by 1NEG dstTwo's complement negationBitwise & Logic
AND dst, srcBitwise ANDOR dst, srcBitwise ORXOR dst, srcBitwise XOR (XOR reg,reg zeros it)NOT dstBitwise NOT (one's complement)SHL / SHRShift left / right (multiply/divide by 2ⁿ)ROL / RORRotate bits left / rightComparison & Flags
CMP a, bSUB without storing result; sets flagsTEST a, bAND without storing result; sets flags07. Program Structure
; ── Declarations ────────────────────────── global _start ; entry point for linker extern printf ; external C function ; ── Data section ────────────────────────── section .data msg db "Hello!", 0x0A ; string + newline msglen equ $ - msg ; calc length at assemble time count dq 0 ; 64-bit integer, value 0 ; ── BSS section ─────────────────────────── section .bss buffer resb 64 ; reserve 64 bytes (uninitialised) ; ── Code section ────────────────────────── section .text _start: ; ... your code here ... mov rax, 60 ; syscall: exit xor rdi, rdi ; exit code 0 syscall
NASM data definition directives: db (byte), dw (word/2B), dd (dword/4B), dq (qword/8B). Reserve uninitialised space with resb/resw/resd/resq.
08. Hello, World! — Complete Program
global _start section .data msg db "Hello, World!", 0x0A msglen equ $ - msg section .text _start: ; write(1, msg, msglen) mov rax, 1 ; syscall number: sys_write mov rdi, 1 ; file descriptor: stdout mov rsi, msg ; pointer to message mov rdx, msglen ; number of bytes to write syscall ; exit(0) mov rax, 60 ; syscall number: sys_exit xor rdi, rdi ; status = 0 syscall
# Assemble into object file nasm -f elf64 hello.asm -o hello.o # Link into executable ld hello.o -o hello # Execute ./hello Hello, World!
syscall instruction. RAX holds the syscall number, and up to 6 arguments go in RDI, RSI, RDX, R10, R8, R9.
09. Control Flow
Unconditional Jump
jmp loop_start ; always jump to label
Conditional Jumps (after CMP/TEST)
JE / JZJump if equal / zero flag setJNE / JNZJump if not equal / not zeroJL / JGJump if less / greater (signed)JLE / JGEJump if less-equal / greater-equalJB / JAJump if below / above (unsigned)Loop Example — Sum 1 to 10
_start: xor rax, rax ; sum = 0 mov rcx, 1 ; counter i = 1 .loop: add rax, rcx ; sum += i inc rcx ; i++ cmp rcx, 11 ; compare i to 11 jl .loop ; if i < 11, repeat ; rax = 55
LOOP Instruction
The LOOP instruction decrements RCX and jumps if RCX ≠ 0 — a compact countdown:
mov rcx, 5 .repeat: ; body executes 5 times loop .repeat
10. Stack & Calling Conventions
The stack grows downward in x86-64. RSP always points to the topmost (lowest address) valid data.
push rax ; RSP -= 8 ; [RSP] = RAX pop rbx ; RBX = [RSP] ; RSP += 8
System V AMD64 ABI (Linux calling convention)
| Parameter # | Register | Note |
|---|---|---|
| 1st | RDI | — |
| 2nd | RSI | — |
| 3rd | RDX | — |
| 4th | RCX | — |
| 5th | R8 | — |
| 6th | R9 | — |
| 7th+ | Stack | Right to left |
| Return value | RAX | 64-bit integer result |
Function Prologue & Epilogue
my_func: ; ── Prologue ── push rbp ; save caller's base pointer mov rbp, rsp ; establish our frame sub rsp, 32 ; allocate 32 bytes of locals ; ... function body ... ; ── Epilogue ── mov rsp, rbp ; restore stack pointer pop rbp ; restore caller's base pointer ret ; pop return address → RIP
call instruction. After call pushes the return address (8 bytes), RSP is misaligned by 8. The prologue's push rbp restores 16-byte alignment.
11. Tools: Assemblers & Debuggers
| Tool | Type | Description |
|---|---|---|
| NASM | Assembler | Netwide Assembler — most popular, clean Intel syntax, multiplatform |
| GAS (GNU as) | Assembler | GNU Assembler — uses AT&T syntax by default; part of binutils |
| MASM | Assembler | Microsoft Macro Assembler — Windows only, Intel syntax |
| YASM | Assembler | Rewrite of NASM with extra features |
| GDB | Debugger | GNU Debugger — step through instructions, inspect registers/memory |
| LLDB | Debugger | LLVM debugger — default on macOS, clean TUI |
| Radare2 | Disassembler | Open-source reverse engineering framework |
| Ghidra | Decompiler | NSA-released reversing suite with pseudo-C decompiler |
| godbolt.org | Online IDE | Compiler Explorer — see assembly output for any C/C++ snippet |
# Disassemble function (gdb) disas _start # Set breakpoint at label (gdb) b _start # Step one instruction (gdb) si # Print all registers (gdb) info registers # Examine 8 bytes at RSP (gdb) x/8xb $rsp # Print RAX in hex (gdb) p/x $rax
12. Real-World Applications
1. Inline Assembly in C
int add(int a, int b) { int result; __asm__ volatile ( "addl %2, %1\n\t" "movl %1, %0" : "=r"(result) // output : "r"(a), "r"(b) // inputs ); return result; }
2. SIMD Optimisation
Using SSE/AVX instructions, you can process 4, 8, or 16 integers in a single clock cycle — critical for image processing, audio, and machine learning kernels.
; Load 4×32-bit floats from memory into XMM registers movaps xmm0, [rsi] ; load A[0..3] movaps xmm1, [rdx] ; load B[0..3] addps xmm0, xmm1 ; xmm0 = A + B (4 additions, 1 instr) movaps [rdi], xmm0 ; store result
3. OS Bootloader
The first 512 bytes loaded by the BIOS (the MBR) must be raw x86 code. Every operating system starts with a tiny assembly stub before switching to 32/64-bit protected mode and calling into C.

0 Comments