INTRODUCTION

To break computers we must, first, understand how they operate. That’s what this chapter will try to cover, at least in the most fundamental aspects (perhaps a bit more). We are going to cover what is the RAM, how it works and what is its purpose. Then, we will be diving deep in two of the most important topics in computer architecture: CPU registers and the call stack. Understanding both is imperative for covert operations and malware development (and reverse engineering too, of course).

Grab your favorite note taking app and an energy drink, let’s roll.

Kiwi typing in a tablet - Cyberpunk 2077: Edgerunners

THE RAM

Introduction

The Random-Access Memory, a.k.a RAM, is a form of electronic computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory, in contrast with legacy direct-access data storage media (such as hard disks and magnetic tape), where the time required to read and write data items varies significantly depending on their physical locations on the recording medium, due to mechanical limitations such as media rotation speeds and arm movement.

You might be already asking “if we have high speed direct-access data storages such as SATA and NVMe SSDs, what is the difference between RAM and SSDs?”

The main differences is how the data is accessed. RAM is byte-addressable directly by the CPU over the memory bus, whereas secondary storage (even fast SSDs) is block-addressable via I/O controllers. The objective is also different: RAM is supposed to be volatile, while SSDs are supposed to be non-volatile (meaning it actually saves the data it stores even after a process or thread finishes execution).

Physical Memory

But what is physical memory anyway? Physical memory refers to the actual hardware components installed on a computer’s motherboard, most commonly the RAM. It is a physical storage medium consisting of microscopic capacitors and transistors that hold electrical charges representing binary data (0s and 1s).

Physical memory is volatile, meaning it requires continuous power to retain data; if the system loses power, the information stored in RAM is immediately lost. The total capacity of physical memory is strictly limited by the hardware installed in the machine (how many gigs of RAM your machine have installed). Every byte of this hardware has an absolute, hardwired location known as a physical address, which the system’s memory bus uses to electronically route data to and from the CPU.

Virtual Memory

Virtual memory is an operating system abstraction (a very import one for that matter). Abstraction is the process of hiding complex, low-level implementation details and exposing only the essential features or behaviors of a system. It allows you to use a tool, function, or concept without needing to know exactly how it works under the hood.

Going back to it, virtual memory is a conceptual layer created by the operating system (like Linux or Windows) to hide the complex reality of the hardware. Instead of forcing programs to interact directly with physical RAM chips (which would be really insufferable to code and manage), the operating system gives every running program (known as a process) the illusion that it has its own massive, private, and unbroken block of memory.

When you write a program in C or C++, the pointers you define and the memory addresses you print out or manipulate are entirely fictional; they are virtual addresses. To the process, memory appears as a blank slate starting from address zero and extending into the gigabytes or terabytes, regardless of how much physical RAM is actually installed in the machine. Fundamentally, to each and every process in your machine it believes in two beautiful lies:

The machine has “infinite” (technically based on the architecture and other stuff not relevant for us right now) random-access memory;
It is the only process running in the entire machine, which means it believes every byte of such infinite RAM is for itself to use.

Without those two lies, a series of problems would emerge. A primary operational challenge resolved by virtual memory is the strict requirement for absolute physical addressing and the consequent issue of hardware fragmentation.

In an environment lacking virtual memory, the operating system must allocate contiguous physical RAM for a program to load and execute. As multiple applications start, allocate memory, and terminate, the physical RAM becomes heavily fragmented, leaving scattered, non-contiguous gaps of unallocated space. While the total free memory across the system might be sufficient for a new process, the lack of a single, continuous block prevents the program from running. Compiled code inherently expects a sequential memory layout to handle execution flow, arrays, and data structures correctly.

Additionally, executing directly on physical RAM implies that a program’s binary must be compiled to execute at specific, hardcoded physical addresses, or the operating system must perform complex, performance-heavy dynamic relocation of all pointers at load time. If a program is compiled to load at a specific physical base address, but another process already occupies that hardware space, execution will fail.

Because software operates using virtual addresses, but the hardware can only store data in physical addresses, a translation mechanism is required every time an instruction is executed. This involves a coordinated effort between the CPU hardware and the operating system software.

Pages and Frames: To make translation efficient, memory is not tracked byte-by-byte. Instead, both virtual and physical memory are chopped into fixed-size chunks, typically 4 Kilobytes (4096 bytes) in size. A chunk of virtual memory is called a Page, and a chunk of physical memory is called a Frame.
The Page Table: The operating system maintains a data structure in RAM called a Page Table. This acts as the master map. It records exactly which virtual Page corresponds to which physical Frame. It also stores permissions (Read, Write, Execute) for each page.
The MMU (Memory Management Unit): The MMU is a dedicated hardware component located directly inside the CPU. Whenever a C/C++ pointer attempts to read or write data, the CPU sends the virtual address to the MMU. The MMU looks up the address in the Page Table, finds the corresponding physical Frame, and outputs the true physical address to the memory bus.
The TLB (Translation Look-aside Buffer): Because checking the Page Table in RAM for every single instruction would be incredibly slow, the CPU contains a high-speed hardware cache called the TLB. The TLB stores the most recent virtual-to-physical translations. The MMU checks the TLB first; if the translation is there, execution continues at full speed.
Page Faults and Paging: If a process accesses a virtual address that has not been mapped to physical RAM yet, the MMU triggers an hardware interrupt called a Page Fault. The CPU halts the process and hands control to the operating system. If the required data is on secondary storage (like a hard drive or SSD swap file), the OS fetches the data, places it into an empty physical Frame, updates the Page Table, and resumes the process.

Memory Hierarchy

Cache memory

Previously we’ve seen the mention of “hardware cache memory”. Hardware cache memory, also known as in-die cache memory, is a small, low-latency, volatile memory tier located directly on or physically adjacent to the processor die. Unlike the Dynamic RAM (DRAM) used in main physical memory, SRAM (in-die cache memory) does not require periodic electrical refreshing, allowing it to operate at frequencies that closely match the CPU’s internal clock cycles.

In another words, cache memory is designed to supply the CPU with data and instructions at a speed matching its execution rate.

The primary purpose of cache memory is to mitigate the von Neumann bottleneck. This bottleneck refers to the significant latency gap between how fast a modern CPU can process operations and how slowly main physical memory (DRAM) can deliver the required data over the motherboard’s memory bus. Without cache, the CPU would spend the majority of its clock cycles idle, waiting for data to arrive from RAM.

To achieve this speed, cache memory is built using SRAM (Static Random Access Memory).

SRAM uses a flip-flop transistor configuration (bistable latching circuitry, for the nerds out there) to hold a binary state. It does not leak charge, meaning it does not need to be constantly refreshed by the memory controller, allowing it to respond in roughly 1 to 3 CPU clock cycles.
DRAM (Dynamic RAM), used in main memory, relies on microscopic capacitors that leak charge and must be electrically refreshed thousands of times per second, making it inherently slower (often requiring 100+ clock cycles to respond).

Because SRAM requires six transistors per bit of storage (compared to DRAM’s one transistor and one capacitor), it is physically larger and significantly more expensive to manufacture. This size and cost constraint dictates that cache cannot replace RAM entirely; it must be used strategically as a temporary buffer.

How cache works

Cache controllers populate SRAM based on predictive heuristics known as the Principle of Locality. I know, I know, I am also tired of those names but hear me out.

The Principle of Locality work with two fundamental computing truths:

Temporal Locality: If the CPU accesses a specific memory address (e.g., a counter variable in a loop), it is highly likely to access that exact address again very soon. The cache retains this data to serve subsequent requests instantly.
Spatial Locality: If the CPU accesses a memory address, it is highly likely to access adjacent addresses next (e.g., reading through an array or executing sequential instructions).

To exploit spatial locality, the CPU does not fetch single bytes from RAM. It fetches data in fixed-size blocks called Cache Lines (typically 64 bytes). When a single byte is requested, the entire 64-byte block surrounding it is pulled into the cache.

Levels

Because a single cache cannot be both infinitely large and infinitesimally fast, CPU architects use a tiered hierarchy. As you move down the tiers, capacity increases, but latency (delay) also increases.

These primary tiers are located on the CPU Die, the actual physical piece of silicon housing the processing cores.

L1 Cache: The smallest, fastest tier, integrated directly inside each individual CPU core.
- Split Design: It is structurally divided into an L1-I (Instruction Cache) and an L1-D (Data Cache). This split allows the CPU to fetch the next execution instruction and the data required for that instruction simultaneously without the two requests blocking each other on a single bus.
L2 Cache: A larger, slightly slower buffer that backs up the L1 cache. In modern architectures, L2 is also dedicated per individual core. It holds data that was recently evicted from L1 or pre-fetched data that the CPU will likely need soon.
L3 Cache: The largest on-die cache, shared globally across all the CPU cores.
- Purpose: It serves as the final internal buffer before a request is sent out to the slow system RAM. Because it is shared, it plays a critical role in Cache Coherence, a protocol ensuring that if Core 1 alters a variable, Core 2 immediately sees the updated value in the shared L3, preventing the execution of stale data.

L4 cache

If data is not found in L1, L2, or L3, a Cache Miss occurs. Traditionally, the request goes directly to system RAM. However, some specialized or high-performance processors implement an L4 Cache, also known as an External Cache.

L4 is a massive cache tier (often 128MB or more) located physically off the primary compute die, but usually still within the CPU Package (the protective green or blue fiberglass substrate that seats into the motherboard socket). Because standard SRAM is too physically large for this capacity, L4 is typically built using eDRAM (Embedded DRAM) or HBM (High Bandwidth Memory). These technologies are faster than standard motherboard RAM but denser than SRAM.

The purpose of L4 cache is to intercept requests that miss the L3 cache before they traverse the system memory bus to the motherboard. Certain operations, like complex 3D rendering, large-scale database queries or integrated graphics processing, have massive Working Sets (the total amount of memory a process requires active access to at a given moment). These working sets easily overflow the capacity of the L3 cache. The L4 cache acts as a high-bandwidth safety net, preventing these heavy workloads from constantly suffering the severe latency penalties of querying main motherboard DRAM.

The hierarchy

Memory hierarchy organizes a computer’s data storage into layers based on access speed, capacity, and cost. As you move from the CPU outward, memory becomes slower and larger, but cheaper per byte. This structure bridges the gap between ultra-fast processors and high-capacity storage. It exist as a reference for designing high performance systems following the previously mentioned principles.

The absolute top of the pyramid contains the CPU Registers. We will be covering it in the next section. Following it, the second most fast memory available is the SRAM (hardware cache). Following it, we have our traditional Random-Access Memory (DRAM), being the middle ground between capacity, latency and price per bit. Finally, at the absolute base of the pyramid we have our permanent storage media like SATA and NVMe SSDs. They are much, much slower than any other memory type but they have huge capacities, which make them perfect for long-term storage.

THE CPU

Introduction

In the previous section we’ve uncovered how memory works, what types of memory there are, what is physical addresses and the difference between it and virtual addresses and learned about cache memory. We’ve briefly mentioned the CPU and its registers themselves when uncovering those topics. Now, we will actually dive into the CPU.

The Central Processing Unit (CPU) is the primary logic and control circuitry of a computer. It is a highly complex integrated circuit composed of billions of microscopic transistors fabricated onto a silicon die. The absolute purpose of a CPU is to sequentially execute instructions defined by software. It acts as the orchestration engine for the entire system, performing arithmetic calculations, evaluating logical conditions, and directing the flow of data between memory, storage, and peripheral devices. Software, whether an operating system kernel, a custom C loader, or a PowerShell script, must ultimately be reduced to basic binary instructions that the CPU’s hardware is physically wired to understand.

Fundamental internal components

To understand how a CPU operates, we must understand the discrete hardware units that compose it.

Instruction Set Architecture (ISA)

While not a physical component, the ISA is the foundational design specification of the CPU (e.g., x86_64, ARM64). It defines the exact vocabulary of machine code instructions (opcodes) the hardware can physically execute, the size of memory addresses, and the available registers.

Registers

A processor register is a quickly accessible storage location available to a computer’s processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. In computer architecture, registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address.

Registers operate at the exact speed of the CPU clock. They do not hold abstract variables; they hold the raw, immediate values the CPU is calculating right now.

Instruction Pointer (IP / PC)

A critical register that strictly holds the memory address of the next instruction to be executed. Altering execution flow (like a jump instruction or a buffer overflow payload) relies on modifying this register. Depending on the architecture of the machine, the instruction pointer might have a different name.

IP, the original Instruction Pointer, is a legacy 16 bit register;
EIP, or Extended Instruction Pointer, is IP’s next iteration going into 32 bit architecture;
RIP is the modern 64 bit version of the register and stands for Register Instruction Pointer.

General purpose registers

The purpose of general purpose registers is to hold the immediate working data, memory addresses, and arithmetic operands that the CPU is actively processing. While cache and main memory hold data that will be needed, GPRs hold the data being manipulated at the exact nanosecond of execution. In the x86_64 architecture, all mathematical operations, bitwise logic, and memory addressing calculations occur within these registers.

The x86_64 (also known as AMD64) architecture provides 16 primary 64-bit general-purpose registers. While they are termed “general purpose” because modern compilers can use them interchangeably for most calculations, they retain specific historical designations and implicit roles in certain assembly instructions. Knowing this nuances helps us during reverse engineering.

64-bit Register	Historical Name	Primary Implicit Roles & Uses
`RAX`	Accumulator	Holds the primary operand for multiplication/division. Crucially, it dictates the system call number before a `syscall` instruction and stores the return value of functions.
`RBX`	Base	Historically used as a base pointer for memory access. Often used as a callee-saved register (a function must restore its original value before returning).
`RCX`	Counter	Used implicitly as a loop counter (e.g., `rep`, `loop` instructions) and for bit-shift operations. It is also used by the x64 windows calling convention for passing a function’s parameter
`RDX`	Data	Used in conjunction with `RAX` for I/O operations and holding the upper 64 bits of 128-bit multiplication/division results. Also used by the x64 windows calling convention for passing a function’s parameter.
`RSI`	Source Index	Holds the source memory pointer for bulk string/memory operations (e.g., `movsb`).
`RDI`	Destination Index	Holds the destination memory pointer for bulk string/memory operations.
`RBP`	Base Pointer	Points to the base of the current stack frame. Used to reference local variables and function arguments relative to a fixed point.
`RSP`	Stack Pointer	Points to the current top of the stack. It is dynamically altered by `push`, `pop`, `call`, and `ret` instructions. Manipulating this register is central to Return Oriented Programming (ROP) chain execution.
`R8` - `R15`	Extended	Introduced with the 64-bit transition. They have no historical naming constraints and are purely general-purpose, though they play specific roles in modern calling conventions. For instance, R8 and R9 registers are also used by the x64 windows calling convention for parameters.

[ ! ] If you don’t know what some terms mean in the above table, don’t worry. We will see many of them later on.

There is an important note to be taken about RBP. In the above table, its historical use is displayed. Most of the other GPRs serve, mostly, the same purpose as they’ve historically have served, but RBP is a highlight in the way it does not.

Modern compilers such as MSVC, by default, utilize an optimization called Frame Pointer Omission (FPO). On x64 Windows, the compiler tracks the depth of the stack statically by utilizing metadata (such as .pdata and .xdata sections of an executable) and hides the use of the RBP as a base pointer on the vast majority of functions that do not utilize alloca. This, fundamentally, allows for RBP to be used as another general purpose register.

Backwards compatibility

The x86_64 architecture is strictly backwards compatible. You can address fractions of a 64-bit register to manipulate 32-bit, 16-bit, or even 8-bit data without affecting the other bits (with one architectural exception).

Using RAX as an example:

RAX: Full 64-bit register.
EAX: Lower 32 bits.
AX: Lower 16 bits.
AH / AL: The High 8 bits and Low 8 bits of the AX register.

[ ! ] Note: Writing to a 32-bit sub-register automatically zero-extends, clearing the upper 32 bits of the 64-bit register. This is an x64 architectural optimization.

Control Unit (CU)

The internal traffic director. It is responsible for reading the binary instructions from memory, interpreting what they mean according to the ISA, and sending the electrical control signals to the rest of the processor to carry out the command.

Arithmetic Logic Unit (ALU)

The mathematical engine. It contains the logic gates required to perform integer arithmetic (addition, subtraction) and bitwise logical operations (AND, OR, XOR, shifts).

The System Clock

An internal oscillator that generates a continuous pulse (measured in Gigahertz, or billions of cycles per second). Every component in the CPU synchronizes its actions to this ticking clock. An operation might take one clock cycle, or it might take several.

Understanding GPRs is heavily tied to how compilers translate C/C++ code into assembly, specifically regarding how functions communicate. This is standardized by Calling Conventions. In the next section, we will end up understanding them further.

When executing direct operating system functions, bypassing standard user land APIs (such as initiating indirect syscalls to evade EDR hooks), you must manually load the registers according to the kernel’s expectation.

THE CALL STACK

Introduction

The call stack (also referred to as execution stack, but most of the time just as “the stack”) is an in-memory (meaning it resides inside the RAM) dynamic data structure. Maintained by the operating system, its main purpose is to control the way procedures and functions call each other and the way they pass parameters to each other. A call stack is maintained for each task and each task’s threads.

The stack has a “last-in, first-out” (LIFO) structure, meaning that you can only remove the currently last chronologically added item from the stack before removing the second last chronologically added, then the third and so on. In simple terms: you can’t remove the very first item from the stack without removing all the other items added afterwards.

Think about it like a book stack. If you add book A first, then book B and book C subsequently, to access book A you would have to remove book C and book B, in this exact order, from their current location.

When something adds to the stack, we say it pushes to the stack. When something removes from the stack, we say it pops from the stack. The reason is simple: those are the assembly instructions to execute call stack item addition (push) and subtraction (pop), respectively.

Of course, the CPU does not magically know where the top of the stack is. For this, it uses a specific CPU register called the Stack Pointer (RSP on 64-bit architecture, ESP on 32-bit). You can think of the Stack Pointer (sometimes also just “SP”) as a bookmark: it always holds the memory address of the most recently added item on the stack. Every time you push or pop, the CPU automatically updates the RSP register to point to the new top

The Stack Foundation

Let’s comprehend the foundational knowledge about how the stack operates. Because the stack is a LIFO structure, the order items are pushed to it matters. Let’s see an example:

Imagine we have a Procedure 1 that calls a Procedure 2, which in turn calls a Procedure 3 which ends up calling a Procedure 4. This means that, for Procedure 1 to get what it needs from Procedure 2, Procedure 3 and 4 must be ran, so they would look like a chain.

Let’s suppose that Procedure 2 has 2 input parameters and that it declares 2 local variables. Let’s also suppose that both Procedure 3 and Procedure 4 receives and declares the same amount of parameters and local variables as Procedure 2.

This is what happens, in our scenario, when Procedure 1 calls Procedure 2:

1. Parameters to the stack

Both needed parameters are pushed to the stack in order to execute Procedure 2. For visualization purposes, let’s see how it usually looks like from a command-line perspective:

procedure2 param1 param2

The stack is a LIFO structure, therefore the first needed parameter comes last:

It would look like this:

For programmers out there, imagine the following function:


def foo(a, b):
	if type(a) is not str and type(b) is not int:
		return False
		
	else:
	print(f'[+] Your string is {a}')
	print(f'[+] Your integer is {b}')
	return True

When calling that function, we would do it the following way:

foo('test', 5)

But in assembly, when the stack is used to store function arguments, it would actually be REVERSED:

; Pseudo-assembly. this is not completely correct.
push 5
push 'test'
call foo

This allow the parameters to be in the correct order inside the stack (first param1 then param2)

2. Return address of P1

[ ! ] This is actually architecture dependent. Most commonly, on x86_64, the RET address is pushed BEFORE everything else, implicitly, by the call instruction before control is delegated to the callee. Also, depending on the calling convention (like the x64 Windows calling convention), some parameters might not be initially pushed onto the stack to begin with. This is just an example to understand how the call stack works as a concept.

After pushing the necessary parameters to the stack, it would be pushed to the stack the return address of Procedure 1. The return address is a memory address that tells the CPU where to resume program execution after completing a function call or subroutine. In other words, this is a “get back to me when you are done” address. The stack, then, would look like this:

3. P2 local variables to stack

Finally, Procedure 2 can push its local variables to the stack so it can access it when needed, like every other data currently in the stack. The stack, then, would look like this now:

With all those values pushed onto the stack, now Procedure 2 has it’s own stack frame. A stack frame is a specific, contiguous block of memory allocated within the program’s call stack every time a function or subroutine is invoked. Every Procedure will have it’s own stack frame, which in turn contains all the data it needs. Procedure 2 can access each element of its stack frame individually and directly, meaning it does not need to pop or push every time it needs to access data in the stack, they can simply access it using its stack frame address directly, allowing it to recover the data independently.

4. Final stack look

Afterwards, Procedure 2 would have to call Procedure 3 and Procedure 3 would have to call Procedure 4 as we discussed. The process above described would repeat, so the final look of the stack would be:

Returning and Cleaning

Once Procedure 4 finishes its work, it is time to return to Procedure 3, then Procedure 2 and, finally, Procedure 1. When returning, the stack frame of the previous procedure is cleansed. Here is an example of how that can look:

The P4 local variables are popped off the stack;
Then the RET address of P3 is popped off too, which repopulates the program counter in the CPU and returns control to P3 (this is also architecture dependent)
In this specific architecture we are discussing, it is P3 responsability to finish cleaning P4 params and then start its own cleansing.

This pattern repeats ‘till it gets back to Procedure 1.

Architectural Differences

As noted before, which and how data is allocated in the stack is architecture dependent. In our example, procedure’s parameters are pushed onto the stack before the return address. Also, in our example, the stack seems to be growing upwards, meaning the addresses would go up inside the stack when a new item is added. That is not necessarily the case and I used that explanation only so you can have a concrete fundamental level of understanding about how a call stack works. Let’s look at how x86_64 call stack would typically work.

Imagine the following scenario:

push arg3  
push arg2  
push arg1  
call func

The stack in this instance, if accurately represented, would look like this:

Now things start to get a little bit more complicated, so pay attention. Because in x86_64 the stack grows downwards, it means the base of the stack starts at a high memory address, and grows towards lower memory addresses. This is where the mathematical shenanigans of the Stack Pointer comes into play.

When you push an item onto the stack, the CPU actually does two things: it subtracts 8 bytes from RSP (moving the pointer to a lower address) and then writes the data to that new address. Conversely, when you pop an item, the CPU reads the data from the current RSP address and then adds 8 bytes to RSP. The data isn’t erased from RAM, the active stack just shrinks, and the pointer moves back up towards the higher base address. Functionally, the stack acts exactly as a LIFO structure, but mechanically, we are just doing math on the Stack Pointer.

Calling Conventions

CPU math shenanigans aside, now that we understand that the stack might grow upwards or downwards (and how it does it) and that which way it grows is architecture dependent, we might as well learn the differences between calling conventions. Previously it was mentioned that, depending on the calling convention, parameters might not be pushed onto the stack at all. Let’s dive a little bit deeper into the x64 Windows calling convention to understand these differences.

First, let’s answer what is a calling convention. An assembly calling convention is a standardized, low-level protocol that dictates how functions pass data and control back and forth in machine code. It serves as a strict contract between the “caller” (the function initiating the call) and the “callee” (the function being executed). In other words, it is a standard for how functions are called in assembly.

x64 Windows calling convention

Windows x64 Application Binary Interface (a.k.a ABI, a low-level specification that dictates how two compiled binary program modules interact with one another) utilizes a four-register, fast-call calling convention, per documentation (here), while also allocating what is known as a shadow space (or home space) in the call stack (at least 32 bytes + 8 bytes, which is used as padding for stack alignment). This is done so callees can save those registers (since registers are dynamic). Each 8-byte “chunk” is reserved for each of the four fast-call registers (8 x 4 = 32).

[rsp+20h] <- for r9
[rsp+18h] <- for r8
[rsp+10h] <- for rdx
[rsp+08h] <- for rcx

The registers used are usually RCX, RDX, R8 and R9. The use of registers shifts how the values are passed to the function, since the LIFO architecture only applies to the stack. This means assembly stops bugging your head and starts looking like a common modern programming language, were a function call in high level code looks like this:

foo(arg1, arg2, arg3, arg4)

And exactly the same in assembly, using the “first left, last right” order:

; foo(arg1, arg2, arg3, arg4)
mov rcx, arg1
mov rdx, arg2
mov r8, arg3
mov r9, arg4
call foo

There is a little more nuance to this and we will see it shortly after our previous example, now adapted to Windows calling convention, below:

sub rsp, 40 ; shadow space 32-bytes + 8 bytes for padding.
mov rcx, arg1
mov rdx, arg2
mov r8, arg3
call foo

; cleanup
add rsp, 40 ; clears the shadow space
ret

[ ! ] Windows x64 ABI strictly demands the stack to be 16-byte aligned before making a call.

You might’ve noticed that to allocate the shadow space we subtract instead of adding. This is because, like we have seen, the stack grows downwards. Since it grows downwards, by making RSP smaller we are actually enlarging the stack,

If our function received more than four arguments, we would have to place the argument inside the stack as previously seen. Instead of just pushing it to the stack, we place it right above the shadow space because the shadow space is specifically for those first parameters as we’ve discussed:

; short version for brevety
mov rcx, arg1
mov rdx, arg2,
mov r8, arg3
mov r9, arg4
mov qword [rsp + 32], arg5 ; places arg5 directly above the shadow space
; ...
call foo

That’s the simplified version. As I mentioned, there is a little bit more nuance about which registers get used for each argument.

For the first four scalar types, such as integers, the previously seen registers are used (RCX, RDX, R8 and R9);
For the first non-scalar types, such as floats and doubles, the registers used are XMM0, XMM1, XMM2 and XMM3.

; non-scalar types example
; this is pseudo-assembly. this is not completely correctly shows how you manipulate non-scalar types in assembly
; but it helps visualizing it without adding complexity

movss xmm0, 1.5 ; movss = mov for floats
movsd xmm1, 2.22 ; movsd = mov for doubles
call foo

Just like for scalar types, from arguments five onward, the call stack is used.

movss xmm0, 1.5
movsd xmm1, 2.22
movss xmm2, 24.9
movss xmm3, 0.3
movss qword [rsp + 32], 0.1
; ...
call foo

Prologues and Epilogues

Function prologues

Functions have a prologue and an epilogue that are inserted by the compiler to satisfy ABI requirements. Take the following image as an example:

This is the prologue of a main() function. As we’ve previously seen in the registers chapter, RBP and RDI are nonvolatile registers. Functions preserve their original values by pushing them onto the stack. The next instruction, sub rsp,118h, is the shadow space allocation. Of the 280 bytes being allocated (118 in hex is 280 in decimal), 32 bytes are for the shadow space, and the remaining 248 bytes are for main’s local variables (there’s also a bit of padding to keep RSP 16-byte aligned). Since main() receives no arguments, it doesn’t spill any register.

Let’s see how the prologue for a function called by main() would look like. Take this as an example:

; A foo() function called by main()
mov dword ptr [rsp+20h],r9d
mov dword ptr [rsp+18h],r8d
mov dword ptr [rsp+10h],edx
mov dword ptr [rsp+8h],ecx
push rbp
push rdi
sub rsp,128h

The above example is an hypothetical function the previous main() function might call. The first thing it does is moving the arguments passed via fast-call registers onto the shadow space, then it preserves the state of RBP and RDI by pushing them onto the stack and, finally, they enlarge the stack by another 296 bytes. Fundamentally, what foo() is actually doing (just like main() did) is enlarging its stack frame.

Function epilogues

Function epilogues, similarly to prologues, are a set of instructions added by the compiler intended to correct the stack back to how it was before execution entered a function. As we’ve seen in above examples, both main() and foo() enlarge the stack by decreasing the stack point. Therefore, they increase the stack pointer to the same amount prior to returning to shrink the stack back to its original size. They, then, pop the values of the nonvolatile registers and, finally, the ret instruction implicitly pops the return address off the stack.

; example foo() epilogue
add rsp,128h
pop rdi
pop rbp
ret

Stack Unwinding

Introduction

Now that we have a fundamental knowledge of the stack, assembly calling conventions and registers, we can dive into stack unwinding. This is where a process walks back over the frames on the stack at runtime, which is primarily used for exception handling. If a function throws an exception, then the program begins to unwind, walking back over each frame on the stack, following each return address and cleaning up each frame, until it reaches a function that can catch and handle the exception or the program crashes (i.e. the exception is left unhandled).

When a program crashes and you see something like this:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
   at blablabla.Program.ExFunction3() in C:\blablabla\Program.cs:line 35
   at blablabla.Program.ExFunction2() in C:\blablabla\Program.cs:line 30
   at blablabla.Program.ExFunction1() in C:\blablabla\Program.cs:line 25
   at blablabla.Program.Main(String[] args) in C:\blablabla\Program.cs:line 20

That’s a text representation of the stack unwinding. It’s useful for developers to see which function threw the exception and what the execution path was at the time. In this case, we see that main() called ExFunction1(), which called ExFunction2(), that ends up calling ExFunction3(), which is where the exception happened.

Windows x64 implements stack unwinding in a very complicated way. Rest assured that, if you have a hard time getting it, know that we all do. This is a best effort in simplifying it without losing key information.

RUNTIME_FUNCTION These are tables containing entries for every function in the program that allocates stack space. It holds each function’s start address, end address, and an address to its UNWIND_INFO structure.
UNWIND_INFO This structure contains information about how a function’s prologue sets up the stack and where it spills fast-call registers. It indicates whether the function has an exception handler (e.g. a try/catch) and if so, provides a pointer to that handler. It also contains an array of UNWIND_CODE structures.
UNWIND_CODE This structure is used to describe each operation in the function’s prologue, e.g. push a nonvolatile register onto the stack, set the frame pointer, spill a register, etc. This information is used to “undo” the operation when the stack is unwound.

To unwind the stack, RtlLookupFunctionEntry returns a pointer to the RUNTIME_FUNCTION for the current instruction pointer. This provides the address to the UNWIND_INFO and RtlVirtualUnwind then walks backwards through each UNWIND_CODE. The program performs adjustments based on each unwind code and if an exception handler is present, it will invoke it. The stack will continue to unwind until an exception is handled or the program crashes.

Security products can use stack unwinding in a way that allows them to analyse the origin of suspicious activity, such as API calls, without performing any of the associated stack clean up. For example, if we have an executable that runs MessageBoxW, we can place a break point and look at the stack frames right before that API is called.

0:000> bp User32!MessageBoxW
0:000> g
Breakpoint 0 hit

0:000> k
 # Child-SP         RetAddr              Call Site
00 000000c43a12ebd8 00007ff6b2a4101d     USER32!MessageBoxW
01 000000c43a12ebe0 00007ff6b2a41039     Example!ExampleFunction+0x1d
02 000000c43a12ec10 00007ff6b2a41260     Example!main+0x9
03 (Inline Function) ----------------    Example!invoke_main+0x22 
04 000000c43a12ec40 00007ffba142e8d7     Example!__scrt_common_main_seh+0x10c 
05 000000c43a12ec80 00007ffbb215c34c     KERNEL32!BaseThreadInitThunk+0x17 
06 000000c43a12ecb0 00000000`00000000    ntdll!RtlUserThreadStart+0x2c

These frames reveal the proper execution flow of a process from ntdll!RtlUserThreadStart, through Example!main, Example!ExampleFunction and finally User32!MessageBoxW.

By intimately understanding how the stack operates, the x64 Calling Convention, and how RtlVirtualUnwind parses UNWIND_INFO and RUNTIME_FUNCTION structures, malware developers can artificially forge fake stack frames before executing a malicious API call. They manipulate the stack pointers and return addresses so that, when the EDR decides to inspect the stack, it walks a beautifully crafted, 100% legitimate-looking execution path.

CONCLUSION

We have covered a massive amount of ground in this chapter. We started by understanding the physical constraints of RAM and the beautiful lies told by Virtual Memory. We zoomed into the CPU to understand its microscopic brain, the execution flow dictated by registers like RIP, and how General Purpose Registers hold the raw data of our operations.

Finally, we disassembled the Call Stack. We learned that the stack is a dynamic LIFO data structure, we navigated the strict rules of the Windows x64 Calling Convention (and its shadow spaces), and we peeked behind the curtain of Stack Unwinding to see how the system and security products trace execution paths.

Why go through all this trouble? Because you cannot manipulate what you do not fundamentally comprehend. When we eventually start writing shellcode, hooking functions, or evading sophisticated EDRs, this low-level knowledge will be the difference between a payload executing silently in the shadows and it crashing the target process instantly due to a misaligned stack or a corrupted register.

Now that we understand how a computer organizes memory and executes instructions natively, the next step is to understand how the operating system utilize all those cool abstractions. The next chapters will uncover Portable Executables and Process and Threads.

As usual,
Keep Hacking,
Sp1d3rM_*^!

Kiwi smoking - Cyberpunk 2077: Edgerunners