How does the Linux Kernel start a Process

In this article, you will learn what happens inside the Linux Kernel when a process calls execve(), how the Kernel prepares the stack and how control is then passed to the userland process for execution.

I had to learn this for the development of Zapper - a Linux tool to delete all command line options from any process (without needing root).

Overview

The Kernel receives SYS_execve() by a userland program.
The Kernel reads the executable file (specific sections) into specific memory locations.
The Kernel prepares the stack, heap, signals, ...
The Kernel passes execution to the userland program.

Examining a binary

Let us start with a simple Linux C program:

int main(int argc, char *argv[0]) {
        return 0;
}

Compile it with gcc -static -o none none.c and find out some details:

$ readelf -h none
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - GNU
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4014f0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          760112 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         10
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 29

The first instructions start at the 'Entry Point' at 0x4014f0. These instructions were created by the compiler (gcc, go, etc). They differ by compiler.

Let's load the binary into gdb and disass 0x4014f0 the instructions. The instructions perform a bit of housekeeping but eventually will call main() (or the GoLang equivalent).

Let's set a break-point at the Entry Point (0x4014f0)and run the app with two command line options (firstarg and secondarg):

gdb ./none
pwndbg> disass 0x4014f0
pwndbg> br *0x4014f0
pwndbg> r firstarg secondarg
 ► 0x4014f0 <_start>       xor    ebp, ebp
   0x4014f2 <_start+2>     mov    r9, rdx
   0x4014f5 <_start+5>     pop    rsi
[...]
──────────────────────[ STACK ]──────────────────────
00:0000│ rsp 0x7ffca4229540 ◂— 0x3
01:0008│     0x7ffca4229548 —▸ 0x7ffca422a4b3 ◂— '/sec/root/none'
02:0010│     0x7ffca4229550 —▸ 0x7ffca422a4c2 ◂— 'firstarg'
03:0018│     0x7ffca4229558 —▸ 0x7ffca422a4cb ◂— 'secondarg'
04:0020│     0x7ffca4229560 ◂— 0x0
05:0028│     0x7ffca4229568 —▸ 0x7ffca422a4d5 ◂— 'BASH_ENV=/etc/shellrc'
[...]

(If you are using gdb without pwngdb then you may need to x/64a $rsp to list the first 64 entries from the stack.)

The Stack Pointer rsp is at 0x7ffd4f48bd10. Let's find out the end of the stack with grep -F '[stack]' /proc/$(pidof none)/maps:

7ffd4f46c000-7ffd4f48d000 rw-p 00000000 00:00 0         [stack]

The Kernel has allocated the stack memory from 0x7ffd4f46c000 to 0x7ffd4f48d000 - a total of 132 KB. It will grow dynamically up to 8MB (ulimit -s kilobytes). Our program (so far; see rsp) only uses the stack from the rsp address (0x7ffd4f48bd10) down to the same end of the stack (0x7ffd4f48d000) - a total of 4,848 bytes (echo $((0x7ffd4f48d000 - 0x7ffd4f48bd10)) == 4848).

This is the 'birth' of the execution: The Kernel, in all its braveness, has passed control to our program. Our program is about to execute its very first instruction - to take its very first step (so to say).

What is on the stack right now is all the information the program gets from the Kernel to run. It contains the argument list, the environment variables and a lot of other interesting information.

For Zapper we had to manipulate the argument list, move stack values around, adjust the pointers and then pass control back to the program - without it falling over. It was prudent to understand a bit better what the Kernel had put on the stack.

Let's dump the stack:

pwndbg> dump binary memory stack.dat $rsp 0x7ffd4f48d000

and load it into hd <stack.dat or xxd <stac.dat and have a little sneak preview...

 03 00 00 00 00 00 00 00  b3 a4 22 a4 fc 7f 00 00  |..........".....|
 c2 a4 22 a4 fc 7f 00 00  cb a4 22 a4 fc 7f 00 00  |..".......".....|
 00 00 00 00 00 00 00 00  d5 a4 22 a4 fc 7f 00 00  |..........".....|
 eb a4 22 a4 fc 7f 00 00  11 a5 22 a4 fc 7f 00 00  |..".......".....|
 25 a5 22 a4 fc 7f 00 00  30 a5 22 a4 fc 7f 00 00  |%.".....0.".....|
[...]
 b4 5c 18 e0 ed f9 fb 0d  30 78 38 36 5f 36 34 00  |.\......0x86_64.|
 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
 00 00 00 2f 73 65 63 2f  72 6f 6f 74 2f 6e 6f 6e  |.../sec/root/non|
 65 00 66 69 72 73 74 61  72 67 00 73 65 63 6f 6e  |e.firstarg.secon|
 64 61 72 67 00 42 41 53  48 5f 45 4e 56 3d 2f 65  |darg.BASH_ENV=/e|
 74 63 2f 73 68 65 6c 6c  72 63 00 43 48 45 41 54  |tc/shellrc.CHEAT|
 5f 43 4f 4e 46 49 47 5f  50 41 54 48 3d 2f 65 74  |_CONFIG_PATH=/et|
[...]
 55 4d 4e 53 3d 31 31 38  00 2f 73 65 63 2f 72 6f  |UMNS=118./sec/ro|
 6f 74 2f 6e 6f 6e 65 00  00 00 00 00 00 00 00 00  |ot/none.........|

Lots of pointers. Lots of strings. Lots of unknowns.

Let's follow the call from execve() to the program's entry point.

The execve() calls the Kernel via a syscall which then calls do_execve ():

Eventually, this ends up in do_execveat_common(). The bprm structure is created and all kinds of information about the program are assigned to it (see binfmts.h).

Important to us, the program's filename, environment variables and options (argv) are copied from Kernel memory to the process's stack. The stack grows towards the lower addresses: The first item put on the stack (the bprm->filename) is thus at the largest address on the stack (the bottom) and above it (e.g. smaller addresses) comes the envp and then the argv.

We follow to bprm_execve() where some checks are completed before calling exec_binprm(). From there into search_binary_handler() where the Kernel checks if the binary is ELF, a shebang (#!) or any other type registered via the binfmt-misc module. The kernel then calls the appropriate function to load the binary.

In our case, it's an ELF binary and so load_elf_binary() is called. The kernel creates the memory and thereafter maps the sections from the binary file into memory. It calls begin_new_exec() to set all credentials and permissions for the new process.

The kernel then checks if the ELF binary should be loaded by an interpreter (ld.so):

or, in the case of a static binary like ours, loaded directly without an interpreter:

Finally, create_elf_tables() is called. This is where all the stack magic happens that we are interested in.

First, the function arch_align_stack() adds a random amount of zeros to the stack (e.g. stack randomization) to make (some) buffer overflow exploits work (a little) less reliably. It then aligns the stack to 16 bytes (e.g. sets the stack pointer to the next lower address that is aligned to 16 bytes with & ~0xf).

The kernel then puts x86_64\0 onto the stack and next adds 16 bytes of random data on top (which libc uses as a seed for its PRNG):

The kernel then creates the elf auxiliary table: A collection of (id, value) pairs that describe useful information about the program being run and the environment it is running in, communicated from the kernel to user space.

The list ends with a Zero Identifier and Zero value (e.g. 16 bytes of 0x00). There are about 20 entries (320 bytes) in the list.

The table starts with ARCH_DLINFO (which expands to AT_SYSINFO_EHDR + AT_MINSIGSTKSZ).

The 'Identifiers' are defined in auxvec.h:

In gdb the elf auxiliary table from our program's stack looks like this (Note: The 'Identifier' values above are in decimal but gdb shows them in hex):

[... above is argc ...]
[... above is argvp ...]
[... above here is envp ...]
0x7ffca42296a8: 0x21    0x7ffca4351000    <-- AT_SYSINFO_EDHR
0x7ffca42296b8: 0x33    0xd30             <-- AT_MINSIGSTKSZ
0x7ffca42296c8: 0x10    0x178bfbff        <-- AT_HWCAP
0x7ffca42296d8: 0x6     0x1000            <-- AT_PAGESZ
0x7ffca42296e8: 0x11    0x64
0x7ffca42296f8: 0x3     0x400040
0x7ffca4229708: 0x4     0x38
0x7ffca4229718: 0x5     0xa
0x7ffca4229728: 0x7     0x0
0x7ffca4229738: 0x8     0x0
0x7ffca4229748: 0x9     0x4014f0         <-- Our entry point
0x7ffca4229758: 0xb     0x0
0x7ffca4229768: 0xc     0x0
0x7ffca4229778: 0xd     0x0
0x7ffca4229788: 0xe     0x0
0x7ffca4229798: 0x17    0x0
0x7ffca42297a8: 0x19    0x7ffca42297f9   <-- Ptr to Random
0x7ffca42297b8: 0x1a    0x2
0x7ffca42297c8: 0x1f    0x7ffca422afe9   <-- Ptr to filename
0x7ffca42297d8: 0xf     0x7ffca4229809   <-- Ptr to x86_64
0x7ffca42297e8: 0x0     0x0                  <-- NULL + NULL
[... ELF table stops here ...]
0x7ffca42297f8: 0xe8e8de3a49831f00      0xdfbf9ede0185cb4 <-- RND16
0x7ffca4229808: 0x34365f363878af        0x0  <-- "x86_64" + '\0'
[... below is empty space from stack randomization ...]
[... below are argv strings ...]
[... below are env strings ...]
[... last is the filename (/root/none) ...]

Thereafter the kernel allocates stack memory to store the elf-aux-table, the argv- and env-pointers and the argc value (+1) and aligns the top of the stack to 16 bytes. (It does not yet copy the elf-aux-table onto the stack just yet. This happens later):

...and then puts the argc, argv-pointers and env-pointers onto the stack:

...and then copies the elf-aux table (elf_info; from above) on the stack (aligned; below the env pointers).

(The clever reader may have noticed that 'RND16' does not start at an aligned address - 0x7ffca42297f9: It is because RND16 was put on the stack before the STACK_ROUND() call to put the elf-info table and env/argv pointers).

Now back in load_elf_binary(), the kernel sets the registers, clears some stuff and finally (!) calls START_THREAD() to start the program.

Afterthought: Someone pointed https://lwn.net/Articles/631631/. Their ASCII art is superior to mine 🫶. It shows the layout of the stack just before execution (but they drew it the other way around; starting with the largest address at the top and the smallest addresses at the bottom):

How to Zapper

Wheeee. What a ride. For Zapper we ptrace() at the entry-point, increase the stack to make a copy of argv/env-strings, adjust all the pointers to point to 'our' copy of the argv/env-strings and ZERO the original ones to 0x00. The kernel does not know about it and still references the now zero'd argv/env-strings and ...wush..they are gone from the process list.

Telegram: https://t.me/thcorg 👈 Releases published here 1 day earlier. 👻

Mastodon: @thc@infosec.exchange

Twat: https://twitter.com/hackerschoice

Knowledge Base

Knowledge Base

How does Linux start a process

...and how to ptrace the entry point and m3ss w1th da stack.

Overview

Examining a binary

How to Zapper