Compiling binary executables
February 22, 2024  |  asm c zigAlmost all compiled languages generate binary executables via the following pipeline.
text -> AST -> IR -> machine code
Here text refers to code written in the programming language in question. In this article we will follow how text passes through the pipeline to produce executable binary machine code.
All examples will be written in C99 and target x86_64
Linux.
To follow along with
every example you will need gcc, make, and binutils
for the ld
linker, as
assembler, and ar
archive utility. The
LLVM toolchain is mentioned, but not necessary.
$ mkdir linking_examples
$ cd linking_examples
You will also need the cproc C compiler, the QBE backend, and
musl libc. On most Linux systems these can be built from source
with make
and gcc
as follows.
$ git clone --depth=1 git://c9x.me/qbe.git
$ cd qbe
$ make
$ make install
$ cd ..
$ git clone --depth=1 https://git.sr.ht/~mcf/cproc
$ cd cproc
$ ./configure
$ make
$ make install
$ cd ..
$ git clone --depth=1 https://git.musl-libc.org/git/musl
$ cd musl
$ ./configure
$ make
$ cd ..
You don’t need to run make install
for musl since the examples will
all use musl/lib
in the working directory. Technically, you don’t need to run
make install
for qbe
or cproc
either:
- you could add them to your path, e.g.
export PATH="/path/to/cproc:$PATH"
; - or just run the local executables, e.g. use
./cproc/cproc-qbe
instead ofcproc-qbe
.
$ echo "
int collatz(int n) {
if (n % 2 == 0) {
return n / 2;
}
return 3 * n + 1;
}
" > collatz.c
An abstract syntax tree or AST is a data structure that is built to represent parsed text code. While AST construction is an important compilation step, it usually is not of too much use to compiler users like us. The LLVM toolchain is the only popular C compiler I know of that will dump a human readable AST.
$ clang -E -Xclang -ast-dump collatz.c | clean-up-by-hand
`-FunctionDecl collatz "int (int)"
|-ParmVarDecl used n "int"
`-CompoundStmt
|-IfStmt
| |-BinaryOperator "int" "=="
| | |-BinaryOperator "int" "%"
| | | |-ImplicitCastExpr "int" <LValueToRValue>
| | | | `-DeclRefExpr "int" lvalue ParmVar "n" "int"
| | | `-IntegerLiteral "int" 2
| | `-IntegerLiteral "int" 0
| `-CompoundStmt
| `-ReturnStmt
| `-BinaryOperator "int" "/"
| |-ImplicitCastExpr "int" <LValueToRValue>
| | `-DeclRefExpr "int" lvalue ParmVar "n" "int"
| `-IntegerLiteral "int" 2
`-ReturnStmt
`-BinaryOperator "int" "+"
|-BinaryOperator "int" "*"
| |-IntegerLiteral "int" 3
| `-ImplicitCastExpr "int" <LValueToRValue>
| `-DeclRefExpr "int" lvalue ParmVar "n" "int"
`-IntegerLiteral "int" 1
collatz.c
file
IR stands for intermediate representation: code designed to run on a simple abstract machine. The IR step is mainly used to perform analysis and optimizations in an architecture agnostic environment. Multiple IR stages may be involved in compilation, e.g. a compiler might have a bespoke internal IR which is converted to a separate IR used by a machine code generation backend (e.g. LLVM or QBE).
$ cproc-qbe collatz.c > collatz.ssa
$ cat collatz.ssa
export function w $collatz(w %.1) {
@start.1
%.2 =l alloc4 4
storew %.1, %.2
@body.2
%.3 =w loadw %.2
%.4 =w rem %.3, 2
%.5 =w ceqw %.4, 0
jnz %.5, @if_true.3, @if_false.4
@if_true.3
%.6 =w loadw %.2
%.7 =w div %.6, 2
ret %.7
@if_false.4
%.8 =w loadw %.2
%.9 =w mul 3, %.8
%.10 =w add %.9, 1
ret %.10
}
Finally, machine code is a collection of binary data and CPU instructions that might eventually be loaded into memory and run on a physical (or virtual) machine. A human readable assembly language version of the output from the machine code step may also be generated.
$ qbe -o collatz.s collatz.ssa
$ cat collatz.s
.text
.globl collatz
collatz:
pushq %rbp
movq %rsp, %rbp
movl %edi, %eax
movl $2, %esi
movl %eax, %ecx
cltd
idivl %esi
movl %ecx, %eax
movl %edx, %ecx
cmpl $0, %ecx
jz .Lbb3
imull $3, %eax, %eax
addl $1, %eax
jmp .Lbb4
.Lbb3:
movl $2, %ecx
cltd
idivl %ecx
.Lbb4:
leave
ret
.type collatz, @function
.size collatz, .-collatz
The binary generated by the basic compilation process is usually an organized collection of machine code called an object file. On Linux object files use the ELF file format. Assembly language can be re-constructed from binaries, a process called disassembly.
$ as -o collatz.o collatz.s
$ objdump -da collatz.o
collatz.o: file format elf64-x86-64
collatz.o
Disassembly of section .text:
0000000000000000 <collatz>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 f8 mov %edi,%eax
6: be 02 00 00 00 mov $0x2,%esi
b: 89 c1 mov %eax,%ecx
d: 99 cltd
e: f7 fe idiv %esi
10: 89 c8 mov %ecx,%eax
12: 89 d1 mov %edx,%ecx
14: 83 f9 00 cmp $0x0,%ecx
17: 74 08 je 21 <collatz+0x21>
19: 6b c0 03 imul $0x3,%eax,%eax
1c: 83 c0 01 add $0x1,%eax
1f: eb 08 jmp 29 <collatz+0x29>
21: b9 02 00 00 00 mov $0x2,%ecx
26: 99 cltd
27: f7 f9 idiv %ecx
29: c9 leave
2a: c3 ret
collatz.o
object file
Object files contain a symbol table section that stores
names and types for select pieces of machine code. Each named entry in the
symbol table is called a symbol.
In C a symbol will be generated for each function and global variable
not declared with the static
keyword.
$ readelf -s collatz.o
Symbol table '.symtab' contains 2 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 43 FUNC GLOBAL DEFAULT 1 collatz
collatz.o
contains the exported collatz
function symbol
An executable is an object file with a designated place to start program execution called an entry point. In C it is common for text files to be separately compiled into object files, then the object files are combined to form a single executable.
$ echo "
int printf(const char *fmt, ...);
int collatz(int n);
int main() {
printf("collatz(5) = %d\n", collatz(5));
return 0;
}
" > print_collatz.c
$ cproc-qbe print_collatz.c | qbe | as -o print_collatz.o
$ ld -o print_collatz print_collatz.o collatz.o musl/lib/crt1.o musl/lib/libc.a
$ ./print_collatz
collatz(5) = 16
The process of combining object files together into a single binary
is called linking or static linking1. Most C compilers will
automatically link object files for the C standard library (libc.a
) and the C
runtime (crt1.o
)2. The C runtime defines an entry point that calls the
main
function.
$ readelf -a print_collatz.o | grep "Entry point"
Entry point address: 0x0
$ readelf -a print_collatz | grep "Entry point"
Entry point address: 0x40106e
Object files can contain references to symbols that they do not define. For
example, our print_collatz.o
file refers to two undefined symbols: collatz
and
printf
. During linking the collatz
symbol is defined in collatz.o
and
printf
is provided by libc.a
.
$ readelf -s print_collatz.o
Symbol table '.symtab' contains 5 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 SECTION LOCAL DEFAULT 3 .data
2: 0000000000000000 40 FUNC GLOBAL DEFAULT 1 main
3: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND collatz
4: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printf
print_collatz.o
Sometimes several object files are combined into a single archive file or static
library for convenient re-use.
The musl implementation of the C standard library compiles into
the static libc.a
library used above.
$ echo "
int fibonacci(int n) {
if (n <= 1) return 1;
return fibonacci(n - 1) + fibonacci(n - 2);
}
" > fibonacci.c
$ cproc-qbe fibonacci.c | qbe | as -o fibonacci.o
$ ar -rs libfunmath.a collatz.o fibonacci.o
$ echo "
int printf(const char *fmt, ...);
int collatz(int n);
int fibonacci(int n);
int main() {
printf("collatz(5) = %d\n", collatz(5));
printf("fibonacci(5) = %d\n", fibonacci(5));
return 0;
}
" > print_funmath.c
$ cproc-qbe print_funmath.c | qbe | as -o print_funmath.o
$ ld -o print_funmath print_funmath.o libfunmath.a musl/lib/crt1.o musl/lib/libc.a
$ ./print_funmath
collatz(5) = 16
fibonacci(5) = 8
An executable that was created using only static linking is called a static executable. The big strength of static executables is that they are simple and self contained: as long as two computers share the same CPU architecture and operating system, a static executable can generally be copied from one machine to the other and run without additional setup.
Sometimes additional metadata will be passed down
between compilation stages in order to preserve information that would otherwise be
lost. E.g. gcc
debug builds (enabled with the -g
flag) use the
DWARF data format to track the name and
line number of the text file from which each piece of machine code originated
(and a lot of other things).
$ gcc -g -c -o collatz_dbg.o collatz.c
$ gcc -g -c -o print_collatz_dbg.o print_collatz.c
$ ld -o print_collatz_dbg collatz_dbg.o print_collatz_dbg.o \
musl/lib/crt1.o musl/lib/libc.a
$ gdb print_collatz_dbg
$(gdb) break main
$(gdb) run
Breakpoint 1, main () at print_collatz.c:5
5 printf("collatz(5) = %d\n", collatz(5));
$(gdb) step
collatz (n=5) at collatz.c:2
2 if (n % 2 == 0) {
$(gdb) step
5 return 3 * n + 1;
$(gdb) step
6 }
$(gdb) step
collatz(5) = 16
main () at print_collatz.c:6
6 return 0;
$(gdb) step
[Inferior 1 (process 367249) exited normally]
While static linking with libraries is one option for code re-use, an alternative is to compile library code into a special shared library binary, also known as a dynamic library. Shared library code is not included directly in the linked executable binary, instead it is loaded into memory at runtime.
There are two primary ways that shared libraries are used: dynamic linking and dynamic loading3.
$ ld -shared -o libfunmath.so fibonacci.o collatz.o
$ ld --dynamic-linker musl/lib/libc.so -L. -rpath=. -Lmusl/lib -rpath=musl/lib \
-lc -lfunmath -o print_funmath_dyn print_funmath.o musl/lib/crt1.o
$ ./print_funmath_dyn
collatz(5) = 16
fibonacci(5) = 8
In dynamic linking a dynamic executable lists a path to a binary called a dynamic linker or program interpreter along with a path to each linked shared library. At runtime the dynamic linker will be used load each shared library into memory and locate the necessary symbols.
$ readelf -a print_funmath_dyn | grep "interpreter"
[Requesting program interpreter: musl/lib/libc.so]
$ readelf -a print_funmath | grep "Shared library"
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x0000000000000001 (NEEDED) Shared library: [libfunmath.so]
$ readelf --dyn-syms print_funmath_dyn
Symbol table '.dynsym' contains 7 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FUNC GLOBAL DEFAULT UND printf
2: 0000000000000000 0 FUNC GLOBAL DEFAULT UND collatz
3: 0000000000000000 0 FUNC WEAK DEFAULT UND _init
4: 0000000000000000 0 FUNC GLOBAL DEFAULT UND fibonacci
5: 0000000000000000 0 FUNC WEAK DEFAULT UND _fini
6: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main
One benefit of dynamic linking is memory savings: the dynamic linker can sometimes have just one copy of each unique library loaded at a given time, shared between executables4. A shared library can also be patched without necessarily having to re-compile every executable that depends on it.
An executable also has the option to manually load and link a shared library while it is running. This is often called dynamic loading and it is useful for purposes such as hot code swapping, i.e. updating a program’s code while it is running.
$ echo "
int open(const char *fname, int flags, int mode);
long read(int fd, char *buffer, long len);
int inotify_init(void);
int inotify_add_watch(int fd, const char *pname, int mask);
void *dlopen(const char *fname, int flags);
void *dlsym(void *restrict handle, const char *restrict symbol);
int dlclose(void *handle);
int printf(const char *fmt, ...);
int main() {
void *lib_handle;
int (*foo)(int);
struct { int wd, mask, cookie, len; char name[128]; } event;
int infd = inotify_init();
inotify_add_watch(infd, "./lock", 0x200);
while (1) {
lib_handle = dlopen("./libfoo.so", 2);
foo = dlsym(lib_handle, "foo");
printf("foo(5) = %d\n", foo(5));
read(infd, (char *)&event, sizeof(event));
dlclose(lib_handle);
}
}
" > hot_reload.c
$ cproc-qbe hot_reload.c | qbe | as -o hot_reload.o
$ ld --dynamic-linker musl/lib/libc.so -Lmusl/lib -rpath=musl/lib -lc
-o hot_reload hot_reload.o musl/lib/crt1.o
$ mkdir lock
$ echo "int foo(int n) { return 2 * n; }" > foo.c
$ cproc-qbe foo.c | qbe | as -o foo.o
$ ld -shared -o libfoo.so foo.c
$ ./hot_reload.c &
foo(5) = 10
$ touch lock/foo
$ echo "int foo(int n) { return 3 * n; }" > foo.c
$ cproc-qbe foo.c | qbe | as -o foo.o
$ ld -shared -o libfoo.so foo.c
$ rm lock/foo
foo(5) = 15
Note that in general dynamic loading is only supported for dynamic executables.