Reversing a Binary can be easy for some , but knowing how a binary is constructed from a source can be beneficical for unlocking more scope for analysing and reversing.

Today we will understand how a binary is constructed from a .c file.

There consist of 4 phases in compilation of a C binary, which are :

  • Preprocessing Phase
  • The Compilation Phase
  • The Assembly Phase
  • The Linking Phase
None
End-to-End Compilation Process in C

Preprocessing Phase

C source files contains macros (defined by #define ) and #include directives. We use the #include directives to include header files (with the extension .h) on which the source file depends.

Preprocessing Phase is responsible for expanding any #define and #include directive present in source file so the result left is pure C code ready to be compiled.

Lets see how that work:

Suppose we want to compile a C source file, that prints the ubiquitous "Hello, world!" message to the screen.

#include <stdio.h>
#define FORMAT_STRING "%s"
#define MESSAGE "Hello, world!\n"
int main(int argc, char *argv[]) {
printf(FORMAT_STRING, MESSAGE);
return 0;
}

By default, gcc will automatically execute all compilation phases, so you have to explicitly tell it to stop after preprocessing and show you the intermediate output.

For gcc, this can be done using the command gcc -E -P, where -E tells gcc to stop after preprocessing and -P causes the compiler to omit debugging information so that the output is a bit cleaner.

┌──(himanshu@Kaaammui)-[/tmp/htb]-(16-04-2026 10:26:33)
└─$ gcc -E -P main.c              

typedef long unsigned int size_t;
typedef __builtin_va_list __gnuc_va_list;
typedef unsigned char __u_char;
typedef unsigned short int __u_short;
typedef unsigned int __u_int;
typedef unsigned long int __u_long;
typedef signed char __int8_t;
typedef unsigned char __uint8_t;
typedef signed short int __int16_t;
typedef unsigned short int __uint16_t;
typedef signed int __int32_t;
typedef unsigned int __uint32_t;
typedef signed long int __int64_t;
typedef unsigned long int __uint64_t;

/* ... */

extern size_t fwrite (const void *__restrict __ptr, size_t __size,
        size_t __n, FILE *__restrict __s) __attribute__ ((__nonnull__ (4)));
extern size_t fread_unlocked (void *__restrict __ptr, size_t __size,
         size_t __n, FILE *__restrict __stream)
  __attribute__ ((__nonnull__ (4)));
extern size_t fwrite_unlocked (const void *__restrict __ptr, size_t __size,
          size_t __n, FILE *__restrict __stream)
  __attribute__ ((__nonnull__ (4)));
extern int fseek (FILE *__stream, long int __off, int __whence)
  __attribute__ ((__nonnull__ (1)));
extern long int ftell (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern void rewind (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern int fseeko (FILE *__stream, __off_t __off, int __whence)
  __attribute__ ((__nonnull__ (1)));
extern __off_t ftello (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern int fgetpos (FILE *__restrict __stream, fpos_t *__restrict __pos)
  __attribute__ ((__nonnull__ (1)));
extern int puts (const char *__s);
extern size_t fwrite (const void *__restrict __ptr, size_t __size,
        size_t __n, FILE *__restrict __s) __attribute__ ((__nonnull__ (4)));
extern size_t fread_unlocked (void *__restrict __ptr, size_t __size,
         size_t __n, FILE *__restrict __stream)
  __attribute__ ((__nonnull__ (4)));
extern size_t fwrite_unlocked (const void *__restrict __ptr, size_t __size,
          size_t __n, FILE *__restrict __stream)
  __attribute__ ((__nonnull__ (4)));
extern int fseek (FILE *__stream, long int __off, int __whence)
  __attribute__ ((__nonnull__ (1)));
extern long int ftell (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern void rewind (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern int fseeko (FILE *__stream, __off_t __off, int __whence)
  __attribute__ ((__nonnull__ (1)));
extern __off_t ftello (FILE *__stream) __attribute__ ((__nonnull__ (1)));
extern int fgetpos (FILE *__restrict __stream, fpos_t *__restrict __pos)
  __attribute__ ((__nonnull__ (1)));

int main(int argc, char *argv[]) {
printf("%s", "Hello, world!\n");
return 0;
}

The stdio.h header is included in its entirety, with all of its type definitions, global variables, and function prototypes "copied in" to the source file. Because this happens for every #include directive, preprocessor output can be quite verbose.

The preprocessor also fully expands all uses of any macros you defined using #define. In the example, this means both arguments to printf (FORMAT_STRING and MESSAGE ) are evaluated and replaced by the constant strings they represent.

The Compilation Phase

The compilation phase takes the preprocessed code and translates it into assembly language. (Most compilers also perform heavy optimization in this phase, typically configurable as an optimization level through command line switches such as options -O0 through -O3 in gcc)

Why not directly into machine code ?

Some examples of popular compiled languages include C, C++, Objective-C, Common Lisp, Delphi, Go, and Haskell, to name a few.

Writing a compiler that directly emits machine code for each of these languages would be an extremely demanding and time-consuming task.

It's better to instead emit assembly code (a task that is already challenging enough) and have a single dedicated assembler that can handle the final translation of assembly to machine code for every language.

Assembly generated by the compilation phase for the "Hello, world!" program:

┌──(himanshu@Kaaammui)-[/tmp/htb]-(16-04-2026 10:46:36)
└─$ gcc -S -masm=intel main.c 
┌──(himanshu@Kaaammui)-[/tmp/htb]-(16-04-2026 10:46:53)
└─$ cat main.s  
   .file "main.c"
   .intel_syntax noprefix
   .text
   .section .rodata
.LC0:
   .string "Hello, world!"
   .text
   .globl main
   .type main, @function
main:
. LFB0:
   .cfi_startproc
   push rbp
   .cfi_def_cfa_offset 16
   .cfi_offset 6, -16
   mov rbp, rsp
   .cfi_def_cfa_register 6
   sub rsp, 16
   mov DWORD PTR -4[rbp], edi
   mov QWORD PTR -16[rbp], rsi
   lea rax, .LC0[rip]
   mov rdi, rax
   call puts@PLT
   mov eax, 0
   leave
   .cfi_def_cfa 7, 8
   ret
   .cfi_endproc
.LFE0:
   .size main, .-main
   .ident "GCC: (Debian 15.2.0-15) 15.2.0"
   .section .note.GNU-stack,"",@progbits

This assembly is for a non-stripped binary , therefore symbols can be seen and assembly can be analysed easily.

Any references to code and data are also symbolic, such as the reference to the "Hello, world!" string . You'll have no such luxury when dealing with stripped binaries (Stripping a binary is the process of removing debugging symbols, symbol tables, and metadata from an executable file that are not required for execution.)

The Assembly Phase

In the assembly phase, your code is turned into real machine code. It takes the assembly files created earlier and converts them into object files (also called modules). These object files contain machine instructions that the processor can understand and run.

Generating an object file with gcc :

┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 19:43:08)
└─$ gcc -c main.c                                                
┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 19:43:28)
└─$ file main.o                                       
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

The first part of the file output shows that the file conforms to the ELF specification for binary executables. More specifically, it's a 64-bit ELF file and it is LSB, meaning that numbers are ordered in memory with their least significant byte first. But most important, you can see that the file is relocatable, meaning it can be moved around in memory before running.

Relocatable:

A relocatable file means:

It is NOT a final executable : It is an object file (.o) that still needs to be linked with other files to become a final program.

It does NOT assume any fixed memory address.

Inside this .o file:

  • Functions
  • Variables
  • Symbols

are not assigned final memory addresses yet.

Instead, they contain placeholders (called relocations) that say:

"The linker will fill this address later."

Why?

Because gcc -c compiles only one file at a time.

The compiler does NOT know:

  • where the code will be placed in memory
  • where other functions from other object files will live
  • final addresses of functions, variables, globals, etc.

So the compiler leaves these as fixable spots.

Later, during linking:

┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 19:43:35)
└─$ gcc main.c -o main   

The linker resolves:

  • final addresses
  • symbol references
  • function call targets
  • global variables

Thus turning the relocatable file into:

an executable or a shared library (.so)

The Linking Phase

The linking phase is the final step where all object files are combined into one executable program. This is done by a tool called the linker.

Object files may depend on functions or variables from other files or libraries. Since the exact memory addresses are not known yet, they use symbolic references (placeholders) instead of real addresses.

There are two types of libraries:

  • Static libraries (.a) → copied into the final executable.
  • Dynamic libraries → shared and loaded only when the program runs.

For dynamic libraries, the actual addresses are filled in later when the program starts.

Most compilers, like GCC, automatically run the linker at the end.

┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 19:58:59)
└─$ gcc main.c         
┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 20:05:38)
└─$ file a.out 
a.out: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=850c42aa5c5c4017222cf6b9082e8559ea62ee7a, for GNU/Linux 3.2.0, not stripped
┌──(himanshu@Kaaammui)-[/tmp]-(17-04-2026 20:05:42)
└─$ ./a.out                            
Hello, world!

The file command shows that a.out is now a 64-bit ELF executable, not just a relocatable file anymore.

It is dynamically linked, which means it uses shared libraries instead of including everything inside the file.

It also shows the interpreter (/lib64/ld-linux-x86-64.so.2), which is the dynamic linker used to load those libraries when the program runs.

When you run ./a.out, it prints "Hello, world!", confirming that the program works correctly.