Understanding Full-Source Bootstrapping
Target audience: Developers, evaluators, and security-auditors who want to understand StageX's trust foundation Time to complete: ~20 minutes Goal: Understand how StageX builds a complete, modern toolchain from a 181-byte machine code seed — with zero pre-compiled binaries.
Why This Tutorial?
In the Quick Start, you built a Rust binary with StageX and verified it's reproducible. But you might wonder: where did the Rust compiler itself come from? How do we know it hasn't been tampered with?
This is the classic Trusting Trust problem — Ken Thompson's 1984 observation that a compiler can be backdoored to inject vulnerabilities into any program it compiles, including itself. The only defense is to bootstrap everything from source, starting from a seed so small that any programmer can audit it by hand.
StageX solves this through full-source bootstrapping — a chain of four stages that builds an entire modern toolchain starting from 181 bytes of hand-crafted x86 machine code, with no pre-compiled binaries at any intermediate step.
The Bootstrap Chain at a Glance
┌──────────────────────────────────────────────────────────────────┐
│ stage0 — From Nothing to C │
│ 181-byte hex0-seed → hex0 → hex1 → hex2 → M2-Planet → kaem │
│ Produces: assemblers, linker, C compiler, build system, utils │
│ Platform: linux/386 (32-bit x86 only) │
└──────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ stage1 — Full 32-bit Userland via live-bootstrap │
│ mes → tcc → musl → gcc-4.0.4 → ... → gcc-13.1.0 → python,etc │
│ Produces: complete 32-bit Linux toolchain & userland │
│ Platform: linux/386 │
└──────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ stage2 — Cross-Compiler Bridge to 64-bit │
│ Cross binutils + Cross GCC → musl → libgcc → libstdc++ │
│ Produces: x86_64 and aarch64 cross-toolchains │
│ Platform: linux/386 (builds), targets: x86_64, aarch64 │
└──────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ stage3 — Native 64-bit Toolchain │
│ gcc-13.1.0 + binutils + musl + cmake + python + busybox │
│ Produces: complete 64-bit build environment for all StageX │
│ Platform: native x86_64 (also usable for aarch64) │
└──────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ stage X — Package Builds (pallet, core, user) │
│ LLVM/Clang, Rust (via pallet-rust), Go, Python, Node.js ... │
│ Produces: every StageX package — all from source │
└──────────────────────────────────────────────────────────────────┘
Every stage is deterministic (same inputs always produce same outputs) and independently verifiable (multiple maintainers build each stage and compare digests).
Stage 0: From 181 Bytes to a C Compiler
The hex0 Seed
At the very bottom of the trust chain is a single file in the StageX repository:
https://codeberg.org/stagex/stagex/src/branch/main/packages/bootstrap/stage0/hex0-seed
curl -s https://codeberg.org/stagex/stagex/raw/branch/main/packages/bootstrap/stage0/hex0-seed | wc -c
181
181 bytes. That's smaller than most email signatures. Every byte was written by hand — no assembler, no compiler, no toolchain of any kind produced this file. It's raw machine code, a 32-bit ELF binary for the i386 architecture:
file <(curl -s https://codeberg.org/stagex/stagex/raw/branch/main/packages/bootstrap/stage0/hex0-seed)
ELF 32-bit LSB executable, Intel i386, version 1 (GNU/Linux), statically linked, no section header
Read the hex dump — every byte is meaningful:
000000 7f 45 4c 46 01 01 01 03 00 00 00 00 00 00 00 00 │.ELF............│
000010 02 00 03 00 01 00 00 00 4c 80 04 08 2c 00 00 00 │........L...,...│
000020 00 00 00 00 00 00 00 00 34 00 20 00 01 00 00 00 │........4. .....│
...
0000b0 04 cd 80 eb b4 │.....│
What Does hex0 Do?
hex0 is a hexadecimal-to-binary converter. It reads a text file containing ASCII hexadecimal bytes (like 7f 45 4c 46...) separated by whitespace, and writes the corresponding binary output. Lines starting with # or ; are comments.
That's it. 181 bytes of code that reads hex and writes binary.
Why this is revolutionary: Most Linux distributions start with a pre-compiled C compiler (GCC) that was built by an even older compiler, which was built by... the chain goes back decades, ultimately to opaque binary blobs. StageX starts from a program so small that any competent programmer can:
- Read every byte and understand what it does
- Reproduce its functionality from scratch in any language
- Verify that it matches the source (the
.hex0file) with no hidden logic
The hex0 seed has been reproduced with the same hash across multiple Linux distributions using wildly different toolchains — definitive proof that it contains nothing hidden.
Stage0 Build Chain
Starting from hex0-seed, Stage0 builds progressively more capable tools. Here's the sequence (from the Stage0 Containerfile):
hex0-seed (181 bytes)
│
├── hex0 — reads hex0 source, writes binary
│ └── hex1 — reads hex1 source (more compact), writes binary
│ └── hex2-0 — reads hex2 source (linker + assembler)
│ │
│ ├── catm — file concatenation utility
│ ├── M0 — micro assembler
│ ├── cc_x86 — primitive C compiler (C → M1 assembly)
│ ├── M2-Planet — C compiler written in C subset
│ ├── blood-elf — ELF metadata extraction tool
│ ├── M1-macro — macro assembler
│ ├── hex2-1 — improved linker
│ ├── M1 — full macro assembler
│ ├── hex2 — final linker
│ ├── kaem — minimal build system (like make)
│ ├── M2-Mesoplanet — simplified C compiler
│ ├── get_machine — CPU detection
│ └── Utilities — sha256sum, untar, ungz, catm, cp, chmod, ...
Every tool in this tree was either:
- Written in hex0/hex1/hex2 assembly (readable as hexadecimal text)
- Compiled by a tool that was itself built from such sources
No binary ever appears that wasn't produced by an earlier stage in the chain.
What Stage0 Produces
The Stage0 build outputs live at /usr/bin/ in the stagex/bootstrap-stage0 image:
| Tool | Purpose | Size |
|---|---|---|
hex2 |
Final linker — combines object files | ~10 KB |
kaem |
Build orchestrator (like make) |
~15 KB |
M1 |
Macro assembler — assembly → machine code | ~40 KB |
M2-Planet |
Full C compiler (subset of C, compiles itself) | ~80 KB |
M2-Mesoplanet |
Simplified C compiler for utility tools | ~50 KB |
blood-elf |
ELF symbol extraction | ~15 KB |
sha256sum |
SHA-256 hash verification | ~10 KB |
untar, ungz, unxz |
Archive extraction utilities | ~10 KB each |
The entire Stage0 image is about 2 MB — a complete self-hosting development environment built from 181 bytes.
Stage 1: The Long Climb — Full 32-bit Userland
Stage1 takes the primitive tools from Stage0 and builds a complete 32-bit Linux userland. This is the most complex stage — the Containerfile is nearly 300 lines, and the source list encompasses hundreds of packages.
The Build Sequence (Simplified)
Stage0 tools (kaem, M2-Planet, hex2, ...)
│
├── checksum-transcriber — verify source integrity
├── mes-0.27 — GNU Mes (Scheme interpreter + C compiler)
├── nyacc — Scheme-based parser generator
│
├── tcc-0.9.26 → tcc-0.9.27 — TinyCC (small, fast C compiler)
│ │
│ ├── musl-1.1.24 — C library (essential for GCC)
│ ├── gcc-4.0.4 — First GCC (can compile modern C++)
│ │ └── gcc-4.7.4 → gcc-10.4.0 → gcc-13.1.0
│ │ Incremental GCC upgrades through versions
│ │
│ ├── binutils-2.30 → binutils-2.41
│ ├── make-3.82 → make-4.2.1
│ ├── coreutils-5.0 → coreutils-9.4
│ ├── bash-2.05b → bash-5.2.15
│ ├── perl-5.000 → perl-5.32.1
│ ├── python-2.0.1 → python-3.11.1
│ └── autoconf, automake, libtool, bison, flex, gawk,
│ sed, grep, patch, tar, xz, gzip, bzip2, openssl, ...
Why So Many Steps?
Each version upgrade in the chain exists because:
-
No compiler can skip versions — GCC 4.0.4 can build GCC 4.7.4, which can build GCC 10.4.0, which can build GCC 13.1.0. But GCC 4.0.4 cannot directly build GCC 13.1.0 — the language standards and build system have changed too much.
-
Incremental bootstrapping — Each new version is built by the previous version, until we reach a modern toolchain. This is the same principle as climbing a ladder one rung at a time.
-
Self-hosting verification — At each major version, the compiler is used to rebuild itself. If the result matches, we know the compiler is self-consistent and hasn't been tampered with.
What Stage1 Produces
The Stage1 image is about 300 MB — a complete 32-bit development environment with:
- GCC 13.1.0 (C, C++, Objective-C, Fortran)
- binutils 2.41 (assembler, linker, archiver)
- musl 1.2.4 (C library)
- bash 5.2.15, coreutils 9.4, make 4.2.1
- python 3.11.1, perl 5.32.1
- openssl 3.0.13, autotools, and dozens more
Everything was built from source, starting from those 181 bytes.
Stage 2: The Architecture Bridge
Stage1 is 32-bit x86 only. But modern hardware is 64-bit (x86_64 and aarch64). Stage2 builds cross-compilers that run on 32-bit x86 but produce code for 64-bit targets.
32-bit Stage1 toolchain (gcc-13.1.0, binutils, musl)
│
├── Cross binutils for x86_64-linux-musl
├── Cross binutils for aarch64-linux-musl
│
├── Cross GCC stage 1 (static libgcc only) for x86_64
├── Cross GCC stage 1 (static libgcc only) for aarch64
│
├── Cross musl (C library) for x86_64
├── Cross musl (C library) for aarch64
│
├── Cross GCC stage 2 (shared libgcc + libstdc++) for x86_64
└── Cross GCC stage 2 (shared libgcc + libstdc++) for aarch64
Why a Separate Bridge Stage?
Cross-compilation is complex. Building a compiler that runs on architecture A but produces code for architecture B requires careful ordering:
- Binutils first — You need a cross-assembler and cross-linker before you can build anything for the target
- Minimal GCC — A bootstrap GCC with only static libraries can compile a minimal C library
- C library — musl must be cross-compiled with the minimal GCC
- Full GCC — With the C library available, GCC can build shared libraries and C++ support
Splitting this into its own stage means Stage1 stays focused on building the 32-bit userland, and Stage2 cleanly produces the cross-toolchains without contaminating either environment.
What Stage2 Produces
The Stage2 image is about 700 MB and contains:
- Cross-assembler, cross-linker, cross-archiver for
x86_64-linux-muslandaarch64-linux-musl - Cross-GCC (C and C++) for both targets
- Cross-musl libc for both targets
- Linux kernel headers for both targets
- Static and shared libgcc, libstdc++
Stage 3: Native 64-bit Toolchain
Stage3 uses the cross-compilers from Stage2 to build a native 64-bit toolchain — a complete environment that runs natively on x86_64 (or aarch64) without any 32-bit emulation.
Cross-compilers from Stage2 (running on 386, targeting x86_64)
│
├── Native musl — 64-bit C library
├── Native gcc-13.1.0 — 64-bit compiler stack (with gmp, mpfr, mpc, isl)
├── Native binutils — 64-bit assembler, linker
├── Native make, cmake — build systems
├── busysbox — 50+ Unix utilities in one binary (sh, ls, cp, mv, ...)
├── Native python — interpreter
├── xz — compression library and tool
├── libucontext, libunwind, libffi — system libraries
└── libzstd, zlib — compression libraries
What's Different About Stage 3?
Stage 3 is the last bootstrap stage. After this, you have a modern, native 64-bit development environment that can build anything in the StageX package tree — including LLVM/Clang, Rust, Go, and everything else.
Native compilation matters because:
- Performance — 64-bit code runs at full speed without emulation layer
- Memory — 64-bit addressing enables working with large codebases
- Modern tooling — Most build systems assume a native 64-bit environment
What Stage3 Produces
The Stage3 image is approximately 1 GB and contains everything needed to build StageX packages:
- gcc-13.1.0 (native C/C++/Fortran with GMP, MPFR, MPC, ISL)
- binutils-2.45 (native assembler, linker)
- musl-1.1.24 (64-bit C library)
- cmake-3.31.5, make-4.4 (build systems)
- python-3.11.8 (scripting and build tooling)
- busysbox-1.35.0 (shell and utilities)
- xz, zlib, libzstd, libucontext, libunwind, libffi
Stage X: Everything Else
With a native 64-bit toolchain from Stage 3, StageX builds all of its application packages — organized into three groups:
| Group | Purpose | Examples |
|---|---|---|
| pallet/ | Language runtime images | pallet-rust, pallet-go, pallet-python, pallet-node |
| core/ | Low-level infrastructure | LLVM/Clang, Go compiler, Rust compiler, OpenSSL |
| user/ | Application packages | 300+ packages from the broader ecosystem |
Each pallet image (like stagex/pallet-rust that you used in the Quick Start) is a FROM-scratch image containing just the language toolchain — built from source, deterministically, with multi-party signed digests.
The Trusting Trust Problem
Ken Thompson's 1984 Turing Award lecture, "Reflections on Trusting Trust" (read it here), demonstrated a fundamental vulnerability:
A compiler can be modified to insert a backdoor into any program it compiles — and to re-insert that backdoor into its own compiled source code, making the backdoor invisible even when reading the source.
This means that any binary compiler is a potential single point of failure. If the GCC binary you downloaded contains a Thompson-style backdoor, it could:
- Insert vulnerabilities into every program it compiles (including your application)
- Re-insert the backdoor into future versions of GCC, persisting across upgrades
- Hide from source-code audits because the source doesn't contain the exploit
StageX's full-source bootstrap eliminates this attack vector entirely:
- No binary compilers — every compiler is built from source, starting from 181 bytes of auditable machine code
- Multiple independent builds — at least two maintainers independently rebuild every package and compare hashes
- Deterministic outputs — if the same source produces the same binary on different machines, there's no room for hidden compiler backdoors
As the StageX whitepaper states:
"By starting from a minimal, auditable trust anchor and building everything from source, StageX eliminates the single points of failure that plague conventional software supply chains."
The Full Trust Chain
graph TB
subgraph "Trust Anchor"
SEED["hex0-seed<br/>181 bytes<br/>hand-crafted x86"]
end
subgraph "Stage 0 — From Nothing"
S0_HEX0["hex0<br/>hexadecimal→binary"]
S0_HEX1["hex1<br/>compact hex assembler"]
S0_HEX2["hex2<br/>linker + assembler"]
S0_C["M2-Planet<br/>C compiler"]
S0_KAEM["kaem<br/>build system"]
S0_UTILS["sha256sum, untar, ..."]
end
subgraph "Stage 1 — 32-bit Userland"
S1_MES["GNU Mes<br/>Scheme + C compiler"]
S1_TCC["TinyCC<br/>small C compiler"]
S1_MUSL["musl libc"]
S1_GCC4["gcc-4.0.4"]
S1_GCC13["gcc-13.1.0"]
S1_USERLAND["bash, python, make,<br/>autotools, openssl, ..."]
end
subgraph "Stage 2 — Cross Compilers"
S2_BINUTILS["Cross binutils<br/>x86_64 + aarch64"]
S2_GCC["Cross GCC"]
S2_MUSL["Cross musl"]
end
subgraph "Stage 3 — Native 64-bit"
S3_GCC["Native gcc-13.1.0"]
S3_BINUTILS["Native binutils"]
S3_CMAKE["cmake, python,<br/>busybox, ..."]
end
subgraph "Stage X — Everything"
SX_PALLET["pallet-rust, pallet-go, ..."]
SX_CORE["LLVM, Go, Rust, ..."]
SX_USER["300+ packages"]
end
SEED --> S0_HEX0 --> S0_HEX1 --> S0_HEX2
S0_HEX2 --> S0_C --> S0_KAEM --> S0_UTILS
S0_UTILS --> S1_MES --> S1_TCC --> S1_MUSL
S1_MUSL --> S1_GCC4 --> S1_GCC13
S1_GCC13 --> S1_USERLAND
S1_USERLAND --> S2_BINUTILS --> S2_GCC --> S2_MUSL
S2_MUSL --> S3_GCC --> S3_BINUTILS --> S3_CMAKE
S3_CMAKE --> SX_PALLET
S3_CMAKE --> SX_CORE
S3_CMAKE --> SX_USER
style SEED fill:#f96,stroke:#333,color:#000
style S0_HEX0 fill:#ff9,stroke:#333,color:#000
style S1_TCC fill:#9cf,stroke:#333,color:#000
style S1_GCC13 fill:#9cf,stroke:#333,color:#000
style S2_GCC fill:#9f9,stroke:#333,color:#000
style S3_GCC fill:#9f9,stroke:#333,color:#000
What to notice:
- Orange — The 181-byte seed. Everything derives from this.
- Yellow — Stage 0. Every tool was either written in hex assembly or compiled by a tool that was.
- Blue — Stage 1. The long climb from TinyCC to modern GCC.
- Green — Stages 2 and 3. From 32-bit to native 64-bit.
- Everything above Stage 3 is built from source using only tools produced in these stages.
Verification at Scale
Full-source bootstrapping isn't just a philosophical exercise — it's a practical security mechanism. Here's how it works in practice:
Deterministic Builds
Every StageX build is deterministic — the same source always produces the same binary, byte for byte. This is achieved through:
SOURCE_DATE_EPOCH=1— All timestamps set to a fixed value- Pinned toolchain versions — Every Containerfile uses a specific
@sha256:digest --network=none— Hermetic builds prevent network-dependent behavior--frozen— Lockfiles prevent dependency drift
Multi-Party Verification
For every StageX package:
- Maintainer A builds the package on their machine
- Maintainer B independently builds the same package on a different machine (different CPU vendor, different hardware)
- Both compute the SHA-256 digest of the output image
- If the digests match, both sign — the artifact is proven reproducible
- A quorum of 2+ signatures is required before any image is published
This means a compromise would require simultaneously subverting two maintainers' independent build environments — across different hardware, different locations, and different verification processes.
Chain of Custody
The entire chain — from hex0 seed to pallet-rust — can be traced and independently rebuilt:
# Build stage 0 (from hex0 seed)
podman build -t bootstrap-stage0 packages/bootstrap/stage0/
# Build stage 1 (using stage 0)
podman build -t bootstrap-stage1 packages/bootstrap/stage1/
# Build stage 2 (using stage 1)
podman build -t bootstrap-stage2 packages/bootstrap/stage2/
# Build stage 3 (using stage 2)
podman build -t bootstrap-stage3 packages/bootstrap/stage3/
# Build pallet-rust (using stage 3)
podman build -t pallet-rust packages/pallet/rust/
Any developer can reproduce this entire chain and verify their digests match the published, signed digests.
What You've Learned
| Concept | Meaning |
|---|---|
| hex0 seed | 181 bytes of hand-crafted machine code — the root of trust |
| Stage 0 | Builds assemblers, linkers, and a C compiler from nothing |
| Stage 1 | Bootstraps a complete 32-bit userland via live-bootstrap |
| Stage 2 | Builds cross-compilers from 32-bit to 64-bit architectures |
| Stage 3 | Produces the native 64-bit toolchain used for all packages |
| Trusting Trust | The vulnerability that full-source bootstrapping eliminates |
| Deterministic builds | Same source → same binary, every time, on any machine |
| Multi-party verification | 2+ independent maintainers must agree on every digest |
Now You Understand
The Rust compiler inside stagex/pallet-rust — the one you used in the Quick Start to build your first reproducible binary — traces its lineage back through Stage 3, Stage 2, and Stage 1, to Stage 0, and ultimately to those 181 bytes of hand-crafted x86 machine code in the hex0-seed file.
Every binary in every StageX image is built from source, with no pre-compiled dependencies at any point in the chain. This is what full-source bootstrapping means — and it's the foundation of StageX's security model.
References
- StageX Whitepaper — "Eliminating Single Points of Failure in Linux Distributions"
- Bootstrappable Builds Project — Community effort to reduce binary seeds
- Reproducible Builds Project — Standards and tools for deterministic builds
- live-bootstrap — The project that powers Stage1
- stage0-posix-x86 — The hex0/hex1/hex2/M0 toolchain
- M2-Planet — C compiler that can compile itself
- GNU Mes — Scheme-based full-source bootstrap
Next Steps
- Verifying Your First StageX Image — Put this knowledge into practice by verifying GPG multi-sigs on a real StageX image
- Why Full-Source Bootstrapping Matters — Deeper dive into the philosophy and security implications
- Reproducible Builds & Supply Chain Integrity — How StageX achieves deterministic builds in practice