Skip to content

Understanding Full-Source Bootstrapping

Target audience: Developers, evaluators, and security-auditors who want to understand StageX's trust foundation Time to complete: ~20 minutes Goal: Understand how StageX builds a complete, modern toolchain from a 181-byte machine code seed — with zero pre-compiled binaries.

Why This Tutorial?

In the Quick Start, you built a Rust binary with StageX and verified it's reproducible. But you might wonder: where did the Rust compiler itself come from? How do we know it hasn't been tampered with?

This is the classic Trusting Trust problem — Ken Thompson's 1984 observation that a compiler can be backdoored to inject vulnerabilities into any program it compiles, including itself. The only defense is to bootstrap everything from source, starting from a seed so small that any programmer can audit it by hand.

StageX solves this through full-source bootstrapping — a chain of four stages that builds an entire modern toolchain starting from 181 bytes of hand-crafted x86 machine code, with no pre-compiled binaries at any intermediate step.


The Bootstrap Chain at a Glance

┌──────────────────────────────────────────────────────────────────┐
│                    stage0 — From Nothing to C                    │
│  181-byte hex0-seed → hex0 → hex1 → hex2 → M2-Planet → kaem    │
│  Produces: assemblers, linker, C compiler, build system, utils  │
│  Platform: linux/386 (32-bit x86 only)                          │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│             stage1 — Full 32-bit Userland via live-bootstrap     │
│  mes → tcc → musl → gcc-4.0.4 → ... → gcc-13.1.0 → python,etc  │
│  Produces: complete 32-bit Linux toolchain & userland            │
│  Platform: linux/386                                              │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│           stage2 — Cross-Compiler Bridge to 64-bit               │
│  Cross binutils + Cross GCC → musl → libgcc → libstdc++         │
│  Produces: x86_64 and aarch64 cross-toolchains                    │
│  Platform: linux/386 (builds), targets: x86_64, aarch64          │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│           stage3 — Native 64-bit Toolchain                        │
│  gcc-13.1.0 + binutils + musl + cmake + python + busybox         │
│  Produces: complete 64-bit build environment for all StageX       │
│  Platform: native x86_64 (also usable for aarch64)                │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│   stage X — Package Builds (pallet, core, user)                   │
│  LLVM/Clang, Rust (via pallet-rust), Go, Python, Node.js ...     │
│  Produces: every StageX package — all from source                 │
└──────────────────────────────────────────────────────────────────┘

Every stage is deterministic (same inputs always produce same outputs) and independently verifiable (multiple maintainers build each stage and compare digests).


Stage 0: From 181 Bytes to a C Compiler

The hex0 Seed

At the very bottom of the trust chain is a single file in the StageX repository:

https://codeberg.org/stagex/stagex/src/branch/main/packages/bootstrap/stage0/hex0-seed

curl -s https://codeberg.org/stagex/stagex/raw/branch/main/packages/bootstrap/stage0/hex0-seed | wc -c
181

181 bytes. That's smaller than most email signatures. Every byte was written by hand — no assembler, no compiler, no toolchain of any kind produced this file. It's raw machine code, a 32-bit ELF binary for the i386 architecture:

file <(curl -s https://codeberg.org/stagex/stagex/raw/branch/main/packages/bootstrap/stage0/hex0-seed)
ELF 32-bit LSB executable, Intel i386, version 1 (GNU/Linux), statically linked, no section header

Read the hex dump — every byte is meaningful:

000000 7f 45 4c 46 01 01 01 03  00 00 00 00 00 00 00 00   │.ELF............│
000010 02 00 03 00 01 00 00 00  4c 80 04 08 2c 00 00 00   │........L...,...│
000020 00 00 00 00 00 00 00 00  34 00 20 00 01 00 00 00   │........4. .....│
...
0000b0 04 cd 80 eb b4                                      │.....│

What Does hex0 Do?

hex0 is a hexadecimal-to-binary converter. It reads a text file containing ASCII hexadecimal bytes (like 7f 45 4c 46...) separated by whitespace, and writes the corresponding binary output. Lines starting with # or ; are comments.

That's it. 181 bytes of code that reads hex and writes binary.

Why this is revolutionary: Most Linux distributions start with a pre-compiled C compiler (GCC) that was built by an even older compiler, which was built by... the chain goes back decades, ultimately to opaque binary blobs. StageX starts from a program so small that any competent programmer can:

  1. Read every byte and understand what it does
  2. Reproduce its functionality from scratch in any language
  3. Verify that it matches the source (the .hex0 file) with no hidden logic

The hex0 seed has been reproduced with the same hash across multiple Linux distributions using wildly different toolchains — definitive proof that it contains nothing hidden.

Stage0 Build Chain

Starting from hex0-seed, Stage0 builds progressively more capable tools. Here's the sequence (from the Stage0 Containerfile):

hex0-seed (181 bytes)
  │
  ├── hex0 — reads hex0 source, writes binary
  │     └── hex1 — reads hex1 source (more compact), writes binary
  │           └── hex2-0 — reads hex2 source (linker + assembler)
  │                 │
  │                 ├── catm — file concatenation utility
  │                 ├── M0 — micro assembler
  │                 ├── cc_x86 — primitive C compiler (C → M1 assembly)
  │                 ├── M2-Planet — C compiler written in C subset
  │                 ├── blood-elf — ELF metadata extraction tool
  │                 ├── M1-macro — macro assembler
  │                 ├── hex2-1 — improved linker
  │                 ├── M1 — full macro assembler
  │                 ├── hex2 — final linker
  │                 ├── kaem — minimal build system (like make)
  │                 ├── M2-Mesoplanet — simplified C compiler
  │                 ├── get_machine — CPU detection
  │                 └── Utilities — sha256sum, untar, ungz, catm, cp, chmod, ...

Every tool in this tree was either:

  • Written in hex0/hex1/hex2 assembly (readable as hexadecimal text)
  • Compiled by a tool that was itself built from such sources

No binary ever appears that wasn't produced by an earlier stage in the chain.

What Stage0 Produces

The Stage0 build outputs live at /usr/bin/ in the stagex/bootstrap-stage0 image:

Tool Purpose Size
hex2 Final linker — combines object files ~10 KB
kaem Build orchestrator (like make) ~15 KB
M1 Macro assembler — assembly → machine code ~40 KB
M2-Planet Full C compiler (subset of C, compiles itself) ~80 KB
M2-Mesoplanet Simplified C compiler for utility tools ~50 KB
blood-elf ELF symbol extraction ~15 KB
sha256sum SHA-256 hash verification ~10 KB
untar, ungz, unxz Archive extraction utilities ~10 KB each

The entire Stage0 image is about 2 MB — a complete self-hosting development environment built from 181 bytes.


Stage 1: The Long Climb — Full 32-bit Userland

Stage1 takes the primitive tools from Stage0 and builds a complete 32-bit Linux userland. This is the most complex stage — the Containerfile is nearly 300 lines, and the source list encompasses hundreds of packages.

The Build Sequence (Simplified)

Stage0 tools (kaem, M2-Planet, hex2, ...)
  │
  ├── checksum-transcriber — verify source integrity
  ├── mes-0.27 — GNU Mes (Scheme interpreter + C compiler)
  ├── nyacc — Scheme-based parser generator
  │
  ├── tcc-0.9.26 → tcc-0.9.27 — TinyCC (small, fast C compiler)
  │     │
  │     ├── musl-1.1.24 — C library (essential for GCC)
  │     ├── gcc-4.0.4 — First GCC (can compile modern C++)
  │     │     └── gcc-4.7.4 → gcc-10.4.0 → gcc-13.1.0
  │     │           Incremental GCC upgrades through versions
  │     │
  │     ├── binutils-2.30 → binutils-2.41
  │     ├── make-3.82 → make-4.2.1
  │     ├── coreutils-5.0 → coreutils-9.4
  │     ├── bash-2.05b → bash-5.2.15
  │     ├── perl-5.000 → perl-5.32.1
  │     ├── python-2.0.1 → python-3.11.1
  │     └── autoconf, automake, libtool, bison, flex, gawk,
  │         sed, grep, patch, tar, xz, gzip, bzip2, openssl, ...

Why So Many Steps?

Each version upgrade in the chain exists because:

  1. No compiler can skip versions — GCC 4.0.4 can build GCC 4.7.4, which can build GCC 10.4.0, which can build GCC 13.1.0. But GCC 4.0.4 cannot directly build GCC 13.1.0 — the language standards and build system have changed too much.

  2. Incremental bootstrapping — Each new version is built by the previous version, until we reach a modern toolchain. This is the same principle as climbing a ladder one rung at a time.

  3. Self-hosting verification — At each major version, the compiler is used to rebuild itself. If the result matches, we know the compiler is self-consistent and hasn't been tampered with.

What Stage1 Produces

The Stage1 image is about 300 MB — a complete 32-bit development environment with:

  • GCC 13.1.0 (C, C++, Objective-C, Fortran)
  • binutils 2.41 (assembler, linker, archiver)
  • musl 1.2.4 (C library)
  • bash 5.2.15, coreutils 9.4, make 4.2.1
  • python 3.11.1, perl 5.32.1
  • openssl 3.0.13, autotools, and dozens more

Everything was built from source, starting from those 181 bytes.


Stage 2: The Architecture Bridge

Stage1 is 32-bit x86 only. But modern hardware is 64-bit (x86_64 and aarch64). Stage2 builds cross-compilers that run on 32-bit x86 but produce code for 64-bit targets.

32-bit Stage1 toolchain (gcc-13.1.0, binutils, musl)
  │
  ├── Cross binutils for x86_64-linux-musl
  ├── Cross binutils for aarch64-linux-musl
  │
  ├── Cross GCC stage 1 (static libgcc only) for x86_64
  ├── Cross GCC stage 1 (static libgcc only) for aarch64
  │
  ├── Cross musl (C library) for x86_64
  ├── Cross musl (C library) for aarch64
  │
  ├── Cross GCC stage 2 (shared libgcc + libstdc++) for x86_64
  └── Cross GCC stage 2 (shared libgcc + libstdc++) for aarch64

Why a Separate Bridge Stage?

Cross-compilation is complex. Building a compiler that runs on architecture A but produces code for architecture B requires careful ordering:

  1. Binutils first — You need a cross-assembler and cross-linker before you can build anything for the target
  2. Minimal GCC — A bootstrap GCC with only static libraries can compile a minimal C library
  3. C library — musl must be cross-compiled with the minimal GCC
  4. Full GCC — With the C library available, GCC can build shared libraries and C++ support

Splitting this into its own stage means Stage1 stays focused on building the 32-bit userland, and Stage2 cleanly produces the cross-toolchains without contaminating either environment.

What Stage2 Produces

The Stage2 image is about 700 MB and contains:

  • Cross-assembler, cross-linker, cross-archiver for x86_64-linux-musl and aarch64-linux-musl
  • Cross-GCC (C and C++) for both targets
  • Cross-musl libc for both targets
  • Linux kernel headers for both targets
  • Static and shared libgcc, libstdc++

Stage 3: Native 64-bit Toolchain

Stage3 uses the cross-compilers from Stage2 to build a native 64-bit toolchain — a complete environment that runs natively on x86_64 (or aarch64) without any 32-bit emulation.

Cross-compilers from Stage2 (running on 386, targeting x86_64)
  │
  ├── Native musl — 64-bit C library
  ├── Native gcc-13.1.0 — 64-bit compiler stack (with gmp, mpfr, mpc, isl)
  ├── Native binutils — 64-bit assembler, linker
  ├── Native make, cmake — build systems
  ├── busysbox — 50+ Unix utilities in one binary (sh, ls, cp, mv, ...)
  ├── Native python — interpreter
  ├── xz — compression library and tool
  ├── libucontext, libunwind, libffi — system libraries
  └── libzstd, zlib — compression libraries

What's Different About Stage 3?

Stage 3 is the last bootstrap stage. After this, you have a modern, native 64-bit development environment that can build anything in the StageX package tree — including LLVM/Clang, Rust, Go, and everything else.

Native compilation matters because:

  • Performance — 64-bit code runs at full speed without emulation layer
  • Memory — 64-bit addressing enables working with large codebases
  • Modern tooling — Most build systems assume a native 64-bit environment

What Stage3 Produces

The Stage3 image is approximately 1 GB and contains everything needed to build StageX packages:

  • gcc-13.1.0 (native C/C++/Fortran with GMP, MPFR, MPC, ISL)
  • binutils-2.45 (native assembler, linker)
  • musl-1.1.24 (64-bit C library)
  • cmake-3.31.5, make-4.4 (build systems)
  • python-3.11.8 (scripting and build tooling)
  • busysbox-1.35.0 (shell and utilities)
  • xz, zlib, libzstd, libucontext, libunwind, libffi

Stage X: Everything Else

With a native 64-bit toolchain from Stage 3, StageX builds all of its application packages — organized into three groups:

Group Purpose Examples
pallet/ Language runtime images pallet-rust, pallet-go, pallet-python, pallet-node
core/ Low-level infrastructure LLVM/Clang, Go compiler, Rust compiler, OpenSSL
user/ Application packages 300+ packages from the broader ecosystem

Each pallet image (like stagex/pallet-rust that you used in the Quick Start) is a FROM-scratch image containing just the language toolchain — built from source, deterministically, with multi-party signed digests.


The Trusting Trust Problem

Ken Thompson's 1984 Turing Award lecture, "Reflections on Trusting Trust" (read it here), demonstrated a fundamental vulnerability:

A compiler can be modified to insert a backdoor into any program it compiles — and to re-insert that backdoor into its own compiled source code, making the backdoor invisible even when reading the source.

This means that any binary compiler is a potential single point of failure. If the GCC binary you downloaded contains a Thompson-style backdoor, it could:

  1. Insert vulnerabilities into every program it compiles (including your application)
  2. Re-insert the backdoor into future versions of GCC, persisting across upgrades
  3. Hide from source-code audits because the source doesn't contain the exploit

StageX's full-source bootstrap eliminates this attack vector entirely:

  • No binary compilers — every compiler is built from source, starting from 181 bytes of auditable machine code
  • Multiple independent builds — at least two maintainers independently rebuild every package and compare hashes
  • Deterministic outputs — if the same source produces the same binary on different machines, there's no room for hidden compiler backdoors

As the StageX whitepaper states:

"By starting from a minimal, auditable trust anchor and building everything from source, StageX eliminates the single points of failure that plague conventional software supply chains."


The Full Trust Chain

graph TB
    subgraph "Trust Anchor"
        SEED["hex0-seed<br/>181 bytes<br/>hand-crafted x86"]
    end

    subgraph "Stage 0 — From Nothing"
        S0_HEX0["hex0<br/>hexadecimal→binary"]
        S0_HEX1["hex1<br/>compact hex assembler"]
        S0_HEX2["hex2<br/>linker + assembler"]
        S0_C["M2-Planet<br/>C compiler"]
        S0_KAEM["kaem<br/>build system"]
        S0_UTILS["sha256sum, untar, ..."]
    end

    subgraph "Stage 1 — 32-bit Userland"
        S1_MES["GNU Mes<br/>Scheme + C compiler"]
        S1_TCC["TinyCC<br/>small C compiler"]
        S1_MUSL["musl libc"]
        S1_GCC4["gcc-4.0.4"]
        S1_GCC13["gcc-13.1.0"]
        S1_USERLAND["bash, python, make,<br/>autotools, openssl, ..."]
    end

    subgraph "Stage 2 — Cross Compilers"
        S2_BINUTILS["Cross binutils<br/>x86_64 + aarch64"]
        S2_GCC["Cross GCC"]
        S2_MUSL["Cross musl"]
    end

    subgraph "Stage 3 — Native 64-bit"
        S3_GCC["Native gcc-13.1.0"]
        S3_BINUTILS["Native binutils"]
        S3_CMAKE["cmake, python,<br/>busybox, ..."]
    end

    subgraph "Stage X — Everything"
        SX_PALLET["pallet-rust, pallet-go, ..."]
        SX_CORE["LLVM, Go, Rust, ..."]
        SX_USER["300+ packages"]
    end

    SEED --> S0_HEX0 --> S0_HEX1 --> S0_HEX2
    S0_HEX2 --> S0_C --> S0_KAEM --> S0_UTILS

    S0_UTILS --> S1_MES --> S1_TCC --> S1_MUSL
    S1_MUSL --> S1_GCC4 --> S1_GCC13
    S1_GCC13 --> S1_USERLAND

    S1_USERLAND --> S2_BINUTILS --> S2_GCC --> S2_MUSL

    S2_MUSL --> S3_GCC --> S3_BINUTILS --> S3_CMAKE

    S3_CMAKE --> SX_PALLET
    S3_CMAKE --> SX_CORE
    S3_CMAKE --> SX_USER

    style SEED fill:#f96,stroke:#333,color:#000
    style S0_HEX0 fill:#ff9,stroke:#333,color:#000
    style S1_TCC fill:#9cf,stroke:#333,color:#000
    style S1_GCC13 fill:#9cf,stroke:#333,color:#000
    style S2_GCC fill:#9f9,stroke:#333,color:#000
    style S3_GCC fill:#9f9,stroke:#333,color:#000

What to notice:

  • Orange — The 181-byte seed. Everything derives from this.
  • Yellow — Stage 0. Every tool was either written in hex assembly or compiled by a tool that was.
  • Blue — Stage 1. The long climb from TinyCC to modern GCC.
  • Green — Stages 2 and 3. From 32-bit to native 64-bit.
  • Everything above Stage 3 is built from source using only tools produced in these stages.

Verification at Scale

Full-source bootstrapping isn't just a philosophical exercise — it's a practical security mechanism. Here's how it works in practice:

Deterministic Builds

Every StageX build is deterministic — the same source always produces the same binary, byte for byte. This is achieved through:

  • SOURCE_DATE_EPOCH=1 — All timestamps set to a fixed value
  • Pinned toolchain versions — Every Containerfile uses a specific @sha256: digest
  • --network=none — Hermetic builds prevent network-dependent behavior
  • --frozen — Lockfiles prevent dependency drift

Multi-Party Verification

For every StageX package:

  1. Maintainer A builds the package on their machine
  2. Maintainer B independently builds the same package on a different machine (different CPU vendor, different hardware)
  3. Both compute the SHA-256 digest of the output image
  4. If the digests match, both sign — the artifact is proven reproducible
  5. A quorum of 2+ signatures is required before any image is published

This means a compromise would require simultaneously subverting two maintainers' independent build environments — across different hardware, different locations, and different verification processes.

Chain of Custody

The entire chain — from hex0 seed to pallet-rust — can be traced and independently rebuilt:

# Build stage 0 (from hex0 seed)
podman build -t bootstrap-stage0 packages/bootstrap/stage0/

# Build stage 1 (using stage 0)
podman build -t bootstrap-stage1 packages/bootstrap/stage1/

# Build stage 2 (using stage 1)
podman build -t bootstrap-stage2 packages/bootstrap/stage2/

# Build stage 3 (using stage 2)
podman build -t bootstrap-stage3 packages/bootstrap/stage3/

# Build pallet-rust (using stage 3)
podman build -t pallet-rust packages/pallet/rust/

Any developer can reproduce this entire chain and verify their digests match the published, signed digests.


What You've Learned

Concept Meaning
hex0 seed 181 bytes of hand-crafted machine code — the root of trust
Stage 0 Builds assemblers, linkers, and a C compiler from nothing
Stage 1 Bootstraps a complete 32-bit userland via live-bootstrap
Stage 2 Builds cross-compilers from 32-bit to 64-bit architectures
Stage 3 Produces the native 64-bit toolchain used for all packages
Trusting Trust The vulnerability that full-source bootstrapping eliminates
Deterministic builds Same source → same binary, every time, on any machine
Multi-party verification 2+ independent maintainers must agree on every digest

Now You Understand

The Rust compiler inside stagex/pallet-rust — the one you used in the Quick Start to build your first reproducible binary — traces its lineage back through Stage 3, Stage 2, and Stage 1, to Stage 0, and ultimately to those 181 bytes of hand-crafted x86 machine code in the hex0-seed file.

Every binary in every StageX image is built from source, with no pre-compiled dependencies at any point in the chain. This is what full-source bootstrapping means — and it's the foundation of StageX's security model.


References

Next Steps