Project Setup
Keith Gangarahwe
@keith-gang
The āI Donāt Have Hardware Yetā Problem
So this is a post on how Iām starting the setup of my ARM7TDMI interpreter.
For now, we donāt have the actual hardware. That is a problem. So, instead of waiting around, we are going to try and make a basic interpreter in Zig that runs on standard machines. We are effectively hallucinating a CPU in software until the real silicon arrives.
Step 1: Generating Test Binaries
To test an interpreter, you need code to interpret. I didnāt want to mess around with complex toolchains, so I set up this: zig-arm7tdmi_build.zig.
This repo is basically a Zig project setup to write inline assembly in Zig and use the Zig build system to generate a flat ARM .bin file. No ELF overhead, just raw instructions.
There is a catch, though. Because we are doing this weird freestanding build, we hit this: pub const panic = @compileError("panic not allowed in freestanding build");
This means we canāt debug on the target. It is a tradeoff to make truly flat binaries.
Build Requirements
Due to the freestanding nature of the code and the absence of a panic handler, you must build with the ReleaseSmall optimization level.
Why ReleaseSmall?
The file src/main.zig explicitly forbids panic generation. In standard Debug builds, Zig inserts safety checks (e.g., for overflow) that require a panic handler to function. Using ReleaseSmall eliminates these checks, allowing the code to compile without a runtime panic implementation. This ensures the resulting binary contains only the assembly instructions we intended, without any extra bloat.
The build produces the following artifacts:
- Flat Binary:
zig-out/bin/main.bin(The final artifact for our interpreter). - ELF Executable:
zig-out/bin/dbg(An intermediate artifact with debug symbols).
However, we have options to run this in QEMU to verify what we are generating:
| Step | Command | Description |
|---|---|---|
| Run | zig build run -Doptimize=ReleaseSmall | Runs the binary in QEMU (versatilepb, arm926). |
| Debug | zig build run-dbg -Doptimize=ReleaseSmall | Runs QEMU paused (-S) with a GDB stub on port 1234 (-s). |
| WSL | zig build run-wsl -Doptimize=ReleaseSmall | Launches the Windows version of QEMU from a WSL environment. |
Here is what the main.zig looks like (yes, itās just inline assembly):
pub const panic = @compileError("panic not allowed in freestanding build");
export fn _start() callconv(.naked) noreturn {
asm volatile (
\\ ldr sp, =0x4000
\\ mov r0, #10
\\ mov r1, #40
\\ mov r4, #66
\\ add r0, r1, r4
);
while (true) {}
}Deconstructing the Binary Generator: main.zig
The code above is our entire program. Itās a bare-metal application that does nothing more than load a few values into registers. Letās break it down:
export fn _start(): This defines our entry point. We name it_startwhich is the conventional starting point for programs without a standard library or runtime.exportmakes it visible to the linker.callconv(.naked): This is the crucial part. It tells the compiler not to add any standard function setup or cleanup code (no function prologue or epilogue). We get full control over the registers and the stack, which is exactly what we need for bare-metal programming.noreturn: We promise the compiler this function will never return. In a freestanding environment, thereās no OS to return to, so we end in an infinite loop (while (true) {}).asm volatile (...): This is Zigās inline assembly.volatileensures the compiler doesnāt reorder or optimize away our raw instructions. Inside, we perform basic ARM operations: initialize the stack pointer (ldr sp, =0x4000) and then perform some simple arithmetic.
Why Inline Assembly and _start?
You might wonder why weāre writing assembly inside a Zig file instead of using traditional .s assembly files.
- Simplicity & Integration: Everything stays in one file. This simplifies the project structure and the build process. We donāt need to tell the build system how to find and assemble separate files.
- Control: The combination of
_startandcallconv(.naked)gives us the low-level control of a pure assembly entry point, but from the comfort of our Zig source. Itās the best of both worlds. - Unified Toolchain: We let the Zig compiler handle everything. It parses the Zig code and the inline assembly in one pass, streamlining the process from source code to machine code.
In short, this approach is a modern, integrated way to achieve what once required separate assembly files and more complex build configurations.
And the build.zig that makes the magic happen:
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.resolveTargetQuery(.{
.cpu_arch = .arm,
.cpu_model = .{ .explicit = &std.Target.arm.cpu.arm7tdmi },
.os_tag = .freestanding,
.abi = .eabi,
});
const optimize = b.standardOptimizeOption(.{});
const root_module = b.createModule(.{
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
});
// -------------------------------------------------
// 1. Build ELF executable (WITH debug symbols)
// -------------------------------------------------
const exe = b.addExecutable(.{
.name = "dbg",
.root_module = root_module,
});
exe.root_module.strip = false; // Keep symbols for GDB
exe.pie = false;
exe.entry = .{ .symbol_name = "_start" };
exe.link_gc_sections = true;
exe.linker_script = b.path("linker.ld");
// Install ELF (creates zig-out/bin automatically)
b.installArtifact(exe);
// -------------------------------------------------
// 2. Convert ELF ā flat binary
// -------------------------------------------------
// Ensure output directory exists
const mkdir = b.addSystemCommand(&.{
"mkdir",
"-p",
"zig-out/bin",
});
mkdir.step.dependOn(&exe.step);
const objcopy = b.addSystemCommand(&.{
"zig",
"objcopy",
"-O",
"binary",
});
objcopy.addArtifactArg(exe);
objcopy.addArg("zig-out/bin/main.bin");
objcopy.step.dependOn(&mkdir.step);
// -------------------------------------------------
// 3. QEMU Normal Run (flat binary)
// -------------------------------------------------
const run_qemu = b.addSystemCommand(&.{
"qemu-system-arm",
"-M", "versatilepb", // The Board
"-cpu", "arm926", // <--- CHANGED: Matches your list exactly
//"-nographic", // use current terminal for I/O
"-S", // Freeze CPU at startup
"-s", // Start GDB server on port 1234
"-serial", "stdio", // Redirect serial to terminal
"-semihosting", // enable semihosting for print support
"-kernel",
"zig-out/bin/main.bin",
});
run_qemu.step.dependOn(&objcopy.step);
// -------------------------------------------------
// 4. QEMU Debug Mode (freeze + GDB)
// -------------------------------------------------
const run_qemu_dbg = b.addSystemCommand(&.{
"qemu-system-arm",
"-M", "versatilepb", // The Board
"-cpu", "arm926", // <--- CHANGED: Matches your list exactly
//"-nographic", // use current terminal for I/O
"-serial", "stdio", // Redirect serial to terminal
"-semihosting", // enable semihosting for print support
"-kernel",
"zig-out/bin/main.bin",
});
run_qemu_dbg.step.dependOn(&objcopy.step);
// -------------------------------------------------
// 5. Windows QEMU (from WSL)
// -------------------------------------------------
const run_qemu_win = b.addSystemCommand(&.{
"/mnt/c/msys64/ucrt64/bin/qemu-system-arm.exe",
"-M", "versatilepb", // The Board
"-cpu", "arm926", // <--- CHANGED: Matches your list exactly
//"-nographic", // use current terminal for I/O
"-serial", "stdio", // Redirect serial to terminal
"-semihosting", // enable semihosting for print support
"-kernel",
"zig-out/bin/main.bin",
});
run_qemu_win.step.dependOn(&objcopy.step);
// -------------------------------------------------
// 6. Expose build steps
// -------------------------------------------------
b.step("elf", "Build ELF with debug symbols")
.dependOn(&exe.step);
b.step("bin", "Build flat ARM binary")
.dependOn(&objcopy.step);
b.step("run", "Run in QEMU (Linux/WSL)")
.dependOn(&run_qemu.step);
b.step("run-dbg", "Run in QEMU Debug mode (freeze)")
.dependOn(&run_qemu_dbg.step);
b.step("run-wsl", "Run Windows QEMU from WSL")
.dependOn(&run_qemu_win.step);
}Deconstructing the Build Magic: build.zig
This build script is the heart of our binary generation process. It automates a multi-step pipeline:
Targeting ARM Freestanding: We first tell Zig that our target is
arm, specificallyarm7tdmi, and that weāre building for a.freestandingenvironment. This is the key to stripping away all operating system dependencies.Generating ELF with Debug Symbols: The first step (
exe) builds a standard ELF executable nameddbg. The lineexe.root_module.strip = false;is critical: it tells the linker to keep the debugging symbols. This allows us to use a debugger like GDB to inspect our code, even though the final binary wonāt have these symbols.Creating the Flat Binary: We then use the
zig objcopycommand. This tool takes thedbgELF file as input and applies the-O binaryflag. This flag is the magic wand: it strips away all the ELF headers, metadata, and section information, leaving only the raw, executable machine code. The output is our finalmain.bin, a pure, flat binary file.Setting up QEMU: The script defines several commands to run our binary in the QEMU emulator.
zig build run -Doptimize=ReleaseSmall: Runs the code immediately.zig build run-dbg -Doptimize=ReleaseSmall: This is for debugging. It uses-Sto freeze the CPU at startup and-sto open a GDB server(on port 1234), allowing you to connect a debugger and step through the code from the very first instruction.- The
-kernel zig-out/bin/main.binargument tells QEMU to load our flat binary file directly into memory and execute it, just as if it were a real piece of hardware booting up.
Step 2: The Naive Interpreter
Now that we have binary blobs, we need something to run them. Enter the basic interpreter: arm7tdmi-interpreter-zig.
In this project, I built a simple pipeline to load the ARM bin file and try to extract the commands. The design is intentionally ānaiveā to start with, focusing on a clean separation of data and logic.
The Cpu struct is a prime example of this. Itās pure dataājust an array of registers and the status register. It has no methods. This is a Data-Oriented Design approach that keeps the state clean and separate from the execution logic.
The interpreter itself has two modes, which reveal the learning process. The run function operates on a pre-decoded array of Instruction structs. This is a āperfect worldā simulation where the binary has been fully analyzed and structured beforehand. The loop function is more realistic, attempting to read raw 32-bit words from a memory slice one by one, decode them, and execute them in a tight loopāmuch closer to how a real CPU operates.
Here is the code. It looks nice, doesnāt it?
const std = @import("std");
/// ============================================
/// CPU STATE (PURE DATA ā NO BEHAVIOR)
/// ============================================
pub const Cpu = struct {
/// 16 General Purpose Registers (R0āR15)
/// R15 is Program Counter in real ARM,
/// but in this simulation we track PC separately.
regs: [16]u32 = [_]u32{0} ** 16,
/// Current Program Status Register
cpsr: u32 = 0x000000D3,
};
/// ============================================
/// Simulated Memory
/// ============================================
pub const Memory = struct {
data: []u8,
};
/// ============================================
/// DECODED INSTRUCTION FORMAT
/// ============================================
pub const Instruction = union(enum) {
mov: struct { rd: u8, imm: u32 },
add: struct { rd: u8, rn: u8, imm: u32 },
sub: struct { rd: u8, rn: u8, imm: u32 },
bl: struct { target: usize },
};
/// ============================================
/// DECODE STAGE
/// Converts raw 32-bit ARM word into structured Instruction
/// ============================================
fn decode(word: u32) !Instruction {
const type_bits = (word >> 26) & 0b11;
// Data Processing instructions (bits 27ā26 must be 00)
if (type_bits == 0) {
const opcode = (word >> 21) & 0xF;
const rd: u8 = @intCast((word >> 12) & 0xF);
const rn: u8 = @intCast((word >> 16) & 0xF);
const i_bit = (word >> 25) & 1;
// MOV (opcode 13)
if (opcode == 13 and i_bit == 1) {
const imm = word & 0xFF;
return .{ .mov = .{ .rd = rd, .imm = imm } };
}
// ADD (opcode 4)
if (opcode == 4 and i_bit == 1) {
const imm = word & 0xFF;
return .{ .add = .{ .rd = rd, .rn = rn, .imm = imm } };
}
// SUB (opcode 2)
if (opcode == 2 and i_bit == 1) {
const imm = word & 0xFF;
return .{ .sub = .{ .rd = rd, .rn = rn, .imm = imm } };
}
}
// Branch with Link (BL)
// Bits 27ā25 = 101
if (((word >> 25) & 0b111) == 0b101) {
const imm24 = word & 0x00FFFFFF;
// Sign-extend 24-bit immediate
const signed = @as(i32, @bitCast(imm24 << 8)) >> 6;
return .{ .bl = .{ .target = @intCast(@as(isize, @divTrunc(signed, 4))) } };
}
return error.UnsupportedInstruction;
}
/// Decode entire program once (DOD style)
fn decodeProgram(
raw: []const u32,
allocator: std.mem.Allocator,
) ![]Instruction {
var decoded = try allocator.alloc(Instruction, raw.len);
for (raw, 0..) |word, i| {
decoded[i] = try decode(word);
}
return decoded;
}
/// ============================================
/// DEBUG PRINT HELPERS
/// ============================================
fn printInstruction(inst: Instruction) void {
switch (inst) {
.mov => |i| std.debug.print("MOV R{d}, #{d}", .{ i.rd, i.imm }),
.add => |i| std.debug.print("ADD R{d}, R{d}, #{d}", .{ i.rd, i.rn, i.imm }),
.sub => |i| std.debug.print("SUB R{d}, R{d}, #{d}", .{ i.rd, i.rn, i.imm }),
.bl => |i| std.debug.print("BL {d}", .{i.target}),
}
}
fn dumpRegisters(cpu: *const Cpu) void {
for (cpu.regs, 0..) |r, i| {
std.debug.print("R{d: <2}: {d: <10} ", .{ i, r });
if ((i + 1) % 4 == 0)
std.debug.print("\n", .{});
}
}
/// ============================================
/// Helper Function to read from the file
/// ============================================
fn readU32(memory: []u8, addr: u32) u32 {
return @as(u32, memory[addr]) |
(@as(u32, memory[addr + 1]) << 8) |
(@as(u32, memory[addr + 2]) << 16) |
(@as(u32, memory[addr + 3]) << 24);
}
/// ============================================
/// EXECUTION STAGE (HOT LOOP)
/// ============================================
fn run(cpu: *Cpu, program: []Instruction) void {
var pc: u32 = 0x10; // entry point
var cycle: usize = 0;
while (pc < program.len) {
const inst = program[pc];
std.debug.print(
"\n==============================\n",
.{},
);
std.debug.print("Cycle: {d}\n", .{cycle});
std.debug.print("PC: {d}\n", .{pc});
std.debug.print("Instruction: ", .{});
printInstruction(inst);
std.debug.print("\n\n", .{});
// Execute instruction
switch (inst) {
.mov => |i| {
cpu.regs[i.rd] = i.imm;
pc += 1;
},
.add => |i| {
cpu.regs[i.rd] = cpu.regs[i.rn] + i.imm;
pc += 1;
},
.sub => |i| {
cpu.regs[i.rd] = cpu.regs[i.rn] - i.imm;
pc += 1;
},
.bl => |i| {
// Save return address in R14 (Link Register)
cpu.regs[14] = @intCast(pc + 1);
// Branch
pc = @intCast(i.target);
},
}
dumpRegisters(cpu);
cycle += 1;
}
}
fn loop(memory: []u8) void {
var cpu = Cpu{};
var pc: u32 = 0x10; // entry point
var cycle: usize = 0;
while (cycle < 10) { // temporary limit
const word = readU32(memory, pc);
std.debug.print(
"\n====================\nCycle: {d}\nPC: 0x{x}\nRaw: 0x{x}\n",
.{ cycle, pc, word },
);
const inst = decode(word) catch {
std.debug.print("Unsupported instruction\n", .{});
break;
};
// Execute (we will expand this next)
switch (inst) {
.mov => |i| {
cpu.regs[i.rd] = i.imm;
pc += 4;
},
.add => |i| {
cpu.regs[i.rd] = cpu.regs[i.rn] + i.imm;
pc += 4;
},
.sub => |i| {
cpu.regs[i.rd] = cpu.regs[i.rn] - i.imm;
pc += 4;
},
else => {
std.debug.print("Instruction not implemented yet\n", .{});
// break;
},
}
dumpRegisters(&cpu);
cycle += 1;
}
}
/// ============================================
/// MAIN ā SIMULATION ENTRY POINT
/// ============================================
pub fn main() !void {
// var cpu = Cpu{};
// Simulated ARM machine code
//const raw_program = [_]u32{
// 0xE3A0000A, // MOV R0, #10
// 0xE3A0100B, // MOV R1, #11
// 0xE2802005, // ADD R2, R0, #5
// 0xE2403002, // SUB R3, R0, #2
//};
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
//const decoded = try decodeProgram(&raw_program, allocator);
//defer allocator.free(decoded);
//run(&cpu, decoded);
const memory_size = 128 * 1024;
const memory = try allocator.alloc(u8, memory_size);
defer allocator.free(memory);
@memset(memory, 0);
// open main.bin
const file_path = "bin/main.bin"; // The arm binary formed by our project, copied here
const file = try std.fs.cwd().openFile(file_path, .{});
defer file.close();
const file_size = try file.readAll(memory);
std.debug.print("Loaded {d} bytes into memory.\n", .{file_size});
loop(memory);
}The Reality Check
Here is where my naivety comes back to bite me.
In the code above, you can see two implementations. One is a nice, clean CPU simulation (run function) that takes a pre-cooked array of u32 hex codes. It works perfectly because I hand-picked instructions that I had already implemented (MOV, ADD, SUB).
Then there is the loop function. This is the work-in-progress pipeline attempting to load an actual ARM7TDMI binary file generated by my build system.
Obviously, it breaks.
Why? Because I havenāt worked on the actual pipeline to do the underlying stuff that isnāt a basic mov or add. When you compile real code, even simple code, the compiler generates instructions like ldr (Load Register) to move data around. My interpreter looks at that ldr instruction, realizes it has no idea what that is, and panics.
I was so focused on the āpure dataā CPU struct that I forgot that a CPU actually has to fetch memory, and memory access is complicated. The run function assumed a perfect world where instructions just exist. The loop function is facing the cold, hard reality of binary files where you have to handle everything, or you handle nothing.