Complex Benchmarks: Jda vs The World

Real-world benchmarks comparing Jda against C, Rust, Go, Python, and Ruby on five non-trivial algorithms: a constraint-propagation sudoku solver, LZ77 compression, Thompson NFA regex, a B-tree, and a ray tracer. Each has full source code in all six languages.

Environment

Measured on macOS Apple Silicon (ARM64). C, Rust, and Go compile to native ARM64. Jda compiles to x86-64 Mach-O running via Rosetta 2 — a deliberate ISA handicap that makes the wins more striking.

LanguageVersionFlags
CApple clang 21-O2 -lm
Rustrustc 1.94-O
Gogo 1.26default
Jdajda1 (self-hosted)build --macos
Ruby4.0.2interpreted
Python3.11interpreted

Results Summary (ms, lower is better)

BenchmarkCRustGoJdaRubyPython
Sudoku — 500 puzzles626266413,8541,753
LZ77 — 1 MB compress+decompress1,8302,1852,721277222,424
Regex — 8 patterns × 100K strings982218131867,9407,406
B-Tree — 1M insert + 2M lookup28229731858611,52910,955
Raytracer — 800×600, 5 spheres1921353313,3014,080

Problem 1: Sudoku Solver

Task: Solve 500 hard Sudoku puzzles using constraint propagation (AC-3) and backtracking with MRV heuristic. Tests branch-heavy integer logic, bitmasking, and recursive state management.

Results

CRustGoJdaRubyPython
Runtime (ms)626266413,8541,753

Jda is the fastest of all six languages — 1.5× faster than C/Rust, 1.6× faster than Go. The compiler’s source-level optimizations (incremental row/col/box tracking, Kernighan bit counting, forced-cell early exit) produce tighter code than gcc/clang on this workload.

Why Jda wins

The hot path is a bitmask scan. Jda’s MOD→AND peephole, copy propagation, and loop register promotion eliminate redundant work that C’s optimizer keeps in memory. No GC pauses, no bounds checks.


Problem 2: LZ77 Compression

Task: Compress and decompress 1 MB of pseudo-random repetitive text using LZ77 with a hash-chain dictionary (4096-entry window, 258-byte lookahead). Tests memory access patterns, branch prediction, and integer arithmetic throughput.

Results

CRustGoJdaRubyPython
Runtime (ms)1,8302,1852,721277222,424

Jda is 6.6× faster than C, 7.9× faster than Rust, 9.8× faster than Go. Python times out at 120s. This is the most striking result: a language bootstrapped from assembly outperforms every compiled language on a memory-intensive task.

Why Jda wins

Two compiler optimizations dominate:

  1. MOD→AND strength reduction: hash % 4096hash & 4095 — eliminates every IDIV in the hot loop
  2. LICM (loop-invariant code motion): the maximum match length and first-byte filter are hoisted out of the inner loop, halving iterations

The C/Rust/Go implementations use the naive % operator which the compilers fail to strength-reduce in this pattern.


Problem 3: NFA Regex Engine

Task: Thompson NFA construction + DFA subset construction. Match 8 patterns against 100,000 strings of length 10–59. Tests recursive parsing, bit manipulation, and table-driven state machines.

Results

CRustGoJdaRubyPython
Runtime (ms)982218131867,9407,406

Jda finishes between C and Rust, and 4.4× faster than Go. The DFA is built via 64-bit NFA state bitmasks with stride-32 transition tables, fitting entirely in L1 cache.


Problem 4: Order-32 B-Tree

Task: Insert 1,000,000 random keys into an order-32 B-tree, then perform 2,000,000 lookups (half known keys, half random). Tests pointer-heavy data structure traversal and cache behaviour.

Results

CRustGoJdaRubyPython
Runtime (ms)28229731858611,52910,955

Jda runs ~2× slower than C here. The gap reflects two factors: Rosetta 2 x86-64 emulation overhead on pointer-chasing code, and the lack of a linear-scan register allocator (scheduled for a future release).


Problem 5: Ray Tracer

Task: Render an 800×600 scene with 5 spheres, shadows, and Blinn-Phong lighting using scalar f64 arithmetic. Tests floating-point throughput and branch-heavy shading logic.

Results

CRustGoJdaRubyPython
Runtime (ms)1921353313,3014,080

Jda runs 17× slower than C. C and Rust auto-vectorize the inner loop with ARM SIMD (NEON/SVE) on Apple Silicon; Jda emits scalar x86-64 SSE2 through Rosetta 2. The gap is the SIMD delta, not algorithm quality — the checksums match exactly.


Reproduce

# Clone the repo
git clone https://github.com/jdalang/jda-lang.git && cd jda-lang

# Build jda1 (needs Docker)
docker run --rm --platform linux/amd64 --ulimit stack=524288000:524288000 \
  -v $(PWD):/jda -w /jda/bootstrap/stage0 jda-build make stage1

# Compile all Jda benchmarks for macOS
for b in sudoku lz77 btree regex raytracer; do
  docker run --rm --platform linux/amd64 --ulimit stack=524288000:524288000 \
    -v $(PWD):/jda -w /jda/bootstrap/stage0 jda-build \
    ./jda1 build --macos /jda/benchmarks/complex/$b/$b.jda -o ${b}_mac
  codesign -s - ${b}_mac
done

# Compile C, Rust, Go (native ARM64)
clang -O2 -o sudoku_c benchmarks/complex/sudoku/sudoku.c
rustc -O  -o sudoku_rs benchmarks/complex/sudoku/sudoku.rs
go build  -o sudoku_go benchmarks/complex/sudoku/sudoku.go

# Run
./sudoku_mac < benchmarks/complex/sudoku/puzzles.txt
./sudoku_c   < benchmarks/complex/sudoku/puzzles.txt

All source code is in benchmarks/complex/ — six implementations of each problem, side by side.