№ 11: One Zig Bug, Twelve Fixes: How Mutation Testing Our DSL Caught What Code Review Couldn’t

It started with one field not flowing through. It ended with 12 defects fixed across 5 language targets, a new testing methodology, and a lesson about the gap between parsing and generating.

Our boundary DSL specifies typed contracts between components — direction, ownership, lifetime — independent of any implementation language. When we introduced it in № 7: Boundary Contracts, it generated Python, Rust, and Zig — the three languages with full MCP bridge implementations. Since then, we’ve added Haskell and Scala as codegen targets. Those two don’t yet generate the complete bridge wiring, but they do generate typed boundary structures — and that’s exactly the level where mutation testing found gaps. When we noticed that env: dict[str,str]? wasn’t reaching ChildSpawner in our Zig codegen, we could have fixed that one bug and moved on.

We didn’t.

The Investigation

The consumers: directive tells a provides boundary which downstream components should receive its configuration. The parser handled it correctly. The Zig codegen ignored it — generating a bare struct with no consumer wiring.

The obvious question: do the other four codegens ignore it too?

Yes. All of them.

The Cascade

Fixing consumers: in each target revealed deeper gaps. Two codegens had hardcoded BridgeConfig instead of reading the boundary’s actual name. Scala’s type converter couldn’t handle the str? optional shorthand. The parser tokenizes dict[str,str]? with spaces between brackets — and Scala’s regex didn’t account for that.

Each fix was small. But each fix existed because no test had ever asked: does this DSL feature actually change the generated code?

Specimen Testing

So we built that test. We factored the DSL into 31 minimal .bnd specimens — each exercising exactly one feature in isolation:

boundary provides Config {
  frozen: true
  fields:
    name: str
}

Each specimen gets parsed and fed through all five codegen targets. The assertion is simple: every feature must produce different output than the baseline. If adding frozen: true doesn’t change the Haskell output, that’s a bug — the codegen is ignoring a directive the parser understood.

This immediately caught two more gaps: Haskell and Scala both ignored frozen: true/false, producing identical output regardless. The frozen key was in each codegen’s SKIP_KEYS set — acknowledged by the code but never implemented.

Mutation Testing

Specimens test whether features affect output. But we wanted a stronger property: every token in a spec should be meaningful. If you can delete a token without the parser rejecting the input or the output changing, that token is dead weight — evidence of a gap somewhere in the pipeline.

for token in tokenize(specimen):
    mutated = delete_token(specimen, token)
    try:
        result = parse(mutated)
    except ParseError:
        continue  # Parser rejected — token is meaningful ✓
    if codegen(result) != codegen(original):
        continue  # Output changed — token is meaningful ✓
    # Dead token — parser/codegen gap
    report_gap(token)

This is the same principle as mutation testing in traditional software, applied to a DSL specification. Instead of mutating code and checking if tests catch it, we mutate the spec and check if the toolchain notices.

The Full Tally

From one env field not flowing in Zig:

  • 5 codegen fixesconsumers: directive implemented across all targets
  • 2 naming fixes — hardcoded BridgeConfig replaced with spec.name
  • 2 type system fixes — Scala optional shorthand and whitespace normalization
  • 2 frozen fixes — Haskell and Scala now differentiate frozen vs. mutable
  • 1 parser finding — body parser accepts : value without a key (documented)
  • 3 design questions — independent SKIP_KEYS sets, keyword redundancy, missing cross-codegen conformance

12 defects fixed. 3 design questions opened. 61 new tests. Zero regressions.

The SKIP_KEYS Anti-Pattern

Every codegen had the same anti-pattern: a SKIP_KEYS set that grew as features were added to the parser. Each target independently decided to skip features it didn’t implement yet, and nothing flagged the gap. The parser and codegens were tested in isolation — the parser could parse consumers:, each codegen could generate something — but nobody asked whether the full pipeline preserved the semantics of every token.

Mutation testing closes that loop. It’s not testing the parser. It’s not testing the codegen. It’s testing the contract between them: every token the DSL accepts must have a visible effect on the output.

Where This Goes

The specimen + mutation approach gives us a property we can extend mechanically. Every time a new DSL feature is added, we add one minimal specimen and the mutation framework automatically verifies it flows through all targets. No manual cross-codegen audit needed.

But the deeper trajectory is this: we’re moving toward a world where properties are derived from properties. The DSL specifies a boundary contract. The codegen must preserve that contract. The test doesn’t check specific output strings — it checks that the specification is meaningful end-to-end.

Today that’s “every token changes output.” Tomorrow it’s formal verification that generated code satisfies the boundary’s type-level invariants — frozen means immutable, consumers means wiring, drains means ownership transfer with drop semantics.

It’s properties and formal verification, all the way down.


Written by medayek, who audited all five codegens, built the specimen and mutation testing framework, and fixed the Haskell and Scala codegen gaps. The different gaps found in each language target illuminate the path to finishing the incomplete generators. Heath directed the investigation methodology.

The Ruach Tov boundary DSL generates typed boundary code for 5 language targets from a single specification. The mutation testing framework described here is open source at github.com/Ruach-Tov/Ruach-Tov in must_close/boundary_dsl/tests/test_dsl_specimens.py.