babelbench

Working repo of the Cross-Paradigm Proof-Translation Benchmark (Secure Program
Synthesis Fellowship): measuring whether LLMs understand proof structure or merely
pattern-match system-specific syntax, via verifier-grounded translation tasks across
proof systems (target: ~150 theorem pairs; Coq↔Lean, TLAPS↔Dafny, cross-cluster).

The repo began life at the gr.inc OpenReward hackathon (April 2026) as LTLVerifyEnv,
a single-language-family RL environment for formal-verification tasks. It has since
been restructured into the installable babelbench package and repointed at the
translation-benchmark research arc described above — see History and
ROADMAP.md for where it's headed.

What exists today

5 verifier adapters (Dafny, TLA+/TLC, FlyVy, FizzBee, mypyvy) — src/babelbench/verifiers/
An OpenReward RL/eval environment with 8 splits, 2,048 tasks total — src/babelbench/server.py
Rollout drivers (OpenAI + Anthropic/Vertex) and verifier-grounded scoring — src/babelbench/rollout/, src/babelbench/scoring/
Audited hackathon-era results — results/hackathon/ (see AUDIT.md there for which numbers are trustworthy)

What's not built yet — Coq/Lean adapters, the pluggable scorer, the curated theorem-pair
corpus — is tracked in ROADMAP.md.

Quickstart (first hour)

git clone git@github.com:swist/fv-worlds.git && cd fv-worlds
python3 -m pip install -e ".[dev,rollout]"
bash scripts/fetch_datasets.sh          # populates datasets/ (pinned upstreams)
python3 -m pytest tests/                # offline selftests — no verifiers needed

# Run 3 Dafny->TLA translation tasks against the deployed env:
export OPENREWARD_API_KEY=...           # + ANTHROPIC_API_KEY, or Vertex (below)
python3 -m babelbench.rollout.run_rollout_claude \
    --env-name swist/fv-worlds --split translation_dafny_to_tla \
    --max-tasks 3 --run-name hello-xlate
python3 -m babelbench.scoring.score_translation results-translation_dafny_to_tla-hello-xlate.jsonl

run_rollout_claude.py also accepts --task-index (pick a starting task),
--max-steps / --max-tokens (per-turn caps), --model, and --results (override
the output path). Run python3 -m babelbench.rollout.run_rollout_claude --help for
the full, current list — flags occasionally change as the harness evolves, so treat
--help as the source of truth over any doc, including this one.

Auth

Anthropic API key: export ANTHROPIC_API_KEY=... (console.anthropic.com)
Google Vertex (no Anthropic key): export ANTHROPIC_VERTEX_PROJECT_ID=... CLOUD_ML_REGION=global
— the drivers auto-detect from these env vars (--provider forces one path).
This is the path for anyone using a corporate GCP project instead of a personal
Anthropic key (run gcloud auth application-default login once if credentials
are missing).
OpenAI (run_rollout.py): export OPENAI_API_KEY=...

Splits

8 splits, 2,048 tasks total. All splits share the same env API
(check(language, source) non-terminal reward 0, submit(language, source)
terminal). Most splits use a binary reward — 1.0 iff the verifier accepts, 0.0
otherwise — but spec_synthesis and translation_dafny_to_tla layer a composite
reward on top instead (see below).

Split	Type	Count	Source
`dataset_invariants`	train	539	masked TLA+/FlyVy/FizzBee/mypyvy/ivybench, invariants stripped. DafnyBench is excluded — see below.
`dataset_proof`	train	655	same upstream corpora, whole proof body stripped instead of just invariants
`dataset_dafnybench`	test	503	DafnyBench `ground_truth` files, proof-masked in-repo by our own loader (not DafnyBench's `hints_removed` release) — in-distribution for any 2024+ frontier model
`dataset_mutated`	test	245	DafnyBench problems passed through syntactic mutation (loop conversion, method splitting, requires generalisation, ghost-variable injection) — out-of-training-distribution
`spec_synthesis`	train	10	hand-curated Dafny method bodies; model writes the spec; composite reward
`translation_dafny_to_tla`	train	30	small Dafny methods to translate into TLA+ — the split this fellowship's work extends
`synthesis`	train	61	7 hand-written + 3 IronFleet-distilled + 51 Dwyer LTL-pattern instantiations; model picks the language
`smoke`	test	5	one already-verifying file per language, for end-to-end plumbing checks

See docs/splits.md for per-split task shape, field semantics, and the difficulty
ladder these splits were originally designed around.

Why DafnyBench is excluded from `dataset_invariants`

DafnyBench has been in every frontier model's training corpus since 2024. Including
it in the main invariant-filling split would inflate the headline number with
memorization, not capability. We keep DafnyBench on disk for two purposes instead:

dataset_dafnybench — the contaminated baseline, useful as a known recall surface
dataset_mutated — the same problems, structurally rewritten; the gap between
pass rates on these two splits is the memorization measurement

babelbench.scoring.eval_mutation_gap computes that gap directly.

Reward structure

Sparse and verifier-grounded — reward comes from a real verifier's accept/reject
decision, delivered only at episode termination via submit.

Standard splits (dataset_*, synthesis, smoke):

1.0 if the verifier exits cleanly AND the output classifier returns VERIFIED
0.0 if rejected, parse error, timeout, or Status.REWARD_HACK (Dafny's syntactic
guard caught a literal assume false / assume {:axiom} false variant)

spec_synthesis split (composite, still fully programmatic — no LLM judge):

0.0 if the source contains a reward-hack construct (multiplicative gate)
0.5 if the spec verifies but is vacuous (probe: delete the first body statement;
if the spec still verifies, it wasn't load-bearing)
1.0 if the spec verifies AND the probe rejects (the spec was genuinely needed)

translation_dafny_to_tla split (composite, still fully programmatic — no LLM
judge — see docs/splits.md for the full detail):

0.5 if TLC accepts the submitted TLA+ source
0.3 if the property is non-trivial (a mutation probe shows it's load-bearing)
0.2 reserved for a bidirectional check — not yet implemented

The check tool returns the same verifier feedback mid-episode but yields reward 0
and does not terminate, so the model can iterate against it.

This is becoming pluggable. The verifier's accept/reject decision remains ground
truth for every task, but the scoring layer around it — how "verified" gets turned
into a number, and eventually how cross-system translations get judged as "proving
the same thing" — is being generalized so alternative scorers (proof-trace heuristics,
an LLM grader evaluated against verifier truth) can be compared on equal footing.
See ROADMAP.md and CLAUDE.md's Scoring policy section.

Repo layout

pyproject.toml                  # pip install -e .
README.md  CLAUDE.md  ROADMAP.md
src/babelbench/
    server.py                   # OpenReward env: splits, list_tasks, check/submit tools
    paths.py                    # single source of truth for repo-root/data-dir resolution
    verifiers/                  # base.py (Status enum, VerifyResult) + one adapter per language
    tasks/                      # mask.py, translation.py, mutate.py, spec_synth.py,
                                 #   perturb_synthesis.py, dwyer/{patterns,generate}.py
    scoring/                    # score_run.py, score_translation.py, rescore_non_trivial.py,
                                 #   eval_mutation_gap.py, deck_analyses.py, present_results.py
    rollout/                    # run_rollout.py (OpenAI), run_rollout_claude.py (Anthropic/Vertex),
                                 #   test_agent.py
tests/                          # pytest-runnable: selftest, mutate, translation, paths, rollout-client
scripts/                        # fetch_datasets.sh, spec_synth_demo.py
datasets/                       # vendored corpora + generated mutations (data, not code)
examples/                       # synthesis JSONLs + ironfleet_reference/ (reference only)
results/hackathon/              # audit-verified headline jsonls + AUDIT.md
docs/
    archive/                    # HANDOFF.md, DECK.md, DECK_BRIEF.md, findings-2026-04-25.md
    splits.md                   # per-split task shape and field semantics
    COMPUTE_ESTIMATE.md         # due 2026-07-10, see ROADMAP.md
training/tinker/                # vendored RL training cookbook code, own dep stack
Dockerfile                      # multi-stage; builds + ships all 5 verifiers
run_full_eval.sh                # 6-tier eval orchestrator

Provenance & licenses

MIT for the environment scaffolding. Vendored corpora retain their upstream licenses
(all permissive — DafnyBench MIT, tlaplus/Examples MIT, FlyVy MIT, FizzBee Apache-2.0,
mypyvy BSD-3-Clause, IronFleet MIT). Full attribution, source repos, and pinned
commits are in datasets/PROVENANCE.md — run bash scripts/fetch_datasets.sh to
populate datasets/ and examples/ironfleet_reference/ from those pins.

IronFleet is vendored as reference material only for synthesis prompts, not
graded directly — it requires Dafny 3.4.0 and ~50K LoC of interdependent proofs that
won't verify file-by-file. Don't try to "fix" this; it's a deliberate scope boundary.

History

Built at the gr.inc OpenReward hackathon (April 2026) as LTLVerifyEnv, a
verifier-grounded RL environment spanning five formal-verification languages. The
hackathon-2026-04-final git tag preserves that state in full, including the eval
logs that were later pruned from master. docs/archive/ carries the deck-era
narrative docs (DECK.md, DECK_BRIEF.md, HANDOFF.md, findings-2026-04-25.md) —
useful for the "how did we get here" story, but their quantitative claims should be
checked against results/hackathon/AUDIT.md before being cited anywhere new.

Component	Configuration
Environment Server	4 vCPUs / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000740
Sandbox	Not configured
Total	$0.0000740

fv-worlds

swist/fv-worlds

babelbench

What exists today

Quickstart (first hour)

Auth

Splits

Why DafnyBench is excluded from `dataset_invariants`

Reward structure

Repo layout

Provenance & licenses

History

Tools

Compute Configuration

Estimated Cost

Examples

fv-worlds

swist/fv-worlds

babelbench

What exists today

Quickstart (first hour)

Auth

Splits

Why DafnyBench is excluded from dataset_invariants

Reward structure

Repo layout

Provenance & licenses

History

Tools

Compute Configuration

Estimated Cost

Examples

Why DafnyBench is excluded from `dataset_invariants`