fv-worlds
LTLVerifyEnv
Description
An RL environment for training and evaluating LLMs on formal verification tasks. The agent writes formal-methods source (in Dafny, TLA+, FlyVy, FizzBee, or mypyvy) and the reward signal comes from the actual verifier accepting or rejecting it — not from an LLM judge. Tasks span six families:
- Filling in stripped invariants in real upstream proofs (TLA+/FlyVy/FizzBee/mypyvy/ivybench).
- Reconstructing whole proof bodies across all five languages.
- Synthesising spec+proof from a natural-language property — model picks the language.
- DafnyBench baseline (
dataset_dafnybench) — the contaminated reference; pair withdataset_mutatedto measure the memorization gap. - DafnyBench mutated (
dataset_mutated) — same problems, syntactically transformed (loop conversion, method splitting, requires generalisation, ghost-variable injection) so they are out-of-training-distribution. - Spec synthesis (
spec_synthesis) — body given, model writes the spec, with composite reward: 0.5 verified + 0.5 non-trivial under mutation, multiplicatively gated to 0 by reward-hack constructs.
Plus a translation_dafny_to_tla cross-language split and a 5-task smoke set.
Capabilities exercised
- Reading and editing five different formal-verification languages
- Inferring inductive invariants strong enough to close a proof
- Writing specs that capture an NL requirement (vs. trivially satisfiable specs)
- Translating natural-language safety/liveness properties into formal specs
- Cross-language translation (Dafny ⇄ TLA+ on simple programs)
- Iterating against a verifier's counterexample feedback (multi-turn)
- Choosing the right formalism for a property (synthesis tasks let the model pick)
Compute Requirements
Single sandbox container, ~3.4 GB image (TLAPS bundles its own backends — drop tlapm to halve the size if TLC alone suffices). No GPU. Each verifier call is CPU-bound on a single thread; the default per-call timeout is 60s. A typical rollout runs ~3-10 verifier invocations.
License
MIT for the environment scaffolding. Vendored corpora retain their upstream licenses (all permissive — DafnyBench MIT, tlaplus/Examples MIT, FlyVy MIT, FizzBee Apache-2.0, IronFleet MIT). Full attribution in datasets/PROVENANCE.md.
Splits
| Split | Type | Count | Source |
|---|---|---|---|
dataset_invariants | train | 294 | masked TLA+/FlyVy/FizzBee/mypyvy + ivybench. DafnyBench excluded — see below. |
dataset_proof | train | 306 | regex-masked proof bodies across all five languages |
dataset_dafnybench | train | 503 | DafnyBench hints_removed baseline. In-distribution for any 2024+ frontier model. |
dataset_mutated | train | ~2,500 | DafnyBench files passed through scripts/mutate.py's 4 syntactic transforms. Out-of-training-distribution. |
spec_synthesis | train | 10 | hand-curated Dafny method bodies; model writes the spec. Composite reward. |
translation_dafny_to_tla | train | 30 | small Dafny methods to translate into TLA+ |
synthesis | train | 61 | 7 hand-written + 3 IronFleet-distilled + 51 Dwyer LTL pattern instantiations |
smoke | test | 5 | one ground-truth file per language for sanity checking |
Each task is a TaskSpec with id, kind, language (or null for synthesis), masked_source or nl_spec, and instructions containing the user-facing prompt. Dataset tasks additionally carry reference_source (the unmasked original) for proof_length_ratio scoring.
Why DafnyBench is excluded from dataset_invariants
DafnyBench has been in every frontier model's training corpus since 2024. Including it inflates the headline number with memorization, not capability. We keep DafnyBench files on disk for two purposes:
dataset_dafnybench(the contaminated baseline) — useful as a known recall surface to compare againstdataset_mutated(the same problems, structurally rewritten) — out-of-distribution; the gap between pass rates on these two splits is the memorization measurement
scripts/eval_mutation_gap.py computes that gap directly.
Reward Structure
Sparse and verifier-grounded — no LLM judge anywhere. Reward delivered only at episode termination via the submit tool.
Standard splits (dataset_*, synthesis, smoke):
- 1.0 if the verifier exits cleanly AND the output classifier returns
VERIFIED - 0.0 if rejected, parse error, timeout, or
Status.REWARD_HACK(Dafny syntactic guard caughtassume false/assume {:axiom} false)
spec_synthesis split (composite, all programmatic):
- 0.0 if the source contains any reward-hack construct (multiplicative gate)
- 0.5 if the spec verifies but is vacuous (probe: delete first body statement → spec still verifies → spec was redundant)
- 1.0 if the spec verifies AND the probe rejects (the spec was load-bearing)
The check tool returns the same verifier feedback during the episode but yields reward 0 and does not terminate, so the model can iterate.
Additive scoring metadata (no reward change)
Every submit() returns metadata for downstream aggregation. This is additive — the binary reward contract above is unchanged:
| Field | Meaning |
|---|---|
verified | bool — verifier said VERIFIED |
non_trivial | bool/null — on by default (opt out with LTLV_NONTRIV=0). Tries deleting the last ensures/invariant clause; if verifier still accepts, the original spec was redundant in that clause. |
proof_length_ratio | float/null — len(submitted) / len(reference) for tasks with a reference source |
n_check_calls | int — how many check() calls the model made before submitting |
used_unsound_constructs | bool — regex scan for known trivializers (assume false, Inv == TRUE, decreases *, {:axiom}, # UNSOUND, etc.) |
scripts/score_run.py aggregates these into a slide-ready table with per-language and per-status breakdowns.
Tools
Two tools, both taking (language, source):
check(language, source)— Run the verifier onsource. Returns the classified status (VERIFIED / REJECTED / PARSE_ERROR / TIMEOUT / REWARD_HACK) plus the verifier's stdout/stderr. Non-terminal, reward 0. Use to iterate.submit(language, source)— Final answer. Ends the episode. Reward as above.
language ∈ {dafny, tlaplus, flyvy, fizzbee, mypyvy}.
Time Horizon
Multi-turn. A typical rollout is 3-8 tool calls: one or two check calls to test hypotheses, then submit. The example agents (run_rollout.py for OpenAI, run_rollout_claude.py for Anthropic + Vertex) cap at 8 turns by default.
Environment Difficulty
Measured numbers from gpt-5.4 (one full pass each):
synthesis: 95.1% pass (61 tasks, hand-curated)synthesis(surface-perturbed): 98.4% pass (61 tasks via--tasks-from) — perturbation didn't drop the score, suggesting the 95% wasn't pure surface-form recall
Other splits: see DECK_BRIEF.md for the methodology. The headline experiment is dataset_dafnybench (in-distribution) vs dataset_mutated (out-of-distribution) — the gap is the memorization signal.
Other Environment Requirements
- An LLM provider API key for sampling (OpenAI / Anthropic direct / Anthropic via Vertex / Google).
- An OpenReward API key.
- No third-party services; all five verifiers run locally inside the container.
Optional environment variables:
| Var | Default | Purpose |
|---|---|---|
LTLV_DATA_ROOT | /app/datasets | dataset location. Set to /orwd_data if uploading via OpenReward Files tab. |
LTLV_SYNTH_ROOT | /app/examples | synthesis JSONL location |
LTLV_VERIFY_TIMEOUT | 60 | per-call verifier timeout in seconds |
LTLV_NONTRIV | 1 | mutation-based non_trivial scoring runs on every verified submit by default (one extra verifier call). Set to 0 to disable when smoke wall-clock matters more than the metadata. |
Safety
- The agent has no network access and only writes to a temporary scratch directory inside the sandbox; each verifier invocation runs in an isolated subdir.
- All verifier subprocess invocations have a wall-clock timeout to prevent runaway proof attempts.
- Reward-hack mitigation is live for Dafny:
verifiers/dafny.py:_is_reward_hackshort-circuits withStatus.REWARD_HACK(reward 0) before invoking the verifier on any source containingassume false,assume {:axiom} false, or related variants.scripts/selftest.pycovers 9 variants. - For the other four languages, the trivializer regex is applied as additive metadata (
used_unsound_constructs) but does NOT zero the reward — those verifiers' soundness properties are well-understood and the regex flags are diagnostic only. - Domain dual-use risk is low — formal-verification proofs are not weaponisable.
Citations
@misc{ltl_verify_env_2026,
title = {LTLVerifyEnv: Verifier-grounded RL environment for formal-methods reasoning},
year = {2026},
note = {OpenReward hackathon submission},
url = {https://openreward.ai/}
}
@article{loughridge2024dafnybench,
title = {DafnyBench: A Benchmark for Formal Software Verification},
author = {Loughridge, Chloe and Sun, Qinyi and Ahrenbach, Seth and Cassano, Federico
and Sun, Chuyue and Sheng, Ying and Mudide, Anish and Misu, Md Rakib Hossain
and Amin, Nada and Tegmark, Max},
journal = {arXiv preprint arXiv:2406.08467},
year = {2024}
}
@inproceedings{hawblitzel2015ironfleet,
title = {IronFleet: Proving Practical Distributed Systems Correct},
author = {Hawblitzel, Chris and Howell, Jon and Kapritsos, Manos and Lorch, Jacob R.
and Parno, Bryan and Roberts, Michael L. and Setty, Srinath and Zill, Brian},
booktitle = {Proc. ACM Symposium on Operating Systems Principles (SOSP)},
year = {2015}
}Quickstart
# 1. Populate corpora (idempotent, ~5 min, ~500 MB on disk)
bash scripts/fetch_datasets.sh
# 2. Generate dataset_mutated (~3,500 candidate files; uses Dafny verifier
# inside the container to filter — pass --dry-run if no Dafny locally)
python3 scripts/mutate.py
# 3. Generate perturbed synthesis tasks
python3 scripts/perturb_synthesis.py
# 4. Offline tests (no verifiers needed)
python3 scripts/selftest.py
# 5. Build the container (~10 min, builds FlyVy from source)
docker build -t ltl-verify-env .
docker run --rm -p 8000:8000 -e LTLV_NONTRIV=1 ltl-verify-env
# 6. Run a rollout against a deployed env (OpenReward; not the local container)
export OPENREWARD_API_KEY=... OPENAI_API_KEY=...
python3 run_rollout.py --split synthesis --max-tasks 61 --workers 8
# 7. Or run the full 5-tier eval pipeline
./run_full_eval.sh
# 8. Score any run for the slide
python3 scripts/score_run.py --by-language results-<split>-<run-name>.jsonlLayout
.
├── server.py # ORS Environment: splits, list_tasks,
│ # check + submit tools, additive metadata
├── verifiers/
│ ├── base.py # Status enum, VerifyResult, base class
│ ├── {dafny,tlaplus,flyvy,fizzbee,mypyvy}.py
│ ├── dafny.py # also: REWARD_HACK syntactic guard
│ └── heuristics.py # spec-quality regex + mutation helpers
├── scripts/
│ ├── selftest.py # offline tests; classifier + masker + guard
│ ├── fetch_datasets.sh # pinned, idempotent corpus fetcher
│ ├── mask.py # invariants/proof regex masking
│ ├── mutate.py # generate dataset_mutated from DafnyBench
│ ├── perturb_synthesis.py # surface perturbations of synthesis split
│ ├── spec_synth.py # hand-curated body→spec task pairs
│ ├── translation.py # Dafny↔TLA+ translation tasks
│ ├── eval_mutation_gap.py # dataset_dafnybench vs dataset_mutated comparator
│ ├── score_run.py # slide-ready aggregation table
│ ├── score_translation.py # translation-specific scoring
│ ├── spec_synth_demo.py # end-to-end composite reward demo
│ └── dwyer/ # 51 LTL-pattern synthesis prompts
├── datasets/ # vendored corpora + generated mutations
├── examples/
│ ├── synthesis.jsonl # 7 hand-written NL specs
│ ├── synthesis_ironfleet.jsonl # 3 IronFleet-derived
│ ├── synthesis_dwyer.jsonl # 51 LTL pattern instantiations
│ ├── perturbed_synthesis_v1.jsonl # surface-perturbed synthesis
│ └── ironfleet_reference/ # IronFleet source (reference only)
├── run_rollout.py # OpenAI rollout driver (parallel, JSONL out)
├── run_rollout_claude.py # Anthropic + Vertex rollout driver
├── run_full_eval.sh # 5-tier orchestrator with cost preview
├── DECK_BRIEF.md # orientation for slide-making
├── HANDOFF.md # multi-tab session log
├── CLAUDE.md # project conventions for Claude instances
├── Dockerfile # multi-stage; builds + ships all 5 verifiers
└── README.md