fv-worlds

API Endpoint
Leaderboard
Loading leaderboard...
README

LTLVerifyEnv

OpenReward Environment

Description

An RL environment for training and evaluating LLMs on formal verification tasks. The agent writes formal-methods source (in Dafny, TLA+, FlyVy, FizzBee, or mypyvy) and the reward signal comes from the actual verifier accepting or rejecting it — not from an LLM judge. Tasks span six families:

  1. Filling in stripped invariants in real upstream proofs (TLA+/FlyVy/FizzBee/mypyvy/ivybench).
  2. Reconstructing whole proof bodies across all five languages.
  3. Synthesising spec+proof from a natural-language property — model picks the language.
  4. DafnyBench baseline (dataset_dafnybench) — the contaminated reference; pair with dataset_mutated to measure the memorization gap.
  5. DafnyBench mutated (dataset_mutated) — same problems, syntactically transformed (loop conversion, method splitting, requires generalisation, ghost-variable injection) so they are out-of-training-distribution.
  6. Spec synthesis (spec_synthesis) — body given, model writes the spec, with composite reward: 0.5 verified + 0.5 non-trivial under mutation, multiplicatively gated to 0 by reward-hack constructs.

Plus a translation_dafny_to_tla cross-language split and a 5-task smoke set.

Capabilities exercised

  • Reading and editing five different formal-verification languages
  • Inferring inductive invariants strong enough to close a proof
  • Writing specs that capture an NL requirement (vs. trivially satisfiable specs)
  • Translating natural-language safety/liveness properties into formal specs
  • Cross-language translation (Dafny ⇄ TLA+ on simple programs)
  • Iterating against a verifier's counterexample feedback (multi-turn)
  • Choosing the right formalism for a property (synthesis tasks let the model pick)

Compute Requirements

Single sandbox container, ~3.4 GB image (TLAPS bundles its own backends — drop tlapm to halve the size if TLC alone suffices). No GPU. Each verifier call is CPU-bound on a single thread; the default per-call timeout is 60s. A typical rollout runs ~3-10 verifier invocations.

License

MIT for the environment scaffolding. Vendored corpora retain their upstream licenses (all permissive — DafnyBench MIT, tlaplus/Examples MIT, FlyVy MIT, FizzBee Apache-2.0, IronFleet MIT). Full attribution in datasets/PROVENANCE.md.

Splits

SplitTypeCountSource
dataset_invariantstrain294masked TLA+/FlyVy/FizzBee/mypyvy + ivybench. DafnyBench excluded — see below.
dataset_prooftrain306regex-masked proof bodies across all five languages
dataset_dafnybenchtrain503DafnyBench hints_removed baseline. In-distribution for any 2024+ frontier model.
dataset_mutatedtrain~2,500DafnyBench files passed through scripts/mutate.py's 4 syntactic transforms. Out-of-training-distribution.
spec_synthesistrain10hand-curated Dafny method bodies; model writes the spec. Composite reward.
translation_dafny_to_tlatrain30small Dafny methods to translate into TLA+
synthesistrain617 hand-written + 3 IronFleet-distilled + 51 Dwyer LTL pattern instantiations
smoketest5one ground-truth file per language for sanity checking

Each task is a TaskSpec with id, kind, language (or null for synthesis), masked_source or nl_spec, and instructions containing the user-facing prompt. Dataset tasks additionally carry reference_source (the unmasked original) for proof_length_ratio scoring.

Why DafnyBench is excluded from dataset_invariants

DafnyBench has been in every frontier model's training corpus since 2024. Including it inflates the headline number with memorization, not capability. We keep DafnyBench files on disk for two purposes:

  • dataset_dafnybench (the contaminated baseline) — useful as a known recall surface to compare against
  • dataset_mutated (the same problems, structurally rewritten) — out-of-distribution; the gap between pass rates on these two splits is the memorization measurement

scripts/eval_mutation_gap.py computes that gap directly.

Reward Structure

Sparse and verifier-grounded — no LLM judge anywhere. Reward delivered only at episode termination via the submit tool.

Standard splits (dataset_*, synthesis, smoke):

  • 1.0 if the verifier exits cleanly AND the output classifier returns VERIFIED
  • 0.0 if rejected, parse error, timeout, or Status.REWARD_HACK (Dafny syntactic guard caught assume false / assume {:axiom} false)

spec_synthesis split (composite, all programmatic):

  • 0.0 if the source contains any reward-hack construct (multiplicative gate)
  • 0.5 if the spec verifies but is vacuous (probe: delete first body statement → spec still verifies → spec was redundant)
  • 1.0 if the spec verifies AND the probe rejects (the spec was load-bearing)

The check tool returns the same verifier feedback during the episode but yields reward 0 and does not terminate, so the model can iterate.

Additive scoring metadata (no reward change)

Every submit() returns metadata for downstream aggregation. This is additive — the binary reward contract above is unchanged:

FieldMeaning
verifiedbool — verifier said VERIFIED
non_trivialbool/null — on by default (opt out with LTLV_NONTRIV=0). Tries deleting the last ensures/invariant clause; if verifier still accepts, the original spec was redundant in that clause.
proof_length_ratiofloat/null — len(submitted) / len(reference) for tasks with a reference source
n_check_callsint — how many check() calls the model made before submitting
used_unsound_constructsbool — regex scan for known trivializers (assume false, Inv == TRUE, decreases *, {:axiom}, # UNSOUND, etc.)

scripts/score_run.py aggregates these into a slide-ready table with per-language and per-status breakdowns.

Tools

Two tools, both taking (language, source):

  • check(language, source) — Run the verifier on source. Returns the classified status (VERIFIED / REJECTED / PARSE_ERROR / TIMEOUT / REWARD_HACK) plus the verifier's stdout/stderr. Non-terminal, reward 0. Use to iterate.
  • submit(language, source) — Final answer. Ends the episode. Reward as above.

language ∈ {dafny, tlaplus, flyvy, fizzbee, mypyvy}.

Time Horizon

Multi-turn. A typical rollout is 3-8 tool calls: one or two check calls to test hypotheses, then submit. The example agents (run_rollout.py for OpenAI, run_rollout_claude.py for Anthropic + Vertex) cap at 8 turns by default.

Environment Difficulty

Measured numbers from gpt-5.4 (one full pass each):

  • synthesis: 95.1% pass (61 tasks, hand-curated)
  • synthesis (surface-perturbed): 98.4% pass (61 tasks via --tasks-from) — perturbation didn't drop the score, suggesting the 95% wasn't pure surface-form recall

Other splits: see DECK_BRIEF.md for the methodology. The headline experiment is dataset_dafnybench (in-distribution) vs dataset_mutated (out-of-distribution) — the gap is the memorization signal.

Other Environment Requirements

  • An LLM provider API key for sampling (OpenAI / Anthropic direct / Anthropic via Vertex / Google).
  • An OpenReward API key.
  • No third-party services; all five verifiers run locally inside the container.

Optional environment variables:

VarDefaultPurpose
LTLV_DATA_ROOT/app/datasetsdataset location. Set to /orwd_data if uploading via OpenReward Files tab.
LTLV_SYNTH_ROOT/app/examplessynthesis JSONL location
LTLV_VERIFY_TIMEOUT60per-call verifier timeout in seconds
LTLV_NONTRIV1mutation-based non_trivial scoring runs on every verified submit by default (one extra verifier call). Set to 0 to disable when smoke wall-clock matters more than the metadata.

Safety

  • The agent has no network access and only writes to a temporary scratch directory inside the sandbox; each verifier invocation runs in an isolated subdir.
  • All verifier subprocess invocations have a wall-clock timeout to prevent runaway proof attempts.
  • Reward-hack mitigation is live for Dafny: verifiers/dafny.py:_is_reward_hack short-circuits with Status.REWARD_HACK (reward 0) before invoking the verifier on any source containing assume false, assume {:axiom} false, or related variants. scripts/selftest.py covers 9 variants.
  • For the other four languages, the trivializer regex is applied as additive metadata (used_unsound_constructs) but does NOT zero the reward — those verifiers' soundness properties are well-understood and the regex flags are diagnostic only.
  • Domain dual-use risk is low — formal-verification proofs are not weaponisable.

Citations

@misc{ltl_verify_env_2026,
  title  = {LTLVerifyEnv: Verifier-grounded RL environment for formal-methods reasoning},
  year   = {2026},
  note   = {OpenReward hackathon submission},
  url    = {https://openreward.ai/}
}

@article{loughridge2024dafnybench,
  title   = {DafnyBench: A Benchmark for Formal Software Verification},
  author  = {Loughridge, Chloe and Sun, Qinyi and Ahrenbach, Seth and Cassano, Federico
             and Sun, Chuyue and Sheng, Ying and Mudide, Anish and Misu, Md Rakib Hossain
             and Amin, Nada and Tegmark, Max},
  journal = {arXiv preprint arXiv:2406.08467},
  year    = {2024}
}

@inproceedings{hawblitzel2015ironfleet,
  title     = {IronFleet: Proving Practical Distributed Systems Correct},
  author    = {Hawblitzel, Chris and Howell, Jon and Kapritsos, Manos and Lorch, Jacob R.
               and Parno, Bryan and Roberts, Michael L. and Setty, Srinath and Zill, Brian},
  booktitle = {Proc. ACM Symposium on Operating Systems Principles (SOSP)},
  year      = {2015}
}

Quickstart

# 1. Populate corpora (idempotent, ~5 min, ~500 MB on disk)
bash scripts/fetch_datasets.sh

# 2. Generate dataset_mutated (~3,500 candidate files; uses Dafny verifier
#    inside the container to filter — pass --dry-run if no Dafny locally)
python3 scripts/mutate.py

# 3. Generate perturbed synthesis tasks
python3 scripts/perturb_synthesis.py

# 4. Offline tests (no verifiers needed)
python3 scripts/selftest.py

# 5. Build the container (~10 min, builds FlyVy from source)
docker build -t ltl-verify-env .
docker run --rm -p 8000:8000 -e LTLV_NONTRIV=1 ltl-verify-env

# 6. Run a rollout against a deployed env (OpenReward; not the local container)
export OPENREWARD_API_KEY=... OPENAI_API_KEY=...
python3 run_rollout.py --split synthesis --max-tasks 61 --workers 8

# 7. Or run the full 5-tier eval pipeline
./run_full_eval.sh

# 8. Score any run for the slide
python3 scripts/score_run.py --by-language results-<split>-<run-name>.jsonl

Layout

. ├── server.py # ORS Environment: splits, list_tasks, │ # check + submit tools, additive metadata ├── verifiers/ │ ├── base.py # Status enum, VerifyResult, base class │ ├── {dafny,tlaplus,flyvy,fizzbee,mypyvy}.py │ ├── dafny.py # also: REWARD_HACK syntactic guard │ └── heuristics.py # spec-quality regex + mutation helpers ├── scripts/ │ ├── selftest.py # offline tests; classifier + masker + guard │ ├── fetch_datasets.sh # pinned, idempotent corpus fetcher │ ├── mask.py # invariants/proof regex masking │ ├── mutate.py # generate dataset_mutated from DafnyBench │ ├── perturb_synthesis.py # surface perturbations of synthesis split │ ├── spec_synth.py # hand-curated body→spec task pairs │ ├── translation.py # Dafny↔TLA+ translation tasks │ ├── eval_mutation_gap.py # dataset_dafnybench vs dataset_mutated comparator │ ├── score_run.py # slide-ready aggregation table │ ├── score_translation.py # translation-specific scoring │ ├── spec_synth_demo.py # end-to-end composite reward demo │ └── dwyer/ # 51 LTL-pattern synthesis prompts ├── datasets/ # vendored corpora + generated mutations ├── examples/ │ ├── synthesis.jsonl # 7 hand-written NL specs │ ├── synthesis_ironfleet.jsonl # 3 IronFleet-derived │ ├── synthesis_dwyer.jsonl # 51 LTL pattern instantiations │ ├── perturbed_synthesis_v1.jsonl # surface-perturbed synthesis │ └── ironfleet_reference/ # IronFleet source (reference only) ├── run_rollout.py # OpenAI rollout driver (parallel, JSONL out) ├── run_rollout_claude.py # Anthropic + Vertex rollout driver ├── run_full_eval.sh # 5-tier orchestrator with cost preview ├── DECK_BRIEF.md # orientation for slide-making ├── HANDOFF.md # multi-tab session log ├── CLAUDE.md # project conventions for Claude instances ├── Dockerfile # multi-stage; builds + ships all 5 verifiers └── README.md
swist/fv-worlds | OpenReward