API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/satbench

README

SATBench

Description

SATBench is an environment for evaluating LLM logical reasoning through natural language puzzles derived from Boolean satisfiability (SAT) formulas. Agents must determine whether a set of conditions can be simultaneously satisfied and, if so, provide a valid variable assignment.

Capabilities

Search-based logical reasoning (backtracking, constraint satisfaction)
Formal verification of satisfying assignments against CNF clauses
Natural language understanding of logical constraints

Compute Requirements

No special compute requirements. Lightweight text-based grading with no external API calls.

License

Apache 2.0

Tasks

Single split: train with 2,100 tasks
SAT/UNSAT balance: 1,050 satisfiable + 1,050 unsatisfiable (50/50)
Difficulty: Easy (4–15 clauses), Medium (20–30 clauses), Hard (31–50 clauses)
Variable count: 5–90 variables (mean 36), with 1D/2D/3D structure

Each task presents a natural language scenario with numbered conditions and asks whether all conditions can be satisfied simultaneously.

Reward Structure

Sparse, fully verifiable rewards (no LLM judge):

Prediction	Ground Truth	Condition	Reward
UNSAT	UNSAT	—	1.0
UNSAT	SAT	—	0.0
SAT	UNSAT	—	0.0
SAT	SAT	Valid assignment provided	1.0
SAT	SAT	Invalid/missing assignment	0.0

For SAT predictions, the agent must provide a satisfying variable assignment that is verified clause-by-clause against the underlying CNF formula. This prevents guessing — the agent must actually solve the problem.

Data

Source: HuggingFace LLM4Code/SATBench
Size: ~7 MB (Parquet)
Format: Parquet with scenario, conditions, question, variable mapping, CNF clauses, and ground truth labels
Production path: /orwd_data/satbench.parquet

Tools

Tool	Description
`submit`	Submit SAT/UNSAT determination with optional variable assignment

Time Horizon

Single-turn. The agent receives the puzzle prompt and makes one submit call.

Environment Difficulty

Easy (4–15 clauses): Most models achieve >80% accuracy
Medium (20–30 clauses): Significant model performance degradation
Hard (31–50 clauses): Best model (o4-mini) achieves only 65% on hard UNSAT

Other Environment Requirements

No other requirements. No external API keys needed for grading.

Safety

SAT puzzles are abstract logical reasoning tasks with no safety concerns. Scenarios are fictional and benign (e.g., superheroes choosing missions, gardeners planting vegetables).

Citations

@inproceedings{wei-etal-2025-satbench,
    title = "{SATB}ench: Benchmarking {LLM}s' Logical Reasoning via Automated Puzzle Generation from {SAT} Formulas",
    author = "Wei, Anjiang  and
      Wu, Yuheng  and
      Wan, Yingjia  and
      Suresh, Tarun  and
      Tan, Huanmi  and
      Zhou, Zhanke  and
      Koyejo, Sanmi  and
      Wang, Ke  and
      Aiken, Alex",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1716/",
    doi = "10.18653/v1/2025.emnlp-main.1716",
    pages = "33832--33849",
}

Repository

Source repository

EnvCommons/SATBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

SATBench

anjiang/SATBench

SATBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples