SATBench
SATBench
Description
SATBench is an environment for evaluating LLM logical reasoning through natural language puzzles derived from Boolean satisfiability (SAT) formulas. Agents must determine whether a set of conditions can be simultaneously satisfied and, if so, provide a valid variable assignment.
Capabilities
- Search-based logical reasoning (backtracking, constraint satisfaction)
- Formal verification of satisfying assignments against CNF clauses
- Natural language understanding of logical constraints
Compute Requirements
No special compute requirements. Lightweight text-based grading with no external API calls.
License
Tasks
- Single split:
trainwith 2,100 tasks - SAT/UNSAT balance: 1,050 satisfiable + 1,050 unsatisfiable (50/50)
- Difficulty: Easy (4–15 clauses), Medium (20–30 clauses), Hard (31–50 clauses)
- Variable count: 5–90 variables (mean 36), with 1D/2D/3D structure
Each task presents a natural language scenario with numbered conditions and asks whether all conditions can be satisfied simultaneously.
Reward Structure
Sparse, fully verifiable rewards (no LLM judge):
| Prediction | Ground Truth | Condition | Reward |
|---|---|---|---|
| UNSAT | UNSAT | — | 1.0 |
| UNSAT | SAT | — | 0.0 |
| SAT | UNSAT | — | 0.0 |
| SAT | SAT | Valid assignment provided | 1.0 |
| SAT | SAT | Invalid/missing assignment | 0.0 |
For SAT predictions, the agent must provide a satisfying variable assignment that is verified clause-by-clause against the underlying CNF formula. This prevents guessing — the agent must actually solve the problem.
Data
- Source: HuggingFace LLM4Code/SATBench
- Size: ~7 MB (Parquet)
- Format: Parquet with scenario, conditions, question, variable mapping, CNF clauses, and ground truth labels
- Production path:
/orwd_data/satbench.parquet
Tools
| Tool | Description |
|---|---|
submit | Submit SAT/UNSAT determination with optional variable assignment |
Time Horizon
Single-turn. The agent receives the puzzle prompt and makes one submit call.
Environment Difficulty
- Easy (4–15 clauses): Most models achieve >80% accuracy
- Medium (20–30 clauses): Significant model performance degradation
- Hard (31–50 clauses): Best model (o4-mini) achieves only 65% on hard UNSAT
Other Environment Requirements
No other requirements. No external API keys needed for grading.
Safety
SAT puzzles are abstract logical reasoning tasks with no safety concerns. Scenarios are fictional and benign (e.g., superheroes choosing missions, gardeners planting vegetables).
Citations
@inproceedings{wei-etal-2025-satbench,
title = "{SATB}ench: Benchmarking {LLM}s' Logical Reasoning via Automated Puzzle Generation from {SAT} Formulas",
author = "Wei, Anjiang and
Wu, Yuheng and
Wan, Yingjia and
Suresh, Tarun and
Tan, Huanmi and
Zhou, Zhanke and
Koyejo, Sanmi and
Wang, Ke and
Aiken, Alex",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1716/",
doi = "10.18653/v1/2025.emnlp-main.1716",
pages = "33832--33849",
}