crustbench
CRUST-Bench
Description
CRUST-Bench is an environment for evaluating C-to-safe-Rust transpilation. Each task provides a C repository and a manually-written safe Rust interface, and the agent must produce a complete Rust implementation that conforms to the interface, compiles without errors, and passes all test cases. Tasks span diverse software categories including compilers, algorithmic libraries, system utilities, and data structures.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Slimshilin.
Capabilities
- Translating C code to idiomatic, safe Rust
- Understanding and implementing Rust interfaces with ownership and borrowing
- Working with Rust's type system, traits, and lifetime annotations
- Debugging compilation errors and test failures
Compute Requirements
Agents are given a sandboxed environment with Rust and C toolchains, bash access, and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 100 C-to-Rust transpilation tasks
Tasks are sourced from open-source C repositories on GitHub, covering programming language infrastructure, algorithmic libraries, system utilities, and more. Each task is paired with expert-written Rust interfaces and test suites.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent reads C source code, implements the corresponding Rust code conforming to the provided interface, and calls submit_answer for verification. The verifier runs cargo build and cargo test.
- 1.0: Code compiles and all tests pass.
- 0.0: Compilation fails or any test fails.
Data
Each task directory contains an instruction.md with the transpilation task, C source files, Rust interface definitions, and test suites. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox (e.g., cargo build, cargo test). |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated compilation and test verification. |
Time Horizon
CRUST-Bench is a multi-turn environment. Agents read C source code, study the Rust interface, implement the transpilation, compile and test iteratively, and submit for verification.
Environment Difficulty
CRUST-Bench is challenging. The original paper evaluates frontier models on C-to-Rust transpilation:
| Model | One-Shot | With Repair (3 rounds) |
|---|---|---|
| o3 | 19% | 48% |
| Claude Opus 4 | 22% | 40% |
| o1 | 15% | 37% |
| Claude 3.7 Sonnet | 13% | 32% |
Even with iterative repair, the best model achieves only 48% success, indicating that safe and idiomatic Rust generation remains challenging.
Other Environment Requirements
There are no further environment requirements; CRUST-Bench works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in CRUST-Bench translate open-source C code to Rust in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{khatry2025crustbench,
author = {Anirudh Khatry and Robert Zhang and Jia Pan and Ziteng Wang and Qiaochu Chen and Greg Durrett and Isil Dillig},
title = {CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation},
journal = {arXiv preprint arXiv:2504.15254},
year = {2025},
url = {https://arxiv.org/abs/2504.15254}
}