AnthropicPerformance
AnthropicPerformance
Description
AnthropicPerformance is an environment based on Anthropic's original performance engineering takehome challenge. Agents optimize a VLIW SIMD kernel implementation to minimize clock cycles for a tree traversal workload. The challenge involves instruction scheduling, vectorization, and low-level optimization in a simulated machine architecture.
Capabilities
- VLIW SIMD instruction optimization
- Kernel code generation and scheduling
- Performance profiling and iterative improvement
- Low-level systems programming
Compute Requirements
Agents are given a sandbox with 2 CPU cores and 2GB RAM.
License
Tasks
There is one split in this environment:
- train: 1 task (perf-opt-v1)
The task starts with a baseline implementation at 18,532 cycles and challenges agents to optimize toward Claude model benchmarks.
Reward Structure
This is a dense, verifiable reward environment. Rewards use linear interpolation between Claude model baselines:
- Baseline: 18,532 cycles (starting point)
- 0.0: Matched worst Claude baseline (2,164 cycles)
- 1.0: Matched best Claude baseline (1,363 cycles)
- >1.0: Superhuman performance
Rewards are calculated as incremental improvements on each submission:
reward = (current_cycles - 2164) / (1363 - 2164)
Data
Source files are provided in the sandbox workspace:
problem.py: Machine simulator (VLIW SIMD architecture)perf_takehome.py: Kernel builder (optimization target)
Test harness runs server-side and is never exposed to agents.
Tools
CLI Tools:
| Tool | Description |
|---|---|
bash | Execute bash commands |
read | Read file contents |
write | Write files |
edit | Edit files |
glob | Find files by pattern |
grep | Search file contents |
ls | List directory contents |
Environment Tools:
| Tool | Description |
|---|---|
submit_solution | Test current implementation (50 submission limit) |
finish_challenge | Complete challenge with current best |
Time Horizon
Multi-turn. Agents analyze code, make optimizations, test with submit_solution, and iterate until satisfied or reaching the 50 submission limit.
Environment Difficulty
Benchmarks from Claude models:
| Model | Cycles | Notes |
|---|---|---|
| Claude Opus 4 (extended) | 2,164 | |
| Claude Opus 4.5 (casual) | 1,790 | ~human level |
| Claude Opus 4.5 (2hr) | 1,579 | |
| Claude Sonnet 4.5 | 1,548 | |
| Claude Opus 4.5 (11.5hr) | 1,487 | Hiring threshold |
| Claude Opus 4.5 (improved) | 1,363 | Current best |
Other Environment Requirements
There are no further environment requirements; Performance Takehome works out of the box with the OpenReward endpoint.
Safety
Agents in Performance Takehome optimize code in an isolated sandbox environment. The environment does not present direct safety risks.
Citation
@misc{anthropic2024perftakehome,
title={Anthropic Original Performance TakeHome},
author={Anthropic},
year={2024},
url={https://github.com/anthropics/original_performance_takehome}
}