AnthropicPerformance

API Endpoint
Leaderboard
Loading leaderboard...
README

AnthropicPerformance

OpenReward Environment

Description

AnthropicPerformance is an environment based on Anthropic's original performance engineering takehome challenge. Agents optimize a VLIW SIMD kernel implementation to minimize clock cycles for a tree traversal workload. The challenge involves instruction scheduling, vectorization, and low-level optimization in a simulated machine architecture.

Capabilities

  • VLIW SIMD instruction optimization
  • Kernel code generation and scheduling
  • Performance profiling and iterative improvement
  • Low-level systems programming

Compute Requirements

Agents are given a sandbox with 2 CPU cores and 2GB RAM.

License

MIT

Tasks

There is one split in this environment:

  • train: 1 task (perf-opt-v1)

The task starts with a baseline implementation at 18,532 cycles and challenges agents to optimize toward Claude model benchmarks.

Reward Structure

This is a dense, verifiable reward environment. Rewards use linear interpolation between Claude model baselines:

  • Baseline: 18,532 cycles (starting point)
  • 0.0: Matched worst Claude baseline (2,164 cycles)
  • 1.0: Matched best Claude baseline (1,363 cycles)
  • >1.0: Superhuman performance

Rewards are calculated as incremental improvements on each submission:

reward = (current_cycles - 2164) / (1363 - 2164)

Data

Source files are provided in the sandbox workspace:

  • problem.py: Machine simulator (VLIW SIMD architecture)
  • perf_takehome.py: Kernel builder (optimization target)

Test harness runs server-side and is never exposed to agents.

Tools

CLI Tools:

ToolDescription
bashExecute bash commands
readRead file contents
writeWrite files
editEdit files
globFind files by pattern
grepSearch file contents
lsList directory contents

Environment Tools:

ToolDescription
submit_solutionTest current implementation (50 submission limit)
finish_challengeComplete challenge with current best

Time Horizon

Multi-turn. Agents analyze code, make optimizations, test with submit_solution, and iterate until satisfied or reaching the 50 submission limit.

Environment Difficulty

Benchmarks from Claude models:

ModelCyclesNotes
Claude Opus 4 (extended)2,164
Claude Opus 4.5 (casual)1,790~human level
Claude Opus 4.5 (2hr)1,579
Claude Sonnet 4.51,548
Claude Opus 4.5 (11.5hr)1,487Hiring threshold
Claude Opus 4.5 (improved)1,363Current best

Other Environment Requirements

There are no further environment requirements; Performance Takehome works out of the box with the OpenReward endpoint.

Safety

Agents in Performance Takehome optimize code in an isolated sandbox environment. The environment does not present direct safety risks.

Citation

@misc{anthropic2024perftakehome,
  title={Anthropic Original Performance TakeHome},
  author={Anthropic},
  year={2024},
  url={https://github.com/anthropics/original_performance_takehome}
}
GeneralReasoning/AnthropicPerformance | OpenReward