Aider-Polyglot
Aider-Polyglot
Description
Aider-Polyglot is an environment for evaluating code generation and editing across multiple programming languages. Based on the Exercism polyglot benchmark used in Aider evaluations, agents implement solutions for programming exercises in Python, Go, Rust, JavaScript, Java, and C++. Tasks are evaluated by running the exercise's test suite.
Capabilities
- Multi-language code generation (Python, Go, Rust, JavaScript, Java, C++)
- Code editing with str_replace and insert tools
- Test-driven development evaluation
- File system operations via bash and view tools
Compute Requirements
Agents are given a sandbox with 4 CPU cores and 8GB RAM, with language-specific toolchains pre-installed.
Tasks
There are seven splits in this environment:
- all: All tasks across all languages
- python: Python exercises
- go: Go exercises
- rust: Rust exercises
- javascript: JavaScript exercises
- java: Java exercises
- cpp: C++ exercises
Each task provides instructions from Exercism and stub files to implement.
Reward Structure
This is a sparse, verifiable reward environment. The agent calls answer to run tests:
- 1.0: All tests pass
- 0.0: One or more tests fail
Agents get up to 2 attempts per task. Tests run with a 3-minute timeout.
Data
Tasks are sourced from the Exercism polyglot benchmark. Exercise files include instructions, stub code, and test files. Task data is stored in the repository.
Tools
| Tool | Description |
|---|---|
answer | Run test suite and submit solution |
bash | Execute bash commands in sandbox |
view | View file contents with optional line range |
str_replace | Replace text in files |
insert | Insert content at a line number |
create | Create new files |
Time Horizon
Multi-turn. Agents iterate on solutions using editing tools before calling answer to run tests.
Environment Difficulty
Exercism exercises range from beginner to advanced difficulty. Selected results from the Aider polyglot leaderboard:
| Model | Polyglot Score | Correct Edits |
|---|---|---|
| GPT-5 (high) | 88.0% | 91.6% |
| GPT-5 (medium) | 86.7% | 88.4% |
| o3-pro (high) | 84.9% | 97.8% |
| Gemini 2.5 Pro Preview (32k think) | 83.1% | 99.6% |
| GPT-5 (low) | 81.3% | 86.7% |
Other Environment Requirements
There are no further environment requirements; Aider-Polyglot works out of the box with the OpenReward endpoint.
Safety
Agents in Aider-Polyglot write and execute code in an isolated sandbox environment. The environment does not present direct safety risks.
Citation
@misc{gauthier2024aider,
title={Aider Polyglot Benchmark},
author={Gauthier, Paul},
year={2024},
url={https://aider.chat/docs/leaderboards/}
}