code-contests
Code Contests
Description
Code Contests is an environment for evaluating AI agents on competitive programming problems. Based on DeepMind's CodeContests dataset, agents must read problem statements, write Python solutions, and pass all test cases. Problems are sourced from platforms including Codeforces, AtCoder, CodeChef, and HackerEarth.
Capabilities
- Algorithmic problem-solving
- Data structure implementation
- Mathematical reasoning and optimization
- Input/output parsing and formatting
- Code debugging and refinement
- Competitive programming techniques
Compute Requirements
Agents are given a sandbox with 1 CPU and 2GB RAM. Each task runs in an isolated Docker container. Agent timeout is 900 seconds (15 minutes) and verifier timeout is 720 seconds (12 minutes).
License
Tasks
There is one split in this environment:
- train: 9,644 competitive programming problems
Problems range from simple bracket matching to complex algorithmic challenges involving graph theory, dynamic programming, number theory, and combinatorics.
Reward Structure
This is a sparse, verifiable reward environment. Rewards are computed when the agent submits their answer:
- 1.0: All test cases pass
- 0.0: Any test case fails
No LLM grader is used. Solutions are validated against multiple input/output test cases using pytest. No partial credit is given.
Data
Each task contains:
instruction.md: Problem statement with constraints and examplestask.toml: Metadata (difficulty, tags, timeouts)tests/test_data.json: Input/output test casestests/test_state.py: Pytest-based test harnesstests/test.sh: Test execution script
Agents must write their solution to /app/solution.py which reads from stdin and writes to stdout.
Tools
Agents have access to 5 tools:
- bash: Execute bash commands in the container
- view: View file contents or directory listings
- str_replace: Replace strings in files
- create_file: Create new files with specified content
- submit_answer: Submit solution and run test cases
Time Horizon
Code Contests is a multi-turn environment where agents read the problem, develop a solution iteratively, test with example cases, and submit for final evaluation.
[Statistics on average tool calls here]
Environment Difficulty
The original AlphaCode paper reports that with up to a million samples per problem, AlphaCode solved 34.2% of problems. On Codeforces competitions, AlphaCode ranked within the top 54.3% of participants.
Safety
Code Contests tasks are run in isolated Docker containers. The environment focuses on algorithmic problem-solving and does not involve external network access or system-level operations.
Citations
This environment uses DeepMind's CodeContests dataset. If you use this environment, please cite the original paper:
@article{Li_2022,
title={Competition-level code generation with AlphaCode},
volume={378},
ISSN={1095-9203},
url={http://dx.doi.org/10.1126/science.abq1158},
DOI={10.1126/science.abq1158},
number={6624},
journal={Science},
publisher={American Association for the Advancement of Science (AAAS)},
author={Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and de Masson d’Autume, Cyprien and Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and Gowal, Sven and Cherepanov, Alexey and Molloy, James and Mankowitz, Daniel J. and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol},
year={2022},
month=dec, pages={1092–1097} }