FrontierCS

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

FrontierCS

OpenReward Environment

Description

FrontierCS is an environment for evaluating AI agents on 240 open-ended, unsolved computer science problems from the FrontierCS benchmark. It consists of two tracks:

  • Algorithmic (172 problems — 98 standard + 74 interactive): Competitive programming problems requiring C++17 solutions, scored by testlib checkers/interactors with partial credit.
  • Research (66 problems): Python optimization problems spanning GPU kernels (Triton/CUDA), scheduling, ML, symbolic regression, and more.

Agents receive interactive sandbox environments to write, compile, test, and submit solutions.

Capabilities

  • Interactive sandbox with bash, file read/write, and directory listing
  • Server-side C++ compilation and testlib checker evaluation (algorithmic track)
  • In-sandbox Python evaluator execution (research track)
  • Partial scoring (0-100, normalized to 0.0-1.0 reward)

Compute Requirements

  • Algorithmic track: 1 CPU, 2GB RAM sandbox for agent; server needs g++ for evaluation
  • Research track: Varies per problem — CPU-only for scheduling/optimization, GPU (CUDA) for kernel problems
  • Network access enabled for both tracks

License

MIT License

Tasks

  • Algorithmic: 172 competitive programming problems (98 standard + 74 interactive). Single split: test.
  • Research: 66 problems across GPU kernels, scheduling, ML optimization, security, and more. Single split: test.

Each task includes a problem statement, constraints, and evaluation criteria.

Reward Structure

  • Sparse, verifiable rewards on a 0.0-1.0 scale
  • Algorithmic: reward = (mean score ratio across test cases) where each test case scored 0-1 by testlib checker
  • Research: reward = evaluator_score / 100 where evaluator scores performance relative to baselines

Data

  • Source: FrontierCS GitHub + HuggingFace dataset
  • Algorithmic: 172 problem directories with test data, checkers, and config
  • Research: 25 problem groups with evaluators, resources, and baselines

Tools

Algorithmic (FrontierCSAlgorithmic)

  • bash — Execute shell commands (compile, test, explore)
  • read_file — Read file contents (50KB limit)
  • write_file — Write file contents
  • list_files — List directory contents
  • submit — Submit C++ code for server-side evaluation (terminal)
  • todo_write — Task planning and progress tracking

Research (FrontierCSResearch)

  • bash — Execute shell commands
  • read_file — Read file contents (50KB limit)
  • write_file — Write file contents
  • list_files — List directory contents
  • submit — Run evaluator on solution.py (terminal)
  • todo_write — Task planning and progress tracking

Time Horizon

Multi-turn. Agents typically need 10-50+ tool calls to understand the problem, develop a solution, test it, and submit.

Environment Difficulty

These are frontier/unsolved problems — high difficulty by design. Most problems have no known perfect solution. Partial credit is awarded.

Safety

  • Algorithmic: Server-side C++ execution is sandboxed with resource limits. Agents cannot access test data or checker source code.
  • Research: Evaluator runs inside the agent's sandbox. The agent could theoretically modify the evaluator, but this is acceptable for benchmark use.

Citations

@article{mang2025frontiercs,
      title={FrontierCS: Evolving Challenges for Evolving Intelligence},
      author={Qiuyang Mang and Wenhao Chai and Zhifei Li and Huanzhi Mao and Shang Zhou and Alexander Du and Hanchen Li and Shu Liu and Edwin Chen and Yichuan Wang and Xieting Chu and Zerui Cheng and Yuan Xu and Tian Xia and Zirui Wang and Tianneng Shi and Jianzhu Yao and Yilong Zhao and Qizheng Zhang and Charlie Ruan and Zeyu Shen and Kaiyuan Liu and Runyuan He and Dong Xing and Zerui Li and Zirong Zeng and Yige Jiang and Lufeng Cheng and Ziyi Zhao and Youran Sun and Wesley Zheng and Meiyuwang Zhang and Ruyi Ji and Xuechang Tu and Zihan Zheng and Zexing Chen and Kangyang Zhou and Zhaozi Wang and Jingbang Chen and Aleksandra Korolova and Peter Henderson and Pramod Viswanath and Vijay Ganesh and Saining Xie and Zhuang Liu and Dawn Song and Sewon Min and Ion Stoica and Joseph E. Gonzalez and Jingbo Shang and Alvin Cheung},
      year={2025},
      eprint={2512.15699},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
}
GeneralReasoning/FrontierCS | OpenReward