FrontierCS

Description

FrontierCS is an environment for evaluating AI agents on 240 open-ended, unsolved computer science problems from the FrontierCS benchmark. It consists of two tracks:

Algorithmic (172 problems — 98 standard + 74 interactive): Competitive programming problems requiring C++17 solutions, scored by testlib checkers/interactors with partial credit.
Research (66 problems): Python optimization problems spanning GPU kernels (Triton/CUDA), scheduling, ML, symbolic regression, and more.

Agents receive interactive sandbox environments to write, compile, test, and submit solutions.

Capabilities

Interactive sandbox with bash, file read/write, and directory listing
Server-side C++ compilation and testlib checker evaluation (algorithmic track)
In-sandbox Python evaluator execution (research track)
Partial scoring (0-100, normalized to 0.0-1.0 reward)

Compute Requirements

Algorithmic track: 1 CPU, 2GB RAM sandbox for agent; server needs g++ for evaluation
Research track: Varies per problem — CPU-only for scheduling/optimization, GPU (CUDA) for kernel problems
Network access enabled for both tracks

License

MIT License

Tasks

Algorithmic: 172 competitive programming problems (98 standard + 74 interactive). Single split: test.
Research: 66 problems across GPU kernels, scheduling, ML optimization, security, and more. Single split: test.

Each task includes a problem statement, constraints, and evaluation criteria.

Reward Structure

Sparse, verifiable rewards on a 0.0-1.0 scale
Algorithmic: reward = (mean score ratio across test cases) where each test case scored 0-1 by testlib checker
Research: reward = evaluator_score / 100 where evaluator scores performance relative to baselines

Data

Source: FrontierCS GitHub + HuggingFace dataset
Algorithmic: 172 problem directories with test data, checkers, and config
Research: 25 problem groups with evaluators, resources, and baselines

Tools

Algorithmic (`FrontierCSAlgorithmic`)

bash — Execute shell commands (compile, test, explore)
read_file — Read file contents (50KB limit)
write_file — Write file contents
list_files — List directory contents
submit — Submit C++ code for server-side evaluation (terminal)
todo_write — Task planning and progress tracking

Research (`FrontierCSResearch`)

bash — Execute shell commands
read_file — Read file contents (50KB limit)
write_file — Write file contents
list_files — List directory contents
submit — Run evaluator on solution.py (terminal)
todo_write — Task planning and progress tracking

Time Horizon

Multi-turn. Agents typically need 10-50+ tool calls to understand the problem, develop a solution, test it, and submit.

Environment Difficulty

These are frontier/unsolved problems — high difficulty by design. Most problems have no known perfect solution. Partial credit is awarded.

Safety

Algorithmic: Server-side C++ execution is sandboxed with resource limits. Agents cannot access test data or checker source code.
Research: Evaluator runs inside the agent's sandbox. The agent could theoretically modify the evaluator, but this is acceptable for benchmark use.

Citations

@article{mang2025frontiercs,
      title={FrontierCS: Evolving Challenges for Evolving Intelligence},
      author={Qiuyang Mang and Wenhao Chai and Zhifei Li and Huanzhi Mao and Shang Zhou and Alexander Du and Hanchen Li and Shu Liu and Edwin Chen and Yichuan Wang and Xieting Chu and Zerui Cheng and Yuan Xu and Tian Xia and Zirui Wang and Tianneng Shi and Jianzhu Yao and Yilong Zhao and Qizheng Zhang and Charlie Ruan and Zeyu Shen and Kaiyuan Liu and Runyuan He and Dong Xing and Zerui Li and Zirong Zeng and Yige Jiang and Lufeng Cheng and Ziyi Zhao and Youran Sun and Wesley Zheng and Meiyuwang Zhang and Ruyi Ji and Xuechang Tu and Zihan Zheng and Zexing Chen and Kangyang Zhou and Zhaozi Wang and Jingbang Chen and Aleksandra Korolova and Peter Henderson and Pramod Viswanath and Vijay Ganesh and Saining Xie and Zhuang Liu and Dawn Song and Sewon Min and Ion Stoica and Joseph E. Gonzalez and Jingbo Shang and Alvin Cheung},
      year={2025},
      eprint={2512.15699},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
}

Component	Configuration
Environment Server	4 vCPUs / 8 GB RAM
Sandbox Machine	Task-Dependent

Component	Cost / second
Environment	$0.0000920
Sandbox	Task-Dependent
Total	$0.0000920

FrontierCS

GeneralReasoning/FrontierCS

FrontierCS

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Algorithmic (`FrontierCSAlgorithmic`)

Research (`FrontierCSResearch`)

Time Horizon

Environment Difficulty

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples

FrontierCS

GeneralReasoning/FrontierCS

FrontierCS

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Algorithmic (FrontierCSAlgorithmic)

Research (FrontierCSResearch)

Time Horizon

Environment Difficulty

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples

Algorithmic (`FrontierCSAlgorithmic`)

Research (`FrontierCSResearch`)