API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/apex-agents

README

APEX-Agents

Description

APEX-Agents (AI Productivity Index for Agents) is an environment for evaluating AI agents on realistic workplace tasks across three professional domains: Investment Banking, Law, and Management Consulting. It contains 480 tasks based on 33 realistic workplace scenarios, requiring multi-turn interaction with file exploration, document analysis, and creation of professional deliverables.

Capabilities

Multi-turn workplace task completion
Document analysis and file exploration (PDFs, spreadsheets, Word, PowerPoint)
Professional deliverable creation
Sandboxed command execution and file manipulation

Compute Requirements

Each agent is given an isolated Docker sandbox with 2 CPUs and 2 GB RAM. Task-specific filesystems with PDFs, spreadsheets, and documents are mounted read-only.

License

CC BY 4.0.

Tasks

There is one split in this environment:

test: 480 tasks (160 per domain: Investment Banking, Law, Management Consulting)

Tasks include task-specific filesystems based on 33 realistic workplace scenarios ("worlds") populated with relevant files, emails, presentations, and spreadsheets.

Reward Structure

This is a multi-turn environment with rubric-based evaluation. The agent uses CLI tools to explore files and complete tasks, then submits via submit_answer (for console message tasks) or submit_files (for file-based outputs). An LLM grader (gpt-5-mini) evaluates against 1-10 binary rubric criteria. ALL criteria must pass for reward=1.0, otherwise reward=0.0.

Data

Data consists of JSON metadata (tasks_and_rubrics.json), world filesystems (world_files/) containing realistic workplace documents for each of the 33 scenarios, and task-specific files (task_files/) for individual tasks. Sourced from HuggingFace mercor/apex-agents. Data is stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit text response for console message tasks. Ends the episode.
`submit_files`	Submit created/edited files for file-based tasks. Ends the episode.
`bash`	Execute shell commands in sandbox.
`read`	Read text file contents.
`write`	Write files.
`edit`	Edit existing files.
`grep`	Search file contents.
`glob`	Find files by pattern.
`ls`	List directory contents.
`excel_read`	Read Excel file contents.
`excel_list_sheets`	List sheets in an Excel file.
`word_read`	Read Word document contents.
`pdf_read`	Read PDF file contents.
`powerpoint_read`	Read PowerPoint file contents.
`powerpoint_list_slides`	List slides in a PowerPoint file.

Time Horizon

Multi-turn. Agents explore files and execute commands before submitting final deliverables.

Environment Difficulty

Tasks are complex multi-step professional workflows that experienced professionals estimate take 1-2 hours to complete. Current leaderboard scores (Pass@1) from mercor.com/apex:

Model	Pass@1
Gemini 3.1 Pro (High)	33.5%
GPT 5.3 Codex (High)	31.7%
Opus 4.6 (High)	29.8%
GPT 5.2 Codex (High)	27.6%
Gemini 3 Flash (High)	24.0%
GPT 5.2 (High)	23.0%
GPT 5.1 Codex (High)	20.6%
GPT 5 Codex (High)	20.0%
Opus 4.5 (High)	18.4%
Gemini 3 Pro (High)	18.4%
GPT 5 (High)	18.3%
Grok 4	15.2%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in APEX-Agents operate within sandboxed environments with read-only data mounts. The environment does not present direct safety risks.

Citation

@misc{vidgen2026apexagents,
  title={APEX--Agents},
  author={Vidgen, Bertie and Mann, Austin and Fennelly, Abby and Wright Stanly, John and Rothman, Lucas and Burstein, Marco and Benchek, Julien and Ostrofsky, David and Ravichandran, Anirudh and Sur, Debnil and Venugopal, Neel and Hsia, Alannah and Robinson, Isaac and Huang, Calix and Varones, Olivia and Khan, Daniyal and Haines, Michael and Richards, Zach and Mahapatra, Chirag and Foody, Brendan and Nitski, Osvald},
  year={2026},
  howpublished={arXiv},
  url={https://arxiv.org/abs/2601.14242}
}

Repository

Source repository

EnvCommons/APEX-Agents

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	2 vCPUs / 4 GB RAM
Sandbox Machine	2 vCPUs / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000460
Sandbox	$0.0000370
Total	$0.0000830

Examples

5-minute session$0.0249

1-hour session$0.2988

APEX-Agents

GeneralReasoning/APEX-Agents

APEX-Agents

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples