Toolathlon Gym

Name: Eigent/toolathlon-gym
Author: Eigent

Description

Toolathlon Gym is an ORS environment for evaluating multi-tool coordination capabilities, developed by Eigent AI. It contains 503 tasks requiring agents to coordinate 4-8 MCP servers (out of 25) backed by PostgreSQL. Tasks span productivity workflows (spreadsheets, documents, email), data analysis (SQL, finance, web scraping), content management (YouTube, Google Forms, Notion), and system administration (terminal, filesystem).

Capabilities

Multi-tool coordination across 25 MCP servers
Database querying via Snowflake/PostgreSQL
Document creation (Excel, Word, PowerPoint, PDF)
Web scraping and browser automation via Playwright
Email, calendar, and forms management
File system operations and Python code execution

Compute Requirements

Each task runs inside a Docker container with PostgreSQL, 25 MCP server implementations, Node.js 22, and Python 3.12. The ORS server runs alongside all services in a single container.

License

Apache 2.0.

Tasks

There is one split in this environment:

test: 503 tasks

Tasks require coordinating 4-8 MCP servers per task across domains including YouTube analysis, spreadsheet automation, database querying, web scraping, document generation, email management, and calendar scheduling.

Reward Structure

This is a multi-turn environment with script-based validation. The agent uses MCP tools and built-in tools to complete tasks, then calls claim_done to run the evaluation script. The reward is binary: 1.0 if all checks pass, 0.0 otherwise.

Data

Data consists of 503 task directories sourced from GitHub eigent-ai/toolathlon_gym. Each task includes task_config.json, docs/task.md, preprocess/main.py, evaluation/main.py, and initial workspace files. PostgreSQL is seeded with data across 6 schemas (Canvas LMS, HR/Sales, e-commerce, finance, YouTube, rail).

Tools

Shared tools (always available):

Tool	Description
`claim_done`	Run evaluation and get reward. Ends the episode.
`python_execute`	Execute Python code in the workspace.

Task-specific tools (per-task, from MCP servers):

Each task exposes tools from its required MCP servers. Examples include read_file, list_spreadsheets, search_videos, run_command, list_databases, browser_navigate, and ~300 others across 25 servers.

Time Horizon

Multi-turn. Agents read task instructions, use MCP tools to gather data, create files, query databases, and perform analysis, then call claim_done for evaluation.

Environment Difficulty

Tasks require coordinating multiple tools across different domains. Most tasks involve 4-8 MCP servers and require multi-step reasoning, data transformation, and file generation.

Other Environment Requirements

None.

Safety

Agents operate within Docker containers with isolated PostgreSQL databases and per-session workspaces. MCP servers are scoped to the agent's workspace directory.

Citation

@misc{toolathlon,
  author    = {Eigent AI},
  title     = {{Toolathlon Gym: A Multi-Tool Benchmark for LLM Agents}},
  year      = {2025},
  url       = {https://github.com/eigent-ai/toolathlon_gym}
}

Repository

Source repository

EnvCommons/toolathlon-gym

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	2 vCPUs / 4 GB RAM
Sandbox Machine	N/A

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000460
Sandbox	N/A
Total	$0.0000460

Examples

5-minute session$0.0138

1-hour session$0.1656