API Endpoint

Leaderboard

Loading leaderboard...

README

SWE-Atlas-QnA

Description

SWE-Atlas-QnA is an environment for evaluating agents on deep code comprehension across production-grade software systems. It is based on the Codebase QnA benchmark from Scale AI, part of the SWE-Atlas suite. Agents must investigate real open-source codebases — tracing execution paths, explaining architectural decisions, and answering deeply technical questions — using runtime verification rather than static code reading alone.

The benchmark spans 11 open-source repositories across 4 languages (Go, Python, C, TypeScript), covering systems including mail servers, terminal emulators, distributed object storage, observability platforms, and secret scanners.

Capabilities

Deep code comprehension across multiple programming languages
Runtime verification and execution tracing
Architectural understanding of production-grade systems
Evidence-based reasoning with artifact collection (logs, traces, API responses)
Multi-step codebase exploration and investigation

Compute Requirements

Agents in SWE-Atlas-QnA are given a sandbox with 2GB of RAM and 2 CPUs. Each task uses a dedicated Docker image with the target repository pre-installed at /app.

Tasks

There is one split: test (124 tasks). Each task presents a question about a specific open-source codebase. The agent must explore the repository, run code, collect evidence, and provide a comprehensive answer. Tasks span 5 categories across 11 repositories and 4 programming languages.

Each task includes a rubric with multiple criteria of two types:

Positive verifiers ("must have" / "should have"): factual claims the answer should contain
Negative verifiers ("must have" / "should have"): incorrect claims the answer should avoid

Reward Structure

This is a sparse, binary reward environment matching the original SWE-Atlas Task Resolve Rate metric. The agent explores the codebase using CLI tools, then calls the answer tool with its response. An LLM grader (Claude Opus 4.5, matching the original evaluation) scores each rubric criterion independently as met or not met.

The reward is binary — all-or-nothing:

1.0 if ALL positive verifier criteria are met AND no negative verifier criteria are triggered
0.0 otherwise

There is no partial credit. The "must have" and "should have" annotations are informational and do not affect the score.

Data

Tasks are sourced from the SWE-Atlas Codebase QnA benchmark by Scale AI. Each task includes a question, expert reference answer, evaluation rubric, and a Docker image containing the target repository at a pinned commit.

Tools

Agents are given access to 7 CLI tools mirroring the Claude Code toolset, plus an environment-specific tool:

bash: Execute shell commands
read: Read file contents
write: Write to files
edit: Edit files with exact string replacement
grep: Search for regex patterns in files
glob: Find files matching glob patterns
todo_write: Track investigation progress
answer: Submit the final answer for grading (ends the task)

Time Horizon

SWE-Atlas-QnA is a multi-turn environment. Agents typically make many tool calls to explore the codebase (reading files, running commands, tracing execution) before submitting a final answer.

Environment Difficulty

Even frontier models scoring >80% on SWE-Bench achieve less than 30% on SWE-Atlas Codebase QnA, highlighting significant gaps in deep code comprehension capabilities.

Other Environment Requirements

SWE-Atlas-QnA requires an Anthropic API key (anthropic_api_key secret) for LLM-based grading of answers using Claude Opus 4.5, matching the original SWE-Atlas evaluation methodology.

Safety

Agents in SWE-Atlas-QnA explore open-source codebases within isolated Docker containers. The environment does not present direct safety risks — agents interact only with the sandboxed repository and have no access to external systems beyond the sandbox. Agents are instructed not to permanently modify source code.

Citations

@dataset{ScaleAISWEAtlasQnA,
  author    = {Scale AI},
  title     = {SWE-Atlas Codebase QnA},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/ScaleAI/SWE-Atlas-QnA}
}

Repository

Source repository

EnvCommons/swe-atlas-qna

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	2 vCPUs / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000370
Total	$0.0000690

Examples

5-minute session$0.0207

1-hour session$0.2484

SWE-Atlas-QnA

GeneralReasoning/SWE-Atlas-QnA

SWE-Atlas-QnA

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples