SWE-Atlas-QnA
SWE-Atlas-QnA
Description
SWE-Atlas-QnA is an environment for evaluating agents on deep code comprehension across production-grade software systems. It is based on the Codebase QnA benchmark from Scale AI, part of the SWE-Atlas suite. Agents must investigate real open-source codebases — tracing execution paths, explaining architectural decisions, and answering deeply technical questions — using runtime verification rather than static code reading alone.
The benchmark spans 11 open-source repositories across 4 languages (Go, Python, C, TypeScript), covering systems including mail servers, terminal emulators, distributed object storage, observability platforms, and secret scanners.
Capabilities
- Deep code comprehension across multiple programming languages
- Runtime verification and execution tracing
- Architectural understanding of production-grade systems
- Evidence-based reasoning with artifact collection (logs, traces, API responses)
- Multi-step codebase exploration and investigation
Compute Requirements
Agents in SWE-Atlas-QnA are given a sandbox with 2GB of RAM and 2 CPUs. Each task uses a dedicated Docker image with the target repository pre-installed at /app.
Tasks
There is one split: test (124 tasks). Each task presents a question about a specific open-source codebase. The agent must explore the repository, run code, collect evidence, and provide a comprehensive answer. Tasks span 5 categories across 11 repositories and 4 programming languages.
Each task includes a rubric with multiple criteria of two types:
- Positive verifiers ("must have" / "should have"): factual claims the answer should contain
- Negative verifiers ("must have" / "should have"): incorrect claims the answer should avoid
Reward Structure
This is a sparse, binary reward environment matching the original SWE-Atlas Task Resolve Rate metric. The agent explores the codebase using CLI tools, then calls the answer tool with its response. An LLM grader (Claude Opus 4.5, matching the original evaluation) scores each rubric criterion independently as met or not met.
The reward is binary — all-or-nothing:
- 1.0 if ALL positive verifier criteria are met AND no negative verifier criteria are triggered
- 0.0 otherwise
There is no partial credit. The "must have" and "should have" annotations are informational and do not affect the score.
Data
Tasks are sourced from the SWE-Atlas Codebase QnA benchmark by Scale AI. Each task includes a question, expert reference answer, evaluation rubric, and a Docker image containing the target repository at a pinned commit.
Tools
Agents are given access to 7 CLI tools mirroring the Claude Code toolset, plus an environment-specific tool:
bash: Execute shell commandsread: Read file contentswrite: Write to filesedit: Edit files with exact string replacementgrep: Search for regex patterns in filesglob: Find files matching glob patternstodo_write: Track investigation progressanswer: Submit the final answer for grading (ends the task)
Time Horizon
SWE-Atlas-QnA is a multi-turn environment. Agents typically make many tool calls to explore the codebase (reading files, running commands, tracing execution) before submitting a final answer.
Environment Difficulty
Even frontier models scoring >80% on SWE-Bench achieve less than 30% on SWE-Atlas Codebase QnA, highlighting significant gaps in deep code comprehension capabilities.
Other Environment Requirements
SWE-Atlas-QnA requires an Anthropic API key (anthropic_api_key secret) for LLM-based grading of answers using Claude Opus 4.5, matching the original SWE-Atlas evaluation methodology.
Safety
Agents in SWE-Atlas-QnA explore open-source codebases within isolated Docker containers. The environment does not present direct safety risks — agents interact only with the sandboxed repository and have no access to external systems beyond the sandbox. Agents are instructed not to permanently modify source code.
Citations
@dataset{ScaleAISWEAtlasQnA,
author = {Scale AI},
title = {SWE-Atlas Codebase QnA},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ScaleAI/SWE-Atlas-QnA}
}