FinanceAgent-Terminal

Description

FinanceAgent-Terminal is an environment for evaluating agents on real-world financial research tasks requiring SEC filing analysis. Agents use Google Search, EDGAR database queries, HTML parsing, and LLM-based information retrieval to answer expert-authored financial questions. Tasks span nine categories from simple retrieval to complex financial modeling.

This OpenReward implementation is based on the Finance Agent Benchmark by Bigeard et al.

Capabilities

SEC EDGAR filing search and analysis
Financial document parsing and information extraction
Multi-hop reasoning across filings and web sources
Quantitative and qualitative financial analysis

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

Test: 50 financial research questions

Each task presents a financial question requiring research across SEC filings and web sources.

Reward Structure

This is a multi-turn environment with binary reward:

1.0 — Correct answer (matches expected answer exactly, or judged correct by gpt-5-mini)
0.0 — Incorrect answer

The agent writes its final answer to /app/answer.txt. An LLM judge evaluates semantic correctness against the expert-provided expected answer.

Data

Data consists of 50 task directories, each containing an instruction file, expected answer, and test harness. Questions are derived from the Finance Agent Benchmark's public validation set.

Tools

Tool	Description
`bash`	Run bash commands in the sandbox container.
`str_replace`	Replace a unique string in a file with another string.
`view`	View file contents or directory listings.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for verification. Runs the test harness and returns reward.

Additionally, in-sandbox tools are available via Python scripts:

In-Sandbox Tool	Description
`google_web_search`	Search the web via Google Search API.
`edgar_search`	Search SEC EDGAR database for filings by form type, CIK, date range.
`parse_html_page`	Parse HTML content and store for later retrieval.
`retrieve_information`	Query stored documents using LLM with character range extraction.

Time Horizon

FinanceAgent-Terminal is a multi-turn environment. Agents search SEC filings and web sources, parse documents, and synthesize answers before submitting.

Environment Difficulty

The original paper evaluates LLMs on financial research tasks:

Model	Accuracy
Claude Opus 4.6 (Thinking)	60.65%
GPT 5.1	56.55%
Claude Sonnet 4.5 (Thinking)	55.32%
OpenAI o3	46.8%

Models perform best on simple quantitative/qualitative retrieval tasks but struggle with complex financial modeling and market analysis.

Other Environment Requirements

OpenAI API key: Required for LLM judge verification and in-sandbox retrieve_information tool

Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in FinanceAgent-Terminal access public SEC filings and web search results. The environment does not involve real financial transactions or private data.

Citations

@article{bigeard2025financeagent,
  author    = {Antoine Bigeard and Langston Nashold and Rayan Krishnan and Shirley Wu},
  title     = {Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks},
  journal   = {arXiv preprint arXiv:2508.00828},
  year      = {2025},
  url       = {https://arxiv.org/abs/2508.00828}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

financeagent-terminal

GeneralReasoning/financeagent-terminal

FinanceAgent-Terminal

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples