financeagent-terminal
FinanceAgent-Terminal
Description
FinanceAgent-Terminal is an environment for evaluating agents on real-world financial research tasks requiring SEC filing analysis. Agents use Google Search, EDGAR database queries, HTML parsing, and LLM-based information retrieval to answer expert-authored financial questions. Tasks span nine categories from simple retrieval to complex financial modeling.
This OpenReward implementation is based on the Finance Agent Benchmark by Bigeard et al.
Capabilities
- SEC EDGAR filing search and analysis
- Financial document parsing and information extraction
- Multi-hop reasoning across filings and web sources
- Quantitative and qualitative financial analysis
Compute Requirements
Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.
License
MIT.
Tasks
There is one split in this environment:
- Test: 50 financial research questions
Each task presents a financial question requiring research across SEC filings and web sources.
Reward Structure
This is a multi-turn environment with binary reward:
- 1.0 — Correct answer (matches expected answer exactly, or judged correct by gpt-5-mini)
- 0.0 — Incorrect answer
The agent writes its final answer to /app/answer.txt. An LLM judge evaluates semantic correctness against the expert-provided expected answer.
Data
Data consists of 50 task directories, each containing an instruction file, expected answer, and test harness. Questions are derived from the Finance Agent Benchmark's public validation set.
Tools
| Tool | Description |
|---|---|
bash | Run bash commands in the sandbox container. |
str_replace | Replace a unique string in a file with another string. |
view | View file contents or directory listings. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for verification. Runs the test harness and returns reward. |
Additionally, in-sandbox tools are available via Python scripts:
| In-Sandbox Tool | Description |
|---|---|
google_web_search | Search the web via Google Search API. |
edgar_search | Search SEC EDGAR database for filings by form type, CIK, date range. |
parse_html_page | Parse HTML content and store for later retrieval. |
retrieve_information | Query stored documents using LLM with character range extraction. |
Time Horizon
FinanceAgent-Terminal is a multi-turn environment. Agents search SEC filings and web sources, parse documents, and synthesize answers before submitting.
Environment Difficulty
The original paper evaluates LLMs on financial research tasks:
| Model | Accuracy |
|---|---|
| Claude Opus 4.6 (Thinking) | 60.65% |
| GPT 5.1 | 56.55% |
| Claude Sonnet 4.5 (Thinking) | 55.32% |
| OpenAI o3 | 46.8% |
Models perform best on simple quantitative/qualitative retrieval tasks but struggle with complex financial modeling and market analysis.
Other Environment Requirements
- OpenAI API key: Required for LLM judge verification and in-sandbox
retrieve_informationtool
Pass via secrets={"openai_api_key": "..."}.
Safety
Agents in FinanceAgent-Terminal access public SEC filings and web search results. The environment does not involve real financial transactions or private data.
Citations
@article{bigeard2025financeagent,
author = {Antoine Bigeard and Langston Nashold and Rayan Krishnan and Shirley Wu},
title = {Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks},
journal = {arXiv preprint arXiv:2508.00828},
year = {2025},
url = {https://arxiv.org/abs/2508.00828}
}