financeagent-terminal

API Endpoint
Leaderboard
Loading leaderboard...
README

FinanceAgent-Terminal

⭐ OpenReward Environment

Description

FinanceAgent-Terminal is an environment for evaluating agents on real-world financial research tasks requiring SEC filing analysis. Agents use Google Search, EDGAR database queries, HTML parsing, and LLM-based information retrieval to answer expert-authored financial questions. Tasks span nine categories from simple retrieval to complex financial modeling.

This OpenReward implementation is based on the Finance Agent Benchmark by Bigeard et al.

Capabilities

  • SEC EDGAR filing search and analysis
  • Financial document parsing and information extraction
  • Multi-hop reasoning across filings and web sources
  • Quantitative and qualitative financial analysis

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

  • Test: 50 financial research questions

Each task presents a financial question requiring research across SEC filings and web sources.

Reward Structure

This is a multi-turn environment with binary reward:

  • 1.0 — Correct answer (matches expected answer exactly, or judged correct by gpt-5-mini)
  • 0.0 — Incorrect answer

The agent writes its final answer to /app/answer.txt. An LLM judge evaluates semantic correctness against the expert-provided expected answer.

Data

Data consists of 50 task directories, each containing an instruction file, expected answer, and test harness. Questions are derived from the Finance Agent Benchmark's public validation set.

Tools

ToolDescription
bashRun bash commands in the sandbox container.
str_replaceReplace a unique string in a file with another string.
viewView file contents or directory listings.
create_fileCreate a new file with specified content.
submit_answerSubmit work for verification. Runs the test harness and returns reward.

Additionally, in-sandbox tools are available via Python scripts:

In-Sandbox ToolDescription
google_web_searchSearch the web via Google Search API.
edgar_searchSearch SEC EDGAR database for filings by form type, CIK, date range.
parse_html_pageParse HTML content and store for later retrieval.
retrieve_informationQuery stored documents using LLM with character range extraction.

Time Horizon

FinanceAgent-Terminal is a multi-turn environment. Agents search SEC filings and web sources, parse documents, and synthesize answers before submitting.

Environment Difficulty

The original paper evaluates LLMs on financial research tasks:

ModelAccuracy
Claude Opus 4.6 (Thinking)60.65%
GPT 5.156.55%
Claude Sonnet 4.5 (Thinking)55.32%
OpenAI o346.8%

Models perform best on simple quantitative/qualitative retrieval tasks but struggle with complex financial modeling and market analysis.

Other Environment Requirements

  • OpenAI API key: Required for LLM judge verification and in-sandbox retrieve_information tool

Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in FinanceAgent-Terminal access public SEC filings and web search results. The environment does not involve real financial transactions or private data.

Citations

@article{bigeard2025financeagent,
  author    = {Antoine Bigeard and Langston Nashold and Rayan Krishnan and Shirley Wu},
  title     = {Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks},
  journal   = {arXiv preprint arXiv:2508.00828},
  year      = {2025},
  url       = {https://arxiv.org/abs/2508.00828}
}
GeneralReasoning/financeagent-terminal | OpenReward