dsbc

Description

The paper introduces a comprehensive benchmark for evaluating data science agents by reflecting real-world user interactions observed in commercial applications, comparing three LLMs (Claude-4.0-Sonnet, Gemini-2.5-Flash, OpenAI-o4-Mini) across zero-shot with context engineering, multi-step with context engineering, and SmolAgent approaches. It spans eight diverse data science task categories, measures sensitivity to prompting issues (data leakage and slight ambiguities) and temperature settings, and provides a dataset and evaluation framework to support future research on robust data science agents.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/DSBC
0
1 months ago
arXiv/dsbc | OpenReward