Nemotron-Agentic-v1
Nemotron-Agentic
Description
Nemotron-Agentic is an environment for evaluating agents on agentic tool-use decision-making. It is based on the Nemotron-Agentic-v1 dataset from NVIDIA, consisting of 335,122 multi-turn conversations with tool use. Each assistant turn in a conversation is extracted as a decision point: the agent sees the conversation history so far and the available tools, then must predict the correct next action -- either calling a specific function or responding with a message.
Capabilities
- Deciding when to call a tool vs. respond with a message
- Selecting the correct function from a set of available tools
- Generating correct function arguments as JSON
- Multi-turn conversation comprehension
- Reasoning about tool capabilities relative to user requests
Compute Requirements
Nemotron-Agentic does not require a sandbox. It has minimal compute requirements.
License
Tasks
There are two splits with a total of 1,197,894 tasks:
- tool_calling (1,127,100 tasks): General-purpose tool-calling scenarios with simulated multi-turn conversations. 50.7% function call / 49.3% message.
- interactive_agent (70,794 tasks): Synthetic multi-turn agentic trajectories for conversational tool use. 42.4% function call / 57.6% message.
Each task presents the conversation history up to a specific assistant turn and asks the agent to predict the correct next action.
Reward Structure
This is a sparse reward environment with continuous scoring. The agent makes a single submission per task:
- Function call tasks: Reward = 0.5 * (name match) + 0.5 * (argument match). Name match is binary (0 or 1). Argument match is the fraction of key-value pairs that match between expected and submitted arguments.
- Message tasks: Reward is computed via LLM grading (gpt-5-mini) if an OpenAI API key is provided, or via keyword overlap fallback otherwise. Scores range from 0.0 to 1.0.
- Wrong action type: Calling a function when a message was expected (or vice versa) yields reward 0.0.
Data
Decision points are extracted from the Nemotron-Agentic-v1 dataset by NVIDIA. The original dataset contains 335,122 multi-turn conversations with tool use in JSONL format. The download_data.py script processes each conversation to extract every assistant turn as a separate decision point, yielding ~1.2M tasks. Each pivot captures the conversation context (all messages before the assistant turn) and the expected action (function call or message).
Tools
This environment uses task-specific tools. Each task dynamically exposes the actual tools from the dataset (e.g., get_weather, search_flights, calculate_tip) via list_task_tools(). The agent interacts with these tools through native function calling.
In addition, there is one shared tool:
submit_message: Submit a text message response. Use when no function call is appropriate and the agent should respond directly to the user.
Time Horizon
Nemotron-Agentic is a single-turn environment. The agent receives a conversation context and submits one action. Each task requires exactly one tool call.
Other Environment Requirements
Nemotron-Agentic optionally accepts an OpenAI API key (openai_api_key secret) for LLM-based grading of message responses. Without it, a simple keyword-overlap fallback grader is used for message tasks. Function call tasks do not require an API key.
Safety
Agents in Nemotron-Agentic are asked to predict the next action in a synthetic conversation. The environment does not present direct safety risks, as agents only submit predictions with no access to external systems, real tools, or the internet.
Citations
@dataset{nvidia_nemotron_agentic_v1,
author = {NVIDIA Corporation},
title = {Nemotron-Agentic-Tool-Use-v1},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nvidia/Nemotron-Agentic-v1},
license = {CC-BY-4.0}
}No implementations linked yet. Add one to showcase related work.