Nemotron-RL-Instruction-Following-MultiTurnChat-v1
Nemotron IF-Eval
Description
Nemotron IF-Eval evaluates instruction-following in multi-turn conversations. Based on the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset from NVIDIA, each task presents a multi-turn conversation with a detailed system prompt containing specific constraints, followed by several user/assistant exchanges. The agent must generate the next assistant response adhering to all instructions. Responses are graded by an LLM judge against 1-6 rubric criteria per task.
The dataset is part of the MultiChallenge benchmark, which tests whether models can persistently follow instructions across complex multi-turn interactions.
Capabilities
- Following complex, multi-constraint instructions across conversation turns
- Maintaining behavioral consistency over extended multi-turn dialogues
- Adhering to formatting, tone, length, and content requirements
- Instruction retention under conversational pressure
Compute Requirements
Nemotron IF-Eval does not require a sandbox. It has minimal compute requirements.
License
Tasks
There are 2,011 tasks in a single train split. Each task presents a multi-turn conversation prompt containing system, user, and assistant messages (9-32 messages per task, mean 15.2). The agent must provide the next assistant response, which is graded against 1-6 evaluation rubric items per task (mean 2.6). Rubric items test specific instruction-following criteria such as tone, formatting constraints, content requirements, and behavioral consistency.
Reward Structure
This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response. Each rubric item is evaluated independently by an LLM grader (gpt-5-mini) that answers a yes/no question and compares the result to the expected pass criteria. The overall score is the fraction of rubric items passed:
Scores range from 0.0 to 1.0.
Data
Conversations are sourced from the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset by NVIDIA. Data files are stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
answer | Submit a response to continue the conversation. The response is graded by the LLM grader against the rubric criteria. Returns the overall score. Called once per task. |
Time Horizon
Nemotron IF-Eval is a single-turn environment. The agent receives a conversation context and submits one response. Each task requires exactly one tool call.
Other Environment Requirements
Nemotron IF-Eval requires an OpenAI API key (openai_api_key secret) for LLM-based grading of responses.
Safety
Agents in Nemotron IF-Eval are asked to respond to multi-turn conversations. The environment does not present direct safety risks, as agents only provide text responses with no access to external systems, tools, or the internet.
Citations
@article{sirdeshmukh2025multichallenge,
title={MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs},
author={Sirdeshmukh, Ved and Deshpande, Kaustubh and Mols, Johannes and Jin, Lifeng and Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow and Yue, Summer and Xing, Chen},
journal={arXiv preprint arXiv:2501.17399},
year={2025}
}No implementations linked yet. Add one to showcase related work.