Nemotron IF-Eval

Name: NVIDIA/Nemotron-RL-Instruction-Following-MultiTurnChat-v1
Author: NVIDIA

Description

Nemotron IF-Eval evaluates instruction-following in multi-turn conversations. Based on the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset from NVIDIA, each task presents a multi-turn conversation with a detailed system prompt containing specific constraints, followed by several user/assistant exchanges. The agent must generate the next assistant response adhering to all instructions. Responses are graded by an LLM judge against 1-6 rubric criteria per task.

The dataset is part of the MultiChallenge benchmark, which tests whether models can persistently follow instructions across complex multi-turn interactions.

Capabilities

Following complex, multi-constraint instructions across conversation turns
Maintaining behavioral consistency over extended multi-turn dialogues
Adhering to formatting, tone, length, and content requirements
Instruction retention under conversational pressure

Compute Requirements

Nemotron IF-Eval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are 2,011 tasks in a single train split. Each task presents a multi-turn conversation prompt containing system, user, and assistant messages (9-32 messages per task, mean 15.2). The agent must provide the next assistant response, which is graded against 1-6 evaluation rubric items per task (mean 2.6). Rubric items test specific instruction-following criteria such as tone, formatting constraints, content requirements, and behavioral consistency.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response. Each rubric item is evaluated independently by an LLM grader (gpt-5-mini) that answers a yes/no question and compares the result to the expected pass criteria. The overall score is the fraction of rubric items passed:

$\text{Reward} = \frac{\text{rubric items passed}}{\text{total rubric items}}$

Scores range from 0.0 to 1.0.

Data

Conversations are sourced from the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset by NVIDIA. Data files are stored on the OpenReward platform.

Tools

Tool	Description
`answer`	Submit a response to continue the conversation. The response is graded by the LLM grader against the rubric criteria. Returns the overall score. Called once per task.

Time Horizon

Nemotron IF-Eval is a single-turn environment. The agent receives a conversation context and submits one response. Each task requires exactly one tool call.

Other Environment Requirements

Nemotron IF-Eval requires an OpenAI API key (openai_api_key secret) for LLM-based grading of responses.

Safety

Agents in Nemotron IF-Eval are asked to respond to multi-turn conversations. The environment does not present direct safety risks, as agents only provide text responses with no access to external systems, tools, or the internet.

Citations

@article{sirdeshmukh2025multichallenge,
  title={MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs},
  author={Sirdeshmukh, Ved and Deshpande, Kaustubh and Mols, Johannes and Jin, Lifeng and Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow and Yue, Summer and Xing, Chen},
  journal={arXiv preprint arXiv:2501.17399},
  year={2025}
}

Implementations

No implementations linked yet. Add one to showcase related work.

Repository

Source repository

EnvCommons/nemotron_ifeval

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152