Nemotron-RLHF-GenRM-v1

Name: NVIDIA/Nemotron-RLHF-GenRM-v1
Author: NVIDIA

Description

Nemotron-RLHF-GenRM-v1 is an environment for training Generative Reward Models (GenRMs) that perform pairwise comparison of LLM responses. Given a conversation context and two assistant responses, the agent must evaluate both responses and produce individual helpfulness scores and a comparative ranking.

This environment implements the GenRM training task from NVIDIA's Nemotron 3 Super training recipe. The dataset is sourced from nvidia/Nemotron-RLHF-GenRM-v1, which is based on allenai/WildChat-1M.

Capabilities

Pairwise comparison of LLM responses across diverse domains
Helpfulness scoring on a 1-5 scale
Comparative ranking on a 1-6 scale
Safety and refusal evaluation
Single-turn evaluation (one submission per task)

License

CC-BY-4.0 (same as the underlying dataset).

Tasks

There is one split:

train: 299,517 pairwise comparison tasks

Each task presents a conversation context with two candidate assistant responses. The agent must reason through the strengths and weaknesses of both responses, then produce:

score_1: helpfulness score for Response 1 (1-5)
score_2: helpfulness score for Response 2 (1-5)
ranking: comparative preference (1-6, where 1 = Response 1 far superior, 6 = Response 2 far superior)

Some tasks only have ground-truth ranking (no individual helpfulness scores), typically for clear-cut safety refusal scenarios.

Reward Structure

Rewards are computed using the formula from the Nemotron 3 Nano paper:

R = -C1 * I_format - |P_h1 - G_h1| - |P_h2 - G_h2| - C2 * |P_r - G_r|

Where:

C1 = 10: format violation penalty (binary: output must be valid JSON)
C2 = 1: ranking deviation weight
I_format: 1 if output doesn't parse to valid JSON, 0 otherwise
P / G: predicted / ground-truth scores

Normalized to [0, 1]. When ground-truth helpfulness scores are absent, only format and ranking terms apply.

No LLM graders are used; all rewards are rule-based.

Tools

Tool	Description
`answer`	Submit evaluation as JSON with `score_1`, `score_2`, and `ranking`

Other Environment Requirements

No external API keys are required. This environment uses purely rule-based grading.

Data

Data sourced from nvidia/Nemotron-RLHF-GenRM-v1 on Hugging Face. Run python download_data.py to download and convert to local parquet format.

Citations

@article{nvidia2025nemotron3nano,
  title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
  author={NVIDIA},
  journal={arXiv preprint arXiv:2512.20848},
  year={2025}
}

Implementations

No implementations linked yet. Add one to showcase related work.

Repository

Source repository

EnvCommons/Nemotron-RLHF-GenRM-v1

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152