Nemotron-RLHF-GenRM-v1
Nemotron-RLHF-GenRM-v1
Description
Nemotron-RLHF-GenRM-v1 is an environment for training Generative Reward Models (GenRMs) that perform pairwise comparison of LLM responses. Given a conversation context and two assistant responses, the agent must evaluate both responses and produce individual helpfulness scores and a comparative ranking.
This environment implements the GenRM training task from NVIDIA's Nemotron 3 Super training recipe. The dataset is sourced from nvidia/Nemotron-RLHF-GenRM-v1, which is based on allenai/WildChat-1M.
Capabilities
- Pairwise comparison of LLM responses across diverse domains
- Helpfulness scoring on a 1-5 scale
- Comparative ranking on a 1-6 scale
- Safety and refusal evaluation
- Single-turn evaluation (one submission per task)
License
CC-BY-4.0 (same as the underlying dataset).
Tasks
There is one split:
- train: 299,517 pairwise comparison tasks
Each task presents a conversation context with two candidate assistant responses. The agent must reason through the strengths and weaknesses of both responses, then produce:
score_1: helpfulness score for Response 1 (1-5)score_2: helpfulness score for Response 2 (1-5)ranking: comparative preference (1-6, where 1 = Response 1 far superior, 6 = Response 2 far superior)
Some tasks only have ground-truth ranking (no individual helpfulness scores), typically for clear-cut safety refusal scenarios.
Reward Structure
Rewards are computed using the formula from the Nemotron 3 Nano paper:
R = -C1 * I_format - |P_h1 - G_h1| - |P_h2 - G_h2| - C2 * |P_r - G_r|
Where:
- C1 = 10: format violation penalty (binary: output must be valid JSON)
- C2 = 1: ranking deviation weight
- I_format: 1 if output doesn't parse to valid JSON, 0 otherwise
- P / G: predicted / ground-truth scores
Normalized to [0, 1]. When ground-truth helpfulness scores are absent, only format and ranking terms apply.
No LLM graders are used; all rewards are rule-based.
Tools
| Tool | Description |
|---|---|
answer | Submit evaluation as JSON with score_1, score_2, and ranking |
Other Environment Requirements
No external API keys are required. This environment uses purely rule-based grading.
Data
Data sourced from nvidia/Nemotron-RLHF-GenRM-v1 on Hugging Face. Run python download_data.py to download and convert to local parquet format.
Citations
@article{nvidia2025nemotron3nano,
title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
author={NVIDIA},
journal={arXiv preprint arXiv:2512.20848},
year={2025}
}No implementations linked yet. Add one to showcase related work.