Nemotron-RLHF-GenRM-v1

API Endpoint
Leaderboard
Loading leaderboard...
README

Nemotron-RLHF-GenRM-v1

OpenReward Environment Hugging Face Dataset

Description

Nemotron-RLHF-GenRM-v1 is an environment for training Generative Reward Models (GenRMs) that perform pairwise comparison of LLM responses. Given a conversation context and two assistant responses, the agent must evaluate both responses and produce individual helpfulness scores and a comparative ranking.

This environment implements the GenRM training task from NVIDIA's Nemotron 3 Super training recipe. The dataset is sourced from nvidia/Nemotron-RLHF-GenRM-v1, which is based on allenai/WildChat-1M.

Capabilities

  • Pairwise comparison of LLM responses across diverse domains
  • Helpfulness scoring on a 1-5 scale
  • Comparative ranking on a 1-6 scale
  • Safety and refusal evaluation
  • Single-turn evaluation (one submission per task)

License

CC-BY-4.0 (same as the underlying dataset).

Tasks

There is one split:

  • train: 299,517 pairwise comparison tasks

Each task presents a conversation context with two candidate assistant responses. The agent must reason through the strengths and weaknesses of both responses, then produce:

  • score_1: helpfulness score for Response 1 (1-5)
  • score_2: helpfulness score for Response 2 (1-5)
  • ranking: comparative preference (1-6, where 1 = Response 1 far superior, 6 = Response 2 far superior)

Some tasks only have ground-truth ranking (no individual helpfulness scores), typically for clear-cut safety refusal scenarios.

Reward Structure

Rewards are computed using the formula from the Nemotron 3 Nano paper:

R = -C1 * I_format - |P_h1 - G_h1| - |P_h2 - G_h2| - C2 * |P_r - G_r|

Where:

  • C1 = 10: format violation penalty (binary: output must be valid JSON)
  • C2 = 1: ranking deviation weight
  • I_format: 1 if output doesn't parse to valid JSON, 0 otherwise
  • P / G: predicted / ground-truth scores

Normalized to [0, 1]. When ground-truth helpfulness scores are absent, only format and ranking terms apply.

No LLM graders are used; all rewards are rule-based.

Tools

ToolDescription
answerSubmit evaluation as JSON with score_1, score_2, and ranking

Other Environment Requirements

No external API keys are required. This environment uses purely rule-based grading.

Data

Data sourced from nvidia/Nemotron-RLHF-GenRM-v1 on Hugging Face. Run python download_data.py to download and convert to local parquet format.

Citations

@article{nvidia2025nemotron3nano,
  title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
  author={NVIDIA},
  journal={arXiv preprint arXiv:2512.20848},
  year={2025}
}
Implementations

No implementations linked yet. Add one to showcase related work.

NVIDIA/Nemotron-RLHF-GenRM-v1 | OpenReward