Nemotron-Cascade-2-RL-data

API Endpoint
Leaderboard
Loading leaderboard...
README

Nemotron-Cascade-2-RL-data

⭐ OpenReward Environment Hugging Face Dataset

Description

Nemotron-Cascade-2-RL-data is a curated reinforcement learning dataset blend developed by NVIDIA for training the Nemotron-Cascade-2-30B-A3B model. This environment implements 3 variants covering instruction following, multi-domain tasks, and on-policy distillation. All variants use LLM-based grading (gpt-5-mini).

The SWE-RL subset (3,612 software engineering tasks) from the original dataset is excluded from this environment, as those tasks are already available on OpenReward via the dedicated SWE-Gym and R2E-Gym environments.

Capabilities

  • Following complex instruction-following constraints (sentence counts, keyword placement, formatting rules)
  • Answering multiple-choice knowledge questions across STEM domains
  • Executing workplace function calls (email, calendar, analytics, project management)
  • Generating structured JSON outputs conforming to schemas

Compute Requirements

No sandbox required. Requires OpenAI API key for LLM grading.

License

Open Data Commons Attribution License (ODC-By) v1.0.

Tasks

This environment uses 3 variants (one per dataset subset), each with a train split:

VariantTasksDescription
nemotronifrl45,879Instruction following with verifiable formatting constraints
nemotronmultidomainrl17,592MCQA, workplace function calling, structured outputs
nemotronmopd6,111Multi-domain on-policy distillation (mixed: IF, MCQA, function calling, schema)

Total: 69,582 tasks.

615 rows from the original dataset (555 from multi-domain-RL, 60 from MOPD) are excluded because they lack a verifiable grading signal (no expected answer, ground truth, constraints, or schema).

Reward Structure

All variants use binary reward (1.0 correct, 0.0 incorrect):

  • IF-RL: LLM (gpt-5-mini) checks all instruction constraints are satisfied
  • Multi-domain-RL: MCQA uses LLM answer matching; function calling uses LLM comparison to ground truth; structured output uses programmatic JSON schema validation
  • MOPD: Mixed grading depending on task type (same strategies as above)

Data

Data is sourced from nvidia/Nemotron-Cascade-2-RL-data on HuggingFace. This environment uses 3 of the 4 subsets:

  • IF-RL: Instruction-following tasks derived from nvidia/Nemotron-RL-instruction_following
  • multi-domain-RL: Knowledge MCQA, workplace assistant, structured outputs
  • MOPD: Blend from AceReason-Math, instruction following, STEM MCQA, and workplace tasks

Data is stored as parquet files on the OpenReward platform.

Tools

ToolDescription
answerSubmit your response. Grading depends on task type (instruction following, MCQA, function calling, schema validation).

Time Horizon

Single-turn. One tool call (answer).

Environment Difficulty

The dataset spans a wide range of difficulty:

  • IF-RL tasks range from simple formatting (word count) to complex multi-constraint satisfaction
  • MCQA covers STEM knowledge questions across multiple domains
  • Function calling tasks require correct tool selection and parameter formatting

Other Environment Requirements

  • OpenAI API key: Required for LLM grading. Pass via secrets={"openai_api_key": "..."}.

Safety

This environment evaluates instruction following, knowledge recall, and function calling. It does not present direct safety risks.

Citations

@article{Nemotron_Cascade_2,
  title={Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation},
  author={Yang, Zhuolin and Liu, Zihan and Chen, Yang and Dai, Wenliang and Wang, Boxin and Lin, Sheng-Chieh and Lee, Chankyu and Chen, Yangyi and Jiang, Dongfu and He, Jiafan and Pi, Renjie and Lam, Grace and Lee, Nayeon and Bukharin, Alexander and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  year={2026},
  journal={arXiv preprint arXiv:2603.19220}
}
Implementations

No implementations linked yet. Add one to showcase related work.

NVIDIA/Nemotron-Cascade-2-RL-data | OpenReward