IFEval

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

IFEval

OpenReward Environment Hugging Face Dataset

Description

IFEval (Instruction-Following Evaluation) is an environment for evaluating an agent's ability to follow verifiable instructions. Based on Google's IFEval benchmark, it contains prompts with specific constraints (e.g., "write in more than 400 words", "mention the keyword AI at least 3 times") that can be programmatically verified without LLM grading.

Capabilities

  • Following precise formatting and content instructions
  • Generating text under multiple simultaneous constraints
  • Handling 25 types of verifiable instructions

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

There is one split in this environment:

  • test: 541 tasks

Each task presents a prompt containing one or more verifiable instructions. The agent must generate a response that satisfies all constraints.

Reward Structure

This is a single-turn environment. The agent submits a response via the submit tool. Each prompt contains one or more verifiable instructions (e.g., word count constraints, keyword inclusion, formatting requirements). The environment evaluates each instruction independently using "loose" mode, which tests 8 response variants with formatting tolerance (e.g., case normalization, whitespace handling) before declaring an instruction unfollowed. If all instructions in the prompt are satisfied, the reward is 1.0; if any single instruction is not followed, the reward is 0.0. No LLM grading is used -- all evaluation is deterministic and programmatic.

Data

Data consists of a JSONL file (ifeval_input_data.jsonl) sourced from HuggingFace google/IFEval. Each row contains a prompt, a list of instruction IDs, and keyword arguments for each instruction's verifier. Data is stored on the OpenReward platform.

Tools

ToolDescription
submitSubmit your response for evaluation against instruction constraints. Ends the episode.

Time Horizon

Single-turn. The agent receives a prompt with constraints and submits one response.

Environment Difficulty

IFEval is one of the core benchmarks on the Open LLM Leaderboard. From the original paper, GPT-4 achieved 76.89% prompt-level strict accuracy and 83.57% instruction-level strict accuracy. Frontier models generally perform well, but the combination of multiple simultaneous constraints in a single prompt can be challenging.

Other Environment Requirements

There are no further environment requirements.

Safety

This environment evaluates instruction-following ability and does not present direct safety risks.

Citation

@article{zhou2023instruction,
  title={Instruction-Following Evaluation for Large Language Models},
  author={Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le},
  journal={arXiv preprint arXiv:2311.07911},
  year={2023}
}
GeneralReasoning/IFEval | OpenReward