IFEval

Description

IFEval (Instruction-Following Evaluation) is an environment for evaluating an agent's ability to follow verifiable instructions. Based on Google's IFEval benchmark, it contains prompts with specific constraints (e.g., "write in more than 400 words", "mention the keyword AI at least 3 times") that can be programmatically verified without LLM grading.

Capabilities

Following precise formatting and content instructions
Generating text under multiple simultaneous constraints
Handling 25 types of verifiable instructions

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

There is one split in this environment:

test: 541 tasks

Each task presents a prompt containing one or more verifiable instructions. The agent must generate a response that satisfies all constraints.

Reward Structure

This is a single-turn environment. The agent submits a response via the submit tool. Each prompt contains one or more verifiable instructions (e.g., word count constraints, keyword inclusion, formatting requirements). The environment evaluates each instruction independently using "loose" mode, which tests 8 response variants with formatting tolerance (e.g., case normalization, whitespace handling) before declaring an instruction unfollowed. If all instructions in the prompt are satisfied, the reward is 1.0; if any single instruction is not followed, the reward is 0.0. No LLM grading is used -- all evaluation is deterministic and programmatic.

Data

Data consists of a JSONL file (ifeval_input_data.jsonl) sourced from HuggingFace google/IFEval. Each row contains a prompt, a list of instruction IDs, and keyword arguments for each instruction's verifier. Data is stored on the OpenReward platform.

Tools

Tool	Description
`submit`	Submit your response for evaluation against instruction constraints. Ends the episode.

Time Horizon

Single-turn. The agent receives a prompt with constraints and submits one response.

Environment Difficulty

IFEval is one of the core benchmarks on the Open LLM Leaderboard. From the original paper, GPT-4 achieved 76.89% prompt-level strict accuracy and 83.57% instruction-level strict accuracy. Frontier models generally perform well, but the combination of multiple simultaneous constraints in a single prompt can be challenging.

Other Environment Requirements

There are no further environment requirements.

Safety

This environment evaluates instruction-following ability and does not present direct safety risks.

Citation

@article{zhou2023instruction,
  title={Instruction-Following Evaluation for Large Language Models},
  author={Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le},
  journal={arXiv preprint arXiv:2311.07911},
  year={2023}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

IFEval

GeneralReasoning/IFEval

IFEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples