InverseIFEval
InverseIFEval
Description
InverseIFEval is an environment for evaluating an agent's ability to follow counterintuitive instructions. Based on the Inverse IFEval benchmark, agents must override conventional training behaviors and follow unconventional or counterintuitive instructions across 8 instruction categories in both Chinese and English.
Capabilities
- Following counterintuitive instructions
- Overriding trained conventions
- Bilingual instruction following (Chinese and English)
- Handling 8 types of unconventional instruction patterns
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
Tasks
There is one split in this environment:
- test: 1,012 tasks (506 Chinese + 506 English)
Tasks span 8 instruction types:
| Instruction Type | Count |
|---|---|
| Instructional Induction | 154 |
| Mid-turn Instruction Modification | 108 |
| Counterfactual Answering | 108 |
| Code without Comments | 198 |
| Deliberately Incorrect Answers | 186 |
| Counter-Conventional Formatting | 82 |
| Question Correction | 90 |
| Intentional Textual Flaws | 86 |
Each task presents the agent with a counterintuitive instruction that deliberately deviates from standard conventions. The agent must read the instruction and submit a response that correctly follows the unconventional directive.
Reward Structure
Single-turn with LLM-graded rewards. The agent submits a response via the submit_response tool. Each task in the dataset includes a judge_prompt_template and judge_system_prompt that define grading criteria specific to that instruction type. The grader (gpt-5-mini) substitutes the agent's response and a reference answer into the template, then outputs a JSON verdict with answer_score of 0 or 1.
The reward is binary:
- 1.0 if the response correctly follows the counterintuitive instruction
- 0.0 if it does not
Data
The dataset consists of a single file:
inverse_ifeval.parquet(6.27 MB, 1,012 samples)
Sourced from the HuggingFace dataset m-a-p/Inverse_IFEval. Each sample contains a prompt, response reference, and judge template for LLM-based grading. Data is stored on the OpenReward platform.
Tools
Agents have access to a single tool:
submit_response: Submit a text response for LLM-based grading against the counterintuitive instruction. Accepts aresponsestring parameter. The episode ends after calling this tool.
Time Horizon
Single-turn. The agent reads the counterintuitive instruction and submits one response.
Environment Difficulty
The original paper evaluates frontier models on Inverse IFEval (Overall Score, English):
| Model | Score |
|---|---|
| o3-high | 75.7 |
| o3-mini | 74.7 |
| GPT-5-high | 73.7 |
| Claude-4-Opus-Thinking | 67.2 |
| Claude-4-Sonnet-Thinking | 64.0 |
| DeepSeek-R1 | 50.0 |
| DeepSeek-V3 | 39.6 |
Models show ~30% performance drop vs conventional IFEval. Thinking mechanisms improve scores by ~15% on average.
Other Environment Requirements
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.
Safety
Agents in InverseIFEval follow counterintuitive instructions in a standard environment. While some tasks ask agents to produce deliberately incorrect or unconventional outputs, this is done in a controlled evaluation context and does not present direct safety risks.
Citation
@article{inverse_ifeval_2024,
title={Inverse IFEval: Evaluating LLMs' Ability to Follow Counterintuitive Instructions},
author={MAP Team},
year={2024},
url={https://huggingface.co/datasets/m-a-p/Inverse_IFEval}
}