InverseIFEval

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

InverseIFEval

OpenReward Environment Hugging Face Dataset

Description

InverseIFEval is an environment for evaluating an agent's ability to follow counterintuitive instructions. Based on the Inverse IFEval benchmark, agents must override conventional training behaviors and follow unconventional or counterintuitive instructions across 8 instruction categories in both Chinese and English.

Capabilities

  • Following counterintuitive instructions
  • Overriding trained conventions
  • Bilingual instruction following (Chinese and English)
  • Handling 8 types of unconventional instruction patterns

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

Tasks

There is one split in this environment:

  • test: 1,012 tasks (506 Chinese + 506 English)

Tasks span 8 instruction types:

Instruction TypeCount
Instructional Induction154
Mid-turn Instruction Modification108
Counterfactual Answering108
Code without Comments198
Deliberately Incorrect Answers186
Counter-Conventional Formatting82
Question Correction90
Intentional Textual Flaws86

Each task presents the agent with a counterintuitive instruction that deliberately deviates from standard conventions. The agent must read the instruction and submit a response that correctly follows the unconventional directive.

Reward Structure

Single-turn with LLM-graded rewards. The agent submits a response via the submit_response tool. Each task in the dataset includes a judge_prompt_template and judge_system_prompt that define grading criteria specific to that instruction type. The grader (gpt-5-mini) substitutes the agent's response and a reference answer into the template, then outputs a JSON verdict with answer_score of 0 or 1.

The reward is binary:

  • 1.0 if the response correctly follows the counterintuitive instruction
  • 0.0 if it does not

Data

The dataset consists of a single file:

  • inverse_ifeval.parquet (6.27 MB, 1,012 samples)

Sourced from the HuggingFace dataset m-a-p/Inverse_IFEval. Each sample contains a prompt, response reference, and judge template for LLM-based grading. Data is stored on the OpenReward platform.

Tools

Agents have access to a single tool:

  • submit_response: Submit a text response for LLM-based grading against the counterintuitive instruction. Accepts a response string parameter. The episode ends after calling this tool.

Time Horizon

Single-turn. The agent reads the counterintuitive instruction and submits one response.

Environment Difficulty

The original paper evaluates frontier models on Inverse IFEval (Overall Score, English):

ModelScore
o3-high75.7
o3-mini74.7
GPT-5-high73.7
Claude-4-Opus-Thinking67.2
Claude-4-Sonnet-Thinking64.0
DeepSeek-R150.0
DeepSeek-V339.6

Models show ~30% performance drop vs conventional IFEval. Thinking mechanisms improve scores by ~15% on average.

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.

Safety

Agents in InverseIFEval follow counterintuitive instructions in a standard environment. While some tasks ask agents to produce deliberately incorrect or unconventional outputs, this is done in a controlled evaluation context and does not present direct safety risks.

Citation

@article{inverse_ifeval_2024,
  title={Inverse IFEval: Evaluating LLMs' Ability to Follow Counterintuitive Instructions},
  author={MAP Team},
  year={2024},
  url={https://huggingface.co/datasets/m-a-p/Inverse_IFEval}
}
GeneralReasoning/InverseIFEval | OpenReward