API Endpoint

Leaderboard

Loading leaderboard...

README

EsoLang-Bench

Description

EsoLang-Bench is an environment for evaluating LLM code generation in esoteric programming languages. Models score ~90% on Python coding tasks but only ~3.8% when the same problems must be solved in esoteric languages — testing genuine reasoning vs pattern matching. Agents are given programming problems and must write solutions in one of five esoteric languages.

Capabilities

Code generation in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare
Iterative testing and debugging via interpreter execution
Algorithmic problem solving under severe language constraints

Compute Requirements

No special compute requirements. All interpreters are pure-Python and execute server-side with a 5-second timeout per execution.

License

CC BY 4.0 (dataset), MIT (interpreters)

Tasks

400 tasks in a single test split (80 problems x 5 languages). Each task combines a programming problem with a target esoteric language. Problems range from simple string output to complex algorithmic challenges across four difficulty levels (easy, medium, hard, extra_hard).

Reward Structure

Partial credit based on hidden test cases:

reward = num_passing_tests / total_tests (6 test cases per problem)
Output matching uses the paper's outputs_match_lang function (language-aware, trailing whitespace tolerant, with numeric normalization)
Reward range: [0.0, 1.0]

Data

Source: Lossfunk/Esolang-Bench on HuggingFace
Size: 80 problems, 6 test cases each
Format: Parquet file with problem descriptions and test cases

Tools

run_code(code, stdin): Execute code in the target esoteric language with optional stdin. Returns stdout, stderr, and error status. 5-second timeout.
submit(code): Submit final solution for grading against all hidden test cases. Terminal action — one submission allowed. Returns per-case results and partial credit reward.

Time Horizon

Multi-turn. Agents iteratively write, test, and debug code before submitting. Expected 5-30 tool calls depending on problem difficulty and language complexity.

Environment Difficulty

Easy (20 problems): Basic I/O, string manipulation
Medium (20 problems): Loops, conditionals, arithmetic
Hard (20 problems): Complex algorithms, data structures
Extra Hard (20 problems): Advanced algorithmic challenges

Difficulty is compounded by the esoteric language constraint — even easy problems become challenging in Brainfuck or Whitespace.

Other Environment Requirements

No other requirements. No sandbox, API keys, or external services needed.

Safety

All code execution happens server-side in pure-Python interpreters with strict timeouts. Esoteric languages operate on abstract machines (tapes, stacks, grids) with no filesystem or network access. No safety concerns.

Citations

@article{sharma2026esolangbench,
      title={EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages},
      author={Aman Sharma and Paras Chopra},
      year={2026},
      eprint={2603.09678},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}

Repository

Source repository

EnvCommons/EsoLang-Bench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

EsoLang

GeneralReasoning/EsoLang

EsoLang-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples