DeepSynth

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

DeepSynth

OpenReward Environment

Description

DeepSynth is an environment for evaluating program synthesis from input-output examples. Agents are given examples of integer and integer-list transformations and must write a Python function that implements the transformation, generalizing beyond the shown examples.

Capabilities

  • Pattern recognition from input-output examples
  • Program synthesis / function induction
  • Integer and list manipulation
  • Generalization from examples to unseen inputs

Compute Requirements

No special compute requirements. Lightweight non-sandbox environment.

Tasks

  • Train split: 1,353 tasks (T=1: 21, T=2: 464, T=3: 455, T=4: 413)
  • Test split: 792 tasks (T=1: 24, T=2: 99, T=3: 487, T=4: 96, T=5: 86)
  • Difficulty levels T=1 (single operation) through T=5 (five chained operations)
  • Each task has 5 visible I/O examples and ~20 hidden test cases

Tasks are drawn from the DeepCoder dataset, which covers integer list transformations using ~40 primitives (sorting, filtering, mapping, zipping, scanning, etc.).

Reward Structure

Binary reward: 1.0 if the submitted function passes ALL hidden test cases, 0.0 otherwise. Fully verifiable — no LLM grader.

Data

  • Source: DeepCoder dataset (Balog et al., ICLR 2017)
  • Enriched with hidden test cases generated from ground-truth DSL programs

Tools

  • test(code) — Test Python code against the visible I/O examples. Returns pass/fail per example. Non-terminal.
  • submit(code) — Submit final Python code for grading against hidden test cases. Terminal action, one attempt only.

Time Horizon

Multi-turn. Agents can iterate using the test tool before submitting. Typical: 1-5 tool calls.

Environment Difficulty

  • T=1: Easy (single operation, e.g., sort, map, filter)
  • T=2: Medium (two chained operations)
  • T=3: Hard (three chained operations, often with multiple inputs)
  • T=4-5: Very hard (four-five chained operations, complex compositions)

Safety

No safety concerns — tasks involve only integer and list manipulation.

Citations

@inproceedings{Balog2017,
  author    = {Balog, Matej and Gaunt, Alexander L. and Brockschmidt, Marc and Nowozin, Sebastian and Tarlow, Daniel},
  title     = {DeepCoder: Learning to Write Programs},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2017},
  url       = {https://arxiv.org/abs/1611.01989}
}

@inproceedings{Fijalkow2022,
  author    = {Fijalkow, Nathana{\"{e}}l and Lagarde, Guillaume and Matricon, Th{\'{e}}o and Ellis, Kevin and Ohlmann, Pierre and Potta, Akarsh},
  title     = {Scaling Neural Program Synthesis with Distribution-based Search},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {36},
  number    = {6},
  pages     = {6623--6630},
  year      = {2022},
  doi       = {10.1609/aaai.v36i6.20616}
}
GeneralReasoning/DeepSynth | OpenReward