API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/deepsynth

README

DeepSynth

Description

DeepSynth is an environment for evaluating program synthesis from input-output examples. Agents are given examples of integer and integer-list transformations and must write a Python function that implements the transformation, generalizing beyond the shown examples.

Capabilities

Pattern recognition from input-output examples
Program synthesis / function induction
Integer and list manipulation
Generalization from examples to unseen inputs

Compute Requirements

No special compute requirements. Lightweight non-sandbox environment.

Tasks

Train split: 1,353 tasks (T=1: 21, T=2: 464, T=3: 455, T=4: 413)
Test split: 792 tasks (T=1: 24, T=2: 99, T=3: 487, T=4: 96, T=5: 86)
Difficulty levels T=1 (single operation) through T=5 (five chained operations)
Each task has 5 visible I/O examples and ~20 hidden test cases

Tasks are drawn from the DeepCoder dataset, which covers integer list transformations using ~40 primitives (sorting, filtering, mapping, zipping, scanning, etc.).

Reward Structure

Binary reward: 1.0 if the submitted function passes ALL hidden test cases, 0.0 otherwise. Fully verifiable — no LLM grader.

Data

Source: DeepCoder dataset (Balog et al., ICLR 2017)
Enriched with hidden test cases generated from ground-truth DSL programs

Tools

test(code) — Test Python code against the visible I/O examples. Returns pass/fail per example. Non-terminal.
submit(code) — Submit final Python code for grading against hidden test cases. Terminal action, one attempt only.

Time Horizon

Multi-turn. Agents can iterate using the test tool before submitting. Typical: 1-5 tool calls.

Environment Difficulty

T=1: Easy (single operation, e.g., sort, map, filter)
T=2: Medium (two chained operations)
T=3: Hard (three chained operations, often with multiple inputs)
T=4-5: Very hard (four-five chained operations, complex compositions)

Safety

No safety concerns — tasks involve only integer and list manipulation.

Citations

@inproceedings{Balog2017,
  author    = {Balog, Matej and Gaunt, Alexander L. and Brockschmidt, Marc and Nowozin, Sebastian and Tarlow, Daniel},
  title     = {DeepCoder: Learning to Write Programs},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2017},
  url       = {https://arxiv.org/abs/1611.01989}
}

@inproceedings{Fijalkow2022,
  author    = {Fijalkow, Nathana{\"{e}}l and Lagarde, Guillaume and Matricon, Th{\'{e}}o and Ellis, Kevin and Ohlmann, Pierre and Potta, Akarsh},
  title     = {Scaling Neural Program Synthesis with Distribution-based Search},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {36},
  number    = {6},
  pages     = {6623--6630},
  year      = {2022},
  doi       = {10.1609/aaai.v36i6.20616}
}

Repository

Source repository

EnvCommons/DeepSynth

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

DeepSynth

GeneralReasoning/DeepSynth

DeepSynth

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples