KernelBench

API Endpoint
Leaderboard
Loading leaderboard...
README

KernelBench

OpenReward Environment Hugging Face Dataset

Description

KernelBench is an environment for evaluating language model agents on GPU kernel optimization. Based on the KernelBench benchmark from the Stanford Scaling Intelligence Lab, agents are given a reference PyTorch implementation and must write optimized CUDA kernels that are both functionally correct and faster than the original. Tasks span single operators, fused kernel patterns, full model architectures, and HuggingFace models across 4 difficulty levels.

Capabilities

  • Writing custom CUDA kernels to replace PyTorch operators
  • Fusing multiple operations into single optimized kernels
  • Optimizing full neural network architectures end-to-end
  • Iterating on kernel implementations based on compilation and correctness feedback

Compute Requirements

KernelBench uses a 2-stage sandbox pattern per run_kernel tool call:

  • CPU sandbox (2 CPUs, 8GB RAM): Compiles CUDA kernels without requiring a GPU
  • GPU sandbox (NVIDIA L4): Benchmarks compiled kernels against the reference implementation

Both sandboxes are ephemeral and created/destroyed per tool call. The sandbox execution timeout is 300 seconds, matching the original KernelBench evaluation.

License

MIT.

Tasks

There are 7 splits with tasks sourced from the ScalingIntelligence/KernelBench HuggingFace dataset:

SplitTasksTypeDescription
dev1validationSingle task for development testing
level1100testSingle-kernel operators (matrix multiply, activations, norms, pooling, convolutions, reductions, loss functions)
level1_verified79testVerified subset of level1
level2100testFused kernel patterns (e.g., Conv2D + ReLU + BiasAdd)
level2_verified60testVerified subset of level2
level350testFull model architectures (MLP, ResNet, VGG, DenseNet, EfficientNet, ViT, LSTM, GRU, MiniGPT, UNet, Mamba)
level420testHuggingFace model architectures (GPT-Neo, OPT, BART, BigBird, Reformer, ELECTRA, GPT-2)

Each task presents a reference PyTorch Model class. The agent must produce a ModelNew class that implements the same functionality using custom CUDA kernels, along with the corresponding CUDA and C++ code.

Reward Structure

This is a dense reward environment with continuous scoring. Each run_kernel call compiles and benchmarks the agent's kernel, returning a reward based on the outcome:

OutcomeRewardDescription
Correct, fastermedian_speedup - 1Positive reward proportional to speedup over reference
Correct, same speed0.0Matches reference performance
Incorrect output-2.0Kernel runs but produces wrong results
Execution failure-3.0Kernel initialized but failed during execution
Initialization failure-4.0Kernel compiled but ModelNew could not be initialized
Compilation failure-5.0CUDA kernel failed to compile
Unknown error-6.0Unclassified error during evaluation
Exception/timeout-7.0Sandbox error or evaluation exceeded 300s timeout

The agent can call run_kernel multiple times to iterate on its solution. The finish tool ends the episode with reward 0.

We do not use LLM graders for this task. Correctness is verified by comparing the kernel output against the reference implementation, and speedup is measured over 1,000 benchmark trials.

Data

Task data is loaded at runtime from the ScalingIntelligence/KernelBench HuggingFace dataset. Each task contains a complete PyTorch file with a Model class, get_inputs(), and get_init_inputs() functions. The agent sees only the Model class and imports (test harness functions are stripped).

Tools

Agents are given two tools:

  • run_kernel: Submit Python code (defining ModelNew), CUDA code, and C++ code. The environment compiles the kernel on CPU, then benchmarks it on GPU against the reference implementation. Returns correctness, speedup, and detailed error messages on failure.
  • finish: End the task. Should be called once the agent has produced its best solution.

Time Horizon

KernelBench is a multi-turn environment. The agent receives a reference implementation and iteratively writes and tests CUDA kernels. There is no hard limit on the number of run_kernel calls; the agent decides when to call finish.

Environment Difficulty

One-shot performance from the original KernelBench paper (fast_1: % of kernels that match or exceed the PyTorch eager baseline):

ModelLevel 1Level 2Level 3
DeepSeek R112%36%2%
OpenAI o110%24%12%
Claude 3.5 Sonnet10%7%2%
DeepSeek V36%4%8%
GPT-4o4%5%0%
Llama 3.1-405B3%0%2%
Llama 3.1-70B3%0%0%

With 10 turns of iterative refinement (execution + profiler feedback), DeepSeek R1 improves to 43% on Level 1, 72% on Level 2, and 18% on Level 3. Writing functionally correct kernels remains the primary challenge, as models struggle with CUDA correctness even when compilation succeeds.

Other Environment Requirements

There are no further environment requirements beyond the OpenReward platform.

Safety

Agents in KernelBench write and execute CUDA code inside isolated sandboxes. The sandboxes are ephemeral and destroyed after each tool call, limiting the blast radius of any generated code. The environment does not present direct safety risks beyond those inherent in arbitrary code execution, which is contained by the sandbox.

Citations

@article{ouyang2025kernelbench,
  title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
  author={Ouyang, Anne and Guo, Simon and Arora, Simran and Zhang, Alex L. and Hu, William and R{\'e}, Christopher and Mirhoseini, Azalia},
  journal={arXiv preprint arXiv:2502.10517},
  year={2025}
}
GeneralReasoning/KernelBench | OpenReward