reasoning-token-allocation
API Endpoint
Leaderboard
Loading leaderboard...
reasoning-token-allocation
This repo provides an OpenReward-compatible ORS environment for learning reasoning-token caps on GSM8K tasks.
The objective is to keep accuracy high while minimizing reasoning token spend.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDataset files
train-00000-of-00001.parquet
test-00000-of-00001.parquetPlace both files in the project root for local runs, or set ORWD_DATA_DIR to the directory containing the files.
Run server
python3 server.pyServer runs on http://localhost:8080.
Docker
docker build -t reasoning-token-allocation .
docker run --rm -p 8080:8080 -e ORWD_DATA_DIR=/orwd_data -v "$(pwd):/orwd_data" reasoning-token-allocationVerify ORS endpoints
curl http://localhost:8080/health
curl http://localhost:8080/list_environments
curl http://localhost:8080/gsm8ktokencapenvironment/splits
curl http://localhost:8080/gsm8ktokencapenvironment/toolsTrain policy through ORS client
python3 train.py --steps 30000 --base-url http://localhost:8080 --env-name gsm8ktokencapenvironmentArtifacts are saved to artifacts/policy.pt.
Evaluate learned policy
python3 evaluate.py --mode policy --checkpoint artifacts/policy.pt --episodes 2000 --base-url http://localhost:8080 --env-name gsm8ktokencapenvironmentEvaluate fixed cap baselines
python3 evaluate.py --mode fixed --episodes 2000 --base-url http://localhost:8080 --env-name gsm8ktokencapenvironmentORS behavior
- Splits:
train,test - Prompt: current GSM8K question
- Tools:
set_token_cap(cap)chooses a reasoning-token capanswer(answer)ends episode and returns reward
- Cap choices:
{128, 512, 1024, 2048, 4096} - Reward:
accuracy * (1 - allocated_tokens/4096) ToolOutputincludesreward,finished, andmetadatawithcorrect,cap,allocated_tokens