API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/plawbench

README

PLawBench

Description

PLawBench is an environment for evaluating LLMs in real-world Chinese legal practice. Agents solve tasks across three categories — practical case analysis, public legal consultation, and legal document generation — and are graded by an LLM judge against expert-designed rubrics covering fine-grained legal reasoning.

Capabilities

Legal case analysis with structured reasoning (conclusion, facts, reasoning, statutes)
Lawyer-style client consultation questioning
Legal document drafting (complaints, defense statements)

Compute Requirements

No special compute requirements. The environment makes OpenAI API calls for LLM-based judging.

Tasks

Single split: test (280 tasks total)
- Case analysis: 250 tasks — given a legal scenario and question, produce a structured analysis
- Legal consultation: 18 tasks — given a client's statement, ask targeted questions to gather information
- Defendant statements: 6 tasks — draft a defense statement from a defendant's account
- Plaintiff statements: 6 tasks — draft a complaint from a plaintiff's account

Reward Structure

Continuous reward normalized to [0.0, 1.0] = total_points / max_points.

Case analysis: Multi-dimension rubric grading (conclusion, facts, reasoning, statutes) with per-dimension scores aggregated
Legal consultation: Rubric-based matching of lawyer questions against expert-designed information-gathering criteria
Document generation: Rubric-based scoring of legal document quality against structured criteria

All evaluation is performed by an LLM judge (gpt-5-mini) following the original PLawBench evaluation protocol.

Data

Source: SKYLENAGE-AI/PLawBench

practical_case_analysis_250.jsonl — 250 case analysis tasks with structured rubrics
public_legal_consultation_18.json — 18 consultation scenarios with scoring rubrics
Defendants_Statement.json — 6 defendant statement tasks with rubrics
Plantiffs_Statement.json — 6 plaintiff statement tasks with rubrics

Tools

submit(answer: str) — Submit a legal answer for rubric-based LLM grading. Ends the episode.

Time Horizon

Single-turn. The agent receives the legal task and submits one answer.

Environment Difficulty

High. Tasks require domain expertise in Chinese law, including knowledge of specific statutes, legal reasoning patterns, and professional document formats.

Other Environment Requirements

OpenAI API key — required for the LLM judge (openai_api_key in secrets)

Safety

Tasks involve Chinese legal scenarios including civil disputes, family law, and corporate liability. All scenarios are fictional or anonymized. The environment evaluates legal reasoning quality, not the provision of actual legal advice.

Citations

@article{shi2026plawbench,
  title={PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice},
  author={Shi, Yuzhen and Liu, Huanghai and Hu, Yiran and Song, Gaojie and Xu, Xinran and Ma, Yubo and Tang, Tianyi and Zhang, Li and Chen, Qingjing and Feng, Di and Lv, Wenbo and Wu, Weiheng and Yang, Kexin and Yang, Sen and Wang, Wei and Shi, Rongyao and Qiu, Yuanyang and Qi, Yuemeng and Zhang, Jingwen and Sui, Xiaoyu and Chen, Yifan and Zhang, Yi and Yang, An and Yu, Bowen and Liu, Dayiheng and Lin, Junyang and Shen, Weixing and Zhao, Bing and Clarke, Charles L.A. and Wei, Hu},
  journal={arXiv preprint arXiv:2601.16669},
  year={2026}
}

Repository

Source repository

EnvCommons/PLawBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 2 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000230
Sandbox	Not configured
Total	$0.0000230

Examples

5-minute session$0.0069

1-hour session$0.0828

PLawBench

GeneralReasoning/PLawBench

PLawBench

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples