PLawBench
PLawBench
Description
PLawBench is an environment for evaluating LLMs in real-world Chinese legal practice. Agents solve tasks across three categories — practical case analysis, public legal consultation, and legal document generation — and are graded by an LLM judge against expert-designed rubrics covering fine-grained legal reasoning.
Capabilities
- Legal case analysis with structured reasoning (conclusion, facts, reasoning, statutes)
- Lawyer-style client consultation questioning
- Legal document drafting (complaints, defense statements)
Compute Requirements
No special compute requirements. The environment makes OpenAI API calls for LLM-based judging.
Tasks
- Single split:
test(280 tasks total)- Case analysis: 250 tasks — given a legal scenario and question, produce a structured analysis
- Legal consultation: 18 tasks — given a client's statement, ask targeted questions to gather information
- Defendant statements: 6 tasks — draft a defense statement from a defendant's account
- Plaintiff statements: 6 tasks — draft a complaint from a plaintiff's account
Reward Structure
Continuous reward normalized to [0.0, 1.0] = total_points / max_points.
- Case analysis: Multi-dimension rubric grading (conclusion, facts, reasoning, statutes) with per-dimension scores aggregated
- Legal consultation: Rubric-based matching of lawyer questions against expert-designed information-gathering criteria
- Document generation: Rubric-based scoring of legal document quality against structured criteria
All evaluation is performed by an LLM judge (gpt-5-mini) following the original PLawBench evaluation protocol.
Data
Source: SKYLENAGE-AI/PLawBench
practical_case_analysis_250.jsonl— 250 case analysis tasks with structured rubricspublic_legal_consultation_18.json— 18 consultation scenarios with scoring rubricsDefendants_Statement.json— 6 defendant statement tasks with rubricsPlantiffs_Statement.json— 6 plaintiff statement tasks with rubrics
Tools
submit(answer: str)— Submit a legal answer for rubric-based LLM grading. Ends the episode.
Time Horizon
Single-turn. The agent receives the legal task and submits one answer.
Environment Difficulty
High. Tasks require domain expertise in Chinese law, including knowledge of specific statutes, legal reasoning patterns, and professional document formats.
Other Environment Requirements
- OpenAI API key — required for the LLM judge (
openai_api_keyin secrets)
Safety
Tasks involve Chinese legal scenarios including civil disputes, family law, and corporate liability. All scenarios are fictional or anonymized. The environment evaluates legal reasoning quality, not the provision of actual legal advice.
Citations
@article{shi2026plawbench,
title={PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice},
author={Shi, Yuzhen and Liu, Huanghai and Hu, Yiran and Song, Gaojie and Xu, Xinran and Ma, Yubo and Tang, Tianyi and Zhang, Li and Chen, Qingjing and Feng, Di and Lv, Wenbo and Wu, Weiheng and Yang, Kexin and Yang, Sen and Wang, Wei and Shi, Rongyao and Qiu, Yuanyang and Qi, Yuemeng and Zhang, Jingwen and Sui, Xiaoyu and Chen, Yifan and Zhang, Yi and Yang, An and Yu, Bowen and Liu, Dayiheng and Lin, Junyang and Shen, Weixing and Zhao, Bing and Clarke, Charles L.A. and Wei, Hu},
journal={arXiv preprint arXiv:2601.16669},
year={2026}
}