A Quantitative Comparison of Large Language Models and Commercial Services for the Translation of Chinese Legal Texts

Authors

  • Fei Qu Southwest University of Political Science and Law

Keywords:

Large Language Models, ChatGPT, BLEU, TER, legal translation

Abstract

The proliferation of Large Language Models (LLMs) presents transformative potential for professional domains, yet their application in the high-stakes field of legal translation requires rigorous empirical validation. This study conducts a quantitative comparison of the translation quality between two leading LLMs (Gemini 2.5 Pro, ChatGPT 4o) and two reputable commercial translation (CT) services (PKU Law, Wolters Kluwer). The evaluation uses the English translations of the General Provisions of the Criminal Law of the People’s Republic of China, with quality assessed through the automated metrics of Bilingual Evaluation Understudy (BLEU) and Translation Edit Rate (TER). Statistical analysis of the four individual sources revealed significant performance differences, with Gemini demonstrating a superior output compared to ChatGPT and, on some measures, PKU Law. However, a subsequent comparison between the aggregated LLM and CT groups found no statistically significant difference in translation quality for either BLEU or TER scores. This study posits that this apparent parity is a methodological illusion that stems from the profound limitations of lexical-based metrics. These metrics reward the superficial fluency of LLMs but are incapable of assessing functional equivalence, thereby failing to penalize critical semantic and legal errors. The findings conclude that despite the impressive coherence of LLM outputs, the nuanced, jurisdiction-specific expertise of human professionals remains the indispensable arbiter of quality and validity in legal translation.

Downloads

Published

2025-09-30