基於大型語言模型在法律判決預測之分析與應用

隨著人工智慧技術的迅速發展，各專業領域對其應用不斷推陳出新，涵蓋機器學習、深度學習及大型語言模型（Large Language Models, LLMs）等多個層面。在電腦計算資源顯著提升的推動下，人工智慧技術呈現出驚人的發展速度。本研究針對深度學習技術與大型語言模型在法律審判與判決書生成中的應用進行探討，旨在驗證其可行性並評估實際效能。近年來，隨著國民法官制度的正式實施，審判工作從過去僅由專業法官主導轉變為結合一般民眾與法官的共同參與。這一制度的引入雖增強了審判的多元性與公信力，但也帶來了新的挑戰：每位參與者對犯罪罪名及刑期的理解可能因主觀認知差異而導致審判結果不一致。因此，如何利用人工智慧技術輔助審判，降低主觀偏差對結果的影響，成為一項值得深入研究的課題。本研究提出一種基於預訓練語言模型 BERT（Bidirectional Encoder Representations from Transformers）的司法審判輔助方法。BERT 模型在實務應用中面臨輸入字元長度限制的挑戰，根據本研究所蒐集之判決書文字長度，有高達85.43\\\\\\\\\\\\\\\\%判決書文字皆超過BERT模型之限制，本研究提出改良BERT模型方法，稱為「chunking BERT」，該方法將超長文字進行分塊處理，透過上下文關聯設計達到延展輸入長度，且保持模型在預測上的準確性。本研究以「傷害罪」及「公共危險罪」為例，透過chunking BERT，可準確依據犯罪事實描述達到預測犯罪罪名及審判刑期，使用判決書做為資料集進行語言模型的微調階段，將犯罪情狀與事實描述套用在微調後的語言模型，對於犯罪嫌疑人可能觸犯的罪名與刑期預測，並提供審判者做為參考，避免每個人對於審判結果期待與認知差異過大影響審判結果。研究結果顯示，採用chunking BERT後，預測犯罪罪名的精確度將從原本的97.53%提升為98.95%。犯罪刑期部分，傷害罪案件的精確度由原本的68.82%提升為72.37%；公共危險罪案件的精確度從原本的73.03%提升為80.93%。此外，由於司法案件數量龐大，司法人員除審判工作，還需花費大量精力撰寫判決書，加重工作壓力，為了減輕司法人員負擔，本研究利用生成式語言模型Llama（Large Language Model Meta AI）進行判決書生成。以「傷害罪」為例，利用大語言模型取出判決書關鍵字，再透過大語言模型生成判決書，本研究以Llama為基礎，透過監督式微調（Supervised Fine-Tuned）及RLHF-PPO（Reinforcement Learning from Human Feedback with Proximal Policy Optimization），生成具高效性與準確性的法律判決書。本研究使用ROUGE做為評估指標，驗證生成模型的效果，以chinese alpaca 2 7B為例，使用監督式微調與RLHF-PPO的ROUGE值都有所提升。綜上，本研究旨在探討人工智慧與大型語言模型技術在法律領域中的應用可行性與實際成效。透過此類技術的輔助，期望能有效減少審判過程中因人為主觀因素所造成的影響，並藉由自動生成判決書的功能，降低司法人員工作負擔，從而進一步提升審判效率與品質。

關鍵字

大型語言模型；自然語言處理；法律判決預測；生成式人工智慧；人工智慧

並列摘要

With the rapid development of artificial intelligence (AI), various professional fields have continually introduced innovative applications covering multiple aspects, such as machine learning, deep learning, and large language models (LLMs). Driven by significant advancements in computing resources, AI technology has been evolving at an astonishing pace. This study examines the application of deep learning and LLMs in legal trials and judgment generation, aiming to validate their feasibility and assess their actual performance. With the formal implementation of the lay judge system in recent years, the judicial process has shifted from being led solely by professional judges to a collaborative effort involving lay citizens and judges. While this system enhances the diversity and credibility of trials, it also presents new challenges: differences in subjective understanding among participants regarding criminal charges and sentencing may lead to inconsistent trial outcomes. Consequently, how to leverage AI to aid in trials and reduce the influence of subjective bias on the results has become a subject warranting in-depth research. This study proposes a judicial decision support method based on the pre-trained language model BERT (Bidirectional Encoder Representations from Transformers). In practical applications, BERT faces challenges posed by input length limitations. According to the judgment texts collected in this study, as many as 85.43% exceed the input length limit for the BERT model. To address this issue, we introduce an improved BERT model called “chunking BERT.＂This approach segments overly long texts into smaller chunks and leverages contextual associations to extend the input length while preserving the model's predictive accuracy. Using “injury＂ and “public endangerment＂ as examples, this study demonstrates how the chunking BERT approach can accurately predict criminal charges and sentencing based on descriptions of criminal facts. During the fine-tuning stage, judgment documents serve as the dataset for the language model. Applying the circumstances and factual descriptions of crimes to the fine-tuned language model makes it possible to predict the charges and sentencing that a defendant may face, providing a reference for decision-makers and mitigating the impact of differing individual expectations and perceptions on trial outcomes. The results show that after adopting chunking BERT, the accuracy of predicting criminal charges improves from 97.53% to 98.95%. Regarding sentencing predictions, accuracy for injury cases increases from 68.82% to 72.37%, while accuracy for public endangerment cases increases from 73.03% to 80.93%. In addition, due to the large number of judicial cases, judicial personnel handle trial-related tasks and spend a considerable amount of time drafting judgments, thereby increasing their workload. To reduce this burden, this study employs the generative language model Llama (Large Language Model Meta AI) to generate judgments. Using “injury＂as an example, an LLM is utilized to extract the key information from a judgment. Then, that same model is used to generate the judgment text. Building upon Llama, this study adopts supervised fine-tuning (SFT) and RLHF-PPO (Reinforcement Learning from Human Feedback with Proximal Policy Optimization) to produce legal judgments that are both highly efficient and accurate. This study uses ROUGE as an evaluation metric to validate the effectiveness of generative models. Taking chinese Alpaca 2 7B as an example, ROUGE scores improved through SFT and RLHF-PPO training. In summary, this study explores the feasibility and practical effectiveness of applying AI and LLMs in the legal domain. By leveraging these technologies, we aim to significantly reduce the impact of human subjectivity in the trial process. Additionally, through automatic judgment generation, we seek to reduce the workload of judicial personnel, thereby further improving the efficiency and quality of trials.

並列關鍵字

Large Language Models ； Natural Language Processing ； Legal Judgment Prediction ； Generative Artificial Intelligence ； Artificial Intelligence

參考文獻

[1] A. Agarwal, S. Xu, and M. Grabmair. Extractive summarization of legal decisions using multi-task learning and maximal marginal relevance. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1857–1872, Abu Dhabi, United Arab Emirates, Dec. 2022.

Google Scholar

[2] K. Ameri, M. Hempel, H. Sharif, J. Lopez Jr., and K. Perumalla. Cybert: Cybersecurity claim classification by fine-tuning the bert language model. Journal of Cybersecurity and Privacy, 1(4):615–637, 2021.

Google Scholar

[3] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.

Google Scholar

[4] L. K. Branting, C. B. Callaway, B. W. Mott, and J. C. Lester. Integrating discourse and domain knowledge for document drafting. In Proceedings of the 7th International Conference on Artificial Intelligence and Law, ICAIL ’99, page 214–220, New York, NY, USA, 1999.

Google Scholar

[5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.

Google Scholar

延伸閱讀

全文下載

主題瀏覽