A Comparative Analysis of Large Language Models for Extracting Judicial Entities

Amir Siddique; Ali Saeed

doi:10.63163/jpehss.v3i4.978

Authors

Amir Siddique Applied Computing Technically Department, Faculty of Information Technology, University of Central Punjab, Lahore, Pakistan Email: amir017@hotmail.com
Ali Saeed Department of Software Engineering, Faculty of Information Technology, University of Central Punjab, Lahore, Pakistan

DOI:

https://doi.org/10.63163/jpehss.v3i4.978

Abstract

Legal judgments contain rich structured information, but the content is typically embedded inside long, formal narrative text. Named Entity Recognition (NER) is therefore a practical building block for legal search, analytics, and document understanding. In Pakistan, however, there is limited comparative evidence showing how well current large language model (LLM) services extract judicial entities from Lahore High Court (LHC) judgments under consistent experimental conditions. This paper benchmarks four LLM families, ChatGPT, Gemini, Grok, and DeepSeek, for prompt-only judicial NER on a corpus of 500 LHC judgments. A 22-type entity schema is defined to reflect common information needs in case law, including parties, judges, citations, acts, sections, and dates. Gold labels are produced as token-level IOB tags and then converted into a gold JSON representation with reconstructed entity spans and offsets. For model inference, a batch API application submits a unified prompt template to each model and converts responses into a standardized prediction JSON format. A separate evaluation application matches prediction JSON against the gold JSON using strict span-level exact match rules and reports micro-averaged precision, recall, and F1, supported by qualitative error analysis. Results show that the top three models perform closely under strict scoring, with Grok achieving the highest micro F1 (0.6854), followed by Gemini (0.6820) and ChatGPT (0.6783), while DeepSeek scores lower (0.5790). ChatGPT produces the highest precision (0.7366), whereas Grok and Gemini achieve higher recall (0.6628 and 0.6534), indicating different operating points that matter for deployment. The error analysis highlights recurring legal-specific challenges, particularly span boundary drift, citation formatting variability, and confusions among closely related legal labels.

A Comparative Analysis of Large Language Models for Extracting Judicial Entities

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

info

Latest publications

Language

Information