LLM-based Methods and Benchmark for Australian Law

As artificial intelligence transforms legal practice, deploying Large Language Models effectively has become critical. While LLMs show promise across legal tasks, challenges around factual accuracy and domain-specific reasoning persist, particularly for citation prediction—where authoritative references carry binding legal force. We introduce the AusLaw Citation Benchmark, comprising 55k real-world Australian instances and 18,677 unique citations—the largest jurisdiction-specific dataset for this task. We systematically compare prompting, retrieval, fine-tuning, and hybrid strategies, including instruction-tuned models, sparse and dense retrieval, and re-ranker ensembles. Our findings reveal that stand-alone generative models—whether general or law-specific—fail almost entirely, underscoring the risks of unaugmented deployment. Task-specific instruction tuning dramatically improves performance, BM25 outperforms dense embeddings in retrieval, and jurisdiction-specific pre-training surpasses larger but less targeted models. Hybrid approaches with trained re-rankers achieve the best results, yet a substantial 40% performance gap remains, exposing the persistent long-tail challenge in citation prediction. These results reframe assumptions about scale, retrieval, and fine-tuning, and establish a foundation for building reliable, jurisdiction-aware legal AI systems.

Legal Citation Prediction with LLMs:
Evaluation of Instruction Tuning, Retrieval, and Jurisdiction-Specific Pre-training on the AusLaw Citation Benchmark

Abstract

BibTeX

Legal Citation Prediction with LLMs: Evaluation of Instruction Tuning, Retrieval, and Jurisdiction-Specific Pre-training on the AusLaw Citation Benchmark

Abstract

BibTeX

Legal Citation Prediction with LLMs:
Evaluation of Instruction Tuning, Retrieval, and Jurisdiction-Specific Pre-training on the AusLaw Citation Benchmark