As artificial intelligence transforms legal practice, deploying Large Language Models effectively has become critical. While LLMs show promise across legal tasks, challenges around factual accuracy and domain-specific reasoning persist, particularly for citation prediction—where authoritative references carry binding legal force. We introduce the AusLaw Citation Benchmark, comprising 55k real-world Australian instances and 18,677 unique citations—the largest jurisdiction-specific dataset for this task. We systematically compare prompting, retrieval, fine-tuning, and hybrid strategies, including instruction-tuned models, sparse and dense retrieval, and re-ranker ensembles. Our findings reveal that stand-alone generative models—whether general or law-specific—fail almost entirely, underscoring the risks of unaugmented deployment. Task-specific instruction tuning dramatically improves performance, BM25 outperforms dense embeddings in retrieval, and jurisdiction-specific pre-training surpasses larger but less targeted models. Hybrid approaches with trained re-rankers achieve the best results, yet a substantial 40\% performance gap remains, exposing the persistent long-tail challenge in citation prediction. These results reframe assumptions about scale, retrieval, and fine-tuning, and establish a foundation for building reliable, jurisdiction-aware legal AI systems.
@article{auslawcitebenchmark,
title={Legal Citation Prediction with LLMs: A Comparative Evaluation of Instruction Tuning, Retrieval, and Jurisdiction-Specific Pre-training on the AusLaw Citation Benchmark},
author={Jiuzhou Han, Paul Burgess, Ehsan Shareghi},
year={2026},
journal = {Artificial Intelligence and Law}
}