Technical Lead & Researcher · JuriLens · Legal-AI research

Turning 10,000-word judgments into structured legal intelligence

JuriLens is published, peer-reviewed research — a RAG and prompt-engineering pipeline that reads Sri Lankan court judgments, summarizes them, and tags them against a custom 89-category taxonomy. I built and evaluated the whole system end-to-end for about $30.

Role

Technical Lead & Researcher

Corpus

1,200+ judgments

Taxonomy

89 categories

Cost

~$30 to build & test

RAG + prompt-engineering pipeline
preprocess → retrieve → summarize → synthesize → classify

The paper ships a working system alongside a frank account of its own evaluation limits — self-evaluation bias, a small benchmark. That candor is the point: I’d rather publish what the numbers can’t prove than dress them up.

Published as part of my MSc in Big Data Analytics at Robert Gordon University — “Utilizing RAG and Prompt Engineering for Categorization & Summarization of Judgments in the Sri Lankan Jurisprudence” (ICIIT Conclave 2024) on ResearchGate.

The problem

Legal information access in Sri Lanka is broken

The Supreme Court and Court of Appeal sites are the primary source for judicial precedent — and they suffer from poor search, weak interfaces, and frequent downtime. The judgments themselves routinely run past 10,000 words of dense, domain-specific language across criminal, civil, constitutional, and procedural matters. Finding the right precedent means reading lengthy documents by hand, or hoping the search works that day.

What was missing was a way to turn that unstructured text into something searchable: structured, multi-label categorization grounded in how lawyers actually think about case law.

How I approached it

Preserve legal context at every stage

At the time, a full judgment was too long to hand a model whole, and legal nuance was too important to lose in naive summarization. So I built a hierarchical pipeline that breaks the problem into stages, each tuned for one job: clean and chunk the document, summarize each chunk with domain-aware prompts, synthesize a coherent whole, then categorize. I processed 1,200+ judgments (2009–2024), built a custom 89-category taxonomy by engaging with Sri Lankan legal codes, and evaluated five LLMs — all on a developing-country budget.

The hard part of legal AI was never the model. It was keeping the law intact while the document got smaller.

What I built

A RAG and prompt-engineering pipeline for legal documents

Three pieces did the work: the pipeline, the taxonomy, and a five-model evaluation I read with its limits in plain view.

1 · The pipeline

Preprocess and chunk each judgment into 900–1000-token blocks (cl100k_base tokenizer) that respect sentence boundaries, so a legal argument never splits mid-reasoning. Summarize each chunk with prompts engineered to keep legal terminology, procedural detail, and judicial reasoning intact. Synthesize the chunk summaries into one coherent document summary. Then run multi-label categorization. Each stage is tuned for its own job; legal context survives end to end.

Pipeline diagram
preprocess / chunk → summarize → synthesize → classify

2 · The 89-category taxonomy

A custom classification system grounded in Sri Lankan legal codes, built with legal-expert consultation: primary domains (criminal, family, property, commercial, labour) plus cross-cutting dimensions (procedural, behavioral, entity, outcome). A single case can be tagged across all of them at once — criminal law, the entities involved, the procedural posture, the outcome — building a rich, searchable structure that didn’t exist for this jurisdiction before.

89-category taxonomy treemap
primary domains + cross-cutting dimensions

3 · Comparative model evaluation

I evaluated five LLMs across summarization and categorization. The most prescient result was on the reliable metric: a small open-source model (Meta-Llama-3-8B) edged the proprietary flagship on summarization semantic similarity (0.908 vs GPT-4o’s 0.885) , a signal that affordable legal AI is technically feasible, independently validated since by Lawma and SaulLM. I read the categorization scores with explicit caveats, because the benchmark that produced them had real limits.

5-model comparison
Meta-Llama-3-8B 0.908 · GPT-4o 0.885 · + 3 models

The categorization F1 table looks lopsided because it is: GPT-4o both generated the ground-truth labels and was scored against them, so its lead is inflated by self-evaluation bias, and 60 documents across 89 categories is too thin for statistical power. I said all of that in the paper. The summarization result is the one to trust.

The outcome

Structured, searchable, and honestly evaluated

1,200+

Court judgments (2009–2024) turned from unstructured text into categorized, searchable intelligence.

~$30

Total API cost to build and evaluate the whole system — evidence that affordable legal AI is viable for the jurisdictions that need it most.

Candid

Published with a frank account of its evaluation limits — self-evaluation bias, a small benchmark — which is what makes the result you can trust trustworthy.

The open-source finding aged well — independently validated by Lawma (ICLR 2025) and SaulLM (NeurIPS 2024). The taxonomy still fills a gap no public tool has filled for this jurisdiction.

Built with

RAG + prompt engineering
Intelligent chunking (cl100k_base)
Domain-aware prompting
Multi-label categorization
5-model evaluation
Serverless / API-driven

Next case study

SalesSuite — device-framed order-capture screen (abstracted)

A sales team in your pocket

Intern to architect — a five-product suite shipped across Asia.

Sitting on documents you can’t search?

Lengthy, domain-specific text that should be structured, categorized, and queryable — the same problem JuriLens solved, pointed at yours.

Scope an AI pipeline