LLM Workflow for Legal Doc Review
> We need an LLM workflow for legal doc review. How should we structure retrieval, validation, and human review checkpoints?
Sonnet Default gave solid practical guidance: split documents by clause rather than fixed length, use hybrid search to find relevant text, and route anything uncertain to a human. It correctly identified confidence scoring and hallucination as the main things to watch.
Sonnet Deepthink caught a specific failure that Sonnet Default missed entirely. A contract’s critical carve-out lived in Schedule 3B — an exhibit that was poorly scanned, got ranked low in search, and fell outside the context window. The AI never saw it. The confidence score still came back high, because the main contract text was retrieved cleanly. The system passed. The attorney never read Schedule 3B.
The key finding: a high confidence score only tells you the AI analyzed what it received well. It says nothing about whether the right documents were retrieved in the first place. Those are two different checks, and they need to be treated separately.
“LLM hallucination is the main risk”
Causal chain analysis
Reversed — corpus definition and reviewer fatigue are higher-risk; hallucination is third
“Human checkpoint = human reviews document”
Pre-mortem + queue modeling
A 2,400-doc queue with 8 min/doc is rubber-stamping, not review — queue length is a quality parameter
“Confidence score validates retrieval”
Inversion + cross-domain analogy
Confidence score and retrieval completeness are orthogonal — LLM can be highly confident on an incomplete context window
“Auto-accept requires full human review fallback”
Technology-assisted review analogy
Protocol soundness (seed set + rolling validation) makes auto-accept defensible — but only with proper documentation
Core pipeline: Ingest → Chunk & Embed → Retrieval → Analysis → Validation → Human Review → Decision. Chunking strategy matters most for legal docs — split by section/clause boundaries, not fixed token counts. Preserve hierarchy: document → section → clause → sub-clause.
Hybrid retrieval (dense + sparse): semantic similarity via embeddings at 0.6 weight, BM25 for exact legal terms at 0.4 weight. Always retrieve the surrounding context (parent section + sibling clauses) — legal meaning is heavily context-dependent.
Validation gates run sequentially, fail fast: Completeness (required clauses present), Consistency (defined terms used consistently), Cross-reference (internal refs resolve), Jurisdiction (escalate to human), Conflict (contradictions between sections).
Human review designed around risk level, not task completion: Low risk = LLM summary + attorney spot-check. Medium risk = LLM draft + attorney full review. High risk = LLM flags issues only, attorney owns document. Any uncertainty = attorney, no exceptions.
Confidence and uncertainty handling: if confidence < 0.8 on any legal conclusion, set requires_human_review=True. Legal is a domain where ‘I’m not sure’ must route to a human.
The model’s job is to surface issues, not to make legal decisions. Structure the workflow so the model reduces attorney time, not attorney involvement. Any workflow where the LLM is the final decision-maker on legal conclusions is a liability.
Done · 11 sec
Validation passes on grounded-but-incomplete retrieval. The LLM output IS grounded in the text it received, the problem is the text was incomplete. Confidence estimator scores output quality, not retrieval completeness. These must be separate gates.
Human-in-the-loop becomes human-in-name-only under workload pressure. Queue length is a quality parameter, not just a throughput metric. Auto-downgrade tier when average time-per-doc drops below minimum threshold.
Tier 3 (high-risk) must suppress LLM pre-summary until after the attorney reads the document independently. Showing the AI’s analysis first anchors the reviewer to the model’s frame.
Tier assignment is driven by BOTH risk category AND retrieval completeness. A boilerplate clause gets escalated to full attorney review if its document family was incompletely indexed. These are not additive.
The real failure modes are upstream and downstream of the LLM. The highest-risk components are Corpus Definition (humans decide what enters the review universe before any LLM runs) and the Human Review Queue (reviewer fatigue degrades quality on long queues). The LLM reasoning layer is in the middle and is not the primary failure source.
Pre-mortem traced a realistic failure: MAE clause had a carve-out in Schedule 3B. The exhibit was OCR’d with 4% error rate, got low embedding scores, ranked 9th in the context window (past the token limit), and was never seen by the LLM. Confidence estimator returned 0.87 because the base contract was clearly retrieved — it had no signal that context was incomplete. The finding was auto-accepted. The attorney never read Schedule 3B.
Critical distinction revealed: confidence score ≠ retrieval completeness. These are different signals and must be different gates. The most dangerous causal chain: validation passes on grounded-but-incomplete retrieval — the LLM output IS grounded in the text it received, the text was just incomplete.
Tiered architecture with a structural resolution: tier = max(risk_tier(clause_type), completeness_penalty(doc_graph)). Even a boilerplate clause gets escalated to full attorney review if its document family was incompletely indexed. Tier 3 suppresses LLM pre-summary until after attorney reads — prevents anchoring bias.
Aviation pre-flight checklist analogy: the key transfer is that the copilot has explicit authority to challenge the captain on safety items. Maps to: any reviewer can escalate any document to Tier 3 regardless of system assignment. This is a hard system right, not a permission to request.
The exploration started with ‘how do we do RAG + checkpoints’ and ended somewhere more structurally interesting. The LLM call is the least problematic step: it can be validated, traced, and logged. The hard problems are document graph completeness and human checkpoint integrity.
15 constraints resolved · 5m 3s
Lawyers consistently override low-confidence findings, training the system to suppress borderline issues. The system learns to flag less, which looks like improving precision but is actually increasing false negatives. Canary audits and floor constraints are required.
Junior associate uses LLM review as primary review under time pressure. The workflow designed for human-in-the-loop becomes human-in-name-only. Enforced gates with minimum review time and random audit of sessions are required.
A 97% agreement rate becomes ammunition for stakeholders to argue for less human review depth. The success metric becomes the attack vector. Not preventable by technical design alone — requires externally-anchored institutional guardrails.
All four stakeholder perspectives converge: the LLM’s core value is triage and routing, not autonomous legal judgment. This resolves the liability question — standard of care shifts to whether the right humans reviewed the right things.
Individually innocuous clauses can create emergent unfavorable outcomes when combined. A holistic document-level analysis pass is required in addition to clause-level review.
This exploration started from a technical architecture question and ended at an organizational design problem. The initial assumption was that retrieval, validation, and human checkpoints are primarily engineering challenges. What emerged is that the most dangerous failure modes are behavioral — systems that technically work but degrade through human behavior changes around them.
Five independent failure modes: (1) ingestion misses document structure (exhibits, schedules), (2) feedback loops drift risk thresholds downward over time, (3) jurisdictional misapplication — US-trained reasoning applied to UK-governed documents, (4) human checkpoints degrade to rubber stamps under time pressure, (5) clause-by-clause analysis misses adversarial document construction where individually innocuous clauses create emergent unfavorable outcomes.
The two most dangerous failures — calibration drift and checkpoint degradation — are BEHAVIORAL, not technical. They emerge slowly and are invisible until incident. Calibration drift: lawyers consistently override low-confidence findings, training the system to suppress borderline issues. Checkpoint degradation: junior associate uses LLM review as primary review under time pressure.
Four stakeholders converge on one reframe: the LLM’s role is triage and routing, not autonomous review. This resolves the liability question entirely — standard of care shifts from ‘was the AI review adequate?’ to ‘did the right humans review the right things?’
Multi-pass output variance serves double duty: hallucination detection AND clause complexity signaling. High variance between passes = genuinely complex clause, which feeds tier routing. Quarterly blind comparison audits are the only direct quality measurement and must be established from day one.
The final pre-mortem revealed the most uncomfortable finding: good audit results get weaponized to justify reducing the very oversight the system depends on. A 97% agreement rate becomes ammunition for stakeholders to argue for less human review. The success metric becomes the attack vector. This failure mode is not preventable by technical design alone — it requires externally-anchored institutional guardrails that internal stakeholders cannot unilaterally change.
12 constraints resolved · 5m 12s
Unlock your AI’s deep reasoning potential
Simple installation for Claude Code, Claude Desktop, Codex, Gemini, Anti-Gravity, Cursor, and Windsurf.
Works with any client that supports MCPs.