The hallucination harness — a regression demo bundled as a credibility artifact. Runs a matrix of prompts × workspace-context sizes through the real `system_prompt.md` end-to-end via OpenRouter, classifies each response for hallucination signatures, and writes JSON + Markdown reports. Honest about being a demo: the 8 prompts in prompts.py are innocuous starters, not the full adversarial surface (~100 curated jailbreak/link-insistence/citation- fabrication prompts, multi-language variants, per-skill regression) that lives in the hosted product. Documents the runner commands (a quick smoke test, the 56-call full demo, and a `--strip-fix` negative control that confirms the harness actually detects hallucinations) and the near-0% acceptance target for naturalistic prompts. When running or extending the hallucination regression tests, or citing how the system is evaluated.
Hallucination harness — OSS demo
A regression harness for testing the agent's hallucination behavior on legal-shaped prompts. Bundled with AnyLegal OSS as a credibility artifact: "we test our system."
What it does
Runs a configurable matrix of prompts × workspace context sizes through the actual system_prompt.md end-to-end via OpenRouter, classifies the agent's response for hallucination signatures, and writes JSON + Markdown reports.
What it isn't
This is a demo set. The 8 prompts in prompts.py are innocuous starter examples — enough to demonstrate the harness end-to-end. They do NOT cover the full adversarial surface that production-grade legal-AI evaluation needs.
The production AnyLegal hallucination evaluation runs a much larger curated set (jailbreak attempts, link-insistence prompts, non-English adversarial variants, document-mention fabrications, citation fabrications, URL fabrications) — that set is part of the hosted product, not this OSS distribution.
Running the harness
cd backend
source .venv/bin/activate
export OPENROUTER_API_KEY=<your-key>
# Quick smoke test
python -m tests.hallucination.runner --sizes 1 --max-prompts 1
# Full demo run (8 prompts × 7 sizes = 56 LLM calls, ~$0.10–0.30)
python -m tests.hallucination.runner \
--sizes 1,5,10,25,50,75,100 \
--out tests/hallucination/results/demo_run.json
# Negative control (confirms the harness actually detects hallucinations)
python -m tests.hallucination.runner --strip-fix \
--out tests/hallucination/results/negative_control.json
Acceptance target
For naturalistic prompts (the demo set), the hallucination rate should be near 0%. Some hallucinations on adversarial prompts are expected — track the delta vs. prior baseline rather than the absolute number.
Production-grade evaluation
For comprehensive hallucination evaluation tied to your jurisdiction, model selection, and skill catalog — see anylegal.ai. The production stack includes:
- ~100 curated adversarial prompts covering jailbreak / link-insistence / citation fabrication / document fabrication
- Multi-language adversarial variants (non-English contract review)
- Per-skill regression tracking (review / draft / docx-editing each have their own evaluation surface)
- Lawyer-review subset for redline-quality scoring