Codex Annex I · Published 2026-03 · 30 pp

The hallucinationbenchmark.

Headline result

When asked the same suite of professional-grade questions, Apothy returned the right answer ninety-five times in a hundred, and invented the answer in fewer than two.

Apothy v6.4 · n = 200 questions · 3 systems × 3 independent LLM judges.

95.0%

Accuracy on answerable questions

1.5%

Overall hallucination rate

For reference:Standard RAG: 31.0% accuracy on the same suite.

Method

How we measured.

Two hundred questions drawn from fifteen real Australian financial-services documents · DDQs, ODD questionnaires, ESG surveys, APRA guidelines. Three systems · Plain LLM, Standard RAG, and Apothy · scored head-to-head by three independent LLM judges from different model families, with position-swap debiasing and bootstrap hypothesis tests. The full method, dataset construction, and per-system results are in the paper.

N200 questions · 100 answerable · 100 unanswerable
Judges3 independent LLM judges · DeepSeek R1, GPT-4.1, Claude Sonnet 4 · position-swap debiased
SystemsPlain LLM · Standard RAG (top-5 pgvector) · Apothy
StatisticsBootstrap + McNemar's exact · Bonferroni correction · Fleiss' kappa · SQuAD 2.0 cross-validation

The Paper

Read in full.

Loading the white paper…

Discuss the result →

Send a brief

Discuss the result.

The full method is in the paper; the implications usually need a call. Tell us where you're applying this and we will respond within two business days.