Codex Annex I · Published 2026-03 · 30 pp

The hallucinationbenchmark.

Headline result

When asked the same suite of professional-grade questions, Apothy returned the right answer ninety-five times in a hundred, and invented the answer in fewer than two.

Apothy v6.4 · n = 200 questions · 3 systems × 3 independent LLM judges.
95.0%
Accuracy on answerable questions
1.5%
Overall hallucination rate
For reference:Standard RAG: 31.0% accuracy on the same suite.
Method

How we measured.

Two hundred questions drawn from fifteen real Australian financial-services documents · DDQs, ODD questionnaires, ESG surveys, APRA guidelines. Three systems · Plain LLM, Standard RAG, and Apothy · scored head-to-head by three independent LLM judges from different model families, with position-swap debiasing and bootstrap hypothesis tests. The full method, dataset construction, and per-system results are in the paper.

  • N200 questions · 100 answerable · 100 unanswerable
  • Judges3 independent LLM judges · DeepSeek R1, GPT-4.1, Claude Sonnet 4 · position-swap debiased
  • SystemsPlain LLM · Standard RAG (top-5 pgvector) · Apothy
  • StatisticsBootstrap + McNemar's exact · Bonferroni correction · Fleiss' kappa · SQuAD 2.0 cross-validation
The Paper

Read in full.

Loading the white paper…
Send a brief

Discuss the result.

The full method is in the paper; the implications usually need a call. Tell us where you're applying this and we will respond within two business days.

We respond within two business days. By introduction only.