The hallucinationbenchmark.
When asked the same suite of professional-grade questions, Apothy returned the right answer ninety-five times in a hundred, and invented the answer in fewer than two.
How we measured.
Two hundred questions drawn from fifteen real Australian financial-services documents · DDQs, ODD questionnaires, ESG surveys, APRA guidelines. Three systems · Plain LLM, Standard RAG, and Apothy · scored head-to-head by three independent LLM judges from different model families, with position-swap debiasing and bootstrap hypothesis tests. The full method, dataset construction, and per-system results are in the paper.
- 200 questions · 100 answerable · 100 unanswerable
- 3 independent LLM judges · DeepSeek R1, GPT-4.1, Claude Sonnet 4 · position-swap debiased
- Plain LLM · Standard RAG (top-5 pgvector) · Apothy
- Bootstrap + McNemar's exact · Bonferroni correction · Fleiss' kappa · SQuAD 2.0 cross-validation
Read in full.
Discuss the result.
The full method is in the paper; the implications usually need a call. Tell us where you're applying this and we will respond within two business days.