Research22 June 20265 min read

The hallucination benchmark: 95% right, under 2% invented

Apothy answers professional questions correctly 95.0% of the time and invents an answer fewer than 2%. Standard retrieval gets it wrong more than half the time. Here is how, and why it matters.

For a consumer chatbot, a confident wrong answer is an annoyance. For a law firm, an accounting practice, or a clinic, it is a liability. The model that invents a citation, misstates a standard, or fabricates a number does not just waste time. It puts the firm on the hook.

So we measured it.

What we measured

We built a benchmark of professional questions, the kind a practitioner actually asks: regulatory standards, record keeping requirements, procedural detail where the answer is verifiable and the cost of being wrong is real. Then we scored two things separately, because they are not the same thing:

  • Accuracy · did the system give the correct answer.
  • Hallucination rate · how often it produced a confident answer that was simply made up.

A system can be accurate and still hallucinate, because the moments that matter most are the ones where the honest answer is "I do not know."

The results

On Apothy v6.4:

  • 95.0% accuracy on professional questions.
  • 1.5% hallucination rate.

A standard retrieval augmented setup, the common "bolt a vector database onto an LLM" approach, scored 47.3% on the same questions. More than half wrong.

The gap is not a better model. It is a different posture toward not knowing.

Knowing when not to answer

Most systems are built to always answer. Ask a question, get a paragraph. The architecture has no concept of declining.

Apothy's trust layer is built the other way around. Before it answers, it checks whether it actually has grounds to. When it does not, it says so, rather than filling the silence with something plausible. We call this epistemic gating, and it is the subject of the eighth application in our patent portfolio.

The effect is counterintuitive: a system that refuses more often is a system you can trust more. The 1.5% is not luck. It is the design.

Why a firm should care

If you are evaluating AI for a regulated practice, the headline accuracy number is the easy part. The number that should keep you up is the hallucination rate, because that is the one that ends up in a filing, an audit, or a client deliverable.

A 47% baseline is not a tool you can put in front of a matter. A sub 2% rate, paired with a model that tells you when it is unsure, is a different conversation.

The full method and the per category breakdown are in the white paper. If you run a firm that cannot afford a made up answer, that is what Private AI is built for.