VERITAS
2026-06-10 · The audit log

The Best Model Anthropic Has Shipped Still Hallucinates. Here Is What the System Card Says.

Anthropic released Claude Fable 5 and Claude Mythos 5, the most capable models the company has published to date. On the AA-Omniscience knowledge-and-hallucination benchmark, Fable scored 40 points, seven points above the previous leader. By the field's own scoring, this is the new high-water mark for getting facts right and inventing fewer of them.

Read that number again. The best score on the field's knowledge-and-hallucination benchmark is 40 out of a possible 100.

The model is better than anything that came before it. The residual error did not reach zero. And inside the 319-page system card Anthropic published alongside the release is a quieter finding: in some tests, Fable 5 and Mythos 5 hallucinate more than the older Opus model they sit above. The frontier moved forward on average and slipped backward in places. Both statements are true at once, and both are documented by the company that built the model.

For most uses, a 40 is a triumph. For a filed brief, the residual is the entire problem.

The announcement does not tell you the part that matters

The public launch post is about what the model can do. Coding, reasoning, vision, speed. It does not quantify how often the model states something that is not true. That figure lives in the appendix, in a system-card document longer than most appellate briefs.

This is not a criticism of Anthropic. It is honest engineering. A responsible lab measures its own residual error and publishes it. The point for a litigator is narrower: the headline tells you the model is better, and the appendix tells you it is not perfect. A filing decision has to be made against the appendix, not the headline.

A citation either resolves to a real case in the reporter it names, or it does not. There is no partial credit. A model that is right 96 percent of the time produces, across a year of drafting, a steady supply of the other 4 percent. One of those, filed, is the pattern that sanctioned the attorneys in Mata v. Avianca. We took that case apart in an earlier post.

Why a better model does not lower the verification duty

The intuition runs the other way. Each release is more accurate, so the verification step should matter less over time. It does not, for three reasons.

The first is that accuracy and trust move together. The more reliable the model feels, the less an associate scrutinizes its output, and the easier a clean-looking phantom citation slides into a footnote. Higher accuracy raises the cost of the errors that remain, because fewer people are still looking for them.

The second is that the obligation does not scale with model quality. Florida Rule 2.515(d)(2), effective June 15, 2026, requires every signer to certify that all cited authorities exist and are accurately cited. New York's Part 161, effective June 1, 2026, makes a signature a certification that the paper contains no fabricated AI-generated cases or statutes. Neither rule has an exception for a good model. The duty attaches to the signature, not to the tool.

The third is that the courts already settled the question of whether a paid, capable AI tool earns a pass. It does not. In one 2026 matter, a firm had used a frontier model through the vendor console, had a written AI policy on the books, and still filed phantom quotations. The court's response was direct: a policy does not verify citations. The capability of the model was never the issue. The absence of an independent check under Rule 11 was.

AI-assisted drafting is the workflow. Verification is the discipline layer.

Veritas does not say AI is dangerous. The opposite. Fable 5 is a remarkable drafting partner, and the lawyers who use it will out-produce the ones who refuse to. The position is simpler than caution: every drafting tool that improves throughput needs a check that runs after drafting and before filing. That check is the discipline layer, and it is what makes the speed defensible.

Veritas verifies the citations in a draft against reporter and public-record sources. It works backward from the cite to the source, the inverse of how a research platform works. Each cite resolves to one of a small set of recorded outcomes. A case that could not be located in the reporter it names is flagged as not located, not as fabricated, because a transport error or a typo is not the same as an invented case, and a tool that false-accuses a real citation is its own kind of malpractice. The output is a Verification Certificate: SHA-256 hashed, time-stamped, and addressable at a public URL. It is the record that someone confirmed the authorities before the brief left the firm.

That certificate is the artifact a partner can hand to a client, file in the matter folder, or produce to a malpractice insurer if the worst happens. It is the difference between "we have an AI policy" and "we verified this filing, and here is the proof."

The takeaway

The models will keep getting better. The benchmark scores will keep climbing. None of that retires the verification step, because the residual error never reaches zero and the certification duty never relaxes. Anthropic measured its own model honestly and reported that the floor is not the ceiling. The disciplined response is not to stop using the model. It is to add the layer that turns a fast first draft into a filing you can defend.

Run the brief through Veritas before it is filed. Keep the certificate.

Run a Filing Risk Scan

Filed under · Claude Fable 5 · AI-assisted legal drafting · Verification Certificate