How the Nightmare numbers are made | Methodology of the Aginor extraction test

This is the methodology behind the Nightmare post. Two reviewers flagged things in v0, one on the draft before publication and one after; the numbers in the headline post and the public repo are v1, post-fix.

How we define hallucination

A hallucination is an extracted value that doesn't appear anywhere in the packet's ground truth after normalizationStrings: lowercase, punctuation collapsed to spaces, whitespace runs collapsed. So "Preston Center Tower, Inc." and "preston center tower inc" match.Numbers: $, commas, spaces, and trailing % stripped before float-cast.A second tier strips everything but [a-z0-9] for ID-style matches ("CL-2023-12345" ↔ "CL202312345"), with a 4-char minimum so trivially short values don't pass.. The generator writes that ground truth alongside the documents themselves: one packet-level JSON, plus per-document artifacts (document_truth_*.json, field_truth_*.json, manifest_*.json, packet_truth.json) next to each rendered PDF or XLSX. No model output ever feeds back into it, so one model's hallucinations can't mask another's. A second pass (alias_audit.py) re-checks each flagged value by token-level coverage, and only values that fail both passes get counted.

The hallucination-check universe is pooled per-packet by union'ing the per-document and packet-level ground truth. Customer info (insured, producer, preparer, carrier) is shared across the 25 to 36 documents in a packet, so a real value modeled in doc-A's ground truth but not in doc-B's shouldn't false-flag on doc-B. Per-document field_truth_*.json still drives field-level scoring (composite, correctness rate, omission rate).

v0 → v1 fixes

Path-resolver fix

A pre-publish audit caught the ground-truth path resolver silently failing on the canonical packet layout, which inflated v0 string rates by 4.5 to 7.9pp per model and numeric rates by 0.9 to 2.4pp.

Strict-mode schema enforcement

The v0 extraction calls ran in default JSON mode without schema enforcement. Apurv Gandhi at Reducto flagged it, and v1 converts every prompt schema to declarative JSON Schema and enables strict mode on each providerOpenAI: response_format with type: "json_schema" and strict: true (Structured Outputs).Anthropic: a single extract tool with tool_choice: "auto", and a defensive unwrap of the occasional envelope key.Gemini: response_mime_type: "application/json" plus response_json_schema, with propertyOrdering injected on every object (Gemini reorders alphabetically by default, which breaks key-aligned diffs across models)..

The 26 per-doc-type schemas (7 non-ACORD + 19 ACORD per-form) live in public/schemas/ and are built through _build_schemas.py, which enforces additionalProperties: false, puts every property in required, and represents nullable types as ["string", "null"]. Without that, "did the model hallucinate a key" and "did the model decline to populate a key" weren't cleanly separable.

Schema completeness

The trigger was the loss_run status enum: v0 had ["Open", "Closed"], and strict-mode validation silently nulled everything else. Ground truth modeled 24 Denied, 18 Subrogation, and 15 Re-opened claims, all of which became null on output. Apurv flagged the loss_run case; I treated it as a class-of-issue and walked every leaf in all 5 ground-truth packets to rebuild the enums from observed values.

Three concrete changes:

Nine fields promoted from free additional_fields to named enums: construction, roof_type, sprinkler_type, fire_alarm_type, vehicle_class, cdl_class, policy_form_type, state, sex.
Eight fields had their enum dropped because they had zero ground-truth presence.
Checkbox values standardized to ["Yes", "No"], replacing the v0 mix of X / checked / unchecked.

loss_run.status ended up as ["Open", "Closed", "Denied", "Subrogation", "Re-opened", "Closed Without Payment"].

Exact-match numeric scoring

v0 used a ±1% relative tolerance on numeric comparisons. An audit caught it masking real model errors: extracted $24,344,800 against rendered $24,514,100 on the N1 financial statement, cents-truncated $153,631 against $153,631.51 on loss-run rows, year off-by-one (2009 against 2010) silently scored as correct. Villify renders exact numeric values and ground truth mirrors them, so any post-normalization mismatch is model error. v1 drops the band; as_float() still normalizes ("$1,500,000" to 1500000.0), but the comparison itself is ==. Numeric hallucination rates moved 2 to 8pp per cohort.

Exact-token composed-string acceptance

v0 accepted a multi-token string if 80% of its tokens appeared in the source universe, on the theory that legitimate model concatenations like "LOC-001: Preston Center Tower, 8117 Preston Road, Dallas, TX 75225" shouldn't be penalized just because the combined form isn't in the universe. The same audit that caught the numeric tolerance found this rule admitting real hallucinations: "9900 state road philadelphia pa 19136" against rendered "7600 state road..." scored as correct because five of six tokens matched. Same failure mode, one-sided against the over-emitting (typically OpenAI) cohorts. v1 requires every token to appear in the universe; legitimate concatenations still pass because every component does in fact appear. String hallucination rates moved 0.2 to 1.5pp per cohort, with the OpenAI side hit roughly 3x harder than Anthropic or Gemini.

Precision vs recall: the omission question

The headline rate is hallucinations / emitted, so a model that returns null on hard fields gets a smaller denominator and looks better (raised by Apurv after the v0 post).

score.py now classifies every ground-truth-populated field as correct, wrong value, or omitted, with the recall denominator being those same fields. A null where ground truth has a value counts as a failure. Aggregate in results_aggregate/field_breakdown.json.

Field-level error breakdown at default effort

Model	Ground-truth fields	Correct	Wrong value	Omitted	Any error
GPT-5.5	5,632	73.6%	7.0%	19.4%	26.4%
GPT-5.4	5,710	71.9%	8.9%	19.2%	28.1%
Opus 4.7	5,898	71.9%	8.7%	19.4%	28.1%
Sonnet 4.6	5,866	70.8%	9.9%	19.3%	29.2%

Omission rates cluster between 18.8% and 22.1% across the 15 cohorts. The cross-model spread on the headline rate is driven by the wrong-value column, not the omitted column, so models aren't gaming it by being selectively silent.

Methodology choices

Thinking defaults are asymmetric

GPT-5.5, GPT-5.4, Opus 4.7, and Sonnet 4.6 default to thinking off when no reasoning parameter is passed. Gemini 3.x defaults to thinking on at HIGH and cannot be disabled. Google's docs are explicit: Gemini 3 uses dynamic thinking by default at high, and thinking cannot be turned off for Gemini 3 Pro or 3.1 Pro.

So a default-vs-default cell with Gemini next to the other four isn't a matched comparison: it's GPT/Claude at thinking-off against Gemini at thinking-on/HIGH. The default-only table excludes Gemini; the matched HIGH/XHIGH tables include all five. Sonnet 4.6's "XHIGH" uses max because Anthropic doesn't support xhigh; the label is kept for side-by-side reading.

XHIGH is each model's ceiling, not a matched setting

"XHIGH" is the highest reasoning level each provider exposes, and the four providers don't expose the same thing. GPT-5.5/5.4 use reasoning_effort="xhigh". Opus 4.7 uses effort="xhigh" (a real Anthropic level). Sonnet 4.6 uses effort="max" because xhigh 400s on Sonnet. Gemini 3.1 Pro uses thinking_budget=32000 (numeric budget; Gemini has no named XHIGH). Tables print "XHIGH" for side-by-side reading, but the right way to read an XHIGH-row delta is "each model at its own ceiling," not "matched effort."

Runtime asymmetries: output caps and retries

Two provider-side knobs aren't matched and would be hard to match without subsidizing one provider. Default-mode output cap: OpenAI and Anthropic are configured at 32K output tokens, Gemini at 128K, so Gemini has 4× the default-mode headroom. Reasoning-mode caps are all 128K and matched. A token-cap audit (scripts/token_cap_audit.py) finds no completion approached its configured ceiling on v1, so the asymmetry didn't bite, but it's worth knowing about. Retry budget on transient errors: OpenAI 5 attempts (~90s backoff), Anthropic 5 (~90s backoff, matched to OpenAI), Gemini 7 (~250s). The Gemini extra is in-code-justified by N1/N2 ACORDs hitting deterministic 503 "high demand" windows; without it Gemini would lose a handful of docs to reliability rather than to capability. We chose to subsidize reliability rather than blame the model for vendor flakiness.

ACORD enum aliasing

ACORD 125/140/160/24/27/28/45 schemas hard-enum construction and roof_type to short ACORD-formal lists (MNC, JM, F, etc.), but SOV and engineering schemas leave the same fields as free strings. The same building shows up as "MNC" on one schema and "Masonry Non-Combustible" on another, and the per-doc-type comparison goes sideways for reasons that have nothing to do with model behavior. score.py accepts the documented abbreviation/full-name mappings in either direction, so the same building scores the same way no matter which schema the value showed up under.

Field-path aliases (scoring coverage)

GT may store a value as a flat scalar like is_revenue while the prompt schema nests it at a longer path like income_statement.revenue.net_revenue. The FIELD_ALIASES table in score.py maps each flat GT field to every path a model might emit the value at; without the right entry, even a correct extraction scores as omitted instead of correct, and a wrong-value emission lands in omitted instead of wrong. An audit found is_revenue and is_operating_expenses missing from the alias table; both were fixed. Numeric and string hallucination rates are computed by hallucination_analysis.py against the packet universe and were unchanged by the fix. Composite scores shifted by ~0.2pp per cohort.

Micro vs macro averaging

We calculate both micro and macro averages. The headline rates are micro (total_hallucinated / total_emitted); macro averages (mean of per-doc rates) are in paired_stats.json for readers who want each doc weighted equally regardless of fact volume.

Doc-level independence in the sign test

paired_stats.json has both sign tests and bootstrap confidence intervals (CIs). The 148 docs are independent because every comparison is the same doc across different models. Bootstrap CIs are for when shared context matters: insured and producer names repeat across the docs in a packet, so errors on those fields can correlate.

Internal arithmetic consistency

internal_consistency.py checks each extraction against itself: for SOV docs, sum of per-location TIVs (total insurable values) vs reported totals.tiv; for loss runs, sum of claim incurred vs reported grand total. Any mismatch at cents precision flags the doc.

For financial statements the same idea applies: at any dict level containing a total_* key, the non-total numeric siblings must sum to the reported total. For sibling dicts, the check uses each dict's own declared total_* instead of a raw sum. This catches an extraction that reports total_operating_expenses alongside depreciation and other_operating whose values don't add up. Cases where a sub-bucket is opaque are skipped, since flagging those would over-fire. No cohort produces a financial_total_mismatch on v1: when models declare a total alongside its line items, they sum consistently.

Universe audit (rendered-page-bounded)

The hallucination check's "universe" is the set of values that count as real, pooled from the packet GT plus the generator's per-document artifacts. The methodological claim is that every value in that universe corresponds to something on a rendered page, so a model gets credit only for emitting values a human reader could have read. An audit sampled universe entries that came only from generator artifacts (not in the packet GT) and confirmed the canonical text is on the page, but also caught that PDF layout coordinates (x, y, width, height, bbox, page indices) were being walked into the numeric universe alongside content. 34-64% of the pre-fix universe number-set was layout metadata; 14 out of 25,969 model-emitted numerics across 5 default-mode cohorts were rescued by a coordinate match (0.054%, below the rounding the published rates use). _UNIVERSE_SKIP_KEYS in hallucination_analysis.py now skips layout keys and containers; rates shifted by 0.0-0.3pp per cohort with no rank changes.

What the hallucination check doesn't see

The hallucination rate is computed over leaf values that survive three filters; the filters are symmetric across models but worth disclosing because they shape what "hallucination" means here.

Short strings. hallucination_analysis.py requires a value to have at least 3 characters to enter checked_strings. Two-letter codes (state, country, ACORD construction abbreviations) never enter the denominator. A model fabricating state: "XX" isn't counted; a model fabricating an insured name is. We chose this because 2-letter values are a different problem (enum-match, not fabrication) and including them would let one provider's enum-strictness inflate its measured rate relative to providers that let off-enum tokens through.

Hedging strings. Filler tokens (Multiple, Various, See below, N/A, checked/unchecked, etc.) are skipped from the numerator. A model that hedges 20% of fields has a lower measured rate than a model that guesses on 20% of fields, even if the practical value is the same. The pure precision of "fraction of emitted values that are real" can't distinguish these, which is why the post also publishes the omission rate.

Free-text prose. SKIP_LEAF_NAMES and SKIP_PATH_FRAGMENTS exclude summary/description/notes/narrative/recommendations leaves. Prose is the part of an extraction most prone to fabrication, but matching prose against a packet "universe" produces noise on every model and we don't have a defensible scorer for it. The headline "hallucination rate" is a structured-field rate.

Limitations

What v1 still isn't

We have 148 docs across 5 packets, so all tests are N=1. This is a test, not a benchmark. We plan to turn it into one with more documents per difficulty tier so per-difficulty rates aren't dominated by noise.

Sonnet at XHIGH cannot complete the hardest docs

Sonnet 4.6 at XHIGH effort deterministically times out on the 4 biggest N5 documents (3 loss-run variants and the driver schedule). Multiple retries with the 1200s API timeout and a fail-fast diagnostic run all return "request timed out or interrupted". The extended-thinking response on those inputs exceeds Anthropic's long-request budget. Sonnet 4.6 at HIGH hits the same wall on 1 N5 doc. Rates for those cohorts are computed on what did complete (n=144 for sonnet_xhigh, n=147 for sonnet_high), and report.md footnotes every affected cell. Treat the Sonnet XHIGH column with a discount: the dropped docs are the hardest in the corpus, so the cohort rate is biased low. No other provider has this failure mode at this scale.

Reproduce

Test set + scoring pipeline

All 148 documents, prompts, scoring engine, and recall analyzer live in the public repo. determinism_test.py is the hard pre-publish CI gate.

Three levels of effort: inspect the checked-in aggregates with jq, re-score the pre-run extractions without API keys, or re-run extraction against the APIs. The repo README walks through each.

Full hallucination tables, recall-view breakdown, and per-model run counts live in report.md.

All data is synthetic: no real PII, no real policies, no real companies.