For carriers, MGAs, and IDP teams

Synthetic SOVs, loss runs, and ACORDs with ground truth built in.

Test data for in-house insurance extraction pipelines. Complete submission packets, correct by construction, never derived from customer data, and re-runnable as a fixed regression suite on every model upgrade.

No forms. Send a doc to tim@aginor.ai, get up to 20 labeled variants back in about 48 hours.

The library

Insurance is where the engine is deepest.

84carrier templates
19ACORD forms
65adversarial patterns

Complete document packets with ground truth at three levels: document, field, and bounding box. Loss runs, ACORD forms, SOVs, dec pages, broker narratives, and more, each rendered through 84 carrier-specific templates with 56 visual variants sourced from real reference PDFs. Difficulty is a config value, from clean digital output to the tier that breaks frontier models.

Document types
Loss runs 19 ACORD forms SOVs Dec pages Broker narratives Financial statements Driver schedules Experience mods
Pattern categories
Table structure Data formats Visual overlays PDF internals Scan effects Cross-doc inconsistency
Example patterns
Broken CMaps Kerning-as-spaces Bezier curve borders Invisible OCR layers Merged headers JS-computed fields Reserve vs Outstanding Phone photo warp Font glyph corruption Mainframe print
What breaks

The failure modes are insurance-specific. So is the test data.

Loss runs that don't reconcile

Claim rows, subtotals, and grand totals that quietly disagree. Your pipeline needs to notice, and a clean test set never forces it to. Here the discrepancy is generated on purpose and recorded in the ground truth.

Reserve vs Outstanding

The same concept wears different labels across carriers, and sometimes different concepts wear the same label. Label ambiguity is one of the 65 generated patterns, not an accident of sampling.

Carrier template sprawl

The loss run you get next month won't look like the one you tuned on. 84 carrier templates and 56 visual variants exist so your test set covers layouts before your customers send them.

Packets that disagree with themselves

An SOV that contradicts the ACORD 140 behind it. Cross-document inconsistency is a difficulty setting, planted deliberately, so ground truth tells you exactly which value is right.

Proof
Exhibit · Nightmare tier, 148 adversarial insurance documents

GPT-5.4 read a $42.0M revenue line and reported $21.65M. On an ACORD 45, GPT-5.5 got 7 of 9 building values wrong while the grand total still summed exactly right.

OpenAI models fabricated numeric values vs the Anthropic and Google rate2-3x
Extractions scoring below 0.5 without tripping a catastrophic flag31%
Hallucination rates at matched default effort (thinking off)
Metric GPT-5.5 GPT-5.4 Opus 4.7 Sonnet 4.6
Numeric hallucination rate11.4%11.9%3.4%5.2%
String hallucination rate5.8%5.9%2.5%3.1%
Fabricated values per document5.55.82.23.2

These are the documents your exact-match evals never see. The outputs look fine in isolation: well-formed currency, schema-valid rows, numbers that tie back to each other. You only catch this class of failure with labeled adversarial ground truth, and the full test is public.

How it works

Email one document

A loss run your pipeline misreads, an SOV with a layout you dread, an ACORD packet. PDF, XLSX, CSV, scans, most filetypes work.

The engine generates variants

Up to 20 variants: same carrier layout and format, new underlying data, adversarial patterns injected where you want them so the model can't lean on memorization. Typical turnaround is 48 hours.

You test, then re-run forever

We placed the data, so the labels are exact: document, field, and bounding-box level. No annotation step, no SME bottleneck. Re-run the same suite on every model upgrade.

Send the loss run that breaks your pipeline.

Email one hard document. You get up to 20 variants back: same layout, new data, adversarial patterns where you want them, ground truth attached.

Email one hard document

Prefer to write your own email? tim@aginor.ai works.