Synthetic training data for AI document extraction and security agents

We build synthetic insurance documents and security logs. Broken tables, corrupted fonts, scan artifacts, carrier-specific layouts, attack scenarios buried in noise. The stuff your agents choke on in production, with labels already attached.

Your AI agents can't train on data that looks like what they'll actually see

Real data is off-limits

HIPAA, SOC2, carrier agreements. The documents your agent needs to handle are the ones you can't legally keep or reuse for training.

LLM synthetic data doesn't break anything

Teams that try it get clean, self-consistent docs. The agent looks great in eval, then falls apart on real inputs with real formatting problems.

Human labeling hits a ceiling fast

SME hourly rates, weeks of turnaround. You get maybe a few hundred labeled docs before the budget or the timeline kills you, and that's nowhere near production volume.

Format explosion

Carriers format differently, lines of business have different fields, and every scan introduces different artifacts. The combinations multiply faster than any labeling team can keep up.

Frankly it's enough to convince me you'd be capable of getting us real-enough data for benchmarking and potentially finetuning. You have genuinely neat tech I have not seen from any other company.

CTO, AI Document Extraction Company Insurance vertical · After reviewing full technical proposal

I used the data to validate a patent-pending authorization framework against real enterprise access patterns. The quality of the synthetic data was good enough to build production AI on — that's an extremely high bar.

Design partner, AI Security Built directly on top of the platform

Synthetic Insurance Documents

65 patterns
82 carrier templates
19 ACORD forms
56 visual variants

Complete document packets with ground truth at three levels: document, field, and bounding box. Loss runs, ACORD forms, SOVs, dec pages, broker narratives, and more, each rendered through 82 carrier-specific templates with 56 visual variants sourced from real reference PDFs.

Pattern categories
Table structure Data formats Visual overlays PDF internals Scan effects Cross-doc inconsistency
Example patterns
Broken CMaps Kerning-as-spaces Bezier curve borders Invisible OCR layers Merged headers JS-computed fields Reserve vs Outstanding Phone photo warp Font glyph corruption Mainframe print
Document types
Loss runs 19 ACORD forms SOVs Dec pages Broker narratives Financial statements Driver schedules Experience mods

Synthetic Security Logs

42 scenarios
52K events / scenario
130 simulated actors
33 false positive patterns

Okta System Log event streams with attack signals buried in 52K events of normal traffic. Attack events are <0.2% of volume, and 33 false-positive patterns make sure your agent can't cheat with single-signal detection. No structural tells. Agents have to reason about behavior.

Attack categories
Direct credential attacks SSO & post-compromise Absence detection Kill chains
Example scenarios
Credential stuffing MFA fatigue Golden SAML bypass AiTM phishing Scattered Spider AWS role escalation SaaS data exfil Token replay Consent phishing Ghost sessions
Anti-fingerprinting
Real ASNs/ISPs ms timestamp jitter Per-actor device IDs 14-day baselines 1,025 event types
01

Define scope

Tell us the doc type, carrier format, or attack scenarios you care about, and which edge cases matter most.

02

We generate

Our engine builds the documents or logs with the specific layout problems, format variation, and corruption you asked for.

03

You train

Same idea as computer vision: we placed the data, so we already know what's in it. Every file ships with ground truth. No annotation step, no SME bottleneck.

65 extraction-breaking
patterns
42 attack
scenarios
82 carrier
templates
56 visual
variants
Tim Michaud, Founder
YC Alum · Previously Staff Security Eng @ Moveworks (acq. ServiceNow, $2.85B)

I shipped 250+ AI agents to Fortune 500 companies at Moveworks and watched them break on real-world inputs. Before that I spent a decade in security research finding bugs in Apple, Chrome, and Qualcomm. Aginor came from putting those two things together: I know what production data does to agents, and I know how to generate the inputs that break them.

Get sample synthetic training data

Tell us your doc type or attack scenario and we'll generate a sample batch with labels.

Request sample