Building an LLM agent for a US insurer in 2026: five layers of the stack, and where most projects fail.
US insurers are past the LLM-agent hype stage. The question in 2026 is not whether to build, but how to ship a narrow agent that survives contact with a claims-admin system, a state DOI rate filing, and a policyholder expecting resolution within 48 hours. The stack has five layers, and most failed projects fail on layer 2.
TL;DR
- LLM agents for US insurers in 2026 are shipping in production, not R&D. Most carriers that will run an agent in 2027 are already building one.
- The stack has five layers: data (what the agent reads), tools (what APIs it calls), orchestration (which framework stitches them together), evaluation (how the carrier tests it pre-production), governance (what state DOIs will ask about).
- Most failed agent projects fail on layer 2, not on model choice. Teams underestimate how many vendor-tool integrations a useful agent actually needs, and how much each one costs.
- The pattern that works: narrow agent scope. One line of business, one use case, one measurable outcome. Horizontal "insurance AI assistants" consistently underdeliver.
- State DOIs (California first, New York close behind) are moving on LLM-agent governance ahead of federal guidance. Carriers shipping now should assume an explainability artefact requirement by 2027.
Why this matters in 2026
Three signals tell you the category has moved from experiment to production.
First, carrier procurement. The 2026 RFP cycle at most tier-1 and tier-2 US P&C carriers now contains explicit LLM-agent provisions. Twelve months ago the same language sat in "AI innovation roadmap" documents. The move from innovation budget to procurement cycle is the signal that counts.
Second, vendor M&A. CCC's $730M acquisition of EvolutionIQ in January 2025 was explicitly positioned around AI guidance for disability and injury claims management. Moody's acquiring Cape Analytics is, in part, an agent-tooling play: property-level attributes become structured input that an underwriting agent can call. The consolidation pattern is aggregating the tool layers that agents need.
Third, regulator posture. California's Department of Insurance has been running explicit AI-in-rating reviews since 2024. New York DFS has similar guidance. The direction of travel is clear: state-level governance will formalise before federal guidance arrives.
The five layers of the stack
Layer 1 — Data: what the agent can read
The most-overlooked layer. An agent is only as useful as the data it can reach at inference time. For a US P&C claims agent this means:
- Policy records (coverage, limits, deductibles, endorsements) — usually on the carrier's policy-admin system (Guidewire, Duck Creek, Insurity).
- Claim records (FNOL, photos, estimates, medical bills, adjuster notes) — on the claims-admin system.
- Property / vehicle data (attributes, telematics, imagery) — often from third-party platforms: Cape Analytics, Nearmap, Cambridge Mobile Telematics.
- Industry data cooperatives (loss runs, claim histories) — Verisk ClaimSearch, LexisNexis Risk Solutions.
- Unstructured documents (policy PDFs, loss-run attachments, police reports, medical bills) — handled by document-intelligence platforms like Hyperscience or Rossum.
Every carrier building an agent is forced to confront how much of this data lives behind systems that were not designed for agentic access. That confrontation is usually harder than the agent build itself.
Layer 2 — Tools: what the agent can call
This is where most projects fail.
A useful insurance agent does not "reason" about claims in the abstract. It calls specific tools: a fraud-scoring API, a document-extraction service, an estimate calculator, a subrogation identifier. Each tool is a vendor integration, each with its own authentication, rate limits, data shapes, and cost.
A realistic claims-triage agent might call:
- Shift Technology or FRISS for fraud scoring
- Hyperscience or Rossum for document extraction
- CCC Intelligent Solutions for auto damage estimation
- Cape Analytics for property attributes
- Cambridge Mobile Telematics for driving behaviour
- Native Guidewire or Duck Creek endpoints for claim record updates
That is six-plus vendor integrations before the agent has done any reasoning. Each integration is a real procurement, an API agreement, a data-mapping exercise, and a maintenance commitment. Carriers who budget one engineering quarter for "integrating the agent's tools" consistently underestimate by 2-3x.
The failure mode: the model can reason beautifully in a prototype that calls stubbed tools, and the prototype ships in three months. The production agent, with real vendor APIs under load, takes 12-18 months. Leadership loses patience somewhere in month 9.
Layer 3 — Orchestration: which framework stitches it together
Orchestration is the glue: the framework that lets the model decide which tool to call, handle failures, manage multi-turn context, and log what happened.
In 2026 the practical choices are:
- Direct API use (OpenAI Responses API, Anthropic Messages API with tool_use blocks, Google Gemini function-calling). Minimal framework overhead. Fine for narrow single-purpose agents.
- Anthropic MCP (Model Context Protocol). Open standard, maturing fast. Works well when the agent needs to reach a growing number of tools without custom integration per tool.
- LangChain, LangGraph, LlamaIndex. Framework-heavy. Real value in complex multi-step workflows. Cost: non-trivial dependency management.
- Carrier-built orchestration. Common at tier-1 carriers with strong engineering teams. Avoids vendor lock-in; requires ongoing investment.
Orchestration is a real choice but rarely the load-bearing one. Most agent projects succeed or fail on layers 2 and 4.
Layer 4 — Evaluation: how the carrier tests it pre-production
An LLM claims agent that returns a wrong denial reason once in a thousand interactions is not an acceptable production system. Evaluation has to surface those one-in-a-thousand failures before production, not after.
Practical eval for insurance agents combines:
- Golden-set testing: 500-5000 human-labelled claim scenarios with known-correct agent behaviour. Re-run on every model / prompt / tool change. Score: failure rate, false-denial rate, false-approval rate.
- Regression suite: the specific failure modes discovered in development, frozen as tests. Every new build must not regress against them.
- Adversarial testing: deliberately crafted edge cases. Policyholder in late-pay status, ambiguous coverage, fraud-adjacent claim, edge-case peril.
- Shadow-mode production: agent runs silently on real claims, its outputs compared against human adjusters. Before any decision the agent makes is binding, the shadow-mode data is the validation.
Tools: ragas, promptfoo, and a growing set of insurance-specific eval frameworks. Most tier-1 carriers end up with a home-grown eval harness; tier-2 typically adopt one off-the-shelf.
Carriers that skip this layer ship an agent that works in demo and fails in quarterly regulatory review.
Layer 5 — Governance: what state DOIs will ask about
The DOI surface is moving faster than most teams expect. California, New York, and a growing list of other states have released or begun drafting AI-in-insurance guidance.
The specific questions carriers should expect at the next rate filing or market conduct exam:
- Explainability per decision. If the agent denied a claim or priced a policy, show the reasoning trail. Not the full model chain-of-thought; the decision-level attribution (which data, which tools, which rule applied). States are differentiating; California is most advanced.
- Bias testing. Did the agent systematically produce different outcomes across protected classes? Tested against realistic demographic distributions, not just against the training set.
- Audit log retention. How long are agent decisions stored, in what format, and who can retrieve them?
- Human-in-the-loop documentation. At what thresholds does the agent escalate to a human? Who approves the threshold? Can the DOI audit escalation compliance?
- Vendor-data flow diagrams. Which third-party tools did the agent consult? Was any policyholder PII shared with vendors that shouldn't have received it?
None of this is science fiction. The California DOI's 2024-2025 guidance on AI in rate filings already requires most of these artefacts for ML-pricing models. Agents will follow.
The pattern that works: narrow scope
The pattern that ships in production and survives a quarterly review is narrow: one LOB, one use case, one measurable outcome.
Concrete examples that have shipped or are close to shipping at US carriers in 2026:
Claims-adjuster assistant for auto-physical-damage. The agent reads the FNOL, pulls the CCC estimate and driver telematics, surfaces fraud flags from Shift or FRISS, drafts a coverage-determination note for the adjuster. It does not make the decision. Measurable outcome: adjuster time-to-decision on physical-damage claims.
Submission-triage agent for commercial lines. The agent reads broker-originated ACORD submissions, extracts structured data via Hyperscience or Rossum, calls Cytora or Send for risk-enrichment, routes the submission to the correct underwriter. Measurable outcome: broker-to-underwriter cycle time.
Underwriter Q&A assistant for specialty. The agent has read the policy forms, the carrier's underwriting guidelines, and the SOV. When an underwriter asks "does this roof-age trigger a deductible adjustment under policy form X?" the agent cites the specific clause and the specific SOV row. No binding decisions. Measurable outcome: underwriter productivity on complex specialty policies.
Notice what these examples share. Scope is tight. Outcome is measurable. The agent never makes a binding decision without a human. Tools called are finite and known.
Horizontal "insurance AI co-pilots" that claim to handle "any claim, any policy, any line" consistently underdeliver. The breadth means the agent has never been properly evaluated on any specific use case.
Three honest failure modes
1. Pilot theatre. An agent built on fake data, fake APIs, and narrow happy-path scenarios demos beautifully. Senior leadership signs off. Production integration reveals that the real claims-admin system does not support the data access the demo assumed. The project runs 18 months longer than planned and ships at 30% of promised functionality.
2. Vendor-tool-cost surprise. Six vendor integrations at $N per call times M calls per claim times K claims per year equals a number that shocks the finance team in year 2. Agents that are profitable on paper become break-even or loss-making once real usage tools up.
3. Regulatory reset. The agent is in production. A state DOI market conduct exam surfaces a claim where the agent's decision cannot be adequately explained. The carrier is required to pause agent-driven decisions on that LOB, re-submit rate filings, and pay remediation. All of this is avoidable with upfront governance investment; most of it happens because the carrier treated governance as a phase-3 concern instead of a phase-1 constraint.
What a sensible agent roadmap looks like
For a tier-1 or tier-2 US P&C carrier starting in the next 12 months:
- Months 1-3: pick one LOB, one use case, one outcome. Map the five layers for that one scope. Identify which tools the agent will call and confirm vendor API availability + pricing.
- Months 4-6: build the eval harness before building the agent. Label a 500-claim golden set. Define success: e.g. "agent draft matches human adjuster decision on 92% of physical-damage claims at $N average cost per claim."
- Months 7-12: build the agent. Shadow-mode production from month 9. Pass the eval harness at target thresholds before any binding decision.
- Months 13-18: governance pre-certification with relevant state DOIs. Rate-filing refresh if the agent touches pricing. Human-in-the-loop documentation.
- Month 18+: scale to second use case in same LOB. Do not widen to a new LOB until the first one has survived a full regulatory review cycle.
Carriers that compress this to "build and ship in 6 months" either ship a demo, not a production system, or skip governance and pay for it in year 2.
Closing
Building an LLM agent for a US insurer in 2026 is a solvable engineering problem, a harder vendor-integration problem, and a soon-to-be-harder governance problem. The engineering problem is the one teams focus on; the other two are what kill projects.
The carriers that will look good in 2027 are the ones shipping narrow, well-evaluated, well-governed agents in 2026. Not horizontal co-pilots. Not fully autonomous claim handlers. Narrow tools that make one specific decision measurably better, with a human in the loop and a regulator in the background.
The rest of this decade's insurance-tech story is going to be written by carriers who got the narrow scope right, not by carriers who built the cleverest model.
Frequently asked
Is Anthropic MCP a good choice for an insurer's agent stack in 2026?
Conditionally. MCP works well when the agent needs to reach a growing number of tools without custom integration per tool. It is Anthropic-originated and still early in adoption outside the Anthropic ecosystem. For a narrow agent calling 5-6 known vendor APIs, direct Messages API / Responses API tool_use may be simpler. For agents expected to grow in tool surface over years, MCP is a defensible bet but not risk-free.
How many vendor integrations does a typical claims agent need?
A realistic US P&C claims-triage agent calls 5-8 vendor tools: fraud scoring, document extraction, estimate calculation, property or vehicle attributes, claims-admin read/write, and one or two industry-data-cooperative APIs. Each integration is real procurement work; carriers should budget 2-3x the engineering time their first estimate suggests.
What is the most common failure mode for insurance agent projects?
Vendor-tool integration timeline overrun. The model works in a prototype with stubbed tools. The production version, with real vendor APIs under actual load and pricing, takes 12-18 months instead of 6. Leadership loses patience in month 9, scope gets cut, and the agent ships at 30-50% of the originally promised functionality.
What governance should an insurer assume state DOIs will require?
At minimum: per-decision explainability artefacts (not full model chain-of-thought, but decision-level attribution to data + tools + rules used), bias testing across protected classes, audit log retention, and documented human-in-the-loop thresholds. California's DOI guidance on AI in rate filings (2024-2025) already requires most of these for ML-pricing models; agents will follow the same trajectory.
Read next
Sources
- CCC Intelligent Solutions Completes Acquisition of EvolutionIQ — CCC Intelligent Solutions
- California Department of Insurance — California Department of Insurance
- Moody's to Acquire CAPE Analytics — Moody's
- Model Context Protocol (MCP) announcement — Anthropic