phidea
Time-stability check · 2026-04-242026-05-04 · 10-day diff

Property-type held. Rideshare flipped back to the specialist. MAPFRE-Boston softened.

Re-ran the home and auto validated-study probes 10 days after the original 2026-04-24 baseline. Same probe scripts, same 5-runs-per-LLM cadence, same query set. The retest tests whether each headline finding is real signal or single-day-editorial accident.

TL;DR
  • Property-type ownership held or strengthened. Condo → Nationwide tightened from 3/5 + 5/5 to 5/5 + 5/5. Historic → Chubb tightened from 4/5 to 5/5 on Perplexity. Luxury → Chubb stayed 5/5 + 5/5. Jewelry → Chubb softened slightly on Gemini (5/5 → 4/5) but still clear.
  • Newark NJ auto → NJM strengthened. Both LLMs tightened from 4/5 to 5/5 + 5/5. The cleanest cross-day, cross-LLM finding in the entire dataset.
  • Rideshare reversal.Round 2 found State Farm (generalist) winning rideshare cross-LLM at 3/5 + 5/5. Today Progressive (the actual TNC-endorsement specialist) wins on both LLMs at 4/5 + 3/5. The “inverted-ownership” framing for rideshare is invalidated.
  • Boston home → MAPFRE softened. Was 5/5 + 3/5; today 2/5 + 0/5 (Gemini modal flipped to Amica). The MAPFRE-Boston-home claim no longer holds at the original confidence.
  • SR-22 split. Perplexity still has GEICO 5/5; Gemini flipped to State Farm 4/5. The inverted-ownership claim for SR-22 is now Perplexity-only.
  • Claude tool-use parsing still gappy. Most Claude runs returned null today (web_search_20260209 tool response shape continues to break our parser). The Perplexity + Gemini cross-LLM panel remains the load-bearing comparison.

Home — property-type ownership (H2)

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best housing insurance for a condo in Seattle?
predicted: Nationwide
Nationwide:3/5Nationwide:5/5Nationwide:5/5Nationwide:5/5strengthened
Best housing insurance for a luxury home in Seattle?
predicted: Chubb
Chubb:5/5Chubb:5/5Chubb:5/5Chubb:5/5stable
Best housing insurance for a historic home in Seattle?
predicted: Chubb
Chubb:4/5Chubb:5/5Chubb:5/5Chubb:5/5strengthened

Best housing insurance for a condo in Seattle?. Perplexity went from 3/5 to 5/5. Cleaner cross-LLM pattern than baseline.

Home — specialty peril (H3)

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best insurance in Seattle for high-value jewelry coverage?
predicted: Chubb / PURE
Chubb:5/5Chubb:5/5Chubb:5/5Chubb:4/5stable

Best insurance in Seattle for high-value jewelry coverage?. Slight Gemini softening (5/5 → 4/5), same modal carrier. Cross-LLM stability holds.

Home — regional champion

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best value housing insurance in Boston, Massachusetts?
predicted: MAPFRE (regional champion)
MAPFRE:5/5MAPFRE:2/5MAPFRE:3/5Amica:3/5drifted

Best value housing insurance in Boston, Massachusetts?. MAPFRE collapsed on Perplexity (5/5 → 2/5). Gemini modal flipped to Amica. The Boston-MAPFRE-home claim no longer holds at the same confidence — degraded from 'clear' to 'softened-or-drifted'.

Auto — emerging category

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best car insurance for an electric vehicle in Seattle?
predicted: varies (open category)
Travelers:4/5Travelers:5/5Travelers:5/5Travelers:4/5stable

Best car insurance for an electric vehicle in Seattle?. EV → Travelers held. Perp tightened from 4/5 to 5/5; Gemini softened from 5/5 to 4/5. Net: still cleanly cross-LLM clear.

Auto — specialty use-case ownership

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best car insurance for rideshare drivers in Seattle?
predicted: Progressive (specialist)
State Farm:3/5Progressive:4/5State Farm:5/5Progressive:3/5drifted
Best car insurance in Seattle for SR-22?
predicted: Dairyland (specialist) or generalist
GEICO:5/5GEICO:5/5GEICO:3/5State Farm:4/5softened

Best car insurance for rideshare drivers in Seattle?. Major reversal. Round 2 found State Farm (a generalist) winning rideshare cross-LLM. Today Progressive (the actual TNC-endorsement specialist) wins on both LLMs. The 'inverted ownership' framing for rideshare is now invalid — specialist actually wins. The wider claim that 'editorial depth beats specialty' may be more conditional than we said.

Best car insurance in Seattle for SR-22?. Perplexity stable on GEICO (the inverted finding holds for Perp). Gemini flipped to State Farm 4/5. Cross-LLM agreement broke down. SR-22 inverted-ownership now Perp-only; not cleanly cross-LLM.

Auto — regional monopoly (line-specific)

QueryPerp 2026-04-24Perp 2026-05-04Gem 2026-04-24Gem 2026-05-04Verdict
Best car insurance in Newark, New Jersey?
predicted: NJM
NJM:4/5NJM:5/5NJM:4/5NJM:5/5strengthened

Best car insurance in Newark, New Jersey?. Both LLMs tightened from 4/5 to 5/5. NJM-NJ-auto is now the cleanest cross-day, cross-LLM finding in the dataset.

What this means for the validated-study claims

Strengthened — promote to high-confidence. Property-type ownership in home insurance (condo, luxury, historic → Nationwide / Chubb / Chubb) replicates at the highest stability we’ve measured. Newark NJ auto → NJM replicates at 5/5 + 5/5 — the regional-monopoly claim for NJ auto is now bullet-proof.

Soften — reduce confidence. The MAPFRE-Boston- home finding is no longer holding at the 5/5 + 5/5 we originally reported. Today it returned MAPFRE 2/5 on Perplexity and Amica 3/5 on Gemini. The lever (regional monopoly works for line-specific dominance) survives, but this specific instantiation is degrading. Limitations section of ablation-home-insurance updated.

Reverse — the rideshare inverted-ownership claim is invalidated.Round 2 reported State Farm winning rideshare cross-LLM, framed as “generalists annex specialty surfaces.” Today Progressive (the actual specialist) wins on both LLMs. The wider claim that “editorial depth beats specialty” needs to be stated more conditionally going forward. Limitations section of ablation-auto-insurance updated.

Method limitation: Claude tool-use parser is still gappy.The web_search_20260209 tool response shape breaks our parser on most multi-run bursts. The original baseline noted this; today’s retest confirms it persists despite the Anthropic credit balance now being adequate. The Perplexity + Gemini cross-LLM panel remains the load-bearing comparison.

Raw data

  • data/probe-validation-v2-2026-05-04.json — home variance retest, 5 runs × 13 queries × 3 LLMs
  • data/probe-ablation-home-insurance-2026-05-04.json — home ablation retest
  • data/probe-auto-insurance-2026-05-04.json — auto retest

Compared against the 2026-04-24 versions of the same files in the same directory. Verdict counts: 3 strengthened, 3 stable, 1 softened, 2 drifted.

See also the 5-day stability check on the multi-query observation tool probes (price-anchor, bundles, commercial cyber).