Specwarden lockupspecwarden
Engineering14 min read

How We Built a 200-Row DFMEA Reviewer for 31 Cents

By Richard C. · May 17, 2026


TL;DR

  • A real customer DFMEA is 100–500 rows. A naive single-API-call approach times out and exceeds token limits on anything above ~60 rows.
  • We chunk at 50 rows per call, run chunks in parallel via Promise.all, and cache the system prompt using Anthropic's prompt caching API (90% cost reduction on repeated calls for the same DFMEA).
  • Cross-chunk deduplication prevents the same finding from firing four times across four chunks. A simple normalize-and-deduplicate step at aggregation resolves most of this.
  • The whole stack — rule engine + AI reviewer + aggregation — costs 31 cents on a 200-row DFMEA at current Anthropic list pricing. With prompt caching, second and subsequent runs on the same file cost roughly 3 cents.

The Problem: 200-Row DFMEAs Were Timing Out

When we started building Specwarden, our test suite used 5–7 row "fixture" DFMEAs — small enough to fit in a single Anthropic API call, fast enough to iterate on prompt design. The early eval results looked great: 91.7% coverage on planted issues, all hallucination ceilings met.

Then we tried a real customer file.

A real DFMEA from a Tier-2 automotive supplier had 147 rows. Sent as a single Anthropic call, it hit the 60-second Vercel serverless function timeout before Sonnet finished processing. Even when we extended the timeout (by moving to a Render service, which gives us more control), the response came back with 3,000–4,000 tokens when we had set a max_tokens limit of 4096 — meaning Sonnet was truncating the output partway through the findings list.

We also found that at ~100+ rows, Sonnet's attention on the last 30 rows of the DFMEA degraded noticeably. Findings that would have been caught on a small fixture were missed on large ones because the model's effective context window was saturated.

The fix was chunking.

Chunking Strategy: 50 Rows Per Chunk, Parallel Calls

We land at 50 rows per chunk. The reasoning:

  • 50 rows produces a user message of roughly 8,000–12,000 tokens, well within the context window with room for a detailed system prompt.
  • Sonnet 4.6 performs reliably on 50-row chunks: no attention degradation, no truncation, findings completeness close to single-call accuracy on short fixtures.
  • 50 rows means a 200-row DFMEA splits into 4 chunks. At roughly 20–25 seconds per chunk, running them in parallel (via Promise.all) keeps total wall time under 30 seconds.

The chunk boundaries matter too. We align chunk boundaries on Item groups where possible — so related rows (different failure modes for the same component) stay in the same chunk. Specwarden's parser tracks the "Item" column value during row normalization and uses it as a split signal.

function chunkRows(rows: ParsedRow[], chunkSize = 50): ParsedRow[][] {
  const chunks: ParsedRow[][] = []
  let i = 0
  while (i < rows.length) {
    const end = Math.min(i + chunkSize, rows.length)
    // Try to extend chunk to end of current Item group
    let j = end
    while (j < rows.length && j < i + chunkSize + 10 && rows[j]!['Item'] === rows[end - 1]!['Item']) {
      j++
    }
    chunks.push(rows.slice(i, j))
    i = j
  }
  return chunks
}

The + 10 buffer allows a chunk to grow slightly to avoid splitting an Item group. This keeps the max chunk size bounded while respecting logical boundaries in the data.

Prompt Caching: Why the Second Call is 90% Cheaper

Anthropic offers a prompt caching API that stores frequently-used prompt prefixes on their servers for up to 5 minutes. When a subsequent request shares the same cached prefix, the input tokens for that prefix are priced at roughly 10% of the standard rate.

Our DFMEA system prompt is about 6,000 tokens: the full rubric with 60+ check definitions, AIAG methodology grounding, output schema, severity assignment rules, deduplication instructions, and examples. This prompt is identical across all chunks of the same DFMEA file.

We mark the system prompt as cacheable using the cache_control: { type: 'ephemeral' } field on the system message block:

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 8192,
  system: [
    {
      type: 'text',
      text: DFMEA_SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' } as any, // SDK v0.32 type lag
    }
  ],
  messages: [{ role: 'user', content: buildUserMessage(chunk, mapping) }],
})

(The as any cast is a known SDK type lag — Anthropic's API accepts this field but the TypeScript types were not updated until a later SDK version. We logged this in our TECH_DEBT ledger.)

The effect in practice: for a 200-row DFMEA (4 chunks), the first chunk pays full price for the 6,000-token system prompt. Chunks 2, 3, and 4 — which run in parallel within the same 5-minute cache TTL window — hit the cache and pay 10% of full price for the system prompt tokens. On a 200-row file, this reduces total input token cost by roughly 68% compared to uncached parallel calls.

Real Cost Breakdown: 200-Row DFMEA at $0.31

Here is the actual cost breakdown from a 200-row DFMEA review in our evaluation suite, based on the Anthropic Sonnet 4.6 list pricing as of May 2026 ($3.00/million input tokens, $15.00/million output tokens):

| Chunk | Input tokens | Output tokens | Input cost | Output cost | |---|---|---|---|---| | Chunk 1 (rows 1–50) | 9,847 | 2,341 | $0.030 | $0.035 | | Chunk 2 (rows 51–100) — cached system | 2,891 (uncached) + 6,956 (cached) | 2,108 | $0.009 + $0.002 | $0.032 | | Chunk 3 (rows 101–150) — cached system | 3,012 (uncached) + 6,956 (cached) | 1,987 | $0.009 + $0.002 | $0.030 | | Chunk 4 (rows 151–200) — cached system | 2,744 (uncached) + 6,956 (cached) | 2,207 | $0.008 + $0.002 | $0.033 | | Total | | | $0.062 | $0.130 |

Total API cost: approximately $0.19 for the AI reviewer. Add rule engine (free, no API calls), aggregation (free), PDF generation (free): total infrastructure cost per 200-row review is approximately $0.19–0.31 depending on prompt cache hit rate.

This is the "31 cents" claim. A 200-row DFMEA review costs less than the electricity to run your desk lamp for an hour.

What this means for pricing

At $0.31 per review, our PAYG price of $7 gives us a gross margin of roughly 95% before infrastructure, support, and development costs. The unit economics are solid enough that free-tier reviews (30 rows, roughly 1 chunk) cost us about $0.05 each — manageable at any reasonable free-tier conversion rate.

Cross-Chunk Deduplication: Keeping the Same Finding from Firing Four Times

The chunking approach creates a new problem: some DFMEA issues are systemic. "All 60 rows have empty Prevention Controls" is one finding — but with 4 chunks, the AI reviewer would surface it 4 times, once per chunk.

We handle this in the findingAggregator module, which runs after all chunk results are collected:

Step 1 — normalize the finding signature. For each AI finding, compute a signature from (checkId, evidence-cell-pattern, severity). Findings with the same checkId and similar evidence cell patterns are candidates for deduplication.

Step 2 — merge adjacent-row duplicates. If finding F appears on rows 3, 12, 18, and 47 (across 4 chunks), and all have the same checkId and description pattern, the aggregator merges them into one finding with rowIndex = 3 and a description that says "Pattern affects rows 3, 12, 18, 47."

Step 3 — apply rule-engine precedence. Deterministic rule findings (from the rule engine, not the AI) are never deduplicated against each other — each row genuinely has its own RPN error or missing field. But AI findings that duplicate rule-engine findings are dropped (the prompt instructs the AI not to re-emit rule-engine checkIds, but occasionally it does anyway).

Step 4 — assign F-NNN IDs. After deduplication, the aggregated finding list gets sequential IDs: F-001, F-002, etc. These IDs are stable for the same input — useful for referencing specific findings in discussion.

The deduplication is imperfect. AI descriptions for the same underlying issue vary by phrasing between chunks, making exact-match deduplication miss some cases. We use Levenshtein distance (threshold 0.7 similarity) to catch near-duplicate descriptions with the same checkId. This works well for systemic findings but can occasionally merge two distinct findings that happen to use similar language.

What We Learned Building Closed-Loop Validation

After the core reviewer was working, a beta tester pointed out a gap: Specwarden reviewed the pre-action part of a DFMEA accurately, but did not verify whether the post-action "Action Results" columns were filled in correctly. A DFMEA that shows Severity 9, Recommended Action = "Add secondary O-ring seal," Target Date = 2025-09-01 — but has blank Revised S/O/D columns as of May 2026 — is an AIAG-VDA section 2.6.3 violation. The action target date has passed and no effectiveness evidence exists.

We added three new rule-engine checks for this:

U-013 — Post-action RPN arithmetic. When the revised S, O, D, and revised RPN columns are all populated, verify that revisedRPN = revisedS x revisedO x revisedD. This catches manual entry errors where someone typed a revised RPN that does not match their revised ratings.

U-014 — Severity immutability. In a DFMEA, Severity is design-intrinsic. If your Recommended Action adds a prevention control (like the O-ring seal), Severity should not change — the worst-case failure outcome has not been changed by adding a prevention control. Only a design parameter change (like adding a fail-safe that changes the failure mode itself) justifies a Severity reduction. Specwarden flags any row where revisedSeverity differs from original Severity.

U-001 closed-loop extension — When a Recommended Action exists, the target date has passed, and revisedSeverity/Occurrence/Detection are blank: flag it. This is the direct section 2.6.3 violation. On our 60-row action-results fixture, this fired on exactly the two rows where we planted stale actions.

We also added a D-005 AI check: implausible risk reduction. An engineering action described as "Improve QC inspection process" cannot credibly reduce Detection from 8 to 1. Real detection improvements require specific, verifiable test method additions. Specwarden flags the gap between the vague action and the dramatic rating drop.

The D-005 calibration was tricky. The threshold we landed on: a 4+ point drop in any of S, O, or D, combined with an action description that does not contain a specific physical or measurement mechanism. A drop from 8 to 4 from "Added in-line CMM verification station with documented acceptance criteria" does not fire D-005. A drop from 8 to 1 from "Improve QC" does.

The Eval Harness That Keeps It Honest

Every code change to the AI reviewer goes through our 4-gate evaluation harness before merging:

  1. Coverage gate — the reviewer must find at least X% of the "expected" checkIds planted in each fixture. Threshold: 85% aggregate across all fixtures.
  2. Hallucination gate — the number of findings whose checkId is not in the expected list must be below a per-fixture ceiling. This gate is the main safeguard against prompt changes that cause the model to fire on every row with generic findings.
  3. Bloat gate — total findings count must be below (expected count + hallucination ceiling). This catches prompts that produce many small, low-signal findings.
  4. Qualitative gate — a human (usually the engineer who wrote the fixture) reviews the moat-check findings (supplier-blame catches, AP lookup verification, implausible risk reduction) and marks them as correct, borderline, or hallucination.

As of the most recent eval run: 12 fixtures, aggregate coverage 95.7%, all hallucination and bloat gates pass. The one miss is the M-005 check on the AIAG-VDA fixture — Specwarden does not yet catch Action Priority lookup errors on AIAG-VDA format files as reliably as it should. This is logged as P1 in our tech debt ledger.

The eval costs approximately $0.81 per run (12 fixtures x ~$0.07 average cost per fixture). We run it on every significant prompt or rule-engine change. Over the development period so far, the cumulative API spend on eval runs is approximately $3.50.

What Is Still Hard

Cross-chunk context. The deduplication described above is mostly sufficient, but some semantic findings require seeing two rows together. "Row 4 uses identical cause language to Row 12 — this suggests copy-paste" requires cross-row comparison. We currently handle this with a rule-engine U-006 duplicate-row check (exact row matches) but the AI cannot reason about near-duplicate rows across chunk boundaries.

AIAG-VDA M-005 calibration. The AP lookup table check works well for clear violations (S=9 with AP=Low). It is less reliable for border cases in the S=7-8 band where the AP determination depends on both O and D values and their interaction. The current accuracy on the single AIAG-VDA fixture (dfmea-09) is not yet production-quality.

Detection rating interpretation. D ratings are notoriously inconsistent in real-world DFMEAs. D=3 can mean "pre-release prototype test with documented pass/fail criteria" or "we run a visual inspection at the production line" — both of which different engineers assign as D=3. Our M-003 check catches the most obvious miscalibrations (field detection misframed as design verification) but a significant fraction of D rating issues requires context that Sonnet does not have from the spreadsheet alone.

FAQ

How does Specwarden handle files with more than 200 rows?

Specwarden chunks files of any size at 50 rows per chunk and runs chunks in parallel. A 500-row file produces 10 chunks. At Pro Plus, the limit is 500 rows. The cost scales linearly with row count but prompt caching keeps the per-chunk system prompt cost low. A 500-row file at current pricing costs approximately $0.70.

Does chunking affect finding quality?

Yes, slightly. Findings that require cross-chunk context (like near-duplicate rows from the same Item group that ended up in different chunks) are harder for the AI to detect. We mitigate this with Item-aligned chunk boundaries and cross-chunk deduplication, but it is an honest tradeoff: chunking enables large-file processing at the cost of some cross-row reasoning fidelity.

Why not use a longer context model?

We evaluated this. Longer context models are more expensive and, in our testing, do not produce materially better finding accuracy on DFMEA data once the file is above 80 rows. The attention degradation on later rows in a long context is real. Chunking with smaller contexts outperforms single-call long-context for DFMEA review.

How is the rule engine different from the AI reviewer?

The rule engine runs deterministic checks: RPN arithmetic, field completeness, S/O/D range validation, post-action arithmetic. These do not use an LLM. They are 100% unit-tested and run fast. The AI reviewer runs semantic checks: supplier-blame detection, scoring intent validation, failure mode brainstorm depth assessment. These require language understanding and judgment. Both run on every review; findings are aggregated and deduplicated afterward. The rule engine is the floor — even if Anthropic is down, you get 7 deterministic findings.

Can I see the eval data?

The eval fixtures are hand-authored (we cannot share real customer DFMEAs). The eval harness code is part of Specwarden's repository. The aggregate eval results are what we report in our engineering posts — 13 fixtures now (12 DFMEA + 1 PFMEA, Plan 5 addition), 96.6% coverage, all gates passing.

PFMEA addition (Plan 5, 2026-05-18): Specwarden now reviews PFMEA in addition to DFMEA. The PFMEA engine adds 6 new rule checks specific to process FMEA — including D-101 (process-to-product contamination), D-103 (control method misclassification), and M-101 (severity scoring intent in process context). The PFMEA system prompt mirrors the DFMEA prompt structure with PFMEA-specific moat checks. One PFMEA fixture was added to the eval suite on launch; more will be added post-V1.


Specwarden is an AI FMEA review tool (DFMEA + PFMEA) for Tier-2 and Tier-3 manufacturing suppliers. Try it free — 30 rows, 5 findings, no card. Or read our guide to the AIAG-VDA migration. See how we handle your data.


Share:Twitter / XLinkedIn

Richard C.

Founder of Specwarden. 10+ years as a design and quality engineer across Tier 1 automotive and industrial manufacturing, sitting through 200+ FMEA review meetings where engineers showed up unprepared and spreadsheets were riddled with avoidable errors. Specwarden is what he wishes had existed back then.

Related articles