The Silent Failure Gap
AI systems fail in a way that is invisible to every monitoring tool you already own.
- Servers are green. APIs respond
HTTP 200 OK. No errors logged. - But the content is wrong. Potentially catastrophically wrong.
- Failures are indistinguishable from correct answers in format and confidence.
- In domains where mistakes have outsized consequences — and where human review of every output is impractical.
- In regulated contexts where the organisation is liable for the AI's output.
At a scale and speed that makes human review of every output impractical.
85mm vs 185mm
An engineer designing a suspension bridge asks an AI copilot for minimum cable diameter — 400m span, 50 kN/m live load.
- AI responds: "Minimum cable diameter: 85mm using Grade 1770 steel"
- Correct answer: 185mm. The AI dropped a digit. Confidently.
- With full explanation. Perfect professional language. Server: 200 OK.
- No alert fired. Wrong value enters design document.
$2.3M — Three Silent Failures
A claims adjuster uses AI to assess a commercial property claim. AI recommends: "$2.3M covering structural damage, inventory loss, and business interruption for 6 weeks."
Failure A — Policy Violation
Policy caps business interruption at 4 weeks, not 6. AI ignored a specific policy clause.
Failure B — Regulatory Breach
Settlement doesn't account for state-specific depreciation schedule required by the insurance commissioner.
Failure C — Outdated Law
AI cited a coverage interpretation overturned in case law two years ago.
Who Verified the AI's Work?
The problem isn't "AI gives wrong answers." Everyone knows that. The problem is what happens organisationally when AI gives wrong answers at scale.
- A civil engineering firm using AI has a governance problem, not a technology problem.
- When a junior engineer uses AI output in a design document, and a licensed PE stamps it — who verified the AI?
- What process caught the errors? What evidence exists that reasonable care was taken?
- Right now, in most organisations: nothing. No systematic process, no audit trail, no institutional standard.
The flight recorder log — every intervention, every flag — is what you show a regulator or a court.
Who Needs Lumen
A startup using ChatGPT to write marketing copy. If the AI writes a bad tagline, someone notices and rewrites it. Cost of failure: low.
Any organisation where AI output enters a professional workflow with regulatory, legal, financial, or safety consequences — and where the volume makes human review of every output impractical.
- A regulator can ask "what controls did you have on your AI?" and you need an answer.
- A court can ask "what evidence of reasonable care?" and you need documentation.
- A professional licence is on the line every time someone signs off on AI-assisted work.
The Trust Gap After the SaaSpocalypse
February 3, 2026: Anthropic's Claude Cowork plugins triggered a $285 billion sell-off in SaaS stocks. The market correctly priced in AI replacing significant white-collar work.
Lumen — The AI Governance Layer
Every response verified. Every decision logged."
Works With Any LLM
Lumen is middleware — it sits between your team and whatever model you already use. OpenAI, Anthropic, open-source, or your organisation's own private internal model. Not a replacement for your LLM.
Check Against Your Standards
Runbooks encode your professional standards as machine-enforceable constraints. Written by your domain experts. Owned by you.
Flag, Challenge, or Block
Three-level remediation. Your team sees flags, decides on challenges, and hard violations are blocked before they land.
"PagerDuty for AI correctness" — a middleware layer that catches AI-specific failures before the user perceives them, regardless of which model produced them.
Two Tiers. One Platform.
Mid-Market — Lumen Chat
SaaS subscription. Business owner signs up, selects industry runbooks, team uses it instead of raw ChatGPT. Self-serve.
- Pre-built, industry-specific runbooks out of the box
- No developer needed — Lumen IS the frontend
- Monthly fee tiered by AI interactions monitored
- Includes flight recorder audit trail + alerting
Enterprise — Lumen SDK
Headless evaluation engine. Developer integrates into existing AI applications. High-touch, consultative sale.
- Custom runbook generation from your standards documents
- Hybrid / on-premise deployment option
- Customer-managed encryption keys
- Standards body partnerships for authoritative runbooks
The Runbook Library — Compounding Advantage
The SDK is open source and replicable. The evaluation engine is engineering, but buildable. The moat lives in the accumulated knowledge.
Runbook Library
Accumulated, incident-data-refined, professionally grounded constraints. Continuously updated. Takes years of enterprise engagements to build — cannot be shortcut with funding.
Standards Body Endorsements
BSI endorses the electrical runbook as the approved machine-enforceable version of BS 7671. IStructE endorses structural engineering. A trust signal no competitor can replicate.
Network Effects
Like antivirus signatures — customers subscribe to maintained "constraint definitions." Professionals share runbooks on the platform. Community becomes the moat.
Where Incumbents Cannot Follow
| Capability | OpenAI Guardrails | Datadog LLM | AWS Bedrock | Lumen by Qanata |
|---|---|---|---|---|
| Model-agnostic | ✗ | ✓ | Partial | ✓ |
| Real-time intervention | Partial | ✗ | ✓ | ✓ |
| Claim-level evaluation | ✗ | ✗ | ✗ | ✓ |
| Reasoning window UX | ✗ | ✗ | ✗ | ✓ |
| Domain expert runbooks | ✗ | ✗ | ✗ | ✓ |
| Professional sign-off workflow | ✗ | ✗ | ✗ | ✓ |
| Compliance-grade flight recorder | ✗ | Partial | Partial | ✓ |
| Horizontal runbook library | ✗ | ✗ | ✗ | ✓ |
OpenAI cannot be model-agnostic — contradicts platform strategy. Datadog sits alongside the stack, not inside it. AWS won't curate professional governance runbooks.
Three Products. One Engine.
① Lumen Chat
Primary product. A governed AI chat interface that connects to any LLM — OpenAI, Anthropic, open-source, or your organisation's private internal model. Lumen sits between your team and whatever model you already use. No integration required. No switching providers.
② Lumen Workspace
Upload your standards documents. Lumen's LLM extracts constraints and produces a draft runbook. Your domain expert reviews and approves. From documents to deployed in a week.
③ Lumen SDK
Headless evaluation engine extracted from Chat. Open source. Full event model, callbacks, headless mode for API-driven workflows. For companies building their own AI applications.
The Reasoning Window
All tokens stream into a provisional "reasoning window" first. Confirmed text promotes to the final response. If evaluation catches a problem — the text was never presented as final. No jarring retraction.
- Minimal mode — clean flow, subtle indicator, daily use
- Full Details mode — shows runbook, constraint, confidence score. Powerful for demos and compliance review.
- Solves the "silent guardian" problem: visible on every interaction, building continuous perceived value
- Consistent 200–300ms visible rhythm — user never notices difference between local and Lumen Cloud evaluation
- Model-agnostic — works identically whether the upstream LLM is GPT-4, Claude, Gemini, Llama, or an internal private model
Three Levels. Graduated Response.
A single kill switch treats every problem identically. The right model is a professional intervention — not a system crash.
Level 1 — Flag Precision > 80%
Low confidence divergence. Claim renders with an unobtrusive indicator. Engineer can hover for detail. Workflow intact. Engineer makes the call. Like a colleague saying: "you might want to double-check that."
Level 2 — Challenge Precision > 95%
Clear constraint violation. AI response pauses. Modal shows: violation description, runbook reference, authority. Options: Show Detail / Accept Anyway / Recalculate. Professional is informed and in control. Flight recorder captures user decision.
Level 3 — Block Precision > 99%
Hard constraint violation. Mandatory stop. Output never reaches user as actionable. Interaction logged. "Escalate to senior" option available. Fires only on deterministic violations — the system is essentially certain. This is the flight recorder entry.
The Flight Recorder
Every interaction logged with full audit trail. This is not a feature of Lumen. It is the core product for regulated environments.
- The user's query, every token returned, every claim boundary detected
- Every evaluation that fired — which runbook, which constraint, what score
- Every remediation action (flag, challenge, block)
- Every user decision — who, when, what they saw
- The final output that was promoted to "confirmed"
Full Logging (Default)
Everything logged. Full audit trail. Maximum compliance value. Customer owns retention period and access controls.
Minimal Logging
Aggregate metrics only. No individual claim text retained. Intervention counts, runbook hit rates, system health. For risk-averse legal departments.
Lumen Workspace
From blank-page authoring problem to editorial review problem. "Upload your standards, get a machine-enforceable constraint set in 30 minutes."
Enterprise Track
Provide standards/regulations → Lumen extracts constraints → draft runbook with confidence scores → domain expert reviews per-constraint → approves → deployed. Familiar editorial workflow.
Mid-Market Track
Subscribe to Lumen's pre-built runbook library → senior person reviews → signs off → live. The sign-off means: "we've reviewed this and accept it as our control set."
Why the Original Spec Was Broken
The original architecture proposed evaluating streaming chunks — small batches of tokens sent to a cloud evaluator in parallel.
- Cloud round-trip: 100–400ms
- Users notice rhythm breaks at: 50–100ms
- Irresolvable conflict: you can't evaluate fast enough without adding noticeable latency
- Deeper problem: hallucinations don't live at the chunk level
- A chunk of 5–10 tokens is semantically meaningless for evaluation purposes
- The spec was evaluating the wrong unit entirely
Claims, Not Chunks
A hallucination is a complete factual claim — a whole, finished assertion about the world.
The Timing Window — Paradox Dissolved
A claim doesn't arrive instantly. The LLM streams it word by word. A single sentence takes 1–3 seconds to fully stream.
Claim Boundary Detector
Job: identify the moment a complete, verifiable assertion has finished streaming. Nothing more. Like a line judge — not evaluating the shot, only calling in or out.
- Signals: punctuation, syntactic structure (SVO), semantic completeness
- Rule-based handles ~80–90% of cases
- Tiny classifier handles ambiguous remainder
- Runs entirely locally — no network, no cloud
- Average: ~1ms per token
Two Types of Hallucination
Type 1 — RAG Contradiction
Model contradicts its own source material. The retrieved documents said one thing; the model says another.
Detectable locally — no internet, no large model required. Fast, cheap, privacy-preserving. ~30ms.
Dominant case in production enterprise AI — healthcare, fintech, legal, engineering almost always use RAG.
Type 2 — Generative Invention
No retrieved documents. Model confidently states something false from its own parameters.
Requires cloud evaluation — needs external knowledge or a powerful evaluator. 200–400ms.
Honestly positioned as best-effort detection with transparent confidence scoring. The edge case, not the dominant case.
Three-Layer Local Evaluation
Simple cosine similarity fails on negations and numerical precision — exactly the errors that matter in regulated domains. Three layers, each solving one weakness.
Layer 1 — Entity Extraction ~10ms
Extract numbers, units, entities from claim. Compare deterministically against source documents.
Fires Level 3 Block — >99% precision.
Layer 2 — NLI Entailment ~20ms
Does the source entail or contradict the claim? Catches negations, qualifications, inversions.
Fires Level 2 Challenge — >95% precision.
Layer 3 — Vector Similarity ~5ms
Cosine similarity as a final weak-signal safety net. Catches only extreme broad-scope divergence.
Fires Level 1 Flag — >80% precision.
Cloud Ensemble for Generative Claims
For claims without RAG context — three weak signals combined provide best-effort detection, honest and bounded.
Internal Consistency
Maintains a session-level claim ledger. New claims checked against prior claims in the session for contradictions. Catches logical inconsistencies, but not isolated false claims.
Confidence Calibration
Cloud SLM evaluates the claim and reports its own confidence. Catches gross errors where the evaluator model is uncertain. Limited by correlated failure — both models may share blind spots.
Claim Risk Categorisation
Specific numerical claims in professional contexts are inherently higher risk than general qualitative statements. Claim type affects flagging aggressiveness.
Runbooks as Knowledge Authority
A runbook written by a licensed professional doesn't just define incident response — it encodes what correct looks like for that domain.
The Human Analogy
Question → Junior retrieves information → Senior validates against established standards → Safe answer delivered.
Lumen: Query → RAG retrieves → Vector comparison checks consistency → Expert runbook validates → Safe response.
Runbooks Excel At
Hard boundaries ("dose must never exceed X"), known dangerous combinations, regulatory requirements, physical constraints ("for spans > 300m, diameter cannot be below 150mm")
Privacy Architecture — Three Tiers
| Component | Mid-Market (Cloud) | Enterprise (Hybrid) | Enterprise Premium (On-Prem) |
|---|---|---|---|
| SDK + Local evaluation | ✓ Customer | ✓ Customer | ✓ Customer |
| Runbook constraint check | ✓ Local | ✓ Local | ✓ Local |
| Type 1 evaluation | ✓ Local | ✓ Local | ✓ Local |
| Type 2 evaluation | Lumen Cloud (redacted) | Lumen Cloud (redacted) | Customer infra (full) |
| Flight recorder | Lumen Cloud (redacted) | Customer infra (full) | Customer infra (full) |
| PII exposure | Redacted only | Redacted only | Zero |
qanata.configure({ industry: 'insurance' }) — that's it. Loads insurance-specific PII patterns. Redacted text (not vectors) for auditability.
Updated Architecture Flow
Horizontal Business Functions, Not Industry Verticals
Building insurance-specific runbooks makes Lumen an industry product. That narrows the market, requires domain expertise, and slows mass adoption.
Every company does these regardless of industry:
Accounting & Finance
VAT, tax, financial reporting, expense classification
HR & Employment
Employment contracts, statutory pay, working time, GDPR
Customer Service
Response accuracy, commitment validation, regulatory comms
Legal & Compliance
Contract review, regulatory filing, data protection
Launch Runbooks
Accounting/Finance + HR/Employment — built from well-structured public regulatory documents. HMRC VAT guidance applies to every UK business. Employment Rights Act applies to every UK employer.
Two-Track Runbook Creation
Track A — Enterprise: Lumen-Assisted Generation
Customer provides standards → Lumen's LLM extracts constraints → draft runbook with confidence scores → customer's domain expert reviews per-constraint → approves → deployed.
- Solves cold-start — no external contributors needed at launch
- Blank-page authoring → editorial review (familiar workflow)
- Professional sign-off creates the audit documentation
- Natural sales motion: "Give us your standards, we'll draft runbooks, you're live in a week"
Track B — Community Contributions (Later)
Domain experts publish runbooks they've written. Others install and adapt. Emerges naturally once Track A has seeded enough runbooks. The marketplace becomes a distribution channel for runbooks that already exist.
Complications to Manage
Review fatigue — mitigate with confidence scores per constraint.
Source document quality variation — BS standards are precise; NHS guidelines may be narrative-heavy.
IP / copyright — customer provides their own licensed copy of the standard.
Standards Body Partnerships
The ambitious version: BSI endorses Lumen's electrical engineering runbook as the approved machine-enforceable version of BS 7671. That's a trust signal no competitor can replicate.
For the Standards Body
Their standards become more relevant, not less. A standard embedded in an AI safety layer is enforced automatically, thousands of times a day. New digital revenue stream from a market they currently can't reach.
For Qanata
The endorsed version. A startup building a competing runbook without the partnership is selling an unofficial interpretation. Qanata sells the authorised version. Moat that isn't about technology at all.
Target Partners First
IET — publishes BS 7671, engaged in technology policy.
ICE — Institution of Civil Engineers.
NICE — clinical guidelines already structured for constraint extraction.
Build Sequence
Lumen Chat + Lumen Workspace
Ship the product. Revenue from day one. Lumen Chat IS the reasoning window — no integration needed. Lumen Workspace IS the runbook management tool.
- Accounting/Finance + HR runbooks at launch
- UK jurisdiction first
- Mid-market self-serve onboarding
Extract Lumen SDK
Open source the headless evaluation pipeline. Developer adoption for custom AI applications. Enterprise segment opens up.
- Enterprise pilots with hybrid deployment
- Industry-specific runbooks (funded by growth capital)
- Standards body partnership outreach begins
Network Effects
SDK users contribute back to the runbook library. Community runbooks emerge organically. Standards body partnerships become viable with adoption numbers.
- Runbook marketplace (npm for AI constraints)
- First standards body endorsement
- International jurisdictions
Open Questions for Session 4
- Lumen Chat UX — LLM provider connection flow, team management, reasoning window rendering
- Lumen Workspace — constraint extraction pipeline, confidence scoring, review UI
- Claim Boundary Benchmark — test corpus from target domains, precision/recall thresholds
- Three-Layer Type 1 Validation — NLI model selection, end-to-end latency benchmarking
- Concurrency & Load — p95/p99 latency, CPU overhead, GPU threshold
- Legal Liability — "tool not authority" terms, liability when user overrides Level 2
- Flight Recorder Retention — customer-managed encryption, data residency by jurisdiction
- Pricing Model — per-interaction vs per-token, tier structure, target price point
- Go-to-Market Execution — first 100 customers, demo strategy, self-serve onboarding
- Buyer Persona Validation — talk to 10+ mid-market companies, map decision-making chain
- Fundraising Narrative — VC deck, SaaSpocalypse framing, productivity enablement story