From Demo to Production in Six Weeks: An FDE Delivery Timeline
2026-06-08 · 12 min read
The demo took three days. Production takes six weeks if you are disciplined — or never if you confuse a working notebook with an operable system. This timeline reflects what repeatable FDE teams use for conversational AI in enterprise environments with security, eval, and handoff requirements.
Week 1: Discovery and access
Run the seventy-two-hour discovery sprint: shadow users, data archaeology, intent taxonomy, one-page brief. Parallel track: secure read access to data sources, staging environment credentials, logging destination approval. Deliverables: ranked backlog, architecture sketch per intent class, executive decision on MVP scope. Common failure: starting embeddings before IAM roles exist. Recovery: narrow MVP to public or sample data while security catches up — but do not pretend sample results equal production quality.
Week 2: MVP on representative data
Build interpretive pipeline first — usually highest volume, lowest write risk. Twenty-question eval set from real utterances. Measure faithfulness and citation accuracy before adding agents. Deliverables: working staging endpoint, eval report v0, known failure catalog. Common failure: chunking breaks on tables and headers in PDFs. Recovery: structure-aware parsing, parent-child chunks, or markdown intermediate format.
Week 3: Integrations and guardrails v1
Wire customer auth, audit logging, input/output filters for PII and jailbreak patterns. If transactional intents are in MVP, add agent with one tool, confirmation step, and idempotent writes. Deliverables: security review packet, guardrail test cases, updated eval set at fifty questions. Common failure: tool calls succeed in dev but fail OAuth in staging. Recovery: pair with customer identity team; mock tool with recorded responses only as temporary bridge — label clearly as non-production.
Week 4: UAT with real users
Five to ten end users, recorded sessions with consent. Trace analysis on failures — wrong intent class, retrieval miss, latency spike. Fix top three failure modes; defer the long tail. Deliverables: UAT summary, runbook draft, training one-pager. Common failure: users ask high-risk questions the MVP must not handle. Recovery: tighten classifier routing to handoff; never improvise regulated advice.
Week 5: Production deployment
Blue-green or canary release. On-call rotation defined — customer or vendor. Dashboards: latency P95, cost per conversation, faithfulness sample, escalation rate. Deliverables: production URL, runbook, rollback procedure. Common failure: production traffic pattern differs from UAT volume causing rate limits. Recovery: request quota increase, cache frequent queries, route classifier to smaller model.
Week 6: Metrics review and expansion
Executive readout with business metrics, not embedding dimensions. Expansion backlog grounded in what worked. Extract reusable patterns for platform team. Deliverables: thirty-day metric trend, phase-two proposal, honest postmortem of deferred scope. Common failure: sponsor wants expansion before week-six metrics prove value. Recovery: propose limited pilot expansion with kill criteria.
Definition of done for conversational AI
Not "LLM responds." Done means: faithfulness at or above agreed threshold on golden set, P95 latency within budget, escalation path tested, runbook validated by non-FDE operator, security sign-off documented, and at least one executive metric moving in the right direction — even if modestly.
Six weeks is aggressive for regulated industries and reasonable for focused use cases with engaged sponsors. The timeline fails when discovery was skipped or when "production" means demo with prod URL. FDE credibility is measured in operated systems, not screenshots.
Ready to tailor your next application?
Start free resume