A practitioner-first curriculum for building production LLM systems: what the market actually hires for, how to architect conversational AI across industries using intent taxonomy, and the full stack from RAG to agents to guardrails. Synthesized from 200+ job postings, enterprise deployment patterns, and real product work.
Applied AI is not training foundation models from scratch. It is the discipline of taking existing models (GPT, Claude, Gemini, Llama) and engineering reliable products around them: retrieval pipelines, agent orchestration, guardrails, evaluation, deployment, and continuous improvement in production.
| Role | Core Output | PhD Required? | Typical Stack |
|---|---|---|---|
| ML Engineer | Trains models from data; feature stores; batch inference | Often helpful | PyTorch, SageMaker, Kubeflow |
| Applied AI / GenAI Engineer | Ships LLM features users touch daily | No | LangGraph, RAG, FastAPI, eval pipelines |
| Forward Deployed Engineer | Embeds with customer; end-to-end delivery on-site | No | Platform + custom integrations + stakeholder mgmt |
| AI PM | Prioritizes use cases, metrics, guardrails, rollout | No | Eval rubrics, cost models, user research |
| Research Scientist | Publishes; trains/fine-tunes foundation models | Usually yes | Distributed training, novel architectures |
Analysis synthesized from 200+ job postings across LinkedIn, Levels.fyi, company career pages, and specialized boards (AgenticCareers, NLP People, Boundev's 50-post sample, Deloitte/Palantir FDE listings). Percentages reflect frequency across LLM-focused roles in 2025–2026.
| Skill / Requirement | Frequency | What "Good" Looks Like in Interviews |
|---|---|---|
| RAG (Retrieval-Augmented Generation) | ~74% | Debugged chunking failures; tuned hybrid search + reranking; measured groundedness |
| Python + API development | ~95% | FastAPI/Flask, async, clean service boundaries, typed schemas |
| Prompt engineering (systematic) | ~88% | Versioned prompts, few-shot libraries, structured outputs — not "I wrote a good prompt once" |
| Agent / tool-use orchestration | ~65% | LangGraph state machines, retries, human-in-the-loop, tool schema design |
| Vector databases | ~70% | pgvector, Pinecone, Weaviate, OpenSearch — plus when NOT to use vectors |
| Evaluation & observability | ~58% | RAGAS, DeepEval, LangSmith/Langfuse, production trace analysis |
| Cloud deployment (AWS/Azure/GCP) | ~72% | Bedrock, Vertex AI, Azure OpenAI — with cost/latency tradeoffs |
| Guardrails & safety | ~45% | Input/output filtering, PII redaction, escalation paths |
| MCP / tool protocol | ~22% (rising fast) | FastMCP servers, OpenAPI-bound action groups |
| Fine-tuning | ~25% | Nice-to-have; most roles expect RAG + prompting first |
| Graph RAG / knowledge graphs | ~18% | Differentiator for enterprise, compliance, lineage queries |
Orchestration: LangChain, LangGraph, LlamaIndex, Semantic Kernel, CrewAI, AutoGen
LangGraph/LangChain dominate enterprise JDs
Evaluation: RAGAS, DeepEval, LangSmith, Braintrust, PromptFoo, G-Eval
Eval is the fastest-growing differentiator
Same work appears under: AI Engineer, Applied AI Engineer, GenAI Engineer, LLM Engineer, AgentOps Engineer, AI Delivery Engineer, Forward Deployed Engineer. Compensation varies up to $86K for identical scope. Optimize for portfolio + production stories, not title collection.
The biggest mistake in enterprise conversational AI is treating every user message as "chat with RAG." Your screenshot captures the right instinct: classify intent first, then route to the correct technical pattern. Below is an expanded taxonomy for Applied AI across industries — building on the 3-class model (Interpretive → RAG, Transactional → Agents, High-Risk → Guardrails + Handoff) plus three additional classes seen in production systems and research (TUNA framework, Broder's web search taxonomy, intent-first RAG literature).
Intent-first architecture inverts naive RAG: classify before retrieve, route before generate.
| Approach | Latency | Accuracy | When to Use |
|---|---|---|---|
| Small LLM with structured output (Haiku, GPT-4o-mini) | ~200ms | Good | Fast MVP, <10 intent classes |
| Fine-tuned classifier (BERT, DistilBERT) | <50ms | Very good in-domain | High volume, stable taxonomy |
| Rules + embeddings hybrid | <30ms | Moderate | Regulated industries with explicit policies |
| LLM + confidence threshold → clarifying question | ~300ms | Best UX | Ambiguous queries common |
Ingestion → chunking → embeddings → vector store → hybrid retrieval → rerank → grounded generation with citations.
When: Interpretive intents, document-heavy domains (insurance, legal, HR policies).
Watch-outs: Fixed chunking breaks semantic coherence; stale docs; no citation = no trust.
State graph → tool selection → execution → observation → loop until done or budget exhausted.
When: Transactional intents, multi-step processes (claims, IT tickets, scheduling).
Watch-outs: Unbounded agent loops burn cost; always set max hops + timeout.
Input filtering (PII, jailbreaks) → policy check → output validation → escalation triggers.
When: High-risk intents, regulated industries, customer-facing support.
Watch-outs: Guardrails are not optional "later" — ship them with v1.
Offline eval sets → CI gates → online A/B → trace analysis → prompt versioning → cost dashboards.
When: Always. The #1 skill gap in candidates who can demo but can't ship.
Watch-outs: "It works on my laptop" is not production.
| Modality | Example Products | Applied AI Pattern |
|---|---|---|
| Copilot / Inline Assist | GitHub Copilot, Cursor, Notion AI | Context from current doc + lightweight completion; not full chat |
| Voice AI | ElevenLabs, Vapi, OpenAI Realtime API | STT → intent → TTS; latency-critical; often hybrid with human transfer |
| Workflow Automation | n8n, Zapier AI, UiPath | Deterministic triggers + LLM for unstructured steps |
| Search + Gen | Perplexity-style, enterprise search | RAG with web or internal index; citation-first UX |
| Multi-Agent Teams | CrewAI, AutoGen patterns | Role-specialized agents; high cost — use when decomposition is proven necessary |
| Computer Use / UI Agents | Browser automation, RPA+LLM | Fragile; prefer API-first transactional agents when APIs exist |
Interpretive: "What does my plan cover for physical therapy?" → RAG over benefits docs.
Transactional: "Schedule my follow-up" → Agent + EHR scheduling API.
High-Risk: "Is this chest pain serious?" → Guardrail → nurse triage line. Never diagnose.
Compliance: HIPAA, no PHI in logs, BAA with vendors.
Interpretive: Advisor RAG over 500-page policy PDFs (maternity vs pregnancy keyword gap).
Transactional: FNOL claim intake, beneficiary updates → Agents + core policy admin APIs.
High-Risk: Investment advice, fraud disputes → human handoff + audit trail.
Analytical: Coverage gap summaries across product lines → multi-doc agentic RAG.
Interpretive: "Will this fit my 2019 Honda?" → RAG + structured fitment graph.
Transactional: Order status, returns, refunds → Agents bound to OMS APIs.
Navigational: "Where's my seller dashboard?" → route, don't generate.
Search AI: Query understanding + semantic retrieval (not generative answers for every search).
Interpretive: "How do I tailor my resume for this JD?" → RAG over role requirements + user profile.
Transactional: Apply to job, schedule mock interview, export PDF → product actions.
Analytical: Fit score explanation across skills gap → structured comparison agent.
Meta: "Make it more senior" → session memory + iterative refinement loop.
Interpretive: Benefits eligibility, permit requirements → RAG over official docs only.
Transactional: Form pre-fill, case status → Agents with strict auth.
High-Risk: Legal interpretation, immigration → mandatory human review.
Constraint: FedRAMP, data residency, no external model calls for classified data.
Interpretive: SOP lookup, safety procedures → RAG with version-controlled docs.
Analytical: "Which suppliers depend on component X?" → graph RAG.
Transactional: PO creation, inventory holds → ERP agents.
Voice: Hands-free floor worker queries via voice AI.
Interpretive: Contract clause lookup → RAG with precise citations (page, section).
Analytical: Compare redlines across versions → multi-doc agent.
High-Risk: "Should I sign this?" → never autonomous; attorney handoff.
Interpretive: Concept explanation, study guides → RAG over curriculum.
Meta: "Quiz me harder" → adaptive difficulty from session state.
Transactional: Enroll, submit assignment → LMS integration.
Eval focus: Factual accuracy rubrics; hallucination = student harm.
Palantir pioneered the Forward Deployed Engineer (FDE) — now adopted by Deloitte, Scale AI, Databricks, and enterprise AI consultancies. In 2026, FDE roles explicitly require GenAI/agentic delivery, not just data integration.
| Dimension | Platform / Product Engineer | Applied AI Engineer | Forward Deployed Engineer |
|---|---|---|---|
| Customer proximity | Indirect (PM proxy) | Sometimes | Embedded on-site / in war room |
| Problem shape | Generalizable features | LLM system components | One client's ambiguous problem → working solution |
| Success metric | DAU, feature adoption | Latency, accuracy, cost | Customer mission outcome in weeks |
| Skills emphasis | Scale, abstractions | RAG, agents, eval | Stakeholder mgmt + rapid prototyping + politics |
| Layer | Build | Buy / Managed | Recommendation |
|---|---|---|---|
| Foundation model | Train from scratch | OpenAI, Anthropic, Bedrock, Vertex | Buy API — 99% of Applied AI roles |
| Vector search | Custom FAISS | Pinecone, OpenSearch, pgvector | Managed until scale proves otherwise |
| Agent orchestration | Custom state machine | LangGraph, Bedrock Agents | Framework first, custom only for edge cases |
| Evaluation | Custom pytest + rubrics | LangSmith, Braintrust, DeepEval | Hybrid — platform for traces, custom for domain rubrics |
| Full conversational platform | From scratch | Kore.ai, Cognigy, Ada, Sierra | Buy for standard support bots; build for differentiated IP |
| Metric | Definition | Why Executives Care |
|---|---|---|
| Containment rate | % resolved without human | Support cost reduction |
| Groundedness / faithfulness | Answer supported by retrieved context | Legal/compliance risk |
| Time-to-resolution | Median conversation length to outcome | Customer satisfaction |
| Cost per conversation | Tokens + infra / session | Unit economics at scale |
| Escalation rate | % routed to human | Guardrail effectiveness |
| Hallucination rate | Factually wrong on golden set | Brand trust |
Upload benefits PDFs. Classify: interpretive vs transactional vs high-risk. RAG with citations for Q&A; mock API for "update beneficiary"; handoff UI for advice requests.
Signals: Taxonomy thinking, RAG debugging, guardrails awareness.
Parse resume + JD → fit analysis → tailored bullet suggestions → mock interview questions. Separate interpretive (explain gap) from transactional (save application).
Signals: Product sense, structured outputs, multi-step workflow.
50 FAQ golden set. Agent with order lookup tool. CI fails on faithfulness regression. Public trace viewer.
Signals: LLMOps maturity — the #1 differentiator in 2026 hiring.
Ingest architecture docs + YAML configs. Answer "what breaks if X fails?" using Neo4j + vectors.
Signals: Advanced retrieval — differentiates senior candidates.
Audience breaks into groups. Classify each query using the 6-class taxonomy and name the technical solution:
Pick one industry (healthcare, e-commerce, HR). Draw: intent classifier → pipelines → eval layer. Identify ONE high-risk path that must never be fully automated.
Connecting seminar concepts to product work already in this codebase — useful for demonstrating real Applied AI delivery:
| Product Feature | Intent Class | Pattern Used |
|---|---|---|
| Resume tailoring / curator | Interpretive + Meta | Structured LLM outputs, iterative refinement, user profile context |
| Mock interview | Interpretive + Analytical | Multi-turn conversation, follow-up generation, rubric scoring |
| Job fit scoring | Analytical | Structured comparison, semantic matching, explainable gaps |
| Simple Apply / FlashApply | Transactional | Agent-like workflow — form mutation via integrations (Greenhouse, extension) |
| Screening sessions | High-Risk (hiring decisions) | Human-in-loop, scored reports, not autonomous hire/reject |
| Study pack generation | Interpretive | RAG over role requirements + generated curriculum |