Day 25 — Technical Interview Drills — Learn & Revise
Pre-reading: Week 4 Overview · Learning Plan
🎯 What You'll Master Today
Technical interviews for AI engineering roles test five distinct categories of skill, each requiring a different response strategy. Today you will learn the five question categories, master the RADSS framework for LLM system design interviews, and work through five fully answered technical question pairs covering RAG design, debugging, evaluation, agent systems, and tradeoff analysis. By end of day you will have a repeatable answer structure for every technical question type you are likely to face.
📖 Core Concepts
The Five Categories of AI Engineer Technical Interviews
| Category | What Is Being Tested | Example Prompt |
|---|---|---|
| System Design | Architecture thinking, scalability, component selection | "Design a RAG system for a legal document Q&A platform" |
| Coding | Implementation skill, data handling, eval code | "Write a function to chunk a PDF and compute chunk-level similarity scores" |
| Concept Explanation | Depth of understanding, communication clarity | "Explain how attention works and when it is a bottleneck" |
| Debugging | Systematic diagnosis, production reasoning | "This RAG system is returning hallucinated answers. What do you check?" |
| Trade-off Analysis | Judgment, awareness of real-world constraints | "When would you choose fine-tuning over RAG?" |
Most AI engineering interviews are heavy on system design and trade-off analysis at the senior level. Coding interviews for AI roles lean toward evaluation code, data pipelines, and simple retrieval logic — not algorithm puzzles.
System Design: The RADSS Framework
RADSS is a five-step framework for answering LLM system design questions in a structured and complete way:
| Step | Focus | Output |
|---|---|---|
| R — Requirements | Clarify functional and non-functional requirements | "What volume? What latency SLO? What domain?" |
| A — Architecture | Sketch the high-level component diagram | Ingestion → Retrieval → Generation → Response |
| D — Data | Discuss data sources, chunking, indexing, updates | Index freshness, chunk strategy, embedding model |
| S — Scale | Address volume, concurrency, and reliability concerns | Caching, async processing, replication |
| S — Safety | Cover guardrails, content moderation, eval gates | Hallucination detection, PII handling, output filters |
Always start with Requirements. Interviewers penalise candidates who jump to architecture before clarifying scope. A good clarifying question demonstrates senior thinking; a missing one signals that you will build the wrong thing.
Concept Explanation Pattern
Use this four-step structure for any concept explanation question:
- Define — one sentence, plain language
- Example — one concrete example from production or practice
- Tradeoff — what does this approach sacrifice?
- Production consideration — what would change at scale or in a real system?
This structure takes 60–90 seconds and covers everything an interviewer is scoring.
Coding Interviews for AI Roles
AI engineering coding interviews test:
- Data processing — chunking, cleaning, batching documents
- Evaluation code — computing recall@K, precision, RAGAS-style metrics
- Pipeline logic — chaining retrieval, prompt construction, and parsing
- Simple retrieval — cosine similarity search, reranking logic
They do not typically test: graph algorithms, dynamic programming, or tree traversal. If you encounter those, the role is likely ML infrastructure, not applied AI.
Debugging LLM Systems
A systematic debugging approach for a RAG system returning bad results:
1. Check the query — is the question well-formed? Out of domain?
2. Check retrieval — run the query against the index. Are relevant chunks in the top-K?
3. Check chunk quality — are chunks well-formed? Do they contain the answer?
4. Check the prompt — is the context being injected correctly? Is the instruction clear?
5. Check the model — is the model version correct? Is temperature appropriate?
6. Check the output — is the failure a retrieval miss or a generation error?
Always separate retrieval failures from generation failures. They have different root causes and different fixes.
🗺️ Strategy Map
⚡ Key Facts — Quick Revision Table
| Concept | One-Line Definition | Why It Matters |
|---|---|---|
| RADSS | Requirements, Architecture, Data, Scale, Safety — system design framework | Ensures completeness in LLM system design answers |
| Five question categories | System design, coding, concept explanation, debugging, trade-off analysis | Each needs a different answer strategy |
| Concept explanation pattern | Define, Example, Tradeoff, Production consideration | Covers everything an interviewer scores in 90 seconds |
| Retrieval failure | Bad results caused by missing or wrong chunks in top-K | First thing to check in a RAG debugging scenario |
| Generation failure | Bad results caused by the LLM misusing correct context | Requires prompt or model-level fix |
| Retrieval confidence filter | Threshold below which the system declines to answer | Prevents hallucination when retrieval quality is low |
| Chunking strategy | How documents are split before indexing | Major driver of retrieval quality |
| Hybrid retrieval | Combining BM25 and dense retrieval with fusion | Handles both lexical and semantic query types |
| Eval gate | Automated quality check that blocks deployment on regression | Standard practice in production AI systems |
| Trade-off question | Question asking you to compare two valid approaches | Tests senior judgment, not just knowledge |
🔬 Deep Dive with Examples
Q1 — System Design: Design a RAG System for Legal Document Q&A
Question: "Design a RAG system for a legal document question-answering use case."
Model Answer:
Requirements (clarify first):
- Expected query volume: 1,000 queries/day initially, scaling to 50,000
- Latency SLO: P95 under 3 seconds
- Domain: contracts, case law, internal legal policies
- Accuracy requirements: high — wrong answers in a legal context have serious consequences
- Update frequency: documents updated weekly
Architecture:
Documents → PDF Extraction → Chunking → Embedding → Vector Store
Query → Embedding → Hybrid Retrieval (BM25 + Dense) → Reranker → Top-5 Chunks
Top-5 Chunks + Query → Prompt Construction → LLM → Output Filter → Response
Data layer:
- Chunk size: 512 tokens with 10% overlap (legal documents have long paragraphs)
- Embedding model: domain-adapted legal embeddings if available, otherwise text-embedding-3-large
- Index: FAISS with HNSW for fast ANN search; separate BM25 index for term matching
Scale:
- Cache frequently repeated queries using a semantic cache (embedding similarity threshold)
- Async ingestion pipeline for document updates with index consistency checks
- Load-balanced serving layer with health checks
Safety:
- Retrieval confidence filter: decline to answer if top chunk score < 0.70
- Output filter: NLI-based hallucination detection for answers in high-stakes categories
- Citation requirement: every answer must cite the source document and section
- Audit log: all queries and responses logged for compliance review
Q2 — Debugging: RAG System Returning Hallucinated Answers
Question: "Here is a RAG system returning hallucinated answers. What would you check?"
Model Answer:
Step 1: Isolate the failure type. Run the query against the retrieval stage in isolation. Are the retrieved chunks relevant? If not, the failure is in retrieval, not generation.
Step 2: Check retrieval quality. Look at the top-5 chunks for the failing query. Do any contain the correct answer? If yes — retrieval is working but generation is ignoring it. If no — retrieval is the root cause.
Step 3: If retrieval is the root cause: Check chunk boundaries — is the answer split across chunks? Check the embedding model — is the query terminology in-domain? Run BM25 as a cross-check — if BM25 retrieves correctly but dense does not, the issue is the embedding model.
Step 4: If generation is the root cause: Check whether the LLM output is contradicting the retrieved context (retrieval contains the answer but the model ignores or contradicts it). This indicates a prompt grounding issue. Add an explicit instruction: "Answer only using facts from the provided context."
Step 5: Add instrumentation. Log the retrieved context alongside every response. Without this, you cannot diagnose the failure type at scale.
Q3 — Evaluation: How Would You Design an Eval System for an LLM Pipeline?
Question: "Walk me through how you would design an evaluation system for an LLM pipeline in production."
Model Answer:
I design evals in three layers: offline, online, and continuous.
Offline: Before any deployment, I build a golden dataset of 200–500 representative queries with human-labelled ground truth answers. I measure retrieval metrics (recall@K, precision@K) separately from generation metrics (faithfulness, answer relevance, RAGAS scores). This dataset is locked before experimentation begins.
Online: Once in production, I add a sampling-based quality monitor that runs the LLM-as-judge evaluation on 5% of live queries. Scores are written to a dashboard. Alert thresholds trigger when any metric drops more than 2% below the deployment baseline.
Continuous: Every pull request that touches prompt files, retrieval logic, or model configuration triggers the full offline eval suite in CI/CD. A regression on any metric blocks the merge. This is the automated eval pipeline pattern.
Q4 — Agent Design: Design an Agent for an Internal IT Help Desk
Question: "Design an LLM agent for an internal IT help desk that can answer questions AND create tickets."
Model Answer:
Architecture: ReAct-style agent with three tools: (1) Knowledge search — RAG over IT documentation, (2) Ticket creation — write to the ticketing API, (3) Ticket status lookup — read from the ticketing API.
Control flow: The agent receives the user query, decides whether a knowledge search is sufficient or whether ticket creation is needed, executes the appropriate tool sequence, and returns a response with a ticket number if created.
Guardrails: Before ticket creation, the agent must confirm intent with the user. Ticket creation is a write operation and requires a confirmation step to prevent accidental duplicates. All tool calls are logged for audit.
Failure modes to address: Infinite loops (if the agent keeps searching without finding an answer — add a max-iterations limit), tool failure (ticket API goes down — add graceful fallback with a manual link), and scope creep (agent tries to solve problems outside IT scope — add a topic classifier as a guardrail before routing to the agent).
Q5 — Trade-off: Fine-tuning vs RAG
Question: "What are the tradeoffs between fine-tuning and RAG?"
Model Answer:
| Dimension | Fine-tuning | RAG |
|---|---|---|
| When to use | Teaching the model new behaviour or style | Grounding answers in specific documents |
| Cost | High — requires training compute and labelled data | Lower — retrieval infrastructure, no training |
| Freshness | Stale — requires retraining to update | Dynamic — update the index, not the model |
| Explainability | Low — reasoning is in weights | High — can cite retrieved source |
| Hallucination risk | Moderate — model may confabulate learned "facts" | Lower — grounded in retrieved context |
| Best for | Code generation style, instruction following, domain vocabulary | Document Q&A, knowledge lookup, citation-required answers |
I use RAG as the default for knowledge-grounding tasks because it is cheaper, fresher, and more explainable. I use fine-tuning when I need to change model behaviour or style at a level that RAG cannot reach — for example, teaching a model to produce a specific output format or to adopt a domain's terminology consistently.
🧪 Practice Drills
Drill 1 — RADSS Drill (30 minutes)
- Pick a system design question: "Design a customer support chatbot for an e-commerce company."
- Set a timer for 5 minutes. Write Requirements only — do not move to Architecture yet.
- After Requirements, sketch Architecture in bullet points (3 minutes).
- Cover Data, Scale, Safety in sequence (2 minutes each).
- Review: did you address all five RADSS components?
Drill 2 — Concept Explanation Speed Drill (20 minutes)
Practice these four topics using the Define → Example → Tradeoff → Production pattern:
- What is embedding similarity?
- What is HNSW?
- What is chain-of-thought prompting?
- What is a semantic cache?
Keep each answer to under 90 seconds.
Drill 3 — Debugging Simulation (20 minutes)
- Imagine a RAG system where 30% of answers for "date-related" queries are wrong.
- Write a systematic debug checklist following the six-step debugging structure.
- What is your hypothesis about the root cause? How would you test it?
💬 Interview Q&A
Design a RAG system for a legal document Q&A use case.
I'd start with requirements: expected query volume, latency SLO, update frequency, and accuracy requirements — legal Q&A has near-zero tolerance for hallucination. For architecture: documents go through PDF extraction, chunking at 512 tokens with overlap, embedding with a domain-adapted model, and storage in a vector index alongside a BM25 index. At query time I run hybrid retrieval — BM25 for exact legal citations, dense for semantics — fused with RRF, then a reranker to get top-5 chunks. The prompt enforces citation: every answer must reference the source document and section. For safety I add a retrieval confidence filter — if similarity is below 0.70 the system says "I cannot find a reliable answer" rather than hallucinating. For scale I add query-level semantic caching since legal users often ask similar jurisdiction questions. All queries are logged for compliance audit.
How would you debug a production LLM system returning hallucinated answers?
I first isolate whether the failure is in retrieval or generation — these have different root causes. I run the failing queries through just the retrieval stage and check whether the correct answer is present in the top-K chunks. If not, I look at chunking — is the answer split across boundaries? — and at embedding coverage — is the terminology in-domain? If retrieval is correct but generation hallucinates, I check the prompt grounding instruction and look for cases where the model is ignoring the context in favour of parametric memory. The fix for generation hallucination is usually adding explicit grounding instructions and a retrieval confidence filter. I also instrument the pipeline to log retrieved context alongside every response — without that data, reproducing and categorising failures at scale is very difficult.
What are the tradeoffs between fine-tuning and RAG?
The core tradeoff is between behaviour change and knowledge grounding. RAG is better for grounding answers in specific, updateable documents — it is cheaper, fresher, and more explainable because you can cite the retrieved source. Fine-tuning is better for changing how the model responds — style, format, domain vocabulary, instruction-following patterns. Fine-tuning is expensive in training compute and labelling cost, and it becomes stale the moment your knowledge base changes. I default to RAG for knowledge tasks and only consider fine-tuning when I need something that RAG cannot provide: consistent output format, domain terminology adoption at the vocabulary level, or instruction-following improvements that prompting alone cannot achieve. In practice, the best production systems often combine both — a fine-tuned model served with a RAG retrieval layer.
✅ End-of-Day Checklist
| Item | Status |
|---|---|
| Five question categories reviewed and understood | ☐ |
| RADSS framework practised on one system design question | ☐ |
| Concept explanation pattern drilled on 4 topics | ☐ |
| Debugging checklist written for a RAG failure scenario | ☐ |
| Fine-tuning vs RAG tradeoff table reviewed | ☐ |
| Drafted answers for all three interview Qs above | ☐ |