Day 17 — Observability and Tracing — Learn & Revise

Pre-reading: Week 3 Overview · Learning Plan

🎯 What You'll Master Today

When an LLM feature behaves unexpectedly in production, you need to know exactly where in the pipeline the failure occurred — retrieval, prompt construction, model inference, or post-processing. Today you will learn the three pillars of observability, how to instrument an LLM pipeline with traces and spans, and which tools (LangSmith, Arize Phoenix, Langfuse) make this practical. By the end you will be able to debug a slow or incorrect LLM response by reading a trace.

📖 Core Concepts

The Three Pillars of Observability

Pillar	What it captures	LLM example
Logs	Discrete events with timestamps and context	"Query received: 'What is RAG?' at 14:03:22, user_id=abc123"
Metrics	Numeric aggregates over time	P95 latency = 3.2s, token usage = 1,240/request, error rate = 0.4%
Traces	End-to-end request journeys broken into spans	User input → retrieval (210ms) → prompt build (12ms) → model (1,840ms) → post-process (45ms)

Logs are rich but hard to aggregate. Metrics are easy to alert on but hide root causes. Traces are the most powerful for LLM debugging — they show you exactly which step was slow or wrong.

LLM-Specific Tracing

A trace is a tree of spans, each representing one unit of work. For a RAG request the span tree looks like:

Root span: handle_query (2,107ms)
  ├── retrieve_documents (210ms)
  │     retrieval_score: 0.82, chunks_returned: 3
  ├── build_prompt (12ms)
  │     prompt_tokens: 1,240
  ├── model_inference (1,840ms)
  │     model: gpt-4o, completion_tokens: 187, finish_reason: stop
  └── post_process (45ms)
        guardrail_pass: true

Tracing lets you answer: "Was the model slow, or was retrieval slow?" and "What exact prompt did the model receive?"

What to Instrument

Every production LLM pipeline should instrument the following on every request:

Signal	Where to capture	Why
Latency per step	Each span	Identifies bottleneck step (model vs retrieval)
Token usage	Model span	Drives cost attribution and context-length alerts
Retrieval scores	Retrieval span	Low scores indicate poor query-document match
Model metadata	Model span	Model name, temperature, finish reason, stop reason
Input/output	Root span	Required for debugging incorrect answers
Error type	Any span	Rate-limit errors vs timeout vs validation errors

Be careful with input/output logging

Logging full prompts and answers creates PII risk. Use field-level redaction on user inputs before writing to your trace store, or use a tool like Presidio to scrub PII before logging.

LangSmith — Tracing for LangChain Apps

LangSmith is LangChain's hosted observability platform. It automatically traces every LangChain and LangGraph run when you set two environment variables. Each run appears in the LangSmith UI as a tree of steps with latency, token counts, inputs, and outputs at each node.

What LangSmith shows:

Full chain trace with every LLM call, tool call, and retrieval step
Token usage and cost per step
Input/output at every node (including intermediate chain steps)
Run comparison — diff two runs side by side
Feedback annotations — attach thumbs-up/down from your UI

Setup:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"
# All subsequent LangChain calls are automatically traced

Tool Comparison — LangSmith vs Arize Phoenix vs Langfuse

Feature	LangSmith	Arize Phoenix	Langfuse
Host	Managed (LangChain)	Local / managed	Self-hosted / managed
Framework integration	LangChain, LangGraph native	OpenTelemetry, generic	SDK for any framework
LLM eval built-in	Yes (RAGAS-style)	Yes	Yes
Dataset management	Yes	Yes	Yes
Cost	Paid tiers; free dev plan	Open source	Open source / paid cloud
Best for	LangChain-heavy stacks	Vendor-agnostic, local dev	Self-hosted, data privacy

Use LangSmith if your stack is primarily LangChain. Use Langfuse if you want full data ownership or a non-LangChain stack. Use Arize Phoenix for local debugging without any cloud dependency.

Alerts and Dashboards

Alert on signals that indicate user-facing harm, not just system health:

Metric	Alert condition	What it means
Error rate	> 1% over 5 minutes	Model calls failing — users get no answer
P95 latency	> 5s for 5 minutes	Responses too slow — users abandon
Token usage per request	> 90% of context window	Risk of truncation and quality degradation
Hallucination proxy	Faithfulness < 0.7 (sampled)	Model fabricating answers — highest risk
Retrieval score	Mean score < 0.5 over 100 requests	Index degraded or query distribution shifted

The hallucination proxy is the hardest to alert on in real-time because scoring requires an LLM judge. A practical approach is to score a 1% sample continuously and alert if the rolling 1-hour average drops below threshold.

🗺️ Architecture / How It Works

graph TD A["User Request"]:::blue --> B["Root Span\nhandle_query"]:::blue B --> C["Retrieval Span\nchunks · scores · latency"]:::orange B --> D["Prompt Span\ntoken count · template"]:::orange B --> E["Model Span\nmodel · tokens · finish reason"]:::orange B --> F["Post-Process Span\nguardrail · redaction"]:::orange C --> G["Trace Store\nLangSmith · Langfuse · Phoenix"]:::green D --> G E --> G F --> G G --> H["Dashboards\np95 · error rate · token cost"]:::green G --> I["Alerts\nlatency · hallucination proxy"]:::red classDef blue fill:#1976d2,color:#fff classDef orange fill:#ff9800,color:#fff classDef green fill:#2e7d32,color:#fff classDef red fill:#c62828,color:#fff

⚡ Key Facts — Quick Revision Table

Concept	One-Line Definition	Why It Matters
Trace	End-to-end record of one request broken into spans	Pinpoints which step caused a failure
Span	One unit of work within a trace (retrieval, model, etc.)	Carries latency, inputs, outputs, metadata
P95 latency	The 95th-percentile response time	Better SLO signal than mean; catches tail slowness
LangSmith	LangChain's managed tracing platform	Zero-config tracing for LangChain stacks
Langfuse	Open-source observability for any LLM stack	Best for self-hosted or non-LangChain stacks
Arize Phoenix	Local-first LLM observability	No cloud dependency; great for local dev
Hallucination proxy	Sampled faithfulness score from production	Real-time quality signal without full eval
Token usage alert	Alert when usage approaches context limit	Prevents silent truncation and quality drop
Retrieval score	Similarity score of retrieved chunks	Low score = poor retrieval = likely bad answer
Error rate alert	Alert when model call failure rate exceeds threshold	Direct user-facing availability signal

🔬 Deep Dive

LangSmith Tracing + Custom Span Annotation

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree

# Enable tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls-your-key-here"
os.environ["LANGCHAIN_PROJECT"] = "rag-production"

client = Client()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


@traceable(name="retrieve_documents")
def retrieve_documents(query: str) -> list[str]:
    """Stub retrieval — replace with real vector DB call."""
    # Attach custom metadata to the current span
    run = get_current_run_tree()
    if run:
        run.extra = {
            "retrieval_scores": [0.91, 0.87, 0.74],
            "index_name": "prod-knowledge-base",
        }
    return [
        "RAG combines retrieval with generation to ground LLM outputs.",
        "Faithfulness measures whether the answer only uses retrieved facts.",
        "Context precision measures what fraction of retrieved chunks were relevant.",
    ]


@traceable(name="generate_answer")
def generate_answer(question: str, contexts: list[str]) -> str:
    context_str = "\n\n".join(contexts)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer using only the provided context. Say 'I don't know' if the context is insufficient."),
        ("human", "Context:\n{context}\n\nQuestion: {question}"),
    ])
    chain = prompt | llm
    response = chain.invoke({"context": context_str, "question": question})
    return response.content


@traceable(name="handle_query", run_type="chain")
def handle_query(query: str) -> str:
    contexts = retrieve_documents(query)
    answer = generate_answer(query, contexts)

    # Attach outcome metadata to root span
    run = get_current_run_tree()
    if run:
        run.extra = {
            "answer_length_tokens": len(answer.split()),
            "contexts_used": len(contexts),
        }
    return answer


# Run and observe in LangSmith UI
if __name__ == "__main__":
    result = handle_query("What is faithfulness in RAG evaluation?")
    print(result)
    print("\nView trace at: https://smith.langchain.com")

Reading the trace: Open LangSmith → your project → latest run. You will see the handle_query root span with two child spans: retrieve_documents and generate_answer. Each span shows latency, token usage, and your custom metadata (retrieval scores, index name).

🧪 Practice Drills

Drill 1 — Instrument Your RAG Pipeline

Add LANGCHAIN_TRACING_V2=true and your LangSmith API key to your environment.
Wrap each stage of your RAG pipeline in a @traceable decorator (retrieve, build_prompt, generate).
Run 5 queries and open LangSmith — verify all spans appear with correct latencies.
Identify your slowest span. Is it retrieval or model inference?

Drill 2 — Build a Latency Dashboard

In LangSmith, open your project and switch to the "Metrics" view.
Set up a chart: P95 latency by day, filtered to run_type=chain.
Identify if latency has a pattern (time of day, query length correlation).
Set a P95 alert threshold of 5 seconds.

Drill 3 — Hallucination Proxy Monitoring

After running 20+ queries, export the run data from LangSmith as JSON.
For a 10% sample, manually check faithfulness (does the answer only use retrieved context?).
Record your spot-check faithfulness rate.
Document: "My production faithfulness proxy is X% based on manual spot-check of N samples."

💬 Interview Q&A

How would you trace a slow LLM request to find the bottleneck?

I would open the trace for the slow request in LangSmith (or Langfuse if self-hosted) and look at the span tree. Each span shows its own latency, so I can immediately see whether the bottleneck is retrieval, prompt construction, model inference, or post-processing. Retrieval slowness usually points to a slow vector DB query — I would check index health and query complexity. Model inference slowness usually means a large context window or high load on the inference endpoint — I would check token counts and move to a faster model for latency-sensitive queries. If it is the first time I am seeing slowness, I compare the trace to the P95 baseline to confirm it is an outlier.

What is LangSmith and what does it show?

LangSmith is LangChain's observability platform. When you set LANGCHAIN_TRACING_V2=true, every LangChain run is automatically traced and sent to LangSmith. You can see the full span tree for each request — every LLM call, tool invocation, retrieval step — with latency, token counts, inputs, and outputs at each node. It also supports dataset management, run comparisons, and human feedback annotations. For non-LangChain stacks, Langfuse and Arize Phoenix provide equivalent functionality via OpenInference or OpenTelemetry instrumentation.

What metrics would you alert on for a production RAG system?

My primary alert set is: error rate above 1% over 5 minutes (model calls failing), P95 latency above 5 seconds for 5 minutes (users experiencing timeouts), and token usage above 90% of context window on more than 10% of requests (risk of truncation). For quality, I run a continuous 1% sample through a faithfulness judge and alert if the rolling hourly average drops below 0.7 — this is my hallucination proxy. I also alert on retrieval score dropping below 0.5, which indicates index degradation or a query distribution shift.

✅ End-of-Day Checklist

Item	Status
Can explain logs, metrics, and traces and what each captures for LLMs	☐
Can describe the span tree for a RAG request	☐
Can explain what LangSmith shows and how to set it up	☐
Can compare LangSmith, Langfuse, and Arize Phoenix	☐
Can list 5 metrics to alert on in production	☐
LangSmith tracing working locally with custom span annotations	☐
Latency dashboard or chart created	☐
All 3 interview answers rehearsed out loud	☐