01.01 · How LLMs Work — Deep Dive

Level: Intermediate
Pre-reading: 01 · AI & LLM Foundations


The Transformer Architecture

Every modern LLM is built on the Transformer architecture (Vaswani et al., 2017). The key innovation is self-attention — every token in the input can "attend to" every other token, allowing the model to capture long-range dependencies.

graph TD
    A[Input Tokens] --> B[Token Embeddings]
    B --> C[Positional Encoding]
    C --> D[Multi-Head Self-Attention]
    D --> E[Feed-Forward Network]
    E --> F[Layer Norm]
    F --> D2[Repeat N times]
    D2 --> G[Output Logits]
    G --> H[Softmax · Next Token Probabilities]
Component Role
Token Embeddings Maps each token ID to a dense vector in high-dimensional space
Positional Encoding Encodes position in the sequence (order matters)
Self-Attention Each token weighs all other tokens' relevance to itself
Feed-Forward Network Per-token transformation after attention
Layer Norm Stabilizes training, applied before or after each sub-layer

Context Window in Practice

The context window is the maximum number of tokens the model can process in a single forward pass — inputs + outputs combined.

Model Context Window Practical Implication
GPT-4o 128K tokens ~100K words — entire Spring Boot service
Claude 3.5 Sonnet 200K tokens ~150K words — entire module with tests
Gemini 1.5 Pro 1M tokens Entire small codebase
LLaMA 3 70B 8K–128K Varies by deployment

Context ≠ Memory

Sending a long context costs tokens on every request. It does NOT persist between calls. For persistent memory across sessions, you need an external store — see 03 · RAG.


Tokens Explained

graph LR
    A["'Hello, world!'"] --> B["['Hello', ',', ' world', '!']"]
    B --> C["[15496, 11, 995, 0]"]
  • English text: ~1 token per ¾ word
  • Code: more tokens per line (special characters, indentation)
  • @SpringBootApplication ≈ 4–6 tokens

Practical rule of thumb: 1,000 tokens ≈ 750 words ≈ ~30–40 lines of Java code.


Temperature and Sampling

Setting Effect Use Case
temperature=0.0 Fully deterministic, greedy Code generation, JSON output
temperature=0.2 Mostly deterministic, minor variation Tests, structured data
temperature=0.7 Creative, diverse Documentation, explanations
temperature=1.0+ High randomness Creative writing
top_p=0.9 Nucleus sampling — only top 90% probability mass Balances diversity and coherence

Agent Configuration

For the JIRA→PR agent, use temperature=0 for code changes and temperature=0.3 for RCA explanations. Deterministic code is reproducible and reviewable.


Function Calling / Tool Use

Modern LLMs expose a structured function calling interface. Instead of embedding tool instructions in text, you provide a JSON schema of available tools and the model outputs a structured invocation.

sequenceDiagram
    participant App
    participant LLM
    participant Tool
    App->>LLM: Prompt + tool schemas
    LLM-->>App: { "tool": "read_file", "args": {"path": "..."} }
    App->>Tool: Execute read_file(path)
    Tool-->>App: File contents
    App->>LLM: Prompt + tool result
    LLM-->>App: Final answer

This is the foundation of all agentic systems — the LLM decides which tool to call and with what arguments, but never executes anything directly.


What is the difference between temperature=0 and top_p=0?

Both make the model deterministic, but via different mechanisms. temperature=0 scales logits to make the highest probability token overwhelmingly dominant. top_p=0 restricts sampling to only the single most likely token (effectively greedy). In practice, temperature=0 is the standard approach for reproducible outputs.

Why does GPT-4 sometimes give different answers for the same prompt?

Even at temperature=0, system-level non-determinism (floating point rounding, parallel GPU computation) can cause minor output variation. For true reproducibility, capture and cache the first response.

How does function calling differ from just asking the model to return JSON?

Native function calling validates the output against a schema before returning it to your application, and allows the model to signal "I need to call a tool" as a distinct response type — not just text that looks like JSON. This enables the runtime (LangGraph, LangChain, etc.) to intercept and execute tool calls automatically.