01.01 · How LLMs Work — Deep Dive

Level: Intermediate
Pre-reading: 01 · AI & LLM Foundations

The Transformer Architecture

Every modern LLM is built on the Transformer architecture (Vaswani et al., 2017). The key innovation is self-attention — every token in the input can "attend to" every other token, allowing the model to capture long-range dependencies.

graph TD
    A[Input Tokens] --> B[Token Embeddings]
    B --> C[Positional Encoding]
    C --> D[Multi-Head Self-Attention]
    D --> E[Feed-Forward Network]
    E --> F[Layer Norm]
    F --> D2[Repeat N times]
    D2 --> G[Output Logits]
    G --> H[Softmax · Next Token Probabilities]

Component	Role
Token Embeddings	Maps each token ID to a dense vector in high-dimensional space
Positional Encoding	Encodes position in the sequence (order matters)
Self-Attention	Each token weighs all other tokens' relevance to itself
Feed-Forward Network	Per-token transformation after attention
Layer Norm	Stabilizes training, applied before or after each sub-layer

Context Window in Practice

The context window is the maximum number of tokens the model can process in a single forward pass — inputs + outputs combined.

Model	Context Window	Practical Implication
GPT-4o	128K tokens	~100K words — entire Spring Boot service
Claude 3.5 Sonnet	200K tokens	~150K words — entire module with tests
Gemini 1.5 Pro	1M tokens	Entire small codebase
LLaMA 3 70B	8K–128K	Varies by deployment

Context ≠ Memory

Sending a long context costs tokens on every request. It does NOT persist between calls. For persistent memory across sessions, you need an external store — see 03 · RAG.

Tokens Explained

graph LR
    A["'Hello, world!'"] --> B["['Hello', ',', ' world', '!']"]
    B --> C["[15496, 11, 995, 0]"]

English text: ~1 token per ¾ word
Code: more tokens per line (special characters, indentation)
@SpringBootApplication ≈ 4–6 tokens

Practical rule of thumb: 1,000 tokens ≈ 750 words ≈ ~30–40 lines of Java code.

Temperature and Sampling

Setting	Effect	Use Case
`temperature=0.0`	Fully deterministic, greedy	Code generation, JSON output
`temperature=0.2`	Mostly deterministic, minor variation	Tests, structured data
`temperature=0.7`	Creative, diverse	Documentation, explanations
`temperature=1.0+`	High randomness	Creative writing
`top_p=0.9`	Nucleus sampling — only top 90% probability mass	Balances diversity and coherence

Agent Configuration

For the JIRA→PR agent, use temperature=0 for code changes and temperature=0.3 for RCA explanations. Deterministic code is reproducible and reviewable.

Function Calling / Tool Use

Modern LLMs expose a structured function calling interface. Instead of embedding tool instructions in text, you provide a JSON schema of available tools and the model outputs a structured invocation.

sequenceDiagram
    participant App
    participant LLM
    participant Tool
    App->>LLM: Prompt + tool schemas
    LLM-->>App: { "tool": "read_file", "args": {"path": "..."} }
    App->>Tool: Execute read_file(path)
    Tool-->>App: File contents
    App->>LLM: Prompt + tool result
    LLM-->>App: Final answer

This is the foundation of all agentic systems — the LLM decides which tool to call and with what arguments, but never executes anything directly.

What is the difference between temperature=0 and top_p=0?

Both make the model deterministic, but via different mechanisms. temperature=0 scales logits to make the highest probability token overwhelmingly dominant. top_p=0 restricts sampling to only the single most likely token (effectively greedy). In practice, temperature=0 is the standard approach for reproducible outputs.

Why does GPT-4 sometimes give different answers for the same prompt?

Even at temperature=0, system-level non-determinism (floating point rounding, parallel GPU computation) can cause minor output variation. For true reproducibility, capture and cache the first response.

How does function calling differ from just asking the model to return JSON?

Native function calling validates the output against a schema before returning it to your application, and allows the model to signal "I need to call a tool" as a distinct response type — not just text that looks like JSON. This enables the runtime (LangGraph, LangChain, etc.) to intercept and execute tool calls automatically.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search