04.02 · State Machines & Workflows

Level: Advanced
Pre-reading: 04 · LangGraph · 04.01 · LangGraph Deep Dive

Agent as a State Machine

Every agentic workflow is fundamentally a state machine — a system with explicit states and transitions that are triggered by conditions or events.

stateDiagram-v2
    [*] --> ReadingTicket
    ReadingTicket --> IdentifyingService
    IdentifyingService --> RetrievingCode
    RetrievingCode --> Analyzing
    Analyzing --> Generating: confident
    Analyzing --> RetrievingCode: need more context
    Generating --> AwaitingApproval
    AwaitingApproval --> CreatingPR: approved
    AwaitingApproval --> Generating: rejected with feedback
    CreatingPR --> [*]

Modelling your agent as a state machine first makes the LangGraph implementation obvious.

Workflow Patterns

Sequential Workflow

Each node completes before the next starts. Simple, predictable, easy to debug.

graph LR
    A[Read Ticket] --> B[Identify Service] --> C[Retrieve Code] --> D[Generate Fix] --> E[Create PR]

Parallel Workflow

Independent tasks run concurrently. Reduces total latency.

graph LR
    A[Analyse Bug] --> B[Generate Fix]
    A --> C[Write Test]
    B --> D[Merge results]
    C --> D
    D --> E[Create PR]

Map-Reduce Workflow

Fan out to process many items; reduce results into a single output.

graph LR
    A[10 failing Playwright tests] --> B[Spawn 10 analysis agents in parallel]
    B --> C[Aggregate: common root causes]
    C --> D[Generate single RCA document]

Event-Driven Workflow

Agent is triggered by external events rather than a direct call.

graph LR
    A[CI webhook: tests failed] --> B[Agent triggered]
    B --> C[Read failure report]
    C --> D[Analyse and fix]
    D --> E[Open MR]

Long-Running Workflows

Some JIRA tickets require hours of agent work. Design for interruption:

Concern	Solution
Server restart	Checkpointed state in PostgreSQL
Token budget exceeded mid-run	State snapshots, resume from last checkpoint
Dependent external event	`interrupt()` until webhook arrives (e.g., CI build completes)
Human feedback latency	Async interrupt, agent resumes when developer clicks approve

Idempotency and Retries

Rule	Why
All tool calls should be idempotent	Retrying a failed step shouldn't create duplicate PRs
Use unique IDs for all resources created	PR description includes JIRA ticket ID to prevent duplicates
Check if a resource already exists before creating	Query GitHub API for existing PRs on the same branch
Write state before acting, not after	If the action fails, state shows the intent and retry is safe

Workflow Observability

Each state transition should emit a structured event:

Event	Payload
`node.started`	`{ node: "retrieve_code", state_snapshot, timestamp }`
`node.completed`	`{ node: "retrieve_code", duration_ms, tokens_used }`
`tool.called`	`{ tool: "read_file", args, result_size }`
`interrupt.raised`	`{ reason: "human_review", diff_preview }`
`workflow.completed`	`{ pr_url, total_tokens, total_duration_ms }`

Feed these to your observability platform (Datadog, OpenTelemetry) for cost tracking and anomaly detection.

How do you handle a workflow where the agent discovers it needs to change multiple services?

This is a scope expansion — the agent should NOT silently expand its blast radius. Add a scope validation node that checks if proposed changes cross service boundaries. If yes, interrupt and ask the JIRA ticket creator to confirm scope. Never let the agent autonomously modify multiple microservices without human sign-off.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search