04.02 · State Machines & Workflows
Level: Advanced
Pre-reading: 04 · LangGraph · 04.01 · LangGraph Deep Dive
Agent as a State Machine
Every agentic workflow is fundamentally a state machine — a system with explicit states and transitions that are triggered by conditions or events.
stateDiagram-v2
[*] --> ReadingTicket
ReadingTicket --> IdentifyingService
IdentifyingService --> RetrievingCode
RetrievingCode --> Analyzing
Analyzing --> Generating: confident
Analyzing --> RetrievingCode: need more context
Generating --> AwaitingApproval
AwaitingApproval --> CreatingPR: approved
AwaitingApproval --> Generating: rejected with feedback
CreatingPR --> [*]
Modelling your agent as a state machine first makes the LangGraph implementation obvious.
Workflow Patterns
Sequential Workflow
Each node completes before the next starts. Simple, predictable, easy to debug.
graph LR
A[Read Ticket] --> B[Identify Service] --> C[Retrieve Code] --> D[Generate Fix] --> E[Create PR]
Parallel Workflow
Independent tasks run concurrently. Reduces total latency.
graph LR
A[Analyse Bug] --> B[Generate Fix]
A --> C[Write Test]
B --> D[Merge results]
C --> D
D --> E[Create PR]
Map-Reduce Workflow
Fan out to process many items; reduce results into a single output.
graph LR
A[10 failing Playwright tests] --> B[Spawn 10 analysis agents in parallel]
B --> C[Aggregate: common root causes]
C --> D[Generate single RCA document]
Event-Driven Workflow
Agent is triggered by external events rather than a direct call.
graph LR
A[CI webhook: tests failed] --> B[Agent triggered]
B --> C[Read failure report]
C --> D[Analyse and fix]
D --> E[Open MR]
Long-Running Workflows
Some JIRA tickets require hours of agent work. Design for interruption:
| Concern | Solution |
|---|---|
| Server restart | Checkpointed state in PostgreSQL |
| Token budget exceeded mid-run | State snapshots, resume from last checkpoint |
| Dependent external event | interrupt() until webhook arrives (e.g., CI build completes) |
| Human feedback latency | Async interrupt, agent resumes when developer clicks approve |
Idempotency and Retries
| Rule | Why |
|---|---|
| All tool calls should be idempotent | Retrying a failed step shouldn't create duplicate PRs |
| Use unique IDs for all resources created | PR description includes JIRA ticket ID to prevent duplicates |
| Check if a resource already exists before creating | Query GitHub API for existing PRs on the same branch |
| Write state before acting, not after | If the action fails, state shows the intent and retry is safe |
Workflow Observability
Each state transition should emit a structured event:
| Event | Payload |
|---|---|
node.started |
{ node: "retrieve_code", state_snapshot, timestamp } |
node.completed |
{ node: "retrieve_code", duration_ms, tokens_used } |
tool.called |
{ tool: "read_file", args, result_size } |
interrupt.raised |
{ reason: "human_review", diff_preview } |
workflow.completed |
{ pr_url, total_tokens, total_duration_ms } |
Feed these to your observability platform (Datadog, OpenTelemetry) for cost tracking and anomaly detection.
How do you handle a workflow where the agent discovers it needs to change multiple services?
This is a scope expansion — the agent should NOT silently expand its blast radius. Add a scope validation node that checks if proposed changes cross service boundaries. If yes, interrupt and ask the JIRA ticket creator to confirm scope. Never let the agent autonomously modify multiple microservices without human sign-off.