LLM Integration Overview¶

What is an LLM?¶

A Large Language Model (LLM) is an AI system trained on vast amounts of text data that can: - Understand context from prompts - Generate human-like responses - Perform reasoning and analysis - Understand nuance and intent

Common LLMs¶

OpenAI: GPT-4, GPT-3.5 (API-based)
Anthropic: Claude (API-based)
Open Source: Llama, Mistral (self-hosted via Ollama)
Google: PaLM, Gemini (API-based)

The Traditional Spring Boot Flow¶

Text Only

HTTP Request
    ↓
Controller (handles routing)
    ↓
Service (business logic)
    ↓
Repository (data access)
    ↓
Database/External APIs
    ↓
Response to Client

Characteristics: - Deterministic results based on cached/stored data - No AI reasoning or context understanding - Fast and predictable - Limited to pre-computed information

The AI-Enhanced Flow¶

Text Only

HTTP Request
    ↓
Controller (handles routing)
    ↓
Service Layer:
  1. Fetch data from DB
  2. Build context for LLM
  3. Call LLM with context
  4. Process LLM response
  5. Enhance original data
    ↓
Database/External APIs + LLM API
    ↓
Enhanced Response to Client

Characteristics: - Context-Aware: Service provides rich context to LLM - Intelligent Processing: LLM reasons about situation - Adaptive: Different responses for different contexts - Slower: Network latency to LLM - Costlier: Pay per API call

Why Integrate AI in the Service Layer?¶

The Right Place¶

The service layer is ideal because:

Already handles business logic → Adding AI context is natural
Has access to all data → Can build rich prompts
Can make decisions → Decide when to call AI
Encapsulates complexity → Controllers don't care about AI
Easy to test → Mock AI client in unit tests

NOT in the Controller¶

❌ Controllers should handle HTTP routing, not AI logic
❌ Couples HTTP handling with AI complexity
❌ Hard to test and maintain

NOT as a Separate System¶

❌ Creates architectural complexity
❌ Requires new infrastructure/monitoring
❌ Hard to use AI results in existing features
❌ Increases latency (additional service calls)

Key Pattern: Abstraction via Interface¶

Java

// This interface is key to flexibility
public interface AiClient {
    String generateResponse(String prompt);
    String generateResponseWithContext(String systemPrompt, String userPrompt);
    boolean isAvailable();
    String getModelName();
}

Benefits of Abstraction¶

Aspect	Benefit
Provider Independence	Switch OpenAI → Claude without changing services
Testing	Mock implementation for unit tests
Gradual Rollout	Start with mock, migrate to real LLM
Fallback Logic	`isAvailable()` lets you handle failures
Cost Control	Easier to swap to cheaper provider

Core AI Integration Concepts¶

1. Prompt Engineering¶

The quality of the LLM response depends heavily on the prompt:

Java

// Good: Provides context and clear instructions
String prompt = "You are a product recommendation assistant. " +
    "User's purchase history: " + userHistory + "\n" +
    "Available products: " + products + "\n" +
    "Recommend the top 5 products they would enjoy.";

// Bad: Vague and lacking context
String prompt = "Recommend products";

2. Context Building¶

Service enriches prompt with domain-specific data:

Java

// In ProductSearchService
private String buildSearchContext(ProductSearchRequest request, List<Product> results) {
    StringBuilder context = new StringBuilder();
    context.append("User Search: ").append(request.getQuery()).append("\n");
    context.append("Found Products:\n");
    results.forEach(p -> 
        context.append("- ").append(p.getName()).append(" ($").append(p.getPrice()).append(")\n")
    );
    return context.toString();
}

3. Response Processing¶

Don't use LLM output directly; process it:

Java

// In service layer
String aiResponse = aiClient.generateResponse(context);

// Process and validate
List<Product> recommendations = parseAiResponse(aiResponse);
// Filter based on business rules
recommendations = filterByInventory(recommendations);
// Rank by relevance
recommendations = rankByRelevance(recommendations, request);

4. Error Handling¶

LLMs can fail or produce invalid output:

Java

if (!aiClient.isAvailable()) {
    // Fallback to traditional approach
    return getRecommendationsWithoutAI();
}

try {
    String response = aiClient.generateResponse(prompt);
    return parseResponse(response);
} catch (LlmException e) {
    // Log, monitor, fallback
    log.error("LLM call failed", e);
    return getFallbackResponse();
}

Decision Matrix: When to Use AI¶

Scenario	Use AI?	Why/Why Not
Product Search	✅ Yes	Understands intent, can reason about relevance
Exact Record Lookup	❌ No	LLM slower than index lookup
Support Auto-Reply	✅ Yes	Needs reasoning and empathy
Data Validation	❌ No	Deterministic rules are faster
Recommendations	✅ Yes	Must understand user preferences
User Authentication	❌ No	Security critical, must be deterministic
Category Classification	~ Maybe	Depends on complexity vs. latency budget
Summarization	✅ Yes	LLMs excel at this

LLM Call Cost Analysis¶

Text Only

Cost = (Tokens Used) × (Price per Token)

Token Pricing Examples (as of 2024): - GPT-4: $0.03 per 1K input tokens, $0.06 per 1K output tokens - Claude 3 Opus: $0.015 per 1K input, $0.075 per 1K output - Open Source (Ollama): $0 (self-hosted)

Cost Optimization: 1. Cache responses for similar requests 2. Batch prompts when possible 3. Use cheaper models for simple tasks 4. Implement request timeouts (don't retry indefinitely) 5. Monitor token usage per feature

Performance Implications¶

Latency¶

Text Only

Traditional Flow:    50-200ms (DB query + serialization)
AI-Enhanced Flow:    50ms (DB) + 1-5s (LLM) + 50ms (processing)
                    = ~1-5.5 seconds

Mitigation: - Cache LLM responses - Run LLM calls asynchronously - Implement timeouts (degrade to non-AI response) - Use faster LLM models for latency-critical features

Throughput¶

Each LLM API call consumes resources: - Rate limiting from provider - Network bandwidth - Cost per request

Mitigation: - Queue requests during high load - Implement circuit breakers - Use fallback to traditional method under load

Common Implementation Patterns¶

Pattern 1: Optional Enhancement¶

Text Only

Request comes in
├─ Fetch data from DB
├─ If AI enabled for this request
│  ├─ Call LLM with context
│  └─ Enhance results
└─ Return response (with or without AI)

Use Case: Product search, recommendations Benefit: Backwards compatible, clients opt-in

Pattern 2: AI-as-Filter¶

Text Only

Request comes in
├─ Get ALL candidate items
├─ Call LLM to score/rank items
├─ Return top N items
└─ Return response

Use Case: Recommendations filtering thousands of items Benefit: LLM doesn't need to generate, just score

Pattern 3: Async AI Enhancement¶

Text Only

Request comes in
├─ Fetch primary data
├─ Return response immediately
├─ In background: Call LLM
├─ Store enhanced result for next request
└─ Next user gets pre-enhanced data

Use Case: Summaries, descriptions, categorization Benefit: Doesn't block response, spreads cost over time

Key Architectural Considerations¶

Should I Use AI?¶

Think Before Adding AI

AI isn't always the answer. Ask:

Does it need reasoning? → Yes = AI might help
Is latency critical? → Yes = Avoid AI or cache
Is accuracy critical? → Yes = Maybe avoid AI or use ensemble
What's the cost? → Document $/request
What's the fallback? → Must have one
Can it be cached? → Yes = Definitely use
Can I test it? → If no, reconsider

Requirements for AI Integration¶

Abstraction interface (AiClient)
Fallback mechanism (if AI unavailable)
Error handling (parsing, validation)
Monitoring (latency, costs, errors)
Caching strategy (if applicable)
Clear context building
Request timeout

Next: Learn the Service Layer Pattern →¶

Service Layer AI Pattern