Design Decision Matrix¶

Decision 1: Where to Integrate AI?¶

Options Comparison¶

Location	Controllers	Service Layer	Separate Service
Responsibility	HTTP → Business	Business Logic	External AI Service
Data Access	Via services	Direct	Via services
Cohesion	Low	High	Low
Reusability	Per endpoint	Multiple endpoints	Multiple apps
Testability	Hard	Easy	Medium
Latency	~1-5s	~1-5s	~2-6s (+RPC)
Complexity	Medium	High	High
Scaling	With main app	With main app	Independent

Recommendation: Service Layer ✅¶

Why: - Single Responsibility Principle ✅ - Easy to test ✅ - Reusable across endpoints ✅ - No additional infrastructure ✅ - Best latency ✅

Decision 2: Which LLM Provider?¶

Provider Comparison Matrix¶

Provider	Model	Cost/1K Tokens	Speed	Accuracy	Customization	Self-Hosted
OpenAI	GPT-4	$0.03 in/$0.06 out	2-5s	⭐⭐⭐⭐⭐	Medium	❌
Anthropic	Claude 3 Opus	$0.015 in/$0.075 out	2-5s	⭐⭐⭐⭐⭐	Medium	❌
Google	Gemini Pro	$0.0005 in/$0.0015 out	1-3s	⭐⭐⭐⭐	Medium	❌
Open Source	Llama 2 (Ollama)	$0	1-10s	⭐⭐⭐	High	✅
Your Project	GPT-3.5	$0.0015 in/$0.002 out	1-3s	⭐⭐⭐⭐	Low	❌

Decision Framework¶

Choose based on:

Cost Sensitivity
High cost sensitivity → Ollama (self-hosted)
Medium budget → GPT-3.5 turbo
High budget → GPT-4 or Claude Opus
Latency Requirements
<500ms → Can't use LLM, use traditional ML
1-3s tolerance → GPT-3.5, Gemini
5s+ tolerance → GPT-4, Claude Opus
Accuracy Requirements
High accuracy needed → GPT-4, Claude Opus
Medium accuracy → GPT-3.5, Claude Sonnet, Gemini
Quick answers → Any provider
Data Privacy
Data stays internal → Ollama
Can use cloud → Any cloud provider
Expertise
Prompt engineering → Claude or GPT-4
Fine-tuning → OpenAI
Custom models → Self-hosted Ollama

Our Demo: MockAiClient¶

Rationale: - No API costs - No external dependencies - Easy to replace with real provider - Perfect for learning

Decision 3: Caching Strategy¶

Caching Options¶

Strategy	Implementation	Hit Rate	Latency	Cost Saved
No Cache	Direct LLM call	N/A	1-5s	$0
Full Response	Cache entire response	30-60%	100ms	30-60%
Semantic Cache	Smart deduplication	50-80%	100ms	50-80%
Embedding Cache	Vector similarity	40-70%	100ms	40-70%

When to Cache?¶

Scenario	Cache?	Reason
Product Search	✅ Yes	Same queries repeat often
Support Auto-Response	❌ No	Each ticket is unique
Recommendations	~ Maybe	Cache by user profile hash
Categorization	✅ Yes	Similar inputs have similar outputs
Summarization	✅ Yes	Same documents repeat
Translation	✅ Yes	Same phrases repeat

Implementation Decision¶

Text Only

Latency Critical?
├─ Yes → Use cache (100ms lookup)
├─ No  → Evaluate cost vs quality

Cost High?
├─ Yes → Use cache
├─ No  → Can afford fresh responses

Same Input Likely?
├─ Yes → Cache works well
├─ No  → Cache not helpful

Decision 4: Async vs Sync¶

Comparison¶

Aspect	Sync	Async
User Waits	Yes (1-5s)	No (immediate)
Response Time	Slow	Fast
Complexity	Simple	More complex
Testing	Easy	Harder
Error Handling	Straightforward	Complex
Cost	Per-request	Spread over time

Decision Tree¶

Text Only

User Need Immediate Response?
├─ Yes → Sync (simpler, acceptable latency)
├─ No  → Check priority

Is 2-5s Latency Acceptable?
├─ Yes → Sync
├─ No  → Async

Can Processing Happen in Background?
├─ Yes → Async (better UX)
├─ No  → Sync

Is Cost High?
├─ Yes → Async (spread cost)
├─ No  → Async for UX

Timeline Example¶

Sync Product Search (blocking):

Text Only

User clicks search
    ↓ (User waits)
Service fetches products (50ms)
    ↓ (User waits)
Service calls LLM (2-5s)
    ↓ (User waits)
Response returns to user
TOTAL: 2-5.5 seconds ⏱️

Async Categorization (background):

Text Only

User submits product
    ↓
Service saves product immediately
    ↓ (User sees confirmation)
Response returns (instant) ✅
    ↓
Background: LLM categorizes product
    ↓
Result saved and available for next user

Decision 5: Error Handling Strategy¶

Fallback Options¶

Error Type	Response	Cost
Timeout	Traditional results	✅ Low cost
Rate Limit	Cached results or traditional	✅ Low
Auth Failure	Traditional results	~ Medium
Invalid Output	Traditional results	✅ Low
Service Down	Traditional results	✅ Low

Recommended Strategy¶

Java

public Response search(Request request) {
    try {
        // Try with AI
        if (aiClient.isAvailable()) {
            return aiEnhancedSearch(request);
        }
    } catch (TimeoutException) {
        log.warn("LLM timeout, using fallback");
        metrics.recordLLMFailure("timeout");
    } catch (RateLimitException) {
        log.warn("LLM rate limit, using cache");
        return cachedResults;
    } catch (Exception e) {
        log.error("LLM error", e);
        metrics.recordLLMFailure("error");
    }

    // Fallback: Traditional approach
    return traditionalSearch(request);
}

Key Points: - Always have a fallback - Log failures for monitoring - Record metrics for observability - Don't let LLM failure break user experience

Decision 6: Monitoring What?¶

Metrics to Track¶

Metric	Why	Alarm Threshold
LLM Latency	Detect slowdowns	> 5s
LLM Cost/Request	Budget tracking	> 0.1$
Error Rate	System health	> 5%
Cache Hit Rate	Cost optimization	< 40% (improve caching)
Timeout Rate	Performance issue	> 2%
Token Usage	Cost tracking	Trending up?

Observability Stack¶

Text Only

Application
    ↓
Spring Boot Micrometer
    ↓
Metrics (Prometheus)
    ↓
Dashboards (Grafana)
    ↓
Alerts (Pagerduty)
    ↓
Logging (ELK Stack)
    ↓
Trace Analysis

Decision 7: Testing Strategy¶

Test Pyramid¶

Text Only

        /\
       /  \  E2E Tests (1-5% of tests)
      /────\ Real LLM calls
     /______\
    /        \
   /  Integ.  \  Integration Tests (15-20%)
  /────────────\ Mock LLM
 /____________\
/              \
 Unit Tests     \  Unit Tests (75-80%)
/                \ Mocked service & AI
/__________________\

Mock vs Real LLM Tests¶

Test Type	Mock LLM	Real LLM
Speed	Fast (<100ms)	Slow (1-5s)
Cost	Free	$$ per test
Reliability	Deterministic	Can fail
Frequency	Every test run	Few times/week
Purpose	Development	Validation

Decision 8: Cost Optimization¶

Cost Reduction Checklist¶

Action	Potential Saving	Effort
Use cheaper model	30-40%	Medium
Cache responses	30-60%	Medium
Batch requests	10-20%	High
Token optimization	10-30%	Medium
Load shedding (skip AI under load)	Varies	Low
Self-host (Ollama)	90%+	High

Cost Example: 1M requests/month¶

Text Only

Model: GPT-3.5 turbo
Avg tokens per request: 500 input, 100 output
Price: $0.005/1K input, $0.0015/1K output

Monthly cost WITHOUT optimization:
- 1M requests × 500 tokens × $0.005/1K = $2,500
- 1M requests × 100 tokens × $0.0015/1K = $150
- Total: ~$2,650/month

Monthly cost WITH 50% caching:
- 500K × ($2,500 + $150) = ~$1,325/month
- Savings: $1,325/month (50%)

Your Decision: Start Here¶

If you're building an AI feature:

Where? → Service Layer
Which LLM? → Start with mock, then OpenAI/Claude
Cache? → Implement if repeat queries expected
Sync or Async? → Sync for now, async if latency critical
Errors? → Always fallback to traditional
Monitor? → Latency, cost, errors
Test? → Mock for unit tests, real LLM for feature validation
Cost? → Track and optimize caching first

Next: Use Cases →

Provider	Model	Cost/1K Tokens	Speed	Accuracy	Customization	Self-Hosted
OpenAI	GPT-4	\(0.03 in/\)0.06 out	2-5s	⭐⭐⭐⭐⭐	Medium	❌
Anthropic	Claude 3 Opus	\(0.015 in/\)0.075 out	2-5s	⭐⭐⭐⭐⭐	Medium	❌
Google	Gemini Pro	\(0.0005 in/\)0.0015 out	1-3s	⭐⭐⭐⭐	Medium	❌
Open Source	Llama 2 (Ollama)	$0	1-10s	⭐⭐⭐	High	✅
Your Project	GPT-3.5	\(0.0015 in/\)0.002 out	1-3s	⭐⭐⭐⭐	Low	❌