Performance Optimization¶
1. Caching Strategy¶
Three-Level Cache:
Java
@Service
public class CachedAiClient implements AiClient {
private final AiClient delegate;
private final ConcurrentHashMap<String, String> l1; // In-memory
private final RedisTemplate<String, String> l2; // Redis
private final DatabaseCache l3; // Database
@Override
public String generateResponse(String prompt) {
String key = hashPrompt(prompt);
// Try L1 cache (in-process, <1ms)
String cached = l1.get(key);
if (cached != null) {
metrics.counter("cache.hit.l1").increment();
return cached;
}
// Try L2 cache (Redis, <10ms)
cached = l2.opsForValue().get(key);
if (cached != null) {
metrics.counter("cache.hit.l2").increment();
l1.put(key, cached); // Populate L1
return cached;
}
// Try L3 cache (DB, <100ms)
cached = l3.get(key);
if (cached != null) {
metrics.counter("cache.hit.l3").increment();
l2.opsForValue().set(key, cached); // Populate L2
l1.put(key, cached);
return cached;
}
// Cache miss - call AI
metrics.counter("cache.miss").increment();
String result = delegate.generateResponse(prompt);
// Store in all caches
l1.put(key, result);
l2.opsForValue().set(key, result, Duration.ofHours(24));
l3.put(key, result);
return result;
}
}
Cache Hit Strategy:
Text Only
Goal: 60% cache hit rate
Tactics:
1. Semantic caching: Similar queries → same response
2. Bucketing: Round timestamp to hour (same request at 14:05 and 14:35)
3. Normalization: "best laptop" = "laptop recommendations"
2. Batch Processing¶
Batch Multiple Requests:
Java
@Service
public class BatchedAiService {
private final BlockingQueue<AiRequest> queue =
new LinkedBlockingQueue<>(1000);
public void scheduleWithBatch(String prompt) {
queue.add(new AiRequest(prompt));
}
@Scheduled(fixedDelay = 100, initialDelay = 100)
public void processBatch() {
List<AiRequest> batch = new ArrayList<>();
queue.drainTo(batch, 100); // Get up to 100
if (batch.isEmpty()) return;
// Process all at once (more efficient)
List<String> results = aiClient.generateResponseBatch(
batch.stream().map(r -> r.prompt).collect(toList())
);
for (int i = 0; i < batch.size(); i++) {
batch.get(i).future.complete(results.get(i));
}
}
}
3. Async Pattern¶
Java
@Service
public class AsyncAiEnhancement {
public ProductSearchResponse searchAsync(ProductSearchRequest req) {
// Immediately return database results
List<Product> immediate = db.search(req);
// Schedule AI ranking in background
CompletableFuture<List<Product>> aiRanking =
CompletableFuture.supplyAsync(() ->
enhanceWithAi(immediate)
);
// Return response without waiting
return ProductSearchResponse.builder()
.results(immediate)
.aiEnhanced(false)
.estimatedAiTime(500) // Hint to client
.build();
}
}
4. Compression¶
Java
@Service
public class CompressedAiClient {
public String generateResponse(String prompt) {
// Compress prompt for transmission
byte[] compressed = compress(prompt.getBytes());
// Send to AI service
ResponseEntity<String> response = rest.postForEntity(
"/api/generate",
compressed,
String.class,
new HttpHeaders() {{
set("Content-Encoding", "gzip");
}}
);
return response.getBody();
}
}
5. Connection Pooling¶
Java
@Configuration
public class AiClientConfig {
@Bean
public RestTemplate restTemplate() {
HttpClientHttpRequestFactory factory =
new HttpClientHttpRequestFactory();
HttpClient httpClient = HttpClientBuilder.create()
.setMaxConnTotal(100) // Total connections
.setMaxConnPerRoute(50) // Per route
.setConnectionReuseStrategy((request, response, context) -> true)
.build();
factory.setHttpClient(httpClient);
return new RestTemplate(factory);
}
}
Performance Targets¶
Text Only
Service Layer AI Performance:
Database query: 10ms
Build context: 10ms
AI API call: 200ms
Parse response: 10ms
Merge with catalog: 20ms
─────────────────────────────
Total latency: ~250ms
Target p95: <500ms
Target p99: <2000ms