08.02 · Data Privacy in AI Pipelines
Level: Advanced
Pre-reading: 08 · AI Security
What Data Flows Through Your AI Pipeline?
Before building, audit every data type your agent touches:
| Data Type | Sensitivity | Where It Goes |
|---|---|---|
| JIRA ticket description | Low–Medium | Sent to LLM API |
| Source code | High (IP) | Sent to LLM API |
| Test code | Medium | Sent to LLM API |
| Stack traces | Medium (architecture exposure) | Sent to LLM API |
| API credentials in code | Critical | Must NEVER reach LLM API |
| PII in JIRA tickets | High (GDPR) | Must be redacted before LLM API |
| Customer data in test payloads | Critical | Must NEVER be in test fixtures |
PII Detection and Redaction
Before sending any data to an external LLM API, scan for PII:
graph LR
A[Raw JIRA ticket] --> B[PII Scanner]
B --> C{PII found?}
C -->|No| D[Send to LLM API]
C -->|Yes| E[Redact PII tokens]
E --> D
D --> F[LLM response]
F --> G[Re-inject original PII if needed]
PII categories to detect and redact:
- Email addresses →
[EMAIL] - Phone numbers →
[PHONE] - Names in bug descriptions →
[NAME] - Customer IDs that might imply personal data →
[CUSTOMER_ID] - IP addresses →
[IP_ADDRESS]
Tools: Microsoft Presidio (open source), AWS Comprehend (managed PII detection).
Credential Scanning Before RAG Indexing
Your codebase may contain accidentally committed secrets. Scan before indexing:
graph LR
A[Code file] --> B[Secret Scanner · gitleaks, trufflehog]
B -->|Clean| C[Chunk + embed + store]
B -->|Secret found| D[Alert + skip file]
D --> E[Notify security team to rotate credential]
Never index a file with a detected credential — it would be embedded and potentially surfaced to the LLM in retrieved context.
Data Residency and Model Selection
| Scenario | Recommended Approach |
|---|---|
| Public/open source codebase | Any cloud LLM API |
| Internal codebase, no compliance constraints | Cloud LLM with enterprise API agreement |
| Financial or healthcare regulated codebase | Self-hosted LLM (LLaMA, Mistral) or Azure OpenAI with data residency |
| Code containing military or government IP | Air-gapped, fully self-hosted |
Self-Hosted LLM Stack
For organisations that cannot send code to external APIs:
| Component | Open Source Option |
|---|---|
| LLM inference | Ollama (local), vLLM (GPU server), llama.cpp |
| Model | LLaMA 3.3 70B, Mistral Large, DeepSeek Coder |
| Embedding | nomic-embed-text (Ollama), sentence-transformers |
| Vector DB | Qdrant (self-hosted), pgvector |
| Orchestration | LangGraph (no cloud dependency in the framework itself) |
Quality Trade-off
Self-hosted models at the 7B–70B parameter range are noticeably weaker than GPT-4o or Claude for complex code reasoning tasks. Test your specific use cases thoroughly before committing to self-hosted. A hybrid approach (self-hosted for context retrieval, cloud API only for code generation with redacted context) can be a middle ground.
Does OpenAI train on API call data?
As of their current data usage policy, OpenAI does not use API data to train models by default. This can be confirmed via their zero-data-retention agreement available to enterprise customers. Always check the current policy before relying on this — it can change.
How do you handle sensitive customer data that appears in bug reproduction steps?
Add an explicit check in the ticket ingestion node: if the ticket description contains what appears to be real customer data (PII scanner hit), interrupt the workflow and ask the ticket author to replace it with anonymised test data. Post a JIRA comment explaining why. Never pass real customer data to the LLM pipeline.