08.02 · Data Privacy in AI Pipelines

Level: Advanced
Pre-reading: 08 · AI Security


What Data Flows Through Your AI Pipeline?

Before building, audit every data type your agent touches:

Data Type Sensitivity Where It Goes
JIRA ticket description Low–Medium Sent to LLM API
Source code High (IP) Sent to LLM API
Test code Medium Sent to LLM API
Stack traces Medium (architecture exposure) Sent to LLM API
API credentials in code Critical Must NEVER reach LLM API
PII in JIRA tickets High (GDPR) Must be redacted before LLM API
Customer data in test payloads Critical Must NEVER be in test fixtures

PII Detection and Redaction

Before sending any data to an external LLM API, scan for PII:

graph LR
    A[Raw JIRA ticket] --> B[PII Scanner]
    B --> C{PII found?}
    C -->|No| D[Send to LLM API]
    C -->|Yes| E[Redact PII tokens]
    E --> D
    D --> F[LLM response]
    F --> G[Re-inject original PII if needed]

PII categories to detect and redact:

  • Email addresses → [EMAIL]
  • Phone numbers → [PHONE]
  • Names in bug descriptions → [NAME]
  • Customer IDs that might imply personal data → [CUSTOMER_ID]
  • IP addresses → [IP_ADDRESS]

Tools: Microsoft Presidio (open source), AWS Comprehend (managed PII detection).


Credential Scanning Before RAG Indexing

Your codebase may contain accidentally committed secrets. Scan before indexing:

graph LR
    A[Code file] --> B[Secret Scanner · gitleaks, trufflehog]
    B -->|Clean| C[Chunk + embed + store]
    B -->|Secret found| D[Alert + skip file]
    D --> E[Notify security team to rotate credential]

Never index a file with a detected credential — it would be embedded and potentially surfaced to the LLM in retrieved context.


Data Residency and Model Selection

Scenario Recommended Approach
Public/open source codebase Any cloud LLM API
Internal codebase, no compliance constraints Cloud LLM with enterprise API agreement
Financial or healthcare regulated codebase Self-hosted LLM (LLaMA, Mistral) or Azure OpenAI with data residency
Code containing military or government IP Air-gapped, fully self-hosted

Self-Hosted LLM Stack

For organisations that cannot send code to external APIs:

Component Open Source Option
LLM inference Ollama (local), vLLM (GPU server), llama.cpp
Model LLaMA 3.3 70B, Mistral Large, DeepSeek Coder
Embedding nomic-embed-text (Ollama), sentence-transformers
Vector DB Qdrant (self-hosted), pgvector
Orchestration LangGraph (no cloud dependency in the framework itself)

Quality Trade-off

Self-hosted models at the 7B–70B parameter range are noticeably weaker than GPT-4o or Claude for complex code reasoning tasks. Test your specific use cases thoroughly before committing to self-hosted. A hybrid approach (self-hosted for context retrieval, cloud API only for code generation with redacted context) can be a middle ground.


Does OpenAI train on API call data?

As of their current data usage policy, OpenAI does not use API data to train models by default. This can be confirmed via their zero-data-retention agreement available to enterprise customers. Always check the current policy before relying on this — it can change.

How do you handle sensitive customer data that appears in bug reproduction steps?

Add an explicit check in the ticket ingestion node: if the ticket description contains what appears to be real customer data (PII scanner hit), interrupt the workflow and ask the ticket author to replace it with anonymised test data. Post a JIRA comment explaining why. Never pass real customer data to the LLM pipeline.