Data governance and privacy
This document defines Astrocyte’ unified approach to data classification, PII handling, data residency, encryption, regulatory compliance, data lineage, and data loss prevention. It consolidates and extends the scattered governance features across the policy layer (policy-layer.md), lifecycle management (memory-lifecycle.md), access control (access-control.md), and event hooks (event-hooks.md).
This maps to Principle 6 (Barrier maintenance) - the blood-brain barrier is not just a wall, it is a selective, actively maintained boundary that classifies what crosses, enforces rules per-substance, and adapts to threats.
1. Data classification
Section titled “1. Data classification”1.1 Classification levels
Section titled “1.1 Classification levels”Every piece of content entering the retain path is classified into a sensitivity level. Classification drives all downstream governance decisions.
| Level | Label | Description | Examples |
|---|---|---|---|
| 0 | public | No restrictions. Safe to store, synthesize, export. | Product docs, public FAQs, general knowledge |
| 1 | internal | Not sensitive, but not for external sharing. | Internal processes, team preferences, project context |
| 2 | confidential | Business-sensitive. Restricted access, audit required. | Customer data, financial metrics, trade secrets, strategic plans |
| 3 | restricted | Regulated data. Legal requirements govern handling. | PII, PHI, PCI, ITAR, legal privilege |
1.2 Classification mechanisms
Section titled “1.2 Classification mechanisms”Classification can be assigned three ways, in priority order:
- Explicit: caller sets
classificationin retain metadata - Rule-based: pattern matching against configured rules (regex, keyword lists)
- Automatic: LLM-based classification on the retain path
governance: classification: default_level: internal # When no classifier matches auto_classify: true # Enable automatic classification auto_classify_mode: rules # "rules" | "llm" | "rules_then_llm"1.3 Sub-classification: data categories
Section titled “1.3 Sub-classification: data categories”Within the restricted level, data is categorized by regulatory domain:
| Category | Code | Regulatory context |
|---|---|---|
| Personally Identifiable Information | PII | GDPR, PDPA, CCPA, LGPD |
| Protected Health Information | PHI | HIPAA, local health data laws |
| Payment Card Industry data | PCI | PCI-DSS |
| Financial data | FIN | SOX, MAS regulations |
| Legal privilege | LEGAL | Attorney-client privilege |
| Trade secrets | TRADE | Trade secret law, NDA obligations |
| Children’s data | COPPA | COPPA, age-gated processing |
@dataclassclass DataClassification: level: int # 0-3 label: str # "public", "internal", "confidential", "restricted" categories: list[str] # ["PII", "PHI", etc.] - for restricted data classified_by: str # "caller", "rules", "llm" classified_at: datetime2. PII detection and handling
Section titled “2. PII detection and handling”2.1 Detection taxonomy
Section titled “2.1 Detection taxonomy”The PII barrier (introduced in policy-layer.md section 2.1) detects specific PII types:
| PII type | Detection method | Examples |
|---|---|---|
email | Regex | user@example.com |
phone | Regex | +1-555-0123, (555) 012-3456 |
ssn | Regex | 123-45-6789 |
credit_card | Regex + Luhn check | 4111-1111-1111-1111 |
passport | Regex (country-specific) | A12345678 |
ip_address | Regex | 192.168.1.1, 2001:db8::1 |
date_of_birth | Regex + context | 1990-01-15, “born on January 15” |
address | NER / LLM | ”123 Main St, Anytown, CA 90210” |
name | NER / LLM | Personal names in context |
medical_record | LLM | Diagnosis, treatment, conditions |
financial_account | Regex | Bank account numbers, routing numbers |
national_id | Regex (country-specific) | NRIC (SG), Aadhaar (IN), etc. |
2.2 Detection modes
Section titled “2.2 Detection modes”| Mode | How it works | Cost | Accuracy |
|---|---|---|---|
regex | Pattern matching against configured patterns | Zero (CPU only) | High for structured PII (email, SSN), low for unstructured (names, addresses) |
ner | Named entity recognition via spaCy or similar | Low (local model) | Good for names, orgs, locations. Misses context-dependent PII. |
llm | LLM classifies content for sensitive data | Medium (API call) | Best. Catches context-dependent PII (“my mother’s maiden name is Smith”). |
rules_then_llm | Regex first, LLM only if regex finds nothing | Medium | Best coverage with cost optimization. |
2.3 Actions per PII type
Section titled “2.3 Actions per PII type”Different PII types can have different actions:
governance: pii: mode: rules_then_llm default_action: redact type_overrides: email: action: redact replacement: "[EMAIL_REDACTED]" credit_card: action: reject # Never store credit cards name: action: warn # Names are often needed for context medical_record: action: reject address: action: redact replacement: "[ADDRESS_REDACTED]"2.4 Redaction strategy
Section titled “2.4 Redaction strategy”When action: redact:
Input: "Calvin's email is calvin@example.com and he lives at 123 Main St"Output: "Calvin's email is [EMAIL_REDACTED] and he lives at [ADDRESS_REDACTED]"Redaction is applied before the content reaches the memory provider. The provider never sees the original content. This is the BBB - the barrier is at the boundary, not inside the brain.
Reversible redaction (optional, for authorized recovery):
governance: pii: redaction: reversible: true # Store encrypted original alongside redacted encryption_key_ref: ${PII_ENCRYPTION_KEY}When reversible: true, the original PII is encrypted and stored in a separate, access-controlled field. Only principals with admin permission and explicit pii_access grant can decrypt.
3. Data residency
Section titled “3. Data residency”3.1 The problem
Section titled “3.1 The problem”Regulations require certain data to stay within geographic boundaries:
- GDPR: EU personal data must be processed within the EU (or with adequate safeguards)
- PDPA: Singapore personal data has cross-border transfer restrictions
- China PIPL: personal data often must remain in China
Memory systems involve two data flows that cross boundaries:
- Storage: where the memory provider stores vectors and metadata
- LLM processing: where retain (entity extraction) and reflect (synthesis) send content
3.2 Residency zones
Section titled “3.2 Residency zones”governance: residency: zones: eu: regions: [eu-west-1, eu-central-1] regulations: [gdpr] sg: regions: [ap-southeast-1] regulations: [pdpa] us: regions: [us-east-1, us-west-2] regulations: [ccpa, hipaa]
bank_assignments: "user-eu-*": # Wildcard matching zone: eu "user-sg-*": zone: sg default: zone: us3.3 Residency enforcement
Section titled “3.3 Residency enforcement”When a bank is assigned to a residency zone:
- Tier 1 (retrieval): the framework validates that the configured retrieval provider’s region matches the zone. Misconfiguration is a startup error (fail-fast).
- LLM calls: the framework routes LLM provider calls to region-appropriate endpoints.
governance: residency: llm_routing: eu: llm_provider: litellm llm_provider_config: model: azure/gpt-4o api_base: https://eu-west.openai.azure.com # EU endpoint sg: llm_provider: litellm llm_provider_config: model: bedrock/anthropic.claude-sonnet-4-20250514-v1:0 aws_region: ap-southeast-1 # Singapore3.4 Cross-border transfer controls
Section titled “3.4 Cross-border transfer controls”Multi-bank recall across zones requires explicit configuration:
governance: residency: cross_border: allowed: false # Default: no cross-zone recall exceptions: - from: eu to: us requires: [adequacy_decision] # Document the legal basis log: true # Audit every cross-border accessWhen allowed: false and a multi-bank recall spans zones, the framework either:
- Excludes banks from other zones (with a warning in the result)
- Returns
CrossBorderViolationerror (ifstrict: true)
4. Encryption
Section titled “4. Encryption”4.1 In-transit
Section titled “4.1 In-transit”All provider communication must use TLS. The framework validates:
- Tier 1 retrieval provider endpoints use
https://or encrypted database connections (SSL mode) - LLM provider endpoints use
https:// - MCP server SSE transport uses TLS when exposed externally
governance: encryption: require_tls: true # Reject non-TLS provider endpoints min_tls_version: "1.2"4.2 At-rest
Section titled “4.2 At-rest”At-rest encryption is the provider’s responsibility (database-level encryption, cloud KMS). Astrocyte validates that the provider reports encryption capability:
class EngineCapabilities: # ... existing fields ... encryption_at_rest: bool = False # Does the provider encrypt stored data?
class VectorStore(Protocol): def storage_info(self) -> StorageInfo: """Report storage characteristics including encryption.""" ...
@dataclassclass StorageInfo: encrypted_at_rest: bool encryption_method: str | None # "AES-256", "AWS KMS", etc. region: str | None # Where data is physically storedIf governance.encryption.require_at_rest: true and the provider reports encrypted_at_rest: false, the framework refuses to initialize.
4.3 Field-level encryption
Section titled “4.3 Field-level encryption”For sensitive metadata fields, the framework can encrypt specific values before passing to the provider:
governance: encryption: field_level: enabled: true key_ref: ${FIELD_ENCRYPTION_KEY} encrypted_metadata_keys: - customer_id - account_number - internal_referenceEncrypted fields are stored as opaque ciphertext in the provider. They can be decrypted by the framework on recall. The provider cannot read them.
5. Regulatory compliance profiles
Section titled “5. Regulatory compliance profiles”5.1 Pre-built compliance configurations
Section titled “5.1 Pre-built compliance configurations”Like use-case profiles (use-case-profiles.md), compliance profiles configure governance policies for specific regulatory regimes:
governance: compliance_profile: gdpr # "gdpr" | "hipaa" | "pdpa" | "ccpa" | "pci" | "none"5.2 Profile definitions
Section titled “5.2 Profile definitions”GDPR profile:
# governance.compliance_profiles.gdprclassification: auto_classify: true auto_classify_mode: rules_then_llmpii: mode: rules_then_llm default_action: redact type_overrides: name: { action: redact } email: { action: redact } phone: { action: redact } address: { action: redact }lifecycle: right_to_forget: true # brain.forget(compliance=True) available ttl: archive_unretrieved_after_days: 365 delete_archived_after_days: 730 # 2 years max retention audit: enabled: true retention_days: 2555 # 7 years for audit recordsresidency: cross_border: allowed: false # Default deny cross-borderencryption: require_tls: true require_at_rest: trueaccess_control: enabled: true default_policy: deny # Explicit grants requireddlp: enabled: true block_pii_in_reflect: true # Don't synthesize PII into answers block_pii_in_export: trueHIPAA profile:
# governance.compliance_profiles.hipaaclassification: auto_classify: true auto_classify_mode: llm # LLM is best at detecting PHIpii: mode: llm default_action: reject # HIPAA: don't store PHI unless explicitly designed for it type_overrides: medical_record: { action: reject } name: { action: redact }lifecycle: audit: enabled: true retention_days: 2555 include_content_hash: true # Audit record includes content hash (not content)encryption: require_tls: true require_at_rest: true field_level: enabled: trueaccess_control: enabled: true default_policy: denydlp: enabled: true block_phi_in_reflect: truePDPA (Singapore) profile:
# governance.compliance_profiles.pdpaclassification: auto_classify: true auto_classify_mode: rules_then_llmpii: mode: rules_then_llm default_action: redact type_overrides: national_id: { action: reject } # NRIC must not be storedlifecycle: right_to_forget: true ttl: delete_archived_after_days: 1825 # 5 years max retention (common PDPA guidance) audit: enabled: trueresidency: cross_border: allowed: true # PDPA allows with safeguards requires: [transfer_impact_assessment] log: trueencryption: require_tls: trueaccess_control: enabled: true default_policy: owner_only5.3 Composing compliance profiles
Section titled “5.3 Composing compliance profiles”Multiple profiles can be composed (strictest rule wins):
governance: compliance_profile: [gdpr, pci] # Both GDPR and PCI-DSSWhen profiles conflict, the more restrictive rule applies. For example, if GDPR says redact and PCI says reject for the same PII type, the result is reject.
6. Data lineage
Section titled “6. Data lineage”6.1 The problem
Section titled “6.1 The problem”For compliance and debugging, you need to answer:
- Where did this memory come from? (source)
- What transformations were applied? (pipeline actions)
- Where has this memory been sent? (consumption)
- Who accessed it? (audit)
6.2 Lineage metadata
Section titled “6.2 Lineage metadata”Every memory carries lineage metadata, automatically maintained by the framework:
@dataclassclass DataLineage: source: LineageSource transformations: list[LineageTransformation] access_log: list[LineageAccess] # Populated on recall/reflect/export
@dataclassclass LineageSource: origin: str # "api:retain", "import:ama", "integration:langgraph" principal: str # Who stored it timestamp: datetime classification: DataClassification source_system: str | None # External system identifier external_id: str | None # ID in source system
@dataclassclass LineageTransformation: action: str # "pii_redacted", "classified", "consolidated", "re_embedded" timestamp: datetime details: dict[str, str] # e.g., {"pii_type": "email", "action": "redact"}
@dataclassclass LineageAccess: operation: str # "recall", "reflect", "export" principal: str timestamp: datetime bank_id: str6.3 Lineage in practice
Section titled “6.3 Lineage in practice”# Query lineage for a specific memorylineage = await brain.get_lineage(bank_id="user-123", memory_id="mem_001")
# Query all memories from a specific sourcememories = await brain.recall( "all memories", bank_id="user-123", metadata_filters={"_lineage_source": "integration:crewai"},)6.4 Lineage storage
Section titled “6.4 Lineage storage”Lineage metadata is stored alongside memory metadata. It is:
- Included in AMA exports (
memory-portability.md) - Included in audit trail events (
memory-lifecycle.md) - Queryable via metadata filters on recall
- Never deleted by TTL policies (lineage of deleted memories is retained in the audit log)
7. Data Loss Prevention (DLP)
Section titled “7. Data Loss Prevention (DLP)”7.1 The problem
Section titled “7.1 The problem”Even with PII redaction on the retain path, sensitive data can leak through:
- Reflect synthesis: the LLM might reconstruct PII from surrounding context
- Export: bulk export could extract sensitive data from the memory provider
- Recall results: returning raw memories to unauthorized callers
- Cross-bank leakage: multi-bank recall exposing data across isolation boundaries
7.2 DLP controls
Section titled “7.2 DLP controls”governance: dlp: enabled: true
# Reflect output scanning scan_reflect_output: true # Scan synthesis for PII before returning reflect_pii_action: redact # "redact" | "reject" | "warn"
# Export controls require_export_approval: false # If true, exports require admin principal block_restricted_in_export: true # Don't include restricted-level memories in exports strip_metadata_on_export: # Remove sensitive metadata fields from exports - customer_id - account_number
# Recall output scanning scan_recall_output: false # Usually off (high cost); enable for regulated workloads recall_pii_action: warn
# Cross-bank controls enforce_classification_boundary: true # Don't fuse restricted + public bank results7.3 Reflect output scanning
Section titled “7.3 Reflect output scanning”When scan_reflect_output: true, the framework runs PII detection on the LLM’s synthesis output before returning it to the caller:
flowchart TD C[caller: brain.reflect] --> P1[Policy: rate limit, access control] P1 --> GEN[Provider or pipeline: synthesis] GEN --> DLP[DLP: scan synthesis for PII] DLP -->|redact| R1[Redact PII in answer + warning] DLP -->|reject| R2[Error - no answer] DLP -->|warn| R3[Answer + pii_warning flag] DLP -->|none| R4[Normal answer] R1 --> OUT[ReflectResult] R2 --> OUT R3 --> OUT R4 --> OUT
This catches cases where the LLM reconstructs PII from context even though the stored memories were redacted.
7.4 Classification boundary enforcement
Section titled “7.4 Classification boundary enforcement”When enforce_classification_boundary: true and a multi-bank recall spans banks with different classification levels:
- Results are partitioned by classification level before fusion
restrictedmemories are only included if the caller has appropriate access grants- Cross-level fusion can be configured:
governance: dlp: classification_fusion: allow_cross_level: false # Don't mix restricted + internal in results # OR allow_cross_level: true downgrade_restricted: true # Redact restricted content before fusing8. Governance observability
Section titled “8. Governance observability”8.1 Governance-specific metrics
Section titled “8.1 Governance-specific metrics”| Metric | Type | Labels |
|---|---|---|
astrocyte_classification_total | Counter | bank_id, level, classified_by |
astrocyte_pii_detected_total | Counter | bank_id, pii_type, action |
astrocyte_pii_redacted_total | Counter | bank_id, pii_type |
astrocyte_pii_rejected_total | Counter | bank_id, pii_type |
astrocyte_dlp_reflect_blocked_total | Counter | bank_id |
astrocyte_dlp_export_blocked_total | Counter | bank_id |
astrocyte_residency_violation_total | Counter | from_zone, to_zone |
astrocyte_compliance_forget_total | Counter | bank_id, regulation |
astrocyte_legal_hold_active | Gauge | bank_id |
astrocyte_classification_distribution | Gauge | bank_id, level |
8.2 Governance dashboard
Section titled “8.2 Governance dashboard”Key panels for a governance dashboard:
- Classification distribution: what percentage of memories are public/internal/confidential/restricted per bank
- PII detection rate: how often is PII detected, what types, what actions taken
- DLP blocks: how often is reflect/export blocked by DLP
- Compliance operations: forget requests, legal holds, cross-border access attempts
- Residency compliance: all banks mapped to zones, any violations
8.3 Governance audit events
Section titled “8.3 Governance audit events”All governance actions are emitted as audit events (see memory-lifecycle.md section 5):
| Event | Description |
|---|---|
governance.classified | Content classified at a sensitivity level |
governance.pii_detected | PII detected in content |
governance.pii_redacted | PII redacted before storage |
governance.pii_rejected | Retain rejected due to PII policy |
governance.dlp_reflect_blocked | Reflect output blocked by DLP |
governance.dlp_export_blocked | Export blocked by DLP |
governance.residency_violation | Cross-border access attempted |
governance.residency_routed | LLM call routed to region-specific endpoint |
governance.encryption_validated | Provider encryption validated at startup |
governance.compliance_forget | Compliance-driven forget executed |
governance.legal_hold_set | Legal hold placed on bank |
governance.legal_hold_released | Legal hold released from bank |
9. Configuration reference
Section titled “9. Configuration reference”Complete governance configuration:
governance: # Compliance profile (sets defaults for everything below) compliance_profile: gdpr # "gdpr" | "hipaa" | "pdpa" | "ccpa" | "pci" | "none" | list
# Data classification classification: default_level: internal auto_classify: true auto_classify_mode: rules_then_llm # "rules" | "llm" | "rules_then_llm" rules: - pattern: "credit card|card number|CVV" level: restricted categories: [PCI] - pattern: "diagnosis|treatment|medication|patient" level: restricted categories: [PHI]
# PII detection and handling pii: mode: rules_then_llm default_action: redact type_overrides: {} # Per-type action overrides custom_patterns: [] # Additional regex patterns redaction: reversible: false encryption_key_ref: null
# Data residency residency: zones: {} bank_assignments: {} llm_routing: {} cross_border: allowed: false strict: false exceptions: []
# Encryption encryption: require_tls: true min_tls_version: "1.2" require_at_rest: false field_level: enabled: false key_ref: null encrypted_metadata_keys: []
# Data Loss Prevention dlp: enabled: false scan_reflect_output: false reflect_pii_action: redact require_export_approval: false block_restricted_in_export: false strip_metadata_on_export: [] scan_recall_output: false enforce_classification_boundary: false
# Per-bank overrides bank_overrides: sensitive-customer: compliance_profile: [gdpr, hipaa] pii: mode: llm default_action: reject dlp: enabled: true scan_reflect_output: true10. Governance and the two-tier model
Section titled “10. Governance and the two-tier model”| Governance feature | Tier 1 (Storage) | Tier 2 (Memory Engine) |
|---|---|---|
| Data classification | Framework classifies on retain path | Framework classifies on retain path |
| PII detection | Framework scans before pipeline processes | Framework scans before forwarding to engine |
| PII redaction | Framework redacts before embedding/storage | Framework redacts before engine.retain() |
| Residency enforcement | Framework validates retrieval provider region | Framework validates engine endpoint region |
| LLM routing by region | Framework routes LLM SPI calls per zone | Engine handles its own LLM calls (residency must be configured in engine) |
| Encryption validation | Framework checks retrieval provider | Framework checks engine capabilities |
| DLP on reflect | Framework scans pipeline synthesis output | Framework scans engine reflect output |
| Audit trail | Framework logs all governance events | Framework logs all governance events |
| Compliance forget | Framework calls retrieval SPI delete | Framework calls engine.forget() |
Key point: governance is enforced at the framework layer, not delegated to providers. This ensures consistent data protection regardless of which backend is active.
11. Relationship to other docs
Section titled “11. Relationship to other docs”| Concern | Primary doc | How this doc extends it |
|---|---|---|
| PII scanning mechanism | policy-layer.md section 2.1 | Adds classification taxonomy, per-type actions, reversible redaction, DLP |
| Use-case PII presets | use-case-profiles.md | Adds compliance profiles (GDPR, HIPAA, PDPA, CCPA, PCI) |
| Compliance forget, legal hold | memory-lifecycle.md | Adds regulatory context, retention minimums/maximums, cross-border controls |
| Access control | access-control.md | Adds classification-based access (restricted data requires explicit grant) |
| Audit events | event-hooks.md | Adds governance-specific event types |
| Memory export | memory-portability.md | Adds DLP controls on export (strip metadata, block restricted) |
| Reflect synthesis | built-in-pipeline.md | Adds DLP scanning on reflect output |
| Portable DTO constraints | implementation-language-strategy.md | DataClassification and DataLineage DTOs follow portable-type rules |