Skip to content

Bank health & utilization

The observability layer (see policy-layer.md) captures per-operation metrics and traces. This document describes in-process memory analytics: system-level health assessment—whether memory is helping agents or just accumulating noise. For warehouse / lakehouse export (append-only events to external tables), see memory-export-sink.md and storage-and-data-planes.md; do not confuse that SPI with bank health metrics here.

This maps to Principle 9 (Observable state) taken to its full conclusion: not just tracing operations, but understanding the health of the memory system as a whole.


Each bank gets a health score derived from usage patterns.

MetricWhat it measuresHealthy range
recall_hit_rate% of recalls that return >=1 result60-95%
avg_recall_scoreAverage relevance score of top hit0.5-0.9
retain_rateRetains per hourDepends on use case
dedup_rate% of retains that are near-duplicates<20%
avg_content_lengthAverage retained content length (chars)>50
recall_to_retain_ratioHow often memory is read vs written>0.5
reflect_success_rate% of reflects that produce a non-empty answer>80%
memory_countTotal memories in bankDepends on use case
entity_countUnique entities trackedGrowing over time
staleness% of memories never recalled in last 30 days<70%
@dataclass
class BankHealth:
bank_id: str
score: float # 0.0 (unhealthy) to 1.0 (healthy)
status: Literal["healthy", "warning", "unhealthy"]
issues: list[HealthIssue]
metrics: dict[str, float]
assessed_at: datetime
@dataclass
class HealthIssue:
severity: Literal["info", "warning", "critical"]
code: str # e.g., "HIGH_DEDUP_RATE", "LOW_RECALL_HIT_RATE"
message: str
recommendation: str
health = await brain.bank_health("user-123")
# → BankHealth(score=0.82, status="healthy", issues=[...])
# All banks
healths = await brain.all_bank_health()
# → [BankHealth(...), BankHealth(...), ...]

Agents in loops or with misconfigured auto-retain can flood memory with low-value content. Analytics detects this.

SignalThreshold (configurable)Meaning
Retain rate spike>5x rolling 1-hour averageAgent is in a loop
Dedup rate spike>80% of retains are duplicatesAgent is re-storing the same content
Content length dropAverage <20 chars over 100 retainsAgent is storing junk
Recall hit rate drop<20% over 100 recallsMemories are not relevant to queries
analytics:
noisy_agent:
enabled: true
action: warn # "warn" | "throttle" | "pause"
throttle_to_percent: 10 # If action=throttle, allow 10% of retains through
alert_hook: on_noisy_agent_detected # Trigger event hook (see event-hooks.md)

Understanding what’s actually useful in memory.

report = await brain.utilization_report("user-123")
@dataclass
class UtilizationReport:
bank_id: str
total_memories: int
active_memories: int # Recalled at least once in last 30 days
stale_memories: int # Never recalled in last 30 days
orphaned_entities: int # Entities with no linked memories
top_recalled: list[MemoryUsage] # Most frequently recalled memories
never_recalled: int # Memories never recalled since creation
fact_type_distribution: dict[str, int] # {"world": 120, "experience": 85, ...}
tag_distribution: dict[str, int] # {"preference": 40, "technical": 60, ...}
storage_estimate_bytes: int # Approximate storage used
period: tuple[datetime, datetime] # Analysis period
@dataclass
class MemoryUsage:
memory_id: str
text: str # First 100 chars
recall_count: int
last_recalled_at: datetime

Track memory quality over time:

trends = await brain.quality_trends(
bank_id="user-123",
period_days=30,
granularity="daily",
)
@dataclass
class QualityTrends:
bank_id: str
data_points: list[QualityDataPoint]
@dataclass
class QualityDataPoint:
date: date
retain_count: int
recall_count: int
recall_hit_rate: float
avg_recall_score: float
dedup_rate: float
reflect_success_rate: float

4. Prometheus metrics (analytics-specific)

Section titled “4. Prometheus metrics (analytics-specific)”

In addition to the per-operation metrics in policy-layer.md, analytics exports:

MetricTypeLabels
astrocyte_bank_health_scoreGaugebank_id
astrocyte_bank_memory_countGaugebank_id, fact_type
astrocyte_bank_stale_memory_percentGaugebank_id
astrocyte_bank_dedup_rateGaugebank_id
astrocyte_bank_recall_hit_rateGaugebank_id
astrocyte_noisy_agent_detected_totalCounterbank_id, action

Analytics metrics are designed for Grafana or equivalent dashboards:

  • Bank overview: health score, memory count, utilization across all banks
  • Agent activity: retain/recall rates per bank, noisy agent alerts
  • Quality trends: recall hit rate and dedup rate over time
  • Lifecycle status: memories by state (active, archived, deleted), TTL expirations

analytics:
enabled: true
health_check_schedule: "*/15 * * * *" # Every 15 minutes
utilization_report_schedule: "0 4 * * *" # 4am daily
trends_retention_days: 90 # Keep trend data for 90 days
noisy_agent:
enabled: true
action: warn

7. Durable analytical plane (warehouses and lakehouses)

Section titled “7. Durable analytical plane (warehouses and lakehouses)”

In-process bank health, utilization, and Prometheus metrics above serve operational observability. For warehouse-scale history, BI, and compliance over SQL, Parquet, Iceberg, Delta, and similar layouts, use the Memory Export Sink SPI and memory_export_sinks: config (memory-export-sink.md, ecosystem-and-packaging.md). Sinks complement—do not replace—this module: sinks focus on append-only event streams to external tables; this document focuses on scores and dashboards inside the Astrocyte process.