Evaluation and benchmarking
Astrocyte ships built-in tools to measure memory quality, compare providers, and monitor accuracy over time. This enables data-driven provider selection and regression detection.
1. Why evaluation matters
Section titled “1. Why evaluation matters”- Users choosing between Tier 1 (built-in pipeline + pgvector) and Tier 2 (Mystique) need objective comparison data.
- Production systems need ongoing quality monitoring - memory quality can degrade as banks grow or content patterns change.
- Provider upgrades need regression testing - did the new version of Mem0 make recall worse?
2. Evaluation API
Section titled “2. Evaluation API”from astrocyte.eval import MemoryEvaluator
brain = Astrocyte.from_config("astrocyte.yaml")evaluator = MemoryEvaluator(brain)
# Run a standard benchmark suiteresults = await evaluator.run_suite( suite="basic", # "basic" | "longmemeval" | custom path bank_id="eval-test-bank", # Dedicated test bank (created if needed) clean_after=True, # Delete test bank after evaluation)
print(results.summary())2.1 Evaluation result
Section titled “2.1 Evaluation result”@dataclassclass EvalResult: suite: str provider: str provider_tier: str # "storage" or "engine" timestamp: datetime metrics: EvalMetrics per_query_results: list[QueryResult] config_snapshot: dict # Astrocyte config at eval time
@dataclassclass EvalMetrics: # Recall accuracy recall_precision: float # Relevant results / total results recall_hit_rate: float # Queries with >=1 relevant result / total queries recall_mrr: float # Mean reciprocal rank of first relevant result recall_ndcg: float # Normalized discounted cumulative gain
# Reflect quality (if suite includes reflect tests) reflect_accuracy: float | None # LLM-judged answer correctness reflect_completeness: float | None # LLM-judged answer completeness reflect_hallucination_rate: float | None # LLM-judged hallucination %
# Performance retain_latency_p50_ms: float retain_latency_p95_ms: float recall_latency_p50_ms: float recall_latency_p95_ms: float reflect_latency_p50_ms: float | None reflect_latency_p95_ms: float | None
# Efficiency total_tokens_used: int # LLM tokens consumed during eval total_duration_seconds: float
@dataclassclass QueryResult: query: str expected: list[str] # Expected memory texts (ground truth) actual: list[MemoryHit] # Actual recall results relevant_found: int precision: float reciprocal_rank: float latency_ms: float3. Built-in test suites
Section titled “3. Built-in test suites”3.1 basic - Quick validation
Section titled “3.1 basic - Quick validation”A lightweight suite (20 retain + 20 recall) that validates basic functionality:
- Retain various content types (facts, experiences, conversations)
- Recall with exact matches, semantic matches, and negative queries
- Verify tag filtering, time range filtering
- Basic reflect test (if provider supports it)
Runtime: ~30 seconds.
3.2 stress - Scale testing
Section titled “3.2 stress - Scale testing”Tests behavior under load:
- 1000 retains across 10 banks
- 500 recalls with varying specificity
- Measures latency degradation as bank size grows
- Tests dedup detection under bulk insert
- Measures concurrent access performance
Runtime: ~5 minutes.
3.3 accuracy - Retrieval quality
Section titled “3.3 accuracy - Retrieval quality”Detailed accuracy measurement:
- 50 retain + 100 recall pairs with labeled ground truth
- Tests semantic similarity (paraphrased queries)
- Tests temporal reasoning (“what happened last week”)
- Tests entity-based recall (“what do we know about Calvin”)
- Tests negative recall (queries with no relevant memories)
- Computes precision, MRR, NDCG
Runtime: ~2 minutes.
3.4 reflect - Synthesis quality
Section titled “3.4 reflect - Synthesis quality”Tests reflect/synthesis capability:
- 20 scenarios with retained context + reflect queries
- LLM-as-judge evaluation of answer quality
- Measures accuracy, completeness, hallucination rate
- Tests with and without dispositions (if supported)
Runtime: ~3 minutes (includes LLM judge calls).
3.5 Custom suites
Section titled “3.5 Custom suites”Users can define custom test suites as YAML files:
name: "Custom domain evaluation"retain: - content: "Calvin prefers dark mode in all applications" tags: [preference, ui] fact_type: experience - content: "The deployment pipeline uses GitHub Actions with a 10-minute timeout" tags: [technical, deployment] fact_type: world
recall: - query: "What are Calvin's UI preferences?" expected_contains: ["dark mode"] tags: [preference] - query: "How does our deployment work?" expected_contains: ["GitHub Actions", "timeout"]
reflect: - query: "Summarize what we know about Calvin's preferences" expected_topics: ["dark mode", "UI"]results = await evaluator.run_suite( suite="./my-eval-suite.yaml", bank_id="eval-custom",)4. Provider comparison
Section titled “4. Provider comparison”Compare two providers side-by-side:
from astrocyte.eval import compare_providers
comparison = await compare_providers( configs=["config-pgvector.yaml", "config-mystique.yaml"], suite="accuracy",)
print(comparison.summary())Provider Comparison: accuracy suite──────────────────────────────────────────────── pgvector (Tier 1) Mystique (Tier 2)Recall precision 0.72 0.89Recall MRR 0.65 0.82Recall NDCG 0.71 0.86Recall p50 (ms) 45 62Recall p95 (ms) 120 145Reflect accuracy 0.68 (fallback) 0.91 (native)Reflect p50 (ms) 1200 850Tokens used 12,400 8,200────────────────────────────────────────────────This gives users concrete data for the Tier 1 → Tier 2 upgrade decision.
5. Continuous quality monitoring
Section titled “5. Continuous quality monitoring”Run evaluations on a schedule against production banks (using a dedicated test partition):
evaluation: continuous: enabled: true schedule: "0 5 * * 1" # Weekly, Monday 5am suite: basic bank_id: eval-continuous # Dedicated bank, not production alert_on_regression: true regression_threshold: 0.05 # Alert if any metric drops >5% alert_hook: on_eval_regression # Trigger event hook5.1 Regression detection
Section titled “5.1 Regression detection”The evaluator compares current results against the last N runs:
@dataclassclass RegressionAlert: metric: str # e.g., "recall_precision" current_value: float baseline_value: float # Average of last 5 runs delta: float # Absolute change delta_percent: float # Percentage change severity: Literal["warning", "critical"]6. CLI support
Section titled “6. CLI support”# Run a benchmarkastrocyte eval --suite basic --config astrocyte.yaml
# Compare providersastrocyte eval compare --configs config-a.yaml config-b.yaml --suite accuracy
# Run custom suiteastrocyte eval --suite ./my-suite.yaml --config astrocyte.yaml
# Output formatsastrocyte eval --suite basic --format json > results.jsonastrocyte eval --suite basic --format table7. Evaluation and the two-tier model
Section titled “7. Evaluation and the two-tier model”| Suite | Tier 1 behavior | Tier 2 behavior |
|---|---|---|
basic | Tests built-in pipeline + storage | Tests engine directly |
accuracy | Measures pipeline recall quality | Measures engine recall quality |
reflect | Tests fallback LLM synthesis | Tests native engine reflect |
stress | Tests pipeline + storage under load | Tests engine under load |
The evaluation framework treats both tiers identically - it uses the public API (retain, recall, reflect). This ensures apples-to-apples comparison between tiers.