Evaluation and benchmarking
Astrocyte ships built-in tools to measure memory quality, compare providers, and monitor accuracy over time. This enables data-driven provider selection and regression detection.
1. Why evaluation matters
Section titled “1. Why evaluation matters”- Users choosing between Tier 1 (built-in pipeline + pgvector) and Tier 2 (Mystique) need objective comparison data.
- Production systems need ongoing quality monitoring - memory quality can degrade as banks grow or content patterns change.
- Provider upgrades need regression testing - did the new version of Mem0 make recall worse?
2. Evaluation API
Section titled “2. Evaluation API”from astrocyte.eval import MemoryEvaluator
brain = Astrocyte.from_config("astrocyte.yaml")evaluator = MemoryEvaluator(brain)
# Run a standard benchmark suiteresults = await evaluator.run_suite( suite="basic", # "basic" | "longmemeval" | custom path bank_id="eval-test-bank", # Dedicated test bank (created if needed) clean_after=True, # Delete test bank after evaluation)
print(f"Precision: {results.metrics.recall_precision}, Hit rate: {results.metrics.recall_hit_rate}")2.1 Evaluation result
Section titled “2.1 Evaluation result”@dataclassclass EvalResult: suite: str provider: str provider_tier: str # "storage" or "engine" timestamp: datetime metrics: EvalMetrics per_query_results: list[QueryResult] config_snapshot: dict # Astrocyte config at eval time
@dataclassclass EvalMetrics: # Recall accuracy recall_precision: float # Relevant results / total results recall_hit_rate: float # Queries with >=1 relevant result / total queries recall_mrr: float # Mean reciprocal rank of first relevant result recall_ndcg: float # Normalized discounted cumulative gain
# Reflect quality (if suite includes reflect tests) reflect_accuracy: float | None # LLM-judged answer correctness reflect_completeness: float | None # LLM-judged answer completeness reflect_hallucination_rate: float | None # LLM-judged hallucination %
# Performance retain_latency_p50_ms: float retain_latency_p95_ms: float recall_latency_p50_ms: float recall_latency_p95_ms: float reflect_latency_p50_ms: float | None reflect_latency_p95_ms: float | None
# Efficiency total_tokens_used: int # LLM tokens consumed during this eval run total_duration_seconds: float
@dataclassclass QueryResult: query: str expected: list[str] # Expected memory texts (ground truth) actual: list[MemoryHit] # Actual recall results relevant_found: int precision: float reciprocal_rank: float latency_ms: float2.2 Per-run token tracking vs. global LLM spend
Section titled “2.2 Per-run token tracking vs. global LLM spend”total_tokens_used tracks the LLM tokens consumed within a single eval run — the sum of all complete() and embed() calls that the pipeline makes during the suite’s retain, recall, and reflect phases. This serves two purposes:
- Cost-efficiency comparison. When
compare_providers()runs the same suite against different configs, token usage is the third leg of the tradeoff alongside latency and accuracy. One provider may score higher but consume 10× more tokens. - Cost regression detection. A config change that silently doubles LLM calls shows up as a spike in
total_tokens_usedeven if accuracy holds steady.
This is a separate concern from global LLM spend tracking (cumulative cost across all production calls, budget alerts, per-model breakdowns). Global spend tracking is the responsibility of your LLM gateway or aggregator (LiteLLM, Portkey, OpenRouter, etc.) — see architecture.md §5 for the boundary. Astrocyte’s per-run token counter works with any LLMProvider implementation, including ones backed by those gateways, because it accumulates Completion.usage at the framework level regardless of which backend is behind the protocol.
3. Built-in test suites
Section titled “3. Built-in test suites”3.1 basic - Quick validation
Section titled “3.1 basic - Quick validation”A lightweight suite (20 retain + 20 recall) that validates basic functionality:
- Retain various content types (facts, experiences, conversations)
- Recall with exact matches, semantic matches, and negative queries
- Verify tag filtering, time range filtering
- Basic reflect test (if provider supports it)
Runtime: ~30 seconds.
3.2 stress - Scale testing
Section titled “3.2 stress - Scale testing”Tests behavior under load:
- 1000 retains across 10 banks
- 500 recalls with varying specificity
- Measures latency degradation as bank size grows
- Tests dedup detection under bulk insert
- Measures concurrent access performance
Runtime: ~5 minutes.
3.3 accuracy - Retrieval quality
Section titled “3.3 accuracy - Retrieval quality”Detailed accuracy measurement:
- 50 retain + 100 recall pairs with labeled ground truth
- Tests semantic similarity (paraphrased queries)
- Tests temporal reasoning (“what happened last week”)
- Tests entity-based recall (“what do we know about Calvin”)
- Tests negative recall (queries with no relevant memories)
- Computes precision, MRR, NDCG
Runtime: ~2 minutes.
3.4 reflect - Synthesis quality
Section titled “3.4 reflect - Synthesis quality”Tests reflect/synthesis capability:
- 20 scenarios with retained context + reflect queries
- LLM-as-judge evaluation of answer quality
- Measures accuracy, completeness, hallucination rate
- Tests with and without dispositions (if supported)
Runtime: ~3 minutes (includes LLM judge calls).
3.5 Custom suites
Section titled “3.5 Custom suites”Users can define custom test suites as YAML files:
name: "Custom domain evaluation"retain: - content: "Calvin prefers dark mode in all applications" tags: [preference, ui] fact_type: experience - content: "The deployment pipeline uses GitHub Actions with a 10-minute timeout" tags: [technical, deployment] fact_type: world
recall: - query: "What are Calvin's UI preferences?" expected_contains: ["dark mode"] tags: [preference] - query: "How does our deployment work?" expected_contains: ["GitHub Actions", "timeout"]
reflect: - query: "Summarize what we know about Calvin's preferences" expected_topics: ["dark mode", "UI"]results = await evaluator.run_suite( suite="./my-eval-suite.yaml", bank_id="eval-custom",)4. Provider comparison
Section titled “4. Provider comparison”Compare two providers side-by-side:
from astrocyte.eval import compare_providers, format_comparison
results = await compare_providers( configs=["config-pgvector.yaml", "config-mystique.yaml"], suite="accuracy",)
print(format_comparison(results))Provider Comparison: accuracy suite──────────────────────────────────────────────── pgvector (Tier 1) Mystique (Tier 2)Recall precision 0.72 0.89Recall MRR 0.65 0.82Recall NDCG 0.71 0.86Recall p50 (ms) 45 62Recall p95 (ms) 120 145Reflect accuracy 0.68 (fallback) 0.91 (native)Reflect p50 (ms) 1200 850Tokens used 12,400 8,200────────────────────────────────────────────────This gives users concrete data for the Tier 1 → Tier 2 upgrade decision.
5. Running evaluations in CI
Section titled “5. Running evaluations in CI”Evaluations run as a separate GitHub Actions workflow (eval.yml), not as part of the main CI pipeline. This keeps fast unit-test feedback decoupled from slower, LLM-dependent eval runs.
5.1 Cadence
Section titled “5.1 Cadence”| Cadence | Suite | Trigger | Purpose |
|---|---|---|---|
| Nightly | basic | Cron (0 5 * * *) | Catch regressions from data drift or provider-side changes |
| Ad-hoc | Any | workflow_dispatch (GitHub UI or gh CLI) | On-demand validation after config changes, provider upgrades, or before releases |
The basic suite (~$0.13/run on GPT-4o-class models) is cheap enough for nightly runs. Heavier suites (accuracy, stress) are intended for ad-hoc use.
5.2 Ad-hoc runs
Section titled “5.2 Ad-hoc runs”Trigger from the terminal:
gh workflow run eval.yml --field suite=basicgh workflow run eval.yml --field suite=accuracyOr from the GitHub Actions UI: Actions → Eval → Run workflow → select suite.
5.3 Eval workflow design
Section titled “5.3 Eval workflow design”The eval workflow:
- Checks out the repo and installs dependencies.
- Runs
MemoryEvaluator.run_suite()with the selected suite. - Uploads
EvalResultas a JSON artifact (for historical comparison). - Posts a summary table to the workflow run.
The workflow requires an OPENAI_API_KEY (or equivalent LLM provider credential) as a repository secret — see the repo’s contributing guide for setup. Eval results are not gated on PR merge; they are informational. Regressions surface as workflow annotations, not merge blockers.
5.4 Why not per-PR?
Section titled “5.4 Why not per-PR?”Per-PR eval runs are feasible (~$0.13/run) but deferred for now:
- Eval suites need a real LLM backend, which means secrets in CI — acceptable for nightly runs on
main, but requires careful scoping for PRs from forks. - The
basicsuite takes ~30 seconds, which is fast, but still slower than unit tests. Keeping eval separate avoids slowing down the PR feedback loop. - Nightly runs on
maincatch the same regressions within 24 hours.
This can be revisited if the team wants tighter feedback.
5.5 Regression detection
Section titled “5.5 Regression detection”The evaluator compares current results against a baseline:
@dataclassclass RegressionAlert: metric: str # e.g., "recall_precision" current_value: float baseline_value: float # Average of last 5 runs delta: float # Absolute change delta_percent: float # Percentage change severity: Literal["warning", "critical"]5.6 Astrocyte config for eval
Section titled “5.6 Astrocyte config for eval”evaluation: continuous: enabled: true schedule: "0 5 * * *" # Nightly, 5am UTC suite: basic bank_id: eval-continuous # Dedicated bank, not production alert_on_regression: true regression_threshold: 0.05 # Alert if any metric drops >5% alert_hook: on_eval_regression # Trigger event hook6. CLI support
Section titled “6. CLI support”# Run a benchmarkastrocyte eval --suite basic --config astrocyte.yaml
# Compare providersastrocyte eval compare --configs config-a.yaml config-b.yaml --suite accuracy
# Run custom suiteastrocyte eval --suite ./my-suite.yaml --config astrocyte.yaml
# Output formatsastrocyte eval --suite basic --format json > results.jsonastrocyte eval --suite basic --format tableNote: The
astrocyte evalCLI above is specified but not yet implemented. Usescripts/run_benchmarks.pydirectly or themaketargets described in section 7.3.
7. External benchmarks
Section titled “7. External benchmarks”Astrocyte includes adapters for two academic memory benchmarks that test retrieval and reasoning quality on realistic conversational data.
7.1 LoCoMo (ECAI 2025)
Section titled “7.1 LoCoMo (ECAI 2025)”LoCoMo tests very long-term conversational memory across four QA categories:
| Category | Tests | Question count |
|---|---|---|
| Single-hop | Direct fact recall from a single session | 282 |
| Multi-hop | Reasoning across multiple sessions | 321 |
| Open-domain | Broad knowledge questions | 96 |
| Temporal | Time-aware reasoning (ordering, dates) | 841 |
The dataset contains 10 conversations with ~200 sessions each and 1,986 total questions. The adapter (astrocyte.eval.benchmarks.locomo) retains each session as a conversational memory with occurred_at timestamps and dialogue-aware chunking.
Scoring conventions: Two judges are supported:
- Stemmed token-F1 (
--canonical-judgewithout a real LLM provider) — the original paper’s metric. Reproducible and deterministic. - LLM-judge (
--canonical-judgewith a real provider) — binary yes/no per question, matching the convention used by Mem0 (ECAI 2025), Hindsight, and MemMachine. Required for numbers directly comparable to published competitor scores. Automatically used whenbench-fullruns with a real provider; falls back to stemmed F1 with the mock provider sobench-smokestays API-key-free.
7.2 LongMemEval
Section titled “7.2 LongMemEval”LongMemEval (ICLR 2025) tests five long-term memory abilities across 500 questions: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The adapter (astrocyte.eval.benchmarks.longmemeval) retains full session haystacks (~1,500 unique sessions) and evaluates recall + reflect against labeled QA pairs.
Scoring: --canonical-judge uses the paper’s LLM-judge (one LLM call per question). Without the flag, the legacy text_overlap_score > 0.3 scorer is used (faster, not comparable to published numbers).
7.3 Running benchmarks locally
Section titled “7.3 Running benchmarks locally”Datasets are fetched automatically on first run to datasets/ (gitignored). Results are written to benchmark-results/ locally and archived to Cloudflare R2 (s3://astrocyte-benchmarks/) at end-of-run for durable history. The local benchmark-results/ directory is the working scratch dir; R2 is the system of record. See bench-archive.md for the bucket layout and the archive/fetch tooling. Requires Doppler for API keys (LLM provider + R2) on real-provider runs.
# Smoke test — in-memory, no API key needed (~25s)make bench-smoke
# Quick subsets (requires API key via Doppler)doppler run -- make bench-locomo-quick # 50 questions, ~2-3 mindoppler run -- make bench-locomo-fair # 200 questions (20×10), ~15-20 mindoppler run -- make bench-longmemeval-quick # 100 questions, ~30-60 min
# Full canonical run — LME + LoCoMo in parallel, LLM-judge# Produces competitor-comparable numbersdoppler run -- make bench-full
# Resume after interruption (laptop close, kill signal, etc.)doppler run -- make bench-full RESUME=1LoCoMo: choosing the right tier
Section titled “LoCoMo: choosing the right tier”Three LoCoMo bench tiers cover different needs along the speed/signal trade-off:
bench-locomo-quick | bench-locomo-fair | bench-locomo | |
|---|---|---|---|
| Questions | 50 | 200 (20 × 10 convos) | 1,986 (full) |
| Wall time | ~3 min | ~15–20 min | ~3 hrs |
| Cost (gpt-4o-mini) | ~$0.30 | ~$1 | ~$5–10 |
| Sampling | First 50 (head-slice) | Category-stratified within each of all 10 conversations | All |
| Conversation coverage | 1 conversation | All 10 conversations × all categories | All 10 |
| Per-category n | ~10 (too small) | ~30–60 (even, balanced) | ~400 |
| 95% CI on overall | ±14 pts | ±7 pts | ±2.2 pts |
| Comparable across runs | ✅ deterministic | ✅ deterministic | ✅ |
| Detects 4-pt change | ❌ | marginal | ✅ |
| Detects 8-pt change | ❌ | ✅ | ✅ |
When to use which:
quick— sanity check / smoke test only. 50 questions can’t tell you anything statistically meaningful about quality. Use for: “does my code crash on real bench data?” and CI gates.fair— recommended fast-iteration target. Same speed as a 200-question head-slice but stratified across both conversations AND categories: for each of 10 conversations, take ⌈N / num_categories⌉ questions per category. Every conversation AND every category gets representation. Per-category numbers are reliable enough to detect ~8-pt swings.bench-locomo— release-quality measurement. Tight per-category CIs (±5 pts) let you make claims like “multi-hop +5 pts.” Direct comparison to published numbers (Hindsight, BEAM, paper baselines).
Recommended workflow for each change you want to ship:
1. Implement the change2. make bench-locomo-fair (~20 min) → does it move the needle?3. If no signal → reject, iterate, or shelve4. If positive signal → make bench-locomo (~3 hrs) → confirm magnitude5. If confirmed at full scale → shipCost per feature: ~3.5 hrs of bench time, ~$6–11. Compare to “always full bench” at 6 hrs and ~$10–20 per feature.
Why no fixed 200-question head-slice tier?
Section titled “Why no fixed 200-question head-slice tier?”A head-slice (--max-questions 200) draws all 200 questions from the first one or two conversations in the dataset. Persona scoping, cross-conversation tag filters, and entity disambiguation across storylines all go untested at that sample. bench-locomo-fair (--max-questions-per-conversation 20) costs the same but exercises every conversation. The deprecated bench-locomo-200 target was removed for this reason; use bench-locomo-fair instead.
Key CLI flags (scripts/run_benchmarks.py):
| Flag | Effect |
|---|---|
--canonical-judge | Use each benchmark’s paper-specified judge. Required for competitor comparisons. |
--multi-query | Enable multi-query expansion in retrieval (extra LLM calls, improves multi-hop recall). |
--max-sessions N | Cap LongMemEval retain phase at N unique sessions (default: all ~1,500). |
--resume | Continue an interrupted run from benchmark-results/checkpoints/. |
--max-questions N | Hard cap on total questions (deterministic head-slice; biased toward early conversations when small). |
--max-questions-per-conversation N | LoCoMo only: take N questions per conversation, stratified across categories (⌈N / num_categories⌉ per category). Ensures every category gets representation; previously head-sliced and could exclude rare categories. Used by bench-locomo-fair. |
Checkpoint / resume: Every evaluated question is checkpointed to benchmark-results/checkpoints/. If a run is interrupted, --resume (or RESUME=1 in make bench-full) replays already-scored questions from cache and skips already-retained sessions (with persistent stores). The checkpoint is deleted on successful completion.
Parallel execution: When both longmemeval and locomo are requested (e.g. bench-full), they run concurrently via asyncio.gather(). Each uses its own bank ID so there is no state conflict.
State reset between runs
Section titled “State reset between runs”Bench targets depend on bench-db-start, which spins up a disposable Postgres container (astrocyte-bench-pg on port 5433). The container persists across runs by default for speed; make bench-db-reset (alias for bench-db-stop && bench-db-start) recreates it from scratch when the schema or extensions need to change.
For data-level reset between runs (without recreating the container), every benchmark run() method calls astrocyte/eval/_state_reset.py at start of execution. The helper:
- TRUNCATEs every bench-relevant table (vectors, banks, wiki pages, entity tables, temporal facts, PgQueuer queue) in FK-safe order
- Drops and recreates the AGE graph (
astrocyte) - Skips tables that don’t exist (deployments without wiki layer, AGE-not-installed, etc.) — best-effort cleanup
- Is a no-op when
DATABASE_URLis unset (the in-memory test path)
Without this reset, leftover state from prior runs silently corrupts scores: stale wiki pages dominate _try_wiki_tier, accumulated entity aliases mis-resolve canonical IDs, and orphaned PgQueuer compile tasks race the recall path. The orchestrator (scripts/run_benchmarks.py) does ONE reset at the top and passes reset_state_before=False to each individual benchmark so asyncio.gather’d parallel runs (bench-full) don’t race on TRUNCATE / drop_graph.
7.4 Preset ablation matrix
Section titled “7.4 Preset ablation matrix”Astrocyte ships five named bench presets under astrocyte-py/benchmarks/, each toggling a coherent set of pipeline features so the bench can attribute score deltas to specific design choices rather than a tangle of co-changes:
| Preset | Config file | What it toggles |
|---|---|---|
baseline | config-baseline.yaml | Minimal pipeline — vector + keyword recall only, no agentic reflect, no causal links, no semantic kNN, no abstention. The bottom of the matrix. |
fast-recall | config-fast-recall.yaml | Adds query analyzer, structured fact extraction, observation consolidation, intent-conditional adversarial defense; keeps cross-encoder rerank and agentic reflect off. Currently the highest-overall preset on LoCoMo (n=200). |
hindsight-parity | config-hindsight-parity.yaml | Cross-encoder rerank ON, agentic reflect ON, causal links ON, semantic kNN ON, query analyzer OFF. Approximates Hindsight’s documented stack. |
hindsight-balanced | config-hindsight-balanced.yaml | Single-variable diff vs. parity: adds the intent-conditional abstention floor + adversarial system-prompt rule, no premise verification. |
quality-max | config-quality-max.yaml | All quality features on simultaneously (multi-query expansion, premise verification, agentic reflect, cross-encoder rerank). Trades latency and cost for accuracy. |
make bench-compare runs the full matrix sequentially against the same dataset and emits one results file per preset:
doppler run -- make bench-compare# emits benchmark-results/results-matrix-<preset>.jsonSee benchmark-presets.md for the live results matrix, per-category scores, the post-mortem on why quality-max underperformed, and the rationale behind each preset’s specific knob settings.
7.5 CI integration
Section titled “7.5 CI integration”The GitHub Actions workflow (.github/workflows/benchmarks.yml) runs benchmarks weekly and on manual dispatch. It uses --canonical-judge and compares results against benchmarks/baselines-openai.json (falling back to baselines-test-provider.json if no real-provider baseline exists yet). The bench-smoke job runs on every PR using the mock provider and baselines-test-provider.json.
The regression gate (scripts/check_benchmark_regression.py) exits non-zero when any metric drops more than a configurable tolerance (default: 2pp overall, 3pp per category, 3pp retrieval metrics).
CI runs also archive their results-*.json to R2 via the same post-run hook used locally; the bench Doppler config carries the R2_* credentials. Trajectory analysis (make bench-archive-trajectory) reads from R2, so weekly CI numbers and ad-hoc local runs share one history.
8. Evaluation and the two-tier model
Section titled “8. Evaluation and the two-tier model”| Suite | Tier 1 behavior | Tier 2 behavior |
|---|---|---|
basic | Tests built-in pipeline + storage | Tests engine directly |
accuracy | Measures pipeline recall quality | Measures engine recall quality |
reflect | Tests fallback LLM synthesis | Tests native engine reflect |
stress | Tests pipeline + storage under load | Tests engine under load |
The evaluation framework treats both tiers identically - it uses the public API (retain, recall, reflect). This ensures apples-to-apples comparison between tiers.