Eval Dashboard

Astrocyte runs nightly evaluation suites against the built-in Tier 1 pipeline to track retrieval quality, latency, and LLM token usage over time. Results are published here automatically.

The basic suite runs 20 retain + 20 recall + reflect operations against an in-memory vector store with OpenAI embeddings and completions. See the Evaluation design doc for details on suites, metrics, and regression detection.

Results

Loading eval history…

Latest run

Quality trends

Metrics reference

| Metric | What it measures | |--------|-----------------| | Hit rate | Fraction of queries that return at least one relevant result | | MRR | Mean reciprocal rank — how high the first relevant result ranks | | NDCG | Normalized discounted cumulative gain — overall ranking quality | | Precision | Relevant results / total results returned | | Reflect accuracy | Topic coverage in synthesized answers (keyword overlap) | | Tokens used | Total LLM tokens consumed during the eval run (not global spend) |

How it works

The eval workflow runs nightly (5am UTC) or on-demand via gh workflow run eval.yml
It executes scripts/run_eval.py with an inline OpenAI adapter + in-memory vector store
Results are appended to docs/public/eval/history.json and committed
The docs site rebuilds, and this page renders the latest data