Eval Dashboard
Astrocyte runs nightly evaluation suites against the built-in Tier 1 pipeline to track retrieval quality, latency, and LLM token usage over time. Results are published here automatically.
The basic suite runs 20 retain + 20 recall + reflect operations against an in-memory vector store with OpenAI embeddings and completions. See the Evaluation design doc for details on suites, metrics, and regression detection.
Results
Section titled “Results”Loading eval history…
Metrics reference
Section titled “Metrics reference”| Metric | What it measures |
|---|---|
| Hit rate | Fraction of queries that return at least one relevant result |
| MRR | Mean reciprocal rank — how high the first relevant result ranks |
| NDCG | Normalized discounted cumulative gain — overall ranking quality |
| Precision | Relevant results / total results returned |
| Reflect accuracy | Topic coverage in synthesized answers (keyword overlap) |
| Tokens used | Total LLM tokens consumed during the eval run (not global spend) |
How it works
Section titled “How it works”- The eval workflow runs nightly (5am UTC) or on-demand via
gh workflow run eval.yml - It executes
scripts/run_eval.pywith an inline OpenAI adapter + in-memory vector store - Results are appended to
docs/public/eval/history.jsonand committed - The docs site rebuilds, and this page renders the latest data