Skip to content

Benchmark trajectory

Live time-series of Astrocyte’s LoCoMo and LongMemEval scores. Data is read from the public R2 bucket at runtime — every archived run shows up within seconds of upload, with no docs rebuild needed.

The chart is rendered client-side from trajectory/<bench>.json in the public R2 bucket. See bench-archive for the bucket layout, the archive / fetch tooling, and how the trajectory artifact is regenerated on each run.

The two shields.io endpoint badges at the top of the main README link to this page:

LoCoMo LongMemEval

The badge JSON lives at badges/<bench>.json on the same public R2 bucket — regenerated whenever make bench-refresh-labels runs. Its content depends on what BENCH_PARITY.yaml (at the repo root) records:

  • If a published release exists for the astrocyte package: badge shows that release’s frozen scores (e.g. LoCoMo (astrocyte v0.14.0) 83.8%). This is what pip install astrocyte actually produces.
  • If no release exists yet but a cycle has been marked shipped: badge shows the mean of that cycle’s shipped run pair (e.g. LoCoMo (n=200, M19 × 2 runs) 84.1%).
  • Otherwise: falls back to the most recent non-smoke run.

The full cycle-close + release ritual that drives badge changes lives in RELEASING.md at the repo root.

Current operating point (post-M17 close, 2026-05-17)

Section titled “Current operating point (post-M17 close, 2026-05-17)”

Conversation Engine + Memory Engine, gpt-4o-mini answerer + judge, --user-profile enabled. Fair subsets: LME-30 via --per-type 5; LoCoMo-200 via MEM0_HARNESS_LOCOMO_MAX_Q=20 (matches baseline subset). 3-run replication.

BenchCutoffM14 baselineM17 3-run mean (Conversation Engine)Δ
LMEtop_2065.0% ±3.3pp75.56% ±5.09pp+10.56pp (3.2σ)
LoCoMotop_2078.25% ±1.5pp80.50% ±1.50pp+2.25pp (1.5σ)

M17 routes LME/LoCoMo through the new Conversation Engine (astrocyte/conversations/ + session-aware Hindsight-parity chunking) → ConversationIngestor → retain SPI → existing extraction. The Document Engine (astrocyte/documents/) ships in parallel but is not benched on conversation workloads (its bench is FinanceBench / DoubleBench, future cycle). See m17-pageindex-ingestion.md §8.

LME run-to-run std (±5.09pp) is wider than baseline (±3.3pp), concentrated in the N=5 single-session-preference and single-session-assistant categories. Conservative “central tendency” LME lift, discounting one SSP outlier, is closer to +8pp than the headline +10.56pp. Both numbers clear the locked 2σ ship gate.

Gap to Hindsight’s published numbers (94.6% LME, 92.0% LoCoMo): −19pp LME, −12pp LoCoMo remaining. M18 quick-wins target the cheapest engine-side levers.

Previous operating point (post-M14 close, 2026-05-15)

Section titled “Previous operating point (post-M14 close, 2026-05-15)”

Multi-run aggregates on the Mem0 harness, gpt-4o-mini judge + answerer, --user-profile enabled. HEAD was at commit 6ec61ea (revert(m14.2)).

BenchCutoff4-run baselineM14 experiment WIP (3-run mean)
LMEtop_2065.0% ±3.3pp68.9% ±5.0pp
LoCoMotop_2078.25% ±1.5pp75.83% ±1.5pp

The “M14 experiment WIP” column reflects bench runs of a working-tree implementation that was never committed — the experiment was torn down after replication invalidated the single-run +8.3pp LME lift. Single-run highs from the M14 cycle (LME 73.3%, LoCoMo 81.0%) did NOT replicate; treat them as variance. See m13-m14-roadmap.md §§8.7-8.10 for the full retrospective and null-verdict explanation.

Trajectory chart not configured. Set R2_PUBLIC_URL at docs build time (doppler run --config bench -- pnpm build) to enable the locomo chart. See benchmarks-doppler-setup.
Trajectory chart not configured. Set R2_PUBLIC_URL at docs build time (doppler run --config bench -- pnpm build) to enable the longmemeval chart. See benchmarks-doppler-setup.

Each point on the overall accuracy chart is one archived bench run. Hovering reveals the run’s stage (e.g. pr2-d55-gate, weekly-ci), the git commit it was launched from, and the question count.

The per-category accuracy chart breaks the same runs down by category. For LoCoMo: single-hop, multi-hop, temporal, open-domain, adversarial. For LongMemEval: single-session-user, single-session-assistant, single-session-preference, multi-session, temporal-reasoning, knowledge-update.

make bench-locomo / bench-longmemeval / bench-parallel
↓ writes per-project result JSON under
↓ benchmark-results/<harness>/<bench>/<project>/<bench>_results_*.json
make bench-archive-rescan
↓ scripts/archive_bench_results.py walks the canonical tree,
↓ skips projects with _ARCHIVED marker, ingests both schemas
↓ (Mem0 metrics_by_cutoff.top_20 OR PageIndex overall_accuracy)
↓ gzip + put_object → private bucket
↓ patch per-day manifest, regenerate trajectory + badges
private bucket: runs/<date>/<stage>/<bench>/results-*.json.gz
public bucket: trajectory/<bench>.json ← this page reads from here
public bucket: badges/<bench>.json ← README badges read from here

A run that completes without R2 credentials still writes locally; it just doesn’t show up here. To archive everything not already pushed (idempotent — projects with an _ARCHIVED marker are skipped):

Terminal window
cd astrocyte-py
make bench-archive-rescan

Smoke / micro runs (stage containing smoke, or n_questions < 30) are filtered out by default — they would otherwise show up as 0% / 100% extreme points on the trajectory. Pass INCLUDE_SMOKE=1 to archive them anyway.

When a cycle’s ship-gate condition is identified, mark its replicate runs with a label so the badge writer can pick them out cleanly:

Terminal window
make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-1 LABEL=m19 RATIONALE="..."
make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-2 LABEL=m19 RATIONALE="..."
make bench-refresh-labels # patches R2 manifests, regenerates badges
make bench-tag-shipped LABEL=m19 # annotated git tag bench/m19 anchors the cycle

At release time, release-mark-all links the released package version(s) to the cycle in BENCH_PARITY.yaml and the README badges flip to display the released version’s frozen scores. Full ritual in RELEASING.md.