Coverage for astrocyte/pipeline/fact_entity_expansion.py: 98%
41 statements
« prev ^ index » next coverage.py v7.15.0, created at 2026-07-04 05:24 +0000
« prev ^ index » next coverage.py v7.15.0, created at 2026-07-04 05:24 +0000
1"""M12.4: entity-graph expansion over fact-grain hits — REVERTED EXPERIMENT.
3**Status:** Implemented but not wired into the bench pipeline. The bench
4gate at M12.4 showed LME regressed -4.5pp (55.5→51.0) with
5multi-session dropping 11.8→2.9 — the exact category expansion was
6supposed to help. LoCoMo was flat (small gains on multi-hop +2.5,
7temporal +2.5 offset by losses on adversarial -2.4, open-domain -2.7).
9Root cause hypothesis: LME's user-haystack has 50+ sessions with dense
10entity overlap (same user, recurring topics). Naive co-occurrence
11expansion floods the candidate pool with off-topic but entity-linked
12facts that the cross-encoder rerank can't filter on a per-fact basis.
13LoCoMo's 10-conversation graph is sparse enough that signal/noise
14breaks even.
16Kept for: documentation of what was tried, the test suite as a
17contract pin if a future attempt re-uses this primitive with a
18smarter gating strategy (e.g. only expand when entities are RARE
19across the bank, or only when the picker selected ≥2 lines).
21Sits between fact semantic-retrieval and fact rerank. For multi-hop
22and multi-session questions, the question's anchor entities and the
23answer's anchor entities are different — the bridge is a chain of
24co-occurring entities across sections. The bi-encoder semantic search
25finds facts that are textually similar to the question, but it can't
26follow that chain.
28This module follows it. Given the top-K semantic hits, collect their
29entities, find OTHER facts that mention those entities (cross-section),
30and return the expanded set. The downstream cross-encoder rerank
31([fact_rerank.py]) picks which expanded facts actually answer the
32question.
34Generic across benches — entity strings (proper nouns + typed labels
35like ``role:doctor``) are bench-agnostic. No question parsing, no LLM
36call: the seed entities come from the bi-encoder's own top hits.
38Design knob trade-offs:
40- ``max_seed_entities``: too high and we expand from noisy entities
41 the bi-encoder happened to surface; too low and we miss valid
42 bridges. 8 is the Hindsight default for similar graph-walk depth.
43- ``max_neighbor_facts_per_entity``: caps the fan-out per seed entity.
44 A common entity like "User" could pull thousands of facts; the cap
45 keeps total candidates bounded.
46- ``max_expanded_facts``: total cap across all entities. Combined with
47 the downstream rerank's ``rerank_top_k=30``, this bounds inference
48 cost.
50See:
51- ``docs/_design/recall.md`` §15 (M12.4)
52- ``astrocyte.pipeline.fact_rerank`` for the next stage
53- Hindsight's ``search_unit_links`` for the section-grain analogue
54"""
56from __future__ import annotations
58import logging
59from typing import TYPE_CHECKING
61if TYPE_CHECKING:
62 from astrocyte.provider import PageIndexStore
63 from astrocyte.types import PageIndexFactHit
65logger = logging.getLogger("astrocyte.pipeline.fact_entity_expansion")
68async def expand_via_entity_graph(
69 initial_hits: list[PageIndexFactHit],
70 *,
71 store: PageIndexStore,
72 bank_id: str,
73 document_id: str | None = None,
74 max_seed_hits: int = 5,
75 max_seed_entities: int = 8,
76 max_neighbor_facts_per_entity: int = 10,
77 max_expanded_facts: int = 20,
78) -> list[PageIndexFactHit]:
79 """Expand a candidate fact set by walking entity co-occurrence.
81 Args:
82 initial_hits: Top-ranked semantic / picker-filtered fact hits.
83 Their entities seed the expansion.
84 store: PageIndexStore for entity-anchored fact lookup.
85 bank_id: Bank scope for all lookups.
86 document_id: Optional doc-scope — if set, only facts within
87 this document are considered. ``None`` lets the expansion
88 cross documents (useful for LME multi-session, where the
89 bridge spans haystack sessions).
90 max_seed_hits: How many of ``initial_hits`` to draw entities
91 from. The top hits are most likely to be on-topic; lower
92 hits introduce noise.
93 max_seed_entities: Cap on distinct entities used as seeds.
94 max_neighbor_facts_per_entity: Cap on facts fetched per seed
95 entity.
96 max_expanded_facts: Total cap on the returned expanded set.
98 Returns:
99 A list of ``PageIndexFactHit`` that are NOT in ``initial_hits``
100 (deduped by ``fact_id``) and that mention at least one entity
101 shared with the seed hits' top entities. Order matches the
102 store's per-entity ranking; downstream rerank reorders.
103 """
104 if not initial_hits:
105 return []
107 # 1. Collect seed entities from the top initial hits. Preserve
108 # order of first appearance so deterministic tests are easy.
109 seed_entities: list[str] = []
110 seen_entities: set[str] = set()
111 for hit in initial_hits[:max_seed_hits]:
112 for entity in hit.entities or []:
113 key = entity.lower()
114 if key in seen_entities:
115 continue
116 seen_entities.add(key)
117 seed_entities.append(entity)
118 if len(seed_entities) >= max_seed_entities:
119 break
120 if len(seed_entities) >= max_seed_entities:
121 break
123 if not seed_entities:
124 return []
126 # 2. For each seed entity, fetch facts that mention it. Dedup
127 # against the initial hits and against each other by fact_id.
128 initial_ids: set[str] = {h.fact_id for h in initial_hits}
129 expanded: list[PageIndexFactHit] = []
130 expanded_ids: set[str] = set()
132 for entity in seed_entities:
133 if len(expanded) >= max_expanded_facts:
134 break
135 try:
136 neighbor_hits = await store.search_facts_by_entity(
137 bank_id,
138 entity,
139 top_k=max_neighbor_facts_per_entity,
140 document_id=document_id,
141 )
142 except Exception as exc: # noqa: BLE001
143 logger.warning(
144 "search_facts_by_entity(%r) failed: %s: %s",
145 entity,
146 type(exc).__name__,
147 exc,
148 )
149 continue
151 for hit in neighbor_hits:
152 if hit.fact_id in initial_ids or hit.fact_id in expanded_ids:
153 continue
154 expanded.append(hit)
155 expanded_ids.add(hit.fact_id)
156 if len(expanded) >= max_expanded_facts:
157 break
159 return expanded