Coverage for astrocyte/pipeline/fact_entity_expansion.py: 98%

41 statements  

« prev     ^ index     » next       coverage.py v7.15.0, created at 2026-07-04 05:24 +0000

1"""M12.4: entity-graph expansion over fact-grain hits — REVERTED EXPERIMENT. 

2 

3**Status:** Implemented but not wired into the bench pipeline. The bench 

4gate at M12.4 showed LME regressed -4.5pp (55.5→51.0) with 

5multi-session dropping 11.8→2.9 — the exact category expansion was 

6supposed to help. LoCoMo was flat (small gains on multi-hop +2.5, 

7temporal +2.5 offset by losses on adversarial -2.4, open-domain -2.7). 

8 

9Root cause hypothesis: LME's user-haystack has 50+ sessions with dense 

10entity overlap (same user, recurring topics). Naive co-occurrence 

11expansion floods the candidate pool with off-topic but entity-linked 

12facts that the cross-encoder rerank can't filter on a per-fact basis. 

13LoCoMo's 10-conversation graph is sparse enough that signal/noise 

14breaks even. 

15 

16Kept for: documentation of what was tried, the test suite as a 

17contract pin if a future attempt re-uses this primitive with a 

18smarter gating strategy (e.g. only expand when entities are RARE 

19across the bank, or only when the picker selected ≥2 lines). 

20 

21Sits between fact semantic-retrieval and fact rerank. For multi-hop 

22and multi-session questions, the question's anchor entities and the 

23answer's anchor entities are different — the bridge is a chain of 

24co-occurring entities across sections. The bi-encoder semantic search 

25finds facts that are textually similar to the question, but it can't 

26follow that chain. 

27 

28This module follows it. Given the top-K semantic hits, collect their 

29entities, find OTHER facts that mention those entities (cross-section), 

30and return the expanded set. The downstream cross-encoder rerank 

31([fact_rerank.py]) picks which expanded facts actually answer the 

32question. 

33 

34Generic across benches — entity strings (proper nouns + typed labels 

35like ``role:doctor``) are bench-agnostic. No question parsing, no LLM 

36call: the seed entities come from the bi-encoder's own top hits. 

37 

38Design knob trade-offs: 

39 

40- ``max_seed_entities``: too high and we expand from noisy entities 

41 the bi-encoder happened to surface; too low and we miss valid 

42 bridges. 8 is the Hindsight default for similar graph-walk depth. 

43- ``max_neighbor_facts_per_entity``: caps the fan-out per seed entity. 

44 A common entity like "User" could pull thousands of facts; the cap 

45 keeps total candidates bounded. 

46- ``max_expanded_facts``: total cap across all entities. Combined with 

47 the downstream rerank's ``rerank_top_k=30``, this bounds inference 

48 cost. 

49 

50See: 

51- ``docs/_design/recall.md`` §15 (M12.4) 

52- ``astrocyte.pipeline.fact_rerank`` for the next stage 

53- Hindsight's ``search_unit_links`` for the section-grain analogue 

54""" 

55 

56from __future__ import annotations 

57 

58import logging 

59from typing import TYPE_CHECKING 

60 

61if TYPE_CHECKING: 

62 from astrocyte.provider import PageIndexStore 

63 from astrocyte.types import PageIndexFactHit 

64 

65logger = logging.getLogger("astrocyte.pipeline.fact_entity_expansion") 

66 

67 

68async def expand_via_entity_graph( 

69 initial_hits: list[PageIndexFactHit], 

70 *, 

71 store: PageIndexStore, 

72 bank_id: str, 

73 document_id: str | None = None, 

74 max_seed_hits: int = 5, 

75 max_seed_entities: int = 8, 

76 max_neighbor_facts_per_entity: int = 10, 

77 max_expanded_facts: int = 20, 

78) -> list[PageIndexFactHit]: 

79 """Expand a candidate fact set by walking entity co-occurrence. 

80 

81 Args: 

82 initial_hits: Top-ranked semantic / picker-filtered fact hits. 

83 Their entities seed the expansion. 

84 store: PageIndexStore for entity-anchored fact lookup. 

85 bank_id: Bank scope for all lookups. 

86 document_id: Optional doc-scope — if set, only facts within 

87 this document are considered. ``None`` lets the expansion 

88 cross documents (useful for LME multi-session, where the 

89 bridge spans haystack sessions). 

90 max_seed_hits: How many of ``initial_hits`` to draw entities 

91 from. The top hits are most likely to be on-topic; lower 

92 hits introduce noise. 

93 max_seed_entities: Cap on distinct entities used as seeds. 

94 max_neighbor_facts_per_entity: Cap on facts fetched per seed 

95 entity. 

96 max_expanded_facts: Total cap on the returned expanded set. 

97 

98 Returns: 

99 A list of ``PageIndexFactHit`` that are NOT in ``initial_hits`` 

100 (deduped by ``fact_id``) and that mention at least one entity 

101 shared with the seed hits' top entities. Order matches the 

102 store's per-entity ranking; downstream rerank reorders. 

103 """ 

104 if not initial_hits: 

105 return [] 

106 

107 # 1. Collect seed entities from the top initial hits. Preserve 

108 # order of first appearance so deterministic tests are easy. 

109 seed_entities: list[str] = [] 

110 seen_entities: set[str] = set() 

111 for hit in initial_hits[:max_seed_hits]: 

112 for entity in hit.entities or []: 

113 key = entity.lower() 

114 if key in seen_entities: 

115 continue 

116 seen_entities.add(key) 

117 seed_entities.append(entity) 

118 if len(seed_entities) >= max_seed_entities: 

119 break 

120 if len(seed_entities) >= max_seed_entities: 

121 break 

122 

123 if not seed_entities: 

124 return [] 

125 

126 # 2. For each seed entity, fetch facts that mention it. Dedup 

127 # against the initial hits and against each other by fact_id. 

128 initial_ids: set[str] = {h.fact_id for h in initial_hits} 

129 expanded: list[PageIndexFactHit] = [] 

130 expanded_ids: set[str] = set() 

131 

132 for entity in seed_entities: 

133 if len(expanded) >= max_expanded_facts: 

134 break 

135 try: 

136 neighbor_hits = await store.search_facts_by_entity( 

137 bank_id, 

138 entity, 

139 top_k=max_neighbor_facts_per_entity, 

140 document_id=document_id, 

141 ) 

142 except Exception as exc: # noqa: BLE001 

143 logger.warning( 

144 "search_facts_by_entity(%r) failed: %s: %s", 

145 entity, 

146 type(exc).__name__, 

147 exc, 

148 ) 

149 continue 

150 

151 for hit in neighbor_hits: 

152 if hit.fact_id in initial_ids or hit.fact_id in expanded_ids: 

153 continue 

154 expanded.append(hit) 

155 expanded_ids.add(hit.fact_id) 

156 if len(expanded) >= max_expanded_facts: 

157 break 

158 

159 return expanded