Coverage for astrocyte/pipeline/fact_entity

1"""M12.4: entity-graph expansion over fact-grain hits — REVERTED EXPERIMENT.

3**Status:** Implemented but not wired into the bench pipeline. The bench

4gate at M12.4 showed LME regressed -4.5pp (55.5→51.0) with

5multi-session dropping 11.8→2.9 — the exact category expansion was

6supposed to help. LoCoMo was flat (small gains on multi-hop +2.5,

7temporal +2.5 offset by losses on adversarial -2.4, open-domain -2.7).

9Root cause hypothesis: LME's user-haystack has 50+ sessions with dense

10entity overlap (same user, recurring topics). Naive co-occurrence

11expansion floods the candidate pool with off-topic but entity-linked

12facts that the cross-encoder rerank can't filter on a per-fact basis.

13LoCoMo's 10-conversation graph is sparse enough that signal/noise

14breaks even.

16Kept for: documentation of what was tried, the test suite as a

17contract pin if a future attempt re-uses this primitive with a

18smarter gating strategy (e.g. only expand when entities are RARE

19across the bank, or only when the picker selected ≥2 lines).

21Sits between fact semantic-retrieval and fact rerank. For multi-hop

22and multi-session questions, the question's anchor entities and the

23answer's anchor entities are different — the bridge is a chain of

24co-occurring entities across sections. The bi-encoder semantic search

25finds facts that are textually similar to the question, but it can't

26follow that chain.

28This module follows it. Given the top-K semantic hits, collect their

29entities, find OTHER facts that mention those entities (cross-section),

30and return the expanded set. The downstream cross-encoder rerank

31([fact_rerank.py]) picks which expanded facts actually answer the

32question.

34Generic across benches — entity strings (proper nouns + typed labels

35like ``role:doctor``) are bench-agnostic. No question parsing, no LLM

36call: the seed entities come from the bi-encoder's own top hits.

38Design knob trade-offs:

40- ``max_seed_entities``: too high and we expand from noisy entities

41 the bi-encoder happened to surface; too low and we miss valid

42 bridges. 8 is the Hindsight default for similar graph-walk depth.

43- ``max_neighbor_facts_per_entity``: caps the fan-out per seed entity.

44 A common entity like "User" could pull thousands of facts; the cap

45 keeps total candidates bounded.

46- ``max_expanded_facts``: total cap across all entities. Combined with

47 the downstream rerank's ``rerank_top_k=30``, this bounds inference

48 cost.

50See:

51- ``docs/_design/recall.md`` §15 (M12.4)

52- ``astrocyte.pipeline.fact_rerank`` for the next stage

53- Hindsight's ``search_unit_links`` for the section-grain analogue

54"""

56from __future__ import annotations

58import logging

59from typing import TYPE_CHECKING

61if TYPE_CHECKING:

62 from astrocyte.provider import PageIndexStore

63 from astrocyte.types import PageIndexFactHit

65logger = logging.getLogger("astrocyte.pipeline.fact_entity_expansion")

68async def expand_via_entity_graph(

69 initial_hits: list[PageIndexFactHit],

70 *,

71 store: PageIndexStore,

72 bank_id: str,

73 document_id: str | None = None,

74 max_seed_hits: int = 5,

75 max_seed_entities: int = 8,

76 max_neighbor_facts_per_entity: int = 10,

77 max_expanded_facts: int = 20,

78) -> list[PageIndexFactHit]:

79 """Expand a candidate fact set by walking entity co-occurrence.

81 Args:

82 initial_hits: Top-ranked semantic / picker-filtered fact hits.

83 Their entities seed the expansion.

84 store: PageIndexStore for entity-anchored fact lookup.

85 bank_id: Bank scope for all lookups.

86 document_id: Optional doc-scope — if set, only facts within

87 this document are considered. ``None`` lets the expansion

88 cross documents (useful for LME multi-session, where the

89 bridge spans haystack sessions).

90 max_seed_hits: How many of ``initial_hits`` to draw entities

91 from. The top hits are most likely to be on-topic; lower

92 hits introduce noise.

93 max_seed_entities: Cap on distinct entities used as seeds.

94 max_neighbor_facts_per_entity: Cap on facts fetched per seed

95 entity.

96 max_expanded_facts: Total cap on the returned expanded set.

98 Returns:

99 A list of ``PageIndexFactHit`` that are NOT in ``initial_hits``

100 (deduped by ``fact_id``) and that mention at least one entity

101 shared with the seed hits' top entities. Order matches the

102 store's per-entity ranking; downstream rerank reorders.

103 """

104 if not initial_hits:

105 return []

106

107 # 1. Collect seed entities from the top initial hits. Preserve

108 # order of first appearance so deterministic tests are easy.

109 seed_entities: list[str] = []

110 seen_entities: set[str] = set()

111 for hit in initial_hits[:max_seed_hits]:

112 for entity in hit.entities or []:

113 key = entity.lower()

114 if key in seen_entities:

115 continue

116 seen_entities.add(key)

117 seed_entities.append(entity)

118 if len(seed_entities) >= max_seed_entities:

119 break

120 if len(seed_entities) >= max_seed_entities:

121 break

122

123 if not seed_entities:

124 return []

125

126 # 2. For each seed entity, fetch facts that mention it. Dedup

127 # against the initial hits and against each other by fact_id.

128 initial_ids: set[str] = {h.fact_id for h in initial_hits}

129 expanded: list[PageIndexFactHit] = []

130 expanded_ids: set[str] = set()

131

132 for entity in seed_entities:

133 if len(expanded) >= max_expanded_facts:

134 break

135 try:

136 neighbor_hits = await store.search_facts_by_entity(

137 bank_id,

138 entity,

139 top_k=max_neighbor_facts_per_entity,

140 document_id=document_id,

141 )

142 except Exception as exc: # noqa: BLE001

143 logger.warning(

144 "search_facts_by_entity(%r) failed: %s: %s",

145 entity,

146 type(exc).__name__,

147 exc,

148 )

149 continue

150

151 for hit in neighbor_hits:

152 if hit.fact_id in initial_ids or hit.fact_id in expanded_ids:

153 continue

154 expanded.append(hit)

155 expanded_ids.add(hit.fact_id)

156 if len(expanded) >= max_expanded_facts:

157 break

158

159 return expanded

Coverage for astrocyte/pipeline/fact_entity_expansion.py: 98%

41 statements