Multimodal LLM SPI and DTOs
This document specifies first-class image and audio (and optional document) payloads for the LLM Provider SPI and the built-in pipeline, so gateways such as LiteLLM and OpenRouter can pass vision / audio inputs to multimodal models without ad-hoc string encoding. For non-LLM video/voice platforms (Tavus, ElevenLabs TTS-only, etc.), see presentation-layer-and-multimodal-services.md.
1. Design principles
Section titled “1. Design principles”| Principle | Implication |
|---|---|
| Backward compatible | Message.content: str remains valid; all existing adapters and configs work unchanged. |
| Gateway-aligned | DTOs map cleanly to OpenAI-style chat messages (text + image_url / input_audio parts) and to LiteLLM’s multimodal completion API. |
| Explicit capability | Providers declare LLMCapabilities so the pipeline can degrade or refuse when a model cannot accept a modality. |
| Text-first pipeline default | Tier 1 pipeline stages that need plain text (chunking, NER, BM25) use deterministic extraction from multimodal turns (see §6). |
| No video generation in SPI | This SPI covers inputs to multimodal LLMs (vision/audio understanding). Video generation as output is out of scope for complete() unless folded into provider-specific extensions later. |
2. Content parts (DTOs)
Section titled “2. Content parts (DTOs)”Parts are discriminated unions (tagged structs). Serializers map them to provider JSON.
2.1 Text
Section titled “2.1 Text”@dataclass(frozen=True)class TextPart: type: Literal["text"] = "text" text: str2.2 Image
Section titled “2.2 Image”| Part | Fields | Notes |
|---|---|---|
ImageUrlPart | url: str, optional detail: Literal["auto","low","high"] | None | Remote URL; OpenAI / many gateways. |
ImageBase64Part | mime_type: str (e.g. image/png), base64_data: str | Raw base64 without data: prefix; adapters may add prefix per provider. |
@dataclass(frozen=True)class ImageUrlPart: type: Literal["image_url"] = "image_url" url: str detail: Literal["auto", "low", "high"] | None = None
@dataclass(frozen=True)class ImageBase64Part: type: Literal["image_base64"] = "image_base64" mime_type: str base64_data: str2.3 Audio
Section titled “2.3 Audio”| Part | Fields | Notes |
|---|---|---|
AudioUrlPart | url: str | When the provider accepts a URL for input audio. |
AudioBase64Part | mime_type: str (e.g. audio/wav, audio/mpeg), base64_data: str | Inline audio bytes. |
@dataclass(frozen=True)class AudioUrlPart: type: Literal["audio_url"] = "audio_url" url: str
@dataclass(frozen=True)class AudioBase64Part: type: Literal["audio_base64"] = "audio_base64" mime_type: str base64_data: str2.4 Optional: documents (PDF, etc.)
Section titled “2.4 Optional: documents (PDF, etc.)”@dataclass(frozen=True)class DocumentUrlPart: type: Literal["document_url"] = "document_url" url: strUse when the model + gateway support file / PDF inputs in chat (e.g. some Gemini / OpenAI flows). Omit from minimal implementations if unsupported.
2.5 Union type
Section titled “2.5 Union type”ContentPart = ( TextPart | ImageUrlPart | ImageBase64Part | AudioUrlPart | AudioBase64Part | DocumentUrlPart)3. Message and completion
Section titled “3. Message and completion”3.1 Message
Section titled “3.1 Message”@dataclassclass Message: role: str # "system" | "user" | "assistant" | "tool" (if used by adapter) content: str | list[ContentPart]Rules:
strcontent - legacy; treated as a singleTextPart.list[ContentPart]- at least one part; typically oneTextPartplus zero or more media parts (OpenAI recommends text first for some models).
Helpers (core library):
message_text_for_policy(msg: Message) -> str- concatenates onlyTextPart.textfor PII / token estimation warnings when media is present (see §7).message_has_media(msg: Message) -> bool- true if any non-text part exists.
3.2 Completion (unchanged default)
Section titled “3.2 Completion (unchanged default)”@dataclassclass Completion: text: str model: str usage: TokenUsage | None = NoneMultimodal models still return text for complete() in the default SPI. Optional future extension: output_parts: list[ContentPart] | None for image/audio output models - not required for Tier 1 pipeline v1.
4. LLM capabilities and protocol extensions
Section titled “4. LLM capabilities and protocol extensions”4.1 LLMCapabilities
Section titled “4.1 LLMCapabilities”@dataclass(frozen=True)class LLMCapabilities: supports_multimodal_completion: bool modalities_supported: frozenset[str] # subset of {"text", "image", "audio", "document"} supports_multimodal_embedding: bool # e.g. Gemini multimodal embedding APIs4.2 LLMProvider protocol
Section titled “4.2 LLMProvider protocol”Required (existing):
async def complete(self, messages: list[Message], ...) -> Completionasync def embed(self, texts: list[str], ...) -> list[list[float]]
New optional method (preferred over breaking embed):
async def llm_capabilities(self) -> LLMCapabilities: """Default: text-only completion + text embedding.""" return LLMCapabilities( supports_multimodal_completion=False, modalities_supported=frozenset({"text"}), supports_multimodal_embedding=False, )Optional multimodal embedding (providers that support image/audio/document vectors):
async def embed_multimodal( self, inputs: list[MultimodalEmbeddingInput], model: str | None = None,) -> list[list[float]]: ... raise NotImplementedError@dataclass(frozen=True)class MultimodalEmbeddingInput: """One vector input; provider defines how many modalities.""" parts: list[ContentPart] # e.g. TextPart + ImageUrlPart for Gemini multimodal embedIf embed_multimodal is not implemented, the pipeline uses text-only embedding paths (see §6).
4.3 Errors
Section titled “4.3 Errors”| Exception | When |
|---|---|
MultimodalNotSupported | complete() received messages with media parts but supports_multimodal_completion is false or model cannot accept the modality. |
MultimodalEmbeddingNotSupported | Pipeline requested multimodal embed but provider did not implement embed_multimodal. |
5. Adapter convention (LiteLLM, OpenRouter, OpenAI SDK)
Section titled “5. Adapter convention (LiteLLM, OpenRouter, OpenAI SDK)”5.1 complete() mapping
Section titled “5.1 complete() mapping”- Walk
messages; for eachMessage, ifcontentisstr, build a single user/system message with string content. - If
contentislist[ContentPart], map to provider JSON:- OpenAI-compatible:
content: [{"type":"text","text":...},{"type":"image_url","image_url":{"url":...}}, ...] - LiteLLM: pass the same structure; LiteLLM forwards to provider-specific format.
- Anthropic: map parts to Anthropic
contentblocks per their SDK.
- OpenAI-compatible:
- Call the gateway SDK once per
complete()request.
5.2 embed() mapping
Section titled “5.2 embed() mapping”- Unchanged:
embed(texts: list[str])for text embeddings. embed_multimodal: only implemented in adapters where the upstream API supports it; otherwise omit and let pipeline fall back (§6).
5.3 Capability discovery
Section titled “5.3 Capability discovery”- Static: config flag
llm_provider_config.multimodal: true+ model name pattern. - Dynamic: optional
llm_capabilities()calling provider metadata orlitellm.supports_vision(model=...).
6. Built-in pipeline (Tier 1) behavior
Section titled “6. Built-in pipeline (Tier 1) behavior”6.1 Retain path with multimodal content
Section titled “6.1 Retain path with multimodal content”Callers may pass RetainRequest (or equivalent) with content_parts: list[ContentPart] in addition to or instead of plain content: str (see public API evolution in core).
Strategies (configurable pipeline.multimodal.retain_strategy):
| Strategy | Behavior |
|---|---|
caption_then_embed (default) | Run one complete() with a vision-capable model using all parts → caption/summary text → chunk + embed text (works with any text embedder). |
multimodal_embed | If embed_multimodal is supported, embed parts directly; else fall back to caption_then_embed. |
text_only | Reject or strip non-text parts if capabilities disallow multimodal. |
6.2 Recall / query analysis
Section titled “6.2 Recall / query analysis”- Text queries - unchanged.
- Voice/image query: caller passes
Messagelist withImageUrlPart/AudioBase64Partin the user turn;complete()for query analysis uses multimodal model when capabilities allow.
6.3 Stages that require plain text
Section titled “6.3 Stages that require plain text”Entity extraction, BM25 document store, and similar steps consume message_text_for_policy() or caption output, not raw pixels.
7. Policy and observability
Section titled “7. Policy and observability”- PII / barriers: Media parts may be excluded from regex PII scans; optional future integration with vision PII models (out of scope for this SPI).
- Token budgets: Token estimation for multimodal requests is provider-dependent; adapters may return approximate
usageor markinput_tokensas partial. - OTel: Span attributes
astrocyte.llm.modalities=image,audio(low cardinality).
8. Relationship to other documents
Section titled “8. Relationship to other documents”| Document | Role |
|---|---|
provider-spi.md §4.11 | Summary of multimodal extensions in the SPI doc |
presentation-layer-and-multimodal-services.md | Non-LLM video/voice products vs LLM multimodal |
built-in-pipeline.md | Pipeline stages; align multimodal retain strategies with §6 here |
9. Summary
Section titled “9. Summary”- DTOs:
ContentPartunion +Message.content: str | list[ContentPart]. - SPI:
llm_capabilities(), optionalembed_multimodal(), samecomplete()entrypoint with richer messages. - Adapters: Map DTOs → OpenAI-compatible / LiteLLM / provider SDKs.
- Pipeline: Text-first defaults; caption or multimodal embed strategies for retain.