Elnor Repo Reader

DOC25_V2_0_DOC INTELLIGENCE.md

Current Specs/DOC25/DOC25_V2_0_DOC INTELLIGENCE.md

Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.

Open text page · Open raw txt · Open path URL

# DOC25 — Document Intelligence and Universal Ingestion Architecture

**Version:** V2.0
**Date:** 2026-04-26
**Status:** Operative spec

**Supersedes:** Document Intelligence & Agent Context Management Specification V1.0 (2026-04-11) and Document Intelligence & Extraction Pipeline — Updates Proposal R1 (2026-04-19). Both source documents are absorbed in full and should not be consulted for current behavior; this document is authoritative.

**Scope statement:** DOC25 V2.0 is ELNOR's universal ingestion and document intelligence layer. It owns: (1) tiered LLM context management for documents in agent conversations; (2) universal ingestion orchestration across all surfaces (corpus ingestion, email attachments, browser captures, chat uploads, demonstration captures, task outputs, autonomous gathering, direct uploads); (3) document conversion routing across MarkItDown, Docling, NuExtract, GLiNER, Firecrawl, Tesseract, OCR; (4) content-addressable storage and cross-surface deduplication; (5) the `DOC25_IngestionResult` consumer contract that DOC73 and other consumers depend on; (6) runtime retrieval tools and re-read posture; (7) the marker scheme for content injected into LLM prompts.

DOC25 V2.0 does NOT own: corpus-specific extraction rules (DOC73), the entity graph schema or node taxonomy (DOC72), capability registry or JIT tool mounting (DOC24), durable queue execution or sole-writer invariants (EC Core), prompt-composition governance (DOC15 + DOC24), or the content of skills, memories, or domain profiles. DOC25 supplies content into prompts in conformance with whatever marker and budget governance DOC15 + DOC24 ultimately codify.

**Primary dependencies:**

- **DOC72 R5.72** — entity graph schema; node taxonomy; intake pipeline; the `document` entity subtype DocumentEntity records hydrate into.
- **DOC73 V1.4.1** — primary consumer of `DOC25_IngestionResult`. V1.4.1 has the schema inlined for its freeze (§15.2). Future DOC73 V1.5 will swap the inline schema to a normative reference into DOC25 V2.0 §17. Until then, DOC25 V2.0 §17 and DOC73 V1.4.1 §15.2 must remain field-for-field identical; updates land here first and propagate to DOC73 in V1.5.
- **DOC18 R2** — LlamaIndex retrieval sidecar; chunk indexing target.
- **DOC20 R4.3** — Document Viewer surface; OCR UI entry points; browser capture surface.
- **DOC24 R2.5** — capability registry; tool definitions; routing cascade. The tools DOC25 specifies (`retrieve_document_pages`, `retrieve_full_document`, `q.document.ocr`, `document.convert_to_markdown`, the memory query tool family) register through DOC24's tool registry.
- **DOC11 R14** — OpenClaw gateway; runtime model capability surface; sub-agent spawning (`sessions_spawn`).
- **EC Core (Addendum A V3.3)** — durable queue, background orchestrator, sole-writer invariant, batch-commit policy, resource throttle.
- **DOC15** — prompt composition governance (eventual owner of the marker scheme format; DOC25 conforms).
- **DOC1** — memory governance (DOC1 governs durable promotion of any memory; DOC25 produces source-grounded extracted content but never directly writes memory directives).

**Primary consumers:**

- **DOC73 V1.4.1 corpus extraction (§14, §15)** — consumes `DOC25_IngestionResult` per §17. Hard-fails any document where the result is malformed.
- **DOC72 §20A surface intake contracts** — route document conversion through DOC25 (MarkItDown/Docling/OCR cascade) before running entity extraction.
- **DOC20 R4.3 Document Viewer** — consumes pre-computed page text, OCR results, classification, and structured summaries for in-viewer features (search, area select, page summaries).
- **DOC10 / DOC14 chat surfaces** — consume `prepareDocumentContext` for tiered LLM context on chat-attached documents and red-team room substrates.
- **DOC23 task module outputs** — task modules emitting documents route them through DOC25's ingestion pipeline.
- **DOC16 Entry 16.7 M365 integration** — email attachments arriving via M365 route through DOC25 ingestion.
- **DOC3 demonstration capture** — demonstration runs that include documents route them through DOC25.

---

## What's New in V2.0 vs V1.0

### Major additions

V1.0 and the R1 proposal were narrow: V1.0 covered tiered PDF context management for LLM conversations only; R1 added three operational improvements (LLM document escalation, OCR pipeline, MarkItDown backend). Neither covered universal ingestion. V2.0 expands DOC25 to own the full document intelligence and ingestion layer. The major V2.0 additions are:

1. **Universal ingestion orchestration** (§11). Any document entering ELNOR through any surface routes through one pipeline, one tool roster, one storage model, one status-tracking system. Surface-specific work happens after the universal ingestion layer completes.
2. **Content-addressable storage model** (§12). Originals stay where the user puts them; derived artifacts are stored under content-hash keys in ELNOR's private document store. Originals are never copied or modified.
3. **Cross-surface deduplication** (§13). The same document encountered through multiple surfaces (corpus, email, browser, chat, EDGAR pull, etc.) is recognized as one document with multiple source paths. No re-conversion, no re-summary, no duplicate entity nodes.
4. **R1 absorbed** (§§8-10). The `retrieve_document_pages` tool (R1 §2), OCR pipeline architecture (R1 §3), and MarkItDown universal extraction backend (R1 §4) are integrated as full operative sections, not proposal references.
5. **DOC73 V1.4.1 R5.3 items honored** (multiple sections). E1 IngestionQualityReport (§15), E2 multi-hash content addressability (§12), E3 strict ingestion state machine (§14), E4 hard-fail reason codes (§14), E5 storage exhaustion controls (§15), E6 thermal/memory throttle (§15), E7 tool health checks and capacity leases (§15), E8 versioned immutable artifacts (§12), E9-E12 tool routing (§10), E13 per-document reconvert affordance (§11, §19), E17 LlamaCloud policy gating (§10), E18 5000-doc batch commit policy (§15), E19 200-doc realistic timing (§15).
6. **`DOC25_IngestionResult` consumer contract authoritatively owned** (§17). DOC73 V1.4.1 inlined the schema for its freeze. DOC25 V2.0 §17 is now the authoritative source. Future DOC73 V1.5 will reference DOC25 §17 instead of inlining.
7. **Pipeline state machine** (§14). Strict per-document state progression with atomic transitions; memory writes forbidden until required source/content/conversion references exist.
8. **Tool health, failure handling, status model** (§15). Pre-dispatch health checks, capacity leases, retryable-vs-hard-fail distinction, hard-fail reason code vocabulary, storage-exhaustion circuit breaker, thermal throttle, batch-commit policy, realistic timing bands.
9. **Runtime retrieval tools and re-read posture** (§16). Document retrieval tools (`retrieve_document_pages`, `retrieve_full_document`, memory-to-source resolution wrapper), memory query tools (consumed by MemoryAgent), and the retrieval posture directive (when to rely on memory vs re-read source).
10. **Marker scheme for injected content** (§18). Consistent format for `extracted_memory`, `corpus_context`, and other content tags injected into LLM prompts. Enables programmatic reference and verification by specialist sub-agents.
11. **Settings surface** (§19). Document Intelligence settings panel from R1 §5 plus universal ingestion controls (per-corpus retention, tool installation status, dedup behavior, OCR controls).
12. **Specialist sub-agent integration** (referenced throughout). DocumentIntelligenceAgent and MemoryAgent (per `ELNOR_SUBAGENT_PRECOMPUTE_TOOL_OPTIMIZATION_NOTES_V4.md` §§1.7-1.11) consume DOC25's runtime tools. DOC25 specifies the tool surface; the sub-agent specification specifies the agents that use them.

### Open question resolutions

The eight open questions in `DOC25_PLANNING_NOTES_V2.md` are resolved as follows. Each resolution is normative for V2.0:

- **Q1 Near-dedup threshold for revisions.** Multi-hash content addressability per E2 (raw_file_hash, normalized_binary_hash, normalized_text_hash, page_hashes, chunk_hashes, source_instance_id) handles V2.0's dedup needs. Deeper near-dedup (shingle/minhash similarity scoring above a configurable threshold) is deferred to V2.1.
- **Q2 Permission/privacy zones.** Same content with different policy/sensitivity contexts gets a different `source_instance_id` even when raw_file_hash matches. A document uploaded to a `firewalled` corpus and the same document uploaded to an `ambient` corpus share raw_file_hash but have different source_instance_ids. This prevents incorrect dedup that would collapse policy distinctions. See §12.
- **Q3 Auto-propose batch reconversion when tools update.** Manual per-document only for V2.0. The Documents tab in DOC7 provides per-document reconvert affordances per E13. Auto-propose batch reconversion when, for example, Docling releases an improved table extractor is deferred to V2.1.
- **Q4 Chunk reuse vs re-chunking when chunking config changes.** Versioned `chunking_config_version` on chunks. Re-chunking is opt-in per document or per corpus. Matches the versioned-immutable-artifacts pattern from E8. See §12, §14.
- **Q5 Storage retention for large corpora.** User-configurable per corpus, default keep-everything. Aligns with the per-corpus trust posture pattern in DOC73 V1.4.1 R5.3 A2 (per-corpus configuration of behavioral defaults). See §15.
- **Q6 Cross-corpus document browser.** Yes. Q > Knowledge > Documents provides a read-only browser of all documents in ELNOR's document store across all corpora and surfaces, with an export option. See §19.
- **Q7 Originals at unstable paths.** Hybrid: best-effort auto-repath by content hash matching (when a document is encountered at a new path with the same hash as a known document, the system silently adds the new path to `original_paths` and continues), plus manual relink fallback with UI affordance for the case where auto-repath fails (e.g., the user moved a OneDrive document the system hasn't seen since the move), plus ALWAYS preserve derived artifacts regardless of original-path break. The document_id and converted markdown remain valid even if all known original paths break. See §13.
- **Q8 Cross-user dedup in multi-user deployments.** Out of scope for V2.0. Single-user deployment assumption holds. Flagged for any future multi-user revision.

### Cleared from V1.0

- **V1.0 §1.3 misidentified DOC73** as "Agent Communication Architecture." DOC73 is Positronic Brain Enhancement (corpus extraction, ConsolidatedUnderstanding, living memory). Corrected throughout V2.0.
- **V1.0 §6.1 model capabilities matrix** is updated to current April 2026 reality (§6). GPT-4o, GPT-4 Turbo, and other major models now support native PDF; the V1.0 matrix listing them as no-PDF was stale.
- **V1.0 §14 Implementation Phases** is dropped. V2.0 is an end-state spec. Phasing belongs in build planning, not in normative architecture.
- **V1.0 §15 Open Questions** are triaged into V2.0 §26: those resolved by R1 + planning notes content are marked resolved with the resolution; those that remain genuinely open carry forward.
- **R1 was a proposal**; V2.0 absorbs it as fully integrated content. No "R1 §X" or "as proposed in R1" references appear in V2.0; the content stands on its own.

---

## Document Organization

DOC25 V2.0 is 27 sections grouped into themes. The themes don't appear as headers in the document — each section is numbered and standalone — but the thematic grouping helps readers navigate.

**Foundational concepts** (§0-§2, §5, §7). What documents are, how they're classified, what intelligence is pre-computed, how non-PDF formats are handled. Read first.

**PDF context management for LLM conversations** (§3, §4, §6, §22). The original V1.0 problem domain: how PDFs are tiered into LLM context, how prompt caching works, which models support what, how chat attachments work.

**Universal ingestion pipeline** (§11-§15). The orchestration, storage, dedup, state machine, and operational layer for every document entering ELNOR. This is the largest V2.0 addition.

**Tool integration** (§9, §10). OCR pipeline architecture and MarkItDown as the universal extraction backend.

**Runtime LLM tools** (§8, §16). The tool surface available to LLMs and specialist sub-agents at runtime: document escalation (`retrieve_document_pages`, `retrieve_full_document`), memory query tools, retrieval posture directive.

**Consumer contract** (§17). The `DOC25_IngestionResult` schema. The single most important section for any consumer building against DOC25.

**Content addressability and dedup** (§12, §13). How originals and derived artifacts are stored and how cross-surface deduplication works.

**Marker scheme** (§18). The format for content injected into LLM prompts.

**UI and surface integration** (§19-§23). Settings surface, agent conversation context manager, workflow optimizations, chat attachment handling, Files API integration.

**Operations** (§24-§27). Performance metrics, cross-document obligations, open questions, closing note.

### Source mapping

The 27 sections draw from three V1.0/proposal sources, the planning notes, and the DOC73 V1.4.1 §15 ingestion contract content. The mapping:

| Section | Sources |
|---|---|
| §0 How to read | new in V2.0 |
| §1 Overview and scope | V1.0 §1 + V2.0 expansion |
| §2 Document Type Classification | V1.0 §2 + V2.0 additions (presentation, audio, scanned-page metadata, classification confidence) |
| §3 Tiered Context System for PDFs | V1.0 §3 + V2.0 escalation integration from R1 §2.5 |
| §4 Prompt Caching Integration | V1.0 §4 + V2.0 multi-model handling |
| §5 Pre-Computed Document Intelligence | V1.0 §5 + V2.0 expanded DocumentEntity |
| §6 Model-Specific Routing | V1.0 §6, capabilities matrix updated |
| §7 Non-PDF Document Handling | V1.0 §7 + V2.0 presentation/audio/scanned-PDF additions |
| §8 LLM Document Escalation Tool | R1 §2 + new `retrieve_full_document` and memory-to-source wrapper |
| §9 OCR Pipeline Architecture | R1 §3 |
| §10 MarkItDown Universal Extraction Backend | R1 §4 + DOC73 V1.4.1 §15.3 routing detail |
| §11 Universal Ingestion Orchestration | Planning notes §1-3 |
| §12 Content-Addressable Storage Model | Planning notes §4 + DOC73 V1.4.1 §15.4 (multi-hash, versioned artifacts) |
| §13 Cross-Surface Deduplication | Planning notes §5 |
| §14 Pipeline State Machine | Planning notes §6-7 + DOC73 V1.4.1 §15.5 |
| §15 Tool Health, Failure Handling, Status | Planning notes §7, §9 + DOC73 V1.4.1 §15.6 (E5, E6, E7, E18, E19) + §15.4 (E1) |
| §16 Runtime Retrieval Tools and Re-Read Posture | Planning notes §10, §11, §12, §13 (memory query tools moved here from §18 — see note below) |
| §17 DOC25_IngestionResult Consumer Contract | DOC73 V1.4.1 §15.2 (DOC25 owns authoritatively in V2.0; DOC73 V1.5 will reference) |
| §18 Marker Scheme for Injected Content | Planning notes §14 |
| §19 Frontend UI and Settings | V1.0 §8 + R1 §5 |
| §20 Agent Conversation Context Manager | V1.0 §9 |
| §21 Workflow-Specific Optimizations | V1.0 §10 |
| §22 Chat Attachment Handling | V1.0 §11 |
| §23 Files API Integration | V1.0 §12 |
| §24 Performance Metrics and Monitoring | V1.0 §13 |
| §25 Cross-Document Obligations | new (synthesized from all sources) |
| §26 Open Questions | V1.0 §15 (triaged) + planning notes open questions (resolved) + new V2.0 questions |
| §27 Closing Note | new |

**Note on §16 vs §18 grouping.** The planning notes grouped memory query tools (planning notes §13) with marker scheme (planning notes §14). V2.0 separates them: memory query tools join the document retrieval tools in §16 (Runtime Retrieval Tools and Re-Read Posture) because both tool families share the same shape (LLM-callable, scope-defaulting, result-limited, both consumed by specialist sub-agents) and the retrieval-posture directive in §16 already discusses memory-vs-source tradeoffs. §18 is left to the marker scheme alone, which is a prompt-formatting concern of a different kind.

### Reading paths

**For a consumer-spec author** (e.g., a DOC73 implementer or someone integrating a new ingestion surface): §1 (scope), §17 (consumer contract — read carefully), §11 (orchestration), §14 (state machine), §15 (failure modes). After those, §10 (extraction tools), §12 (storage model), §13 (dedup).

**For a code agent** building any part of this: §1, §17, §14, §11, §12 in that order. §10 and §15 next. The schemas in §17, §12, and §14 are normative; conform to them exactly.

**For an integration reviewer** doing red-team or cross-doc analysis: §1, §25 (cross-doc obligations), §17, §11. §16 and §18 for the prompt/tool surface. §26 for known gaps.

---

## §0 How to Read This Document

### 0.1 Normative language

DOC25 V2.0 uses RFC 2119 conventions. **MUST**, **MUST NOT**, and **REQUIRED** indicate absolute requirements; **SHOULD** and **SHOULD NOT** indicate strong recommendations with deviation justified by stated reasons; **MAY** indicates optionality.

### 0.2 Versioning

This is V2.0. Future versions get new filenames (V2.1, V3.0, etc.); V2.0 is never overwritten. Sections that materially change get a section-level version bump noted in the section header. Schemas in §17, §12, and §14 are versioned independently and changes there propagate as breaking changes to consumers.

### 0.3 Scope-of-ownership statements

Every cross-cutting concern in this document carries an explicit ownership statement. "DOC25 owns X" means DOC25 is authoritative on X; other docs reference and conform. "DOC25 consumes X (owned by DOC72/DOC24/etc.)" means DOC25 is a downstream consumer and conforms to the upstream owner's spec. When a section asserts ownership, the corresponding obligation appears in §25 (cross-document obligations).

### 0.4 Contract semantics

Section §17 is the authoritative `DOC25_IngestionResult` consumer contract. It has the same status as a public API: consumers depend on the field shapes and semantics defined there. Adding fields is non-breaking; changing or removing fields is breaking and triggers a contract version bump. DOC73 V1.4.1 §15.2 currently inlines this schema as a freeze artifact; the field-for-field identical version lives in §17 of this document, and DOC73 V1.5 will swap to a normative reference.

### 0.5 What DOC25 does not own

DOC25 does not own corpus-specific extraction rules (DOC73 territory), the entity graph schema or node taxonomy (DOC72), capability registry (DOC24), durable queue execution (EC Core), prompt-composition governance (DOC15), or the content of skills or memories (DOC1, DOC3). When this document references those concerns, it is conforming to upstream specs.

---

## §1 Overview and Scope

### 1.1 Purpose

DOC25 V2.0 solves two related problems: how to handle document content efficiently when communicating with LLM agents, and how to orchestrate the universal ingestion of every document that enters ELNOR through any surface.

The first problem — the original V1.0 problem — is one of token economics. Documents, particularly PDFs, are expensive to send to LLMs in tokens, cost, and latency. A 100-page complaint sent as a PDF to Claude costs approximately 300K tokens. In a 10-turn red-teaming session, naively re-sending the document every turn costs approximately 3M tokens. The V1.0 solution is a tiered context management system that automatically selects the optimal method for including document content in agent conversations, balancing comprehension quality against cost and latency. V2.0 preserves this system with refinements.

The second problem — the V2.0 expansion — is one of orchestration. ELNOR ingests documents from many surfaces: corpus ingestion for deep knowledge work, email attachments arriving via M365, Q Browser captures during browsing sessions, chat-attached documents in red-team rooms, demonstration captures from DOC3, DOC23 task outputs, autonomous gathering pipelines (EDGAR pulls, PACER dockets, web scrapes), and direct file uploads. Without a universal ingestion layer, each surface re-implements conversion, storage, deduplication, and quality-tracking; the same document encountered through multiple surfaces gets re-processed; quality degrades with no visibility; failure modes proliferate. V2.0 unifies all of this under DOC25.

These two problems share infrastructure. The pre-computed document intelligence that V1.0 caches per document (extracted text, structured summary, page summaries, classification) is exactly the artifact set that universal ingestion needs to produce once per document and reuse across surfaces. Tiering, caching, and runtime retrieval all read from this artifact set. V2.0's universal ingestion produces the artifacts; V1.0's tiered context management consumes them.

### 1.2 Scope

DOC25 V2.0 covers:

- **PDF document handling** — primary focus for tiered LLM context management. Tier 1 (full PDF native), Tier 2 (cached PDF or extracted text), Tier 3 (summary plus targeted excerpts). Automatic tier selection. Prompt caching integration. Deposition special case. LLM-initiated escalation via `retrieve_document_pages`.
- **Non-PDF document handling** — Word, plain text, spreadsheets, images, presentations (PPTX), audio (transcribed), HTML. Format-specific extraction routed through MarkItDown as the universal backend.
- **Universal ingestion orchestration** — single pipeline for all surfaces. Per-step concurrency pools managed by EC's background orchestrator. Status tracking, retry, fallback, dynamic pool sizing.
- **Document conversion routing** — Docling vs MarkItDown profile-routed hybrid; OCR cascade (MarkItDown OCR plugin primary, Tesseract.js browser fallback, Azure Document Intelligence opt-in cloud upgrade); NuExtract for literal+schema targets; GLiNER opt-in for entity-rich profiles; Firecrawl for web fetch fallback; LlamaCloud policy-gated.
- **Content-addressable storage and cross-surface deduplication** — originals stay where the user put them; derived artifacts stored under content-hash keys; multi-hash addressing for different dedup questions; original_paths tracked as an array for same-content-multiple-locations.
- **Pipeline state machine and hard-fail reason codes** — strict per-document state progression; atomic transitions; bounded retries; explicit reason-code vocabulary for hard failures; no quiet retry cycling.
- **Tool health, capacity leases, resource controls** — pre-dispatch health checks; tool capacity leases that distinguish transient tool failure from document failure; storage exhaustion circuit breaker with reserved disk floor; thermal and memory throttle for sustained workloads on M4 Pro hardware.
- **Quality reporting** — `IngestionQualityReport` per document, surfaced in document UI, corpus dashboard, and Runs tab. Single mechanism turning silent degradation into visible degradation.
- **Versioned immutable ingestion artifacts** — reconversion creates a new conversion artifact version; re-extraction creates a new extraction run. Old artifacts preserved. Enables rollback, audit trail, and recovery from misconfiguration.
- **Pre-computed document intelligence** — text extraction, classification, structured summary, page summaries, computed at document open time, cached, refreshed on content change. Consumed by tiering, retrieval, search, and downstream extraction.
- **Runtime retrieval tools** — `retrieve_document_pages`, `retrieve_full_document`, memory-to-source resolution wrapper, memory query tool family.
- **Marker scheme for injected content** — consistent format for `extracted_memory`, `corpus_context`, `tool_schema`, and other injected content tags, enabling programmatic reference and verification.
- **Frontend UI and settings** — Document Intelligence settings panel, in-viewer OCR controls, document badges, cost indicators, cross-corpus document browser.
- **Files API integration** — when to upload to Claude Files API vs send as base64; file_id management; invalidation.
- **Performance metrics** — token usage, cache hit rates, tier distribution, classification accuracy, ingestion timing.
- **Consumer contract** — `DOC25_IngestionResult` schema for downstream consumers (DOC73 corpus extraction, DOC72 intake, etc.).

DOC25 V2.0 does NOT cover:

- Corpus-specific extraction rules. DOC73 owns those. DOC25 supplies the universal artifacts; DOC73 runs corpus extraction over them.
- Entity graph node taxonomy or schema. DOC72 owns those. DOC25 writes a `document` entity record per its node-shape contract; DOC72 governs the shape.
- Capability registry. DOC24 owns the registry; DOC25's tools register through it.
- Durable queue execution and sole-writer invariant. EC Core owns those.
- Prompt-composition governance. DOC15 + DOC24 own assembly of the full prompt envelope; DOC25 supplies content (injected memories, corpus context, retrieval results) in conformance with the agreed marker scheme.
- The content of skills, memories, or domain profiles. DOC1, DOC3, DOC72 own those.
- Extraction quality of LLM-produced summaries. The summarization model is configurable; DOC25 specifies that a summary is produced and stored, not the prompt that produced it.

### 1.3 Design principles

DOC25 V2.0 has eight design principles. The first five are inherited from V1.0; the last three are added in V2.0 to cover the universal ingestion expansion.

1. **The user should never have to think about tiering or routing.** Tier selection is automatic. Tool routing is automatic. The system makes the right choice based on conversation state, document type, model capabilities, and corpus profile. Manual overrides are available for power users but not required.

2. **First interaction is perfect.** The model sees the full document exactly as the user sees it on the first pass. No degraded experience on the first turn. Tiering reduces cost on subsequent turns, never on the first.

3. **Subsequent interactions are efficient.** The system automatically reduces token usage on follow-up turns without sacrificing comprehension quality. Prompt caching and tiering work together to deliver typical 80-90% savings on multi-turn document sessions.

4. **Visual understanding is preserved when it matters.** Charts, tables, exhibits, signatures — the system knows when these are important (visual_importance high) and when text-only is sufficient (visual_importance low; deposition special case). Tier choice respects visual importance.

5. **Pre-computation happens invisibly.** Document intelligence is extracted in the background when documents are opened, not when the user asks a question. Steps 3-4 of the extraction pipeline (structured summary, page summaries) run async; the system works without them, just less efficiently.

6. **Single ingestion path across all surfaces.** Every document entering ELNOR routes through one pipeline, one tool roster, one storage model, one status-tracking system. Surface-specific work (corpus extraction, email triage, browser display) happens after universal ingestion completes. This is the V2.0 expansion's load-bearing principle.

7. **Content-addressable storage; originals never modified.** Original files stay where the user put them — OneDrive, local filesystem, email server, etc. ELNOR never copies originals into its own directories or modifies them. Derived artifacts (converted markdown, summaries, chunks, entity tags) are stored in ELNOR's private document store keyed by content hash. This preserves user portability and enables exact dedup.

8. **Truthful degraded states; no silent quality loss.** Every conversion produces an `IngestionQualityReport`. Quality issues surface in document UI, corpus dashboard, and Runs tab. A document with garbled tables shows a "tables degraded" badge, not a successful-looking conversion. Hard failures show explicit reason codes, not generic errors. The user always knows what's working, what's degraded, and what's failed.

These principles compose. Principles 1-5 govern the LLM-context experience; principles 6-8 govern the ingestion experience. Together they describe a system where the user trusts the document layer to do the right thing automatically, but can always see exactly what it did and override when needed.

---

## §2 Document Type Classification

Every document in the system is classified into a category, which determines how it's handled. Classification is done once at document open or ingestion time, stored in the entity graph, and never recomputed unless the document content changes.

### 2.1 Document categories

| Category | File types | Typical size | Token cost (full) | Handling strategy |
|---|---|---|---|---|
| **PDF — Visual** | .pdf with exhibits, charts, signatures, multi-column layouts, complex tables | 10-200 pages | 3K-600K tokens | Tiered system (§3) |
| **PDF — Text-heavy** | .pdf depositions, correspondence, simple motions, plain-text PDFs | 10-500 pages | 3K-1.5M tokens | Text extraction preferred, tiered fallback |
| **PDF — Scanned** | .pdf with image-based pages, no text layer | Any | Depends on OCR cost | OCR pipeline (§9) then text-heavy |
| **Word** | .docx, .doc | 1-100 pages | 2K-50K tokens | Always send extracted text |
| **Plain text** | .md, .txt, .html | Any | 1K-30K tokens | Send as-is (HTML stripped to clean text) |
| **Spreadsheet** | .xlsx, .csv, .tsv | Any | 5K-50K tokens | Convert to markdown table or CSV |
| **Presentation** | .pptx, .ppt | 10-100 slides | 5K-30K tokens | Extract slide text + speaker notes via MarkItDown |
| **Image** | .png, .jpg, .jpeg, .tiff, .gif | Single page | ~1.5K tokens each | Send as image content block; LLM vision describes |
| **Audio** | .mp3, .wav, .m4a, .flac | Any duration | Variable | Transcribe via Whisper (MarkItDown audio plugin); send transcript as text |

V2.0 adds the Presentation, Audio, and PDF-Scanned categories explicitly. V1.0 implicitly handled scanned PDFs as PDF-Visual; V2.0 calls them out separately because their handling routes through OCR (§9) before tiering.

### 2.2 Automatic classification

When a document is first opened or ingested, the system classifies it by analyzing format-specific signals.

For PDFs, classification analyzes:

1. **Text density per page.** High text density (>500 chars/page average) with low variation suggests text-heavy (deposition, motion). Low text density with high variation suggests visual content (exhibits, charts, expert reports). Per-page text density is computed via `page.getTextContent()` from PDF.js.
2. **Page structure patterns.** Detection patterns include Q/A format with line numbers (deposition transcript), caption block on first page (court filing), uniform paragraphs (motion or brief), mixed content of text plus images plus tables (expert report or complaint with exhibits), and form fields (fillable form).
3. **Document metadata.** PDF title, author, creation date, page labels.
4. **First-page analysis.** Caption blocks, "EXHIBIT" labels, table-of-contents presence.
5. **Per-page scanned detection.** For each page, if `getTextContent()` returns no items or returns text density below 0.05 chars per unit area, the page is classified as scanned. If any page is scanned, the document is classified PDF-Scanned (or remains its primary classification with `has_scanned_pages: true` set; see §2.3 schema).

For non-PDF formats, classification is by file extension and MIME type. Sub-classification (e.g., `documentType: "contract"` for a Word doc) is best-effort based on filename patterns and first-page content but is not strictly required.

V2.0 adds a `classification_confidence` score on the output (0-1). Classification with confidence below 0.5 falls back to the most general category for the file type (e.g., `pdf_text` rather than a specific document type). Low-confidence classifications surface as "unclassified" in the UI with an option for the user to set the type manually.

Classification runs locally and is fast (under 50ms for typical documents). It is deterministic — same content produces same classification — except for the per-page scanned detection which depends on PDF.js text extraction (occasionally non-deterministic on edge-case PDFs).

### 2.3 Classification output

```typescript
interface DocumentClassification {
  category:
    | 'pdf_visual'
    | 'pdf_text'
    | 'pdf_scanned'
    | 'word'
    | 'plaintext'
    | 'spreadsheet'
    | 'presentation'
    | 'image'
    | 'audio';

  documentType: string;
  // 'complaint', 'deposition', 'expert_report', 'motion', 'contract',
  // 'correspondence', 'sec_filing', 'earnings_call_transcript',
  // 'manual', 'paper', 'unknown' — this list grows as domain profiles register types

  visualImportance: 'high' | 'medium' | 'low';
  // whether visual layout matters for understanding

  classificationConfidence: number;
  // 0.0-1.0; classifications below 0.5 surface as 'unclassified' in UI

  estimatedTokens: {
    fullPdf: number;      // cost to send as native PDF
    textOnly: number;     // cost to send extracted text
    summary: number;      // cost to send structured summary only
  };

  pageCount: number;

  // Per-page metadata
  pageMetadata: Array<{
    pageNumber: number;
    isScanned: boolean;
    textDensity: number;  // chars per unit area
    hasOcrCache: boolean; // OCR result cached for this page
  }>;

  // Document-level flags
  hasExhibits: boolean;
  hasCharts: boolean;
  hasTables: boolean;
  hasSignatures: boolean;
  hasFormFields: boolean;
  hasScannedPages: boolean;

  // For audio
  audioDuration?: number;  // seconds
  audioLanguage?: string;
}
```

V2.0 expands V1.0's `DocumentClassification` with `pdf_scanned`, `presentation`, and `audio` categories; the `classificationConfidence` field; the per-page `pageMetadata` array; the `hasScannedPages` flag; and audio-specific fields.

Classification is stored in the entity graph as part of the DocumentEntity record (§5.2) and never recomputed unless the document content changes (detected via raw_file_hash; see §5.3 cache invalidation).

---

## §3 Tiered Context System (PDF Documents)

The tiered context system is the core of how PDFs are sent to LLMs. Three tiers balance comprehension quality against token cost.

### 3.1 Tier definitions

#### Tier 1 — Full PDF (native)

**What's sent:** the raw PDF file as a base64-encoded `document` content block, or a Files API `file_id` reference if the document has been uploaded.

**Token cost:** approximately 1,500-3,000 tokens per page (text) plus image tokens per page. A 50-page document is approximately 75K-150K tokens.

**When used:**
- First time the model encounters this document in a conversation.
- User explicitly requests "send full document."
- Document has high visual importance (exhibits, charts, complex tables).
- Document is short enough that cost is negligible (under 10 pages).

**API format (Claude):**

```json
{
  "type": "document",
  "source": {
    "type": "base64",
    "media_type": "application/pdf",
    "data": "<base64-encoded-pdf>"
  },
  "cache_control": { "type": "ephemeral" }
}
```

**Note:** always include `cache_control: ephemeral` on Tier 1 sends. This enables prompt caching for subsequent turns. See §4 for caching mechanics.

#### Tier 2 — Cached PDF or text reference

**What's sent:** the prompt-cached PDF reference (if the cache is still warm in the same session), OR the pre-computed structured summary plus full extracted text.

**Token cost:**
- With prompt caching: same as Tier 1 but at ~10% cost on subsequent turns.
- Without caching (text-only path): approximately 500-1,000 tokens per page (text only) plus summary overhead.

**When used:**
- Turns 2 and later in a conversation where the document was sent as Tier 1 on turn 1 (prompt-cached path).
- The model has already seen the visual layout and now just needs text references.
- Cross-conversation reuse where the document was analyzed in a prior conversation.
- Document classification is text-heavy (low visual importance) — Tier 2 used from turn 1 in this case.

**Content structure (text-only path, without caching):**

```
[Document: {title}]
Type: {documentType} | Pages: {pageCount} | Classification: {category}

## Structured Summary
{AI-generated summary with key facts, parties, dates, legal standards, monetary amounts, case numbers}

## Full Text
### Page 1
{extracted text, MarkItDown structured markdown preserving headings, tables, lists}

### Page 2
{extracted text}
...
```

V2.0 specifies that Tier 2 text content uses MarkItDown structured markdown, not raw text blobs. The LLM receives a document that reads like a document — headings, tables, and lists preserved — not a wall of text. See §10 for MarkItDown details.

#### Tier 3 — Summary plus targeted excerpts

**What's sent:** the document's structured summary plus specific pages or sections relevant to the current question.

**Token cost:** approximately 50-200 tokens per page (summary only) plus targeted page text. A reference to a 100-page document might cost only 2K-5K tokens.

**When used:**
- Document is one of several in the conversation (context window pressure).
- User's question references a specific section or page.
- Document is being referenced tangentially, not analyzed deeply.
- Long conversations where context is tight.

**Content structure:**

```
[Document: {title}]
Type: {documentType} | Pages: {pageCount}
Document ID: {document_id}  -- for retrieve_document_pages tool calls

## Summary
{structured summary}

## Relevant Excerpts
### Page {N}: {page summary}
{extracted text for relevant pages only}

### Page {M}: {page summary}
{extracted text}

[Note: full document available via retrieve_document_pages(document_id="{document_id}", pages=[...]) — see §8]
```

V2.0 adds the `Document ID` line and the `retrieve_document_pages` note. When a Tier 3 summary is sent, the document_id allows the LLM to escalate via the runtime tool (§8) for additional pages without the system having to pre-decide which pages to include.

### 3.2 Automatic tier selection algorithm

```
function selectTier(document, conversation, currentMessage):

  // Rule 1: If document has never been sent in this conversation
  if not conversation.hasSeenDocument(document.id):

    // Sub-rule: If model supports prompt caching AND document is within page limit
    if model.supportsPromptCaching AND document.pageCount <= 600:
      return TIER_1_WITH_CACHE

    // Sub-rule: If document is text-heavy (deposition, simple motion)
    if document.classification.visualImportance == 'low':
      return TIER_2_TEXT_ONLY

    // Sub-rule: If document is small
    if document.pageCount <= 10:
      return TIER_1

    // Sub-rule: Large visual document, no caching available
    return TIER_1  // worth the cost for first pass

  // Rule 2: Document has been sent before in this conversation

  // Sub-rule: Prompt cache is still warm (same session, within TTL)
  if conversation.promptCacheActive(document.id):
    return TIER_1_CACHED  // ~90% cost reduction, zero re-processing latency

  // Sub-rule: User's question references specific pages or sections
  if currentMessage.referencesSpecificPages:
    return TIER_3_TARGETED

  // Sub-rule: Multiple documents in context (window pressure)
  if conversation.documentCount > 2:
    return TIER_3_TARGETED

  // Default for subsequent turns
  return TIER_2

  // Rule 3 (V2.0): LLM-initiated escalation
  // When the LLM calls retrieve_document_pages on a document at Tier 3:
  //   1. The tool returns the requested pages as tool output (§8)
  //   2. conversation.seenPages[document.id] is updated with the returned pages
  //   3. On subsequent turns, retrieved pages are included alongside the Tier 3 summary
  //   4. If total retrieved pages exceed 50% of the document, the system
  //      auto-escalates the document to TIER_2 for the rest of the conversation
  // This rule supplements Rule 2; it doesn't replace it.
```

V2.0 adds Rule 3 (LLM-initiated escalation). When the LLM working with a Tier 3 summary needs more detail, it calls `retrieve_document_pages` rather than the system having to guess which pages to pre-include. The tool returns the requested pages; the conversation state tracks what's been retrieved; if retrieval crosses 50% of the document, the system gives up on Tier 3 and promotes to Tier 2 to avoid repeated piecemeal retrievals.

### 3.3 Deposition transcript special case

Deposition transcripts are a specific optimization opportunity. They are often 200-500 pages, the format is highly structured (Q/A with line numbers), visual layout adds zero information value, and the text extracts losslessly.

**Rule:** if `documentType == 'deposition'`, always use Tier 2 (text-only) — even on the first pass. Text extraction for depositions is lossless; nothing in the visual layout adds information the model needs.

**Additional optimization for depositions:** instead of sending the full 500-page text on every turn, the system can:

1. Pre-index the transcript by topic and witness (done at ingestion time as part of pre-computed intelligence; see §5.1).
2. Send only the sections relevant to the current question (effectively Tier 3 from the start, with the LLM able to escalate via `retrieve_document_pages` for full sections).
3. Include a full table of contents (page-numbered topic index) so the model can request specific sections by reference.

The deposition optimization is a special case of Tier 3 with a domain-specific page index. Other domain profiles (e.g., SEC filings with their predictable section structure) can register similar pre-indexed handling.

---

## §4 Prompt Caching Integration

Prompt caching is the load-bearing mechanism for cost-efficient multi-turn document conversations. V2.0 expands V1.0's caching coverage with multi-model handling.

### 4.1 How prompt caching works

Claude's API supports caching content blocks across turns in the same session. The mechanics:

- **First send.** Full token cost. Content is cached server-side at the cache breakpoint.
- **Subsequent turns.** Cached content costs approximately 10% of original tokens. No re-processing latency on the cached portion.
- **Cache TTL.** 5 minutes from last use, extended with each use. Active conversations keep the cache warm.
- **Cache invalidation.** Any change to the cached content block invalidates the cache. Subsequent sends re-cache from scratch.

Other models with native PDF support handle caching differently or not at all. V2.0's caching strategy is model-aware (§4.5).

### 4.2 Implementation strategy

For single-document deep work — red-teaming, analysis, drafting — the caching pattern is:

```
Turn 1: Send PDF with cache_control: ephemeral
  → Model processes full document
  → Content cached server-side
  → Cost: 150K tokens (50-page document)

Turn 2: Send same PDF reference with cache_control
  → Cache hit: ~15K tokens (~90% savings)
  → Zero re-processing latency
  → Model has full document understanding

Turns 3-10: Same as Turn 2
  → ~15K tokens each

Total for 10-turn session: 150K + (9 × 15K) = 285K tokens
Without caching: 150K × 10 = 1,500K tokens
Savings: ~81%
```

Real-world savings vary based on cache hit rate (depends on conversation pacing — gaps over 5 minutes cause cache misses) and on whether the conversation includes other large content blocks competing for cache slots.

### 4.3 Cache management

The Agent Conversation Context Manager (§20) tracks per-conversation per-document cache state:

```typescript
interface ConversationDocumentState {
  documentId: string;
  firstSentAt: string;           // ISO timestamp
  tier: 1 | 2 | 3;               // current tier for this document in this conversation
  promptCacheActive: boolean;    // whether the cache breakpoint is currently warm
  lastCacheUse: string;          // ISO timestamp of last cached send
  cacheContentHash: string;      // hash of cached content for invalidation detection
  totalTokensSent: number;       // running total across all turns
  totalTokensSaved: number;      // running total of savings vs naive re-send
  retrievedPages: number[];      // pages retrieved via retrieve_document_pages (§8)
}
```

This state is held in EC's runtime state and persisted across conversation reloads.

### 4.4 Cache warming strategy

For workflows where the user is likely to ask multiple questions about a document, the cache warming sequence is:

1. **Document opened in viewer.** Pre-extract text and classify (background, no API call).
2. **User opens Ask panel.** UI shows "Ready to send" indicator with estimated cost.
3. **User sends first message.** Send full PDF with `cache_control: ephemeral`. UI shows "Document sent (cached for this session)."
4. **Subsequent messages.** Automatically use cached reference. UI shows "Using cached document."
5. **Cache expires (5 min idle).** If user sends another message, re-send with cache. UI shows "Re-cached document."
6. **User switches model mid-conversation.** Cache is model-specific; see §4.5.

The cache warming sequence is invisible to the user under normal conditions. UI indicators surface only when state changes (new cache, re-cache, expired cache).

### 4.5 Multi-model handling (V2.0 addition)

When a user switches models mid-conversation — for example, starts with Sonnet, switches to Opus for a complex question — the prompt cache is invalidated. The cached content block lives at the model level; switching models means re-caching from scratch.

V2.0 handles this with a model-switch detection step in the Context Manager:

```
On user message send:
  if conversation.lastModelUsed != currentModel AND
     any document in context has promptCacheActive == true:
    
    // Mark all caches as invalid for the new model
    for each document in conversation.documents:
      document.promptCacheActive = false
      document.cacheModel = null
    
    // Optional UI warning
    if user has not disabled this warning:
      show notice: "Switching from {lastModel} to {currentModel}.
                    Document cache will be re-built. Estimated added cost: {estimate}."
```

The warning is optional and dismissible. Power users who frequently switch models can disable it in settings (§19). The cache rebuild itself is automatic — no user action required beyond confirming the switch.

For workflows that intentionally route different turns to different models (e.g., Haiku for quick questions, Opus for complex analysis), the multi-model warning would fire on every model switch and become noise. The user can disable it; the rebuild still happens correctly. V2.1 may add a per-conversation "model routing mode" that suppresses warnings when intentional routing is configured.

---

## §5 Pre-Computed Document Intelligence

Pre-computed document intelligence is the artifact set that universal ingestion produces and runtime systems consume. V2.0 expands V1.0's DocumentEntity to incorporate multi-hash content addressability, source_instance_id, ingestion_status, and versioned conversion artifacts.

### 5.1 Extraction pipeline

When a document is opened or ingested, the following extraction happens. Some steps are local and immediate; others run in the background and are optional in the sense that the system works without them, just less efficiently.

```
1. Text Extraction (immediate, local, no API call)
   └─ MarkItDown on the source file (or PDF.js getTextContent() as fallback for PDFs;
      see §10 for routing)
   └─ Store as rawTextByPage: string[] for paginated formats
   └─ Store as rawTextFull: string for non-paginated formats
   └─ ~0-200ms latency for typical documents

2. Classification (immediate, local, no API call)
   └─ Analyze text density, structure patterns, metadata, per-page scanned detection
   └─ Compute classificationConfidence
   └─ Store classification result
   └─ ~10-50ms latency

3. Structured Summary (background, one LLM call)
   └─ Send extracted text or markdown to a fast model (Haiku or Sonnet by default)
   └─ Prompt extracts: parties, dates, key facts, legal standards, monetary amounts,
      case numbers, plus a one-paragraph high-level summary
   └─ Store as structuredSummary: string
   └─ ~2-5 second latency, runs invisibly

4. Page-Level Summaries (background, one LLM call)
   └─ Send page texts in batch to a fast model
   └─ Prompt: "For each page, write a 1-sentence summary."
   └─ Store as pageSummaries: string[]
   └─ ~3-8 second latency, runs invisibly

5. Domain-specific indexing (background, conditional)
   └─ For depositions: Q/A topic indexing
   └─ For SEC filings: section-structure indexing
   └─ For court filings: caption/jurisdiction extraction
   └─ Profile-driven; not run for documents without a registered profile
   └─ ~1-3 second latency

6. Entity tagging (background, conditional, opt-in per corpus or surface)
   └─ GLiNER zero-shot NER for entity-rich profiles
   └─ Produces entity_observation records (NOT confirmed entities by default)
   └─ See §10.3 for promotion rules

7. Chunking (background)
   └─ Chunks computed per chunking_config_version
   └─ Indexed into DOC18 LlamaIndex sidecar
   └─ ~1-5 seconds depending on document size

8. Index writes to DOC72 (background)
   └─ document entity node written via DOC72 §20A intake
   └─ Linked entity nodes (people, organizations, dates) written if entity tagging ran
   └─ Edges created per intake contract

9. Processing log write (immediate after each step)
   └─ Per-step timestamps, outcomes, costs, errors
   └─ Stored in document store under processing_log.json
```

Steps 1-2 are instant and local. Steps 3-9 happen in the background. If the user asks a question about the document before steps 3-4 complete, the system uses Tier 1 (full PDF) instead of Tier 3 (which needs summaries). Tier 2 works as soon as step 1 completes.

Steps 5-8 are conditional on document type, corpus profile, and surface configuration. They are part of universal ingestion (§11) when the document enters via a corpus or other intake-aware surface; they may be skipped for direct chat attachments where the document is read once and not retained for retrieval.

### 5.2 Entity graph document record (DocumentEntity)

Every document in the system gets a record in the DOC72 entity graph as a `world_entity` node with `entity_type: "document"`. The DocumentEntity payload extends V1.0's schema with V2.0's content addressability and versioning fields.

```typescript
interface DocumentEntity {
  // Identity
  id: string;                           // unique document_id (UUID)
  title: string;                        // filename or extracted title
  
  // Content addressability (V2.0; see §12)
  contentAddressableKeys: {
    rawFileHash: string;                // SHA-256 of original bytes
    normalizedBinaryHash: string;       // post-normalization (whitespace, line endings)
    normalizedTextHash: string;         // hash of converted markdown (for OCR-different copies)
    pageHashes: string[];               // per page, for partial-document dedup
    chunkHashes: string[];              // per chunk, for retrieval-level dedup
    sourceInstanceId: string;           // unique per ingestion event with policy context
  };
  
  // Source tracking (V2.0)
  originalPaths: Array<{
    path: string;                       // absolute path or URI
    locationType: 'onedrive' | 'local' | 'email_attachment' | 'browser' |
                  'chat_attachment' | 'task_output' | 'web_fetch' | 'edgar_pull' |
                  'pacer_docket' | 'demonstration_capture' | 'direct_upload';
    addedAt: string;                    // ISO timestamp
    isStale: boolean;                   // path no longer resolves
  }>;
  
  fileSize: number;                     // bytes
  mimeType: string;
  
  // Classification
  classification: DocumentClassification;  // see §2.3
  
  // Extracted Content
  rawTextByPage: string[];              // full text per page (paginated formats)
  rawTextFull: string;                  // concatenated full text
  textTokenCount: number;               // pre-computed token count
  
  // AI-generated intelligence (background, cached)
  structuredSummary: string | null;
  pageSummaries: string[] | null;
  summaryGeneratedAt: string | null;
  summaryModel: string | null;
  
  // Conversion artifacts (V2.0; versioned per E8)
  conversionArtifacts: Array<{
    version: string;                    // 'v1', 'v2', etc.
    tool: 'docling' | 'markitdown' | 'pdfjs' | 'mammoth' | 'ocr_markitdown' |
          'ocr_tesseract' | 'ocr_azure' | 'whisper';
    toolVersion: string;                // tool's own version
    config: Record<string, any>;        // tool config used for this conversion
    createdAt: string;
    convertedMarkdownPath: string;      // pointer to converted file in document store
    qualityReport: IngestionQualityReport;  // see §15.4
    isActive: boolean;                  // current authoritative conversion
  }>;
  
  activeConversionVersion: string;      // points into conversionArtifacts
  
  // Chunks (V2.0; versioned per Q4)
  chunks: Array<{
    chunkId: string;
    chunkingConfigVersion: string;
    spanStart: number;                  // character offset in convertedMarkdown
    spanEnd: number;
    pageNumber: number | null;
    embeddingId: string;                // pointer to DOC18 vector store entry
  }>;
  
  // Page anchor map (V2.0; for documents with page structure)
  pageAnchorMap: {
    pageToOffset: Record<number, number>;  // page_number → character_offset
    offsetToPage: Array<{                  // sorted array for binary search
      offset: number;
      page: number;
    }>;
  };
  
  // Ingestion state (V2.0; see §14)
  ingestionStatus:
    | 'registered'
    | 'hash_checked'
    | 'source_stored'
    | 'conversion_pending'
    | 'converted'
    | 'classification_failed'
    | 'classified'
    | 'summary_failed'
    | 'summarized'
    | 'indexed'
    | 'extraction_pending'
    | 'extracted'
    | 'verified'
    | 'written'
    | 'complete'
    | 'degraded_complete'
    | 'hard_failed';
  
  hardFailReasonCode: string | null;    // see §14.5.2 vocabulary
  
  // Files API reference
  apiFileId: string | null;             // Claude Files API file_id
  apiFileUploadedAt: string | null;
  
  // Conversation history
  conversationsSentTo: Record<string, ConversationDocumentState>;  // keyed by conversationId
  
  // Corpus memberships (V2.0)
  corpusMemberships: Array<{
    corpusId: string;
    addedAt: string;
    addedBy: string;
    extractionStatus: string;
    extractionHistory: Array<{
      runId: string;
      specVersion: string;
      completedAt: string;
      memoryCount: number;
    }>;
  }>;
  
  // Timestamps
  firstOpenedAt: string;
  firstIngestedAt: string;
  lastAccessedAt: string;
  lastReingestedAt: string | null;
  extractedAt: string;
  
  // User annotations
  bookmarks: Array<{ pageIdx: number; label: string }>;
  highlights: Array<{ pageIdx: number; text: string; color: string }>;
  notes: Array<{ pageIdx: number; content: string }>;
}
```

V2.0 expands V1.0's DocumentEntity with: (1) `contentAddressableKeys` carrying all six hash types from E2; (2) `originalPaths` as an array supporting same-content-multiple-locations; (3) `conversionArtifacts` as a versioned list per E8 with quality reports per E1; (4) `chunks` carrying `chunkingConfigVersion` per Q4; (5) `pageAnchorMap` for page-to-offset bidirectional lookup; (6) `ingestionStatus` tracking the state machine state per E3; (7) `hardFailReasonCode` per E4; (8) `corpusMemberships` for cross-corpus membership tracking.

The single `fileHash` field from V1.0 is removed; consumers now use `contentAddressableKeys.rawFileHash` instead. V1.0 conversion was implicitly single-version (one extracted-text blob per document); V2.0 makes conversion explicit and versioned. The `extractedAt` field remains but now means "first complete extraction"; per-version timestamps live in `conversionArtifacts[].createdAt`.

### 5.3 Cache invalidation

The DocumentEntity's derived artifacts are invalidated when:

- The file's `rawFileHash` changes (document was modified externally — caught at next ingestion attempt or on-demand re-hash).
- The user modifies the PDF through Q Dashboard (annotations, redactions); modification produces a new `sourceInstanceId` and a new conversion artifact version.
- The structured summary is older than 30 days (regenerate for freshness — configurable).
- A user-initiated reconversion is performed (creates a new conversion artifact version; old version preserved per E8).

On invalidation:

- `rawTextByPage` and `rawTextFull` are re-extracted via the active conversion tool; a new conversion artifact version is created.
- `structuredSummary` and `pageSummaries` are regenerated against the new conversion.
- `apiFileId` is invalidated (re-upload needed when the document next goes to Tier 1).
- Active prompt caches in conversations are marked stale; next send re-caches.
- Chunks are re-chunked if `chunking_config_version` changed; reused otherwise.

V2.0 invalidation differs from V1.0 in that the old conversion artifact is preserved (per E8 versioning), not overwritten. The user can revert to a prior conversion through the per-document reconvert affordance (§11) within the 30-day retention window, after which old conversions are eligible for purge.

---

## §6 Model-Specific Routing

Different LLM providers have different capabilities for native PDF support, prompt caching, and context window size. Routing chooses the best content format for each model.

### 6.1 Model capabilities matrix

The matrix below reflects April 2026 reality. Specific page limits and caching behaviors evolve; the routing logic in §6.2 reads from a runtime capability registry rather than hardcoding these values.

| Model | Native PDF | Prompt caching | Practical max pages | Recommended strategy |
|---|---|---|---|---|
| Claude Opus 4.x | Yes | Yes | ~600 | Tier 1 + cache for deep work; Tier 2/3 for multi-doc |
| Claude Sonnet 4.x | Yes | Yes | ~600 | Same as Opus, lower cost |
| Claude Haiku 4.x | Yes | Yes | ~600 | Same, lowest cost — good for summaries |
| GPT-4o, GPT-4 Turbo | Yes | Limited | ~500 | Tier 1 first pass; Tier 2 subsequent |
| GPT-5 series | Yes | Yes | ~500 | Tier 1 + cache; behavior similar to Claude |
| Gemini 2.5 Pro | Yes | No | ~1000 | Tier 1 first pass; Tier 2 subsequent (no cache benefit) |
| Gemini 2.5 Flash | Yes | No | ~1000 | Same, lower cost |
| DeepSeek | No | No | N/A | Always Tier 2 (text) |
| Kimi 2.5 | Limited | No | Variable | Tier 2 (text) primary; Tier 1 if explicitly supported |
| Qwen | No | No | N/A | Always Tier 2 (text) |
| Open-source local (Ollama, etc.) | No | No | N/A | Tier 2 (text), shorter context windows |

V1.0's matrix listed GPT models as no-PDF; that is stale. Major commercial models added native PDF support across 2024-2025. The routing logic now treats native PDF as broadly available with model-specific page limits and caching behavior.

The capability registry that populates this matrix at runtime is defined in DOC11 (model routing). DOC25 reads from it; it is not a DOC25-owned data source.

### 6.2 Routing logic

```
function getDocumentContentForModel(document, model, tier):

  capabilities = capabilityRegistry.get(model)

  if capabilities.supportsNativePdf AND document.pageCount <= capabilities.maxPdfPages:
    if tier == 1:
      content = { type: 'document', source: base64Pdf }
      if capabilities.supportsPromptCaching:
        content.cache_control = 'ephemeral'
      return content
    
    if tier == 1_cached:
      // cache hit — same shape, ~10% cost
      content = { type: 'document', source: base64Pdf }
      if capabilities.supportsPromptCaching:
        content.cache_control = 'ephemeral'
      return content
    
    if tier == 2:
      return { type: 'text', content: formatStructuredText(document) }
    
    if tier == 3:
      return { type: 'text', content: formatTargetedExcerpts(document, relevantPages) }

  else:  // model doesn't support PDF or document exceeds page limit
    if tier == 1 OR tier == 2:
      return { type: 'text', content: formatStructuredText(document) }
    if tier == 3:
      return { type: 'text', content: formatTargetedExcerpts(document, relevantPages) }
```

`formatStructuredText` produces MarkItDown-converted markdown with the metadata header described in §3.1 (Tier 2 content structure). `formatTargetedExcerpts` produces the §3.1 Tier 3 content structure with summary plus targeted pages plus the document_id reference.

### 6.3 Image fallback for vision-capable but non-PDF models

A few legacy or specialty models support image input but not native PDF. For these:

- Convert relevant pages to images (PNG or JPEG via PDF.js render).
- Send as image content blocks.
- More expensive than text per token but preserves visual layout.
- Use only when visual understanding is critical (charts, exhibits, signatures).

In April 2026 this fallback is rarely needed in practice — most vision-capable models also support native PDF. The fallback exists for completeness and for models that may add vision but not PDF in the future.

---

## §7 Non-PDF Document Handling

Non-PDF documents have simpler handling than PDFs. Most are extracted to text or structured markdown and sent as-is; visual layout matters less or not at all. Extraction routes through MarkItDown as the universal backend (§10).

### 7.1 Word documents (.docx, .doc)

**Strategy:** always extract text. Never send the raw file.

**Extraction:** MarkItDown converts the document to structured markdown, preserving paragraph text with heading levels, table content (as markdown tables), list items, and footnotes/endnotes. The legacy mammoth.js path remains available as a fallback if MarkItDown is unavailable, but its output is less structured (HTML-converted text rather than native markdown).

**Token cost:** negligible. A 100-page Word document is typically 10K-30K tokens as text.

**Format sent to LLM:**

```
[Document: Contract_Agreement.docx]
Type: Word Document | Pages: ~45

## Content

# Section 1: Definitions
{paragraph text}

## Table: Payment Schedule
| Date | Amount | Status |
|------|--------|--------|
| ... | ... | ... |

{remaining text}
```

For Word documents containing tracked changes or comments, MarkItDown extracts the accepted text as the document body and emits tracked changes as a structured section at the end. This preserves track-change visibility for the LLM without contaminating the main content.

### 7.2 Plain text (.md, .txt, .html)

**Strategy:** send as-is when possible. Strip HTML tags for `.html` to get clean text; preserve heading structure.

**Token cost:** typically 1K-30K tokens.

For `.html` specifically, MarkItDown converts to clean markdown rather than raw text. This preserves headings, lists, and tables that would otherwise be lost in tag stripping.

### 7.3 Spreadsheets (.xlsx, .csv, .tsv)

**Strategy:** convert to structured text format.

**Extraction:** MarkItDown extracts:

- Sheet names and structure
- Column headers
- Data rows (truncate to first 100 rows for very large sheets, with a note indicating truncation)
- Formulas as their computed values
- Named ranges

**Format sent to LLM:**

```
[Spreadsheet: Damages_Calculation.xlsx]
Sheets: Summary, Transactions, Adjustments

## Sheet: Summary
| Category | Amount | Basis |
|----------|--------|-------|
| Lost Profits | $2.4M | DCF Analysis |
| ...

## Sheet: Transactions (showing 100 of 1,247 rows)
| Date | Description | Amount |
| ... | ... | ... |
```

When a sheet has more than 100 rows, the truncation note tells the LLM the full data is available and can be requested via `retrieve_document_pages` with a section_query like "all rows from Transactions sheet" (§8).

### 7.4 Presentations (.pptx, .ppt) (V2.0)

**Strategy:** extract slide text and speaker notes via MarkItDown.

**Format sent to LLM:**

```
[Presentation: Q1_Strategy_Deck.pptx]
Slides: 32

## Slide 1: Title — Q1 Strategy
{slide text}
[Speaker notes: {speaker notes if present}]

## Slide 2: Executive Summary
{slide text}
[Speaker notes: ...]
...
```

Embedded images in slides are described by MarkItDown's image-description capability when available (LLM vision-based) or noted as "[image: filename]" placeholders when not. Embedded charts are extracted as data tables when MarkItDown can parse them; otherwise as image placeholders with the underlying chart title.

### 7.5 Audio (.mp3, .wav, .m4a, .flac) (V2.0)

**Strategy:** transcribe via MarkItDown's audio plugin (Whisper integration).

**Extraction:**

- Whisper transcription with timestamps
- Speaker diarization where the audio quality supports it
- Language detection

**Format sent to LLM:**

```
[Audio: Deposition_Johnson_2026_03_15.mp3]
Duration: 4h 12m | Language: en | Speakers detected: 3

## Transcript

[00:00:23] Speaker 1 (Q): Mr. Johnson, please state your full name for the record.
[00:00:28] Speaker 2 (A): Robert James Johnson.
[00:00:31] Speaker 1 (Q): And your role at Marex Industries during the relevant period?
[00:00:35] Speaker 2 (A): Chief Financial Officer, from June 2023 through October 2025.
...
```

For deposition audio specifically, the resulting transcript is treated by downstream systems as a deposition (deposition special case applies; see §3.3). The transcript is paginated by time blocks for retrieval purposes — `retrieve_document_pages` with `pages: [3]` returns the 3rd time block.

### 7.6 Images (.png, .jpg, .jpeg, .tiff, .gif)

**Strategy:** send as image content blocks. All major LLMs in the April 2026 model lineup support image input.

**Optimization:** resize large images to reduce token cost. A 4K screenshot does not need to be sent at full resolution; resize to 1024px on the longest edge typically preserves all needed information.

For images that contain primarily text (screenshots of articles, photographed pages of paper documents), the system runs OCR (§9) and sends the OCR output as a text content block alongside or instead of the image, depending on whether visual context matters.

### 7.7 Scanned PDFs (no text layer)

**Strategy:** OCR pipeline (§9) before tiering. Once OCR has produced a text layer, the document handles like a regular PDF (Tier 1/2/3 selection per §3) using the OCR-derived text.

**Format sent to LLM after OCR:** identical to PDF — Tier 1 sends the OCR'd PDF (now with searchable text layer); Tier 2 sends the extracted text. The OCR confidence is reflected in the `IngestionQualityReport` and surfaces in the document badge as a "low OCR confidence" warning when the per-page confidence average falls below threshold.

---

## §8 LLM Document Escalation Tool

When a document is sent to an LLM at Tier 3 (summary plus targeted excerpts) and the LLM determines it needs more detail than the summary provides, it must have a way to request that detail without the system having to pre-decide which pages to include. This section defines the runtime tool surface for LLM-initiated document escalation.

V1.0 §3.2 included the line "if the model asks for more detail: Automatically escalate to Tier 2" without defining a tool, schema, or trigger. R1 §2 specified `retrieve_document_pages` as that tool. V2.0 absorbs `retrieve_document_pages` and adds two related tools (`retrieve_full_document`, the memory-to-source wrapper) plus the multi-document retrieval design decision.

### 8.1 Purpose

When the LLM receives a document as a Tier 3 summary and needs more detail, it calls these tools to retrieve specific pages, sections, or full documents. The tools return extracted text or markdown for the requested range. This enables the LLM to drill down into a document without the system re-sending the entire thing speculatively.

These tools also serve a second purpose: when the LLM is reasoning from extracted memories (per DOC73 corpus extraction) and needs to verify the source or read context the memory didn't preserve, the same tool family supports source re-reading.

### 8.2 retrieve_document_pages

#### 8.2.1 Tool registry entry

```typescript
// DOC24 tool registry addition
const retrieveDocumentPages: ActionRegistryEntry = {
  action_id: "document.retrieve_pages",
  domain: "document_processing",
  display_name: "Retrieve Document Pages",
  user_goal: "Get specific pages or sections from a document already in context",
  description:
    "When a document was sent as a summary (Tier 3), retrieve the full text of " +
    "specific pages or sections for deeper analysis. Returns extracted text or " +
    "markdown for the requested range.",
  stability_class: "stable_action",
  agent_invocable: true,
  invocation_bindings: [
    {
      transport: "native_openclaw_tool",
      binding_name: "retrieve_document_pages",
      readiness: "ready",
      client_exposure: { elnor_native: true, mcp_external: false, q_ui: false },
      schema_version: 1,
    },
  ],
  confirmation_policy: "none",
  safety_class: "read",
  aliases: ["get pages", "show page", "read section", "pull excerpt"],
  common_phrases: [
    "show me page 47",
    "what does section 3 say",
    "read the conclusion",
  ],
  codegen_source: "companion_registration",
  schema_version: 1,
};
```

#### 8.2.2 Request and response schemas

```typescript
interface RetrieveDocumentPagesRequest {
  document_id: string;           // Internal document reference from context
  pages?: number[];              // Specific page numbers, e.g., [47, 48, 49]
  page_range?: {                 // OR a contiguous range
    start: number;
    end: number;
  };
  section_query?: string;        // OR a semantic query, e.g., "the scienter allegations"
  format: "text" | "markdown";   // Output format; markdown preserves structure
  max_tokens?: number;           // Optional cap; defaults to 20-page equivalent
}

interface RetrieveDocumentPagesResponse {
  document_id: string;
  document_title: string;
  pages_returned: number[];
  total_pages: number;
  content: string;               // Extracted text or markdown for requested pages
  extraction_method:
    | "pdfjs_text"
    | "markitdown"
    | "ocr_markitdown"
    | "ocr_tesseract"
    | "ocr_azure";
  token_count: number;           // Cost of the returned content
  truncated: boolean;            // true if max_tokens cap was hit
}
```

#### 8.2.3 How it works

1. When the tiered context system sends a Tier 3 summary, it includes a `document_id` reference in the formatted Tier 3 content (per §3.1 Tier 3 content structure). The LLM sees this reference and can use it in subsequent tool calls.
2. The LLM reads the summary, determines it needs more detail, and calls `retrieve_document_pages` with the relevant page numbers, page range, or semantic query.
3. The tool looks up the document — already cached locally because the document was opened in Q Dashboard or otherwise pre-ingested. It extracts the requested pages from the active conversion artifact (MarkItDown markdown by default; PDF.js text as fallback) and returns the content.
4. The conversation's document state is updated: `conversation.documents[document_id].retrievedPages` is appended with the returned page numbers.
5. On subsequent turns, the previously retrieved pages are included alongside the Tier 3 summary in the formatted content (until cumulative retrieval crosses the §3.2 Rule 3 50% threshold, at which point the document auto-escalates to Tier 2).
6. If the LLM makes multiple `retrieve_document_pages` calls in a single turn (e.g., querying for two unrelated sections), each call is independent and returns its own response. The tool does not batch within a turn.

#### 8.2.4 Section query resolution

When the LLM uses `section_query` instead of page numbers (e.g., "the scienter allegations" or "the damages section"), the tool resolves the query to specific pages without an LLM call. Resolution order:

1. **Heading match.** Search the document's extracted headings (from MarkItDown's structured markdown — heading lines starting with `#`) for matches against the query terms. Return pages containing matched headings plus any pages without their own headings between matched headings.
2. **Full-text search.** Search the extracted text for query terms. Return pages with the highest match density (computed as match count divided by page text length).
3. **Section-summary match.** If the document has page-level summaries (from §5.1 step 4), match the query against those summaries and return the highest-scoring pages.

Resolution is deterministic and runs in milliseconds. It does not consume LLM tokens. If no resolution matches with reasonable confidence (default threshold 0.3 normalized match score), the response includes `pages_returned: []` with a `notes` field indicating no match found, allowing the LLM to retry with different terms or fall back to specific page numbers.

#### 8.2.5 Per-call page cap

The tool defaults to returning at most 20 pages per call. This prevents an LLM call asking for "all pages" from blowing through the conversation token budget in a single retrieval. The cap is configurable per-deployment via the Settings surface (§19) and per-call via the optional `max_tokens` request field.

When the cap is hit, the response sets `truncated: true` and `notes` includes guidance like "Pages 1-20 returned of 47 requested. Call again with `pages: [21..47]` to retrieve the remainder."

### 8.3 retrieve_full_document

For small documents or when full context is needed (e.g., when a memory references "the Johnson March 15 memo" and the entire short memo is easier to re-read than guessing which sections matter), `retrieve_full_document` returns the complete document at once.

#### 8.3.1 Tool registry entry

```typescript
const retrieveFullDocument: ActionRegistryEntry = {
  action_id: "document.retrieve_full",
  domain: "document_processing",
  display_name: "Retrieve Full Document",
  user_goal: "Retrieve the complete text of a document",
  description:
    "Returns the full converted markdown for a document. Use when the document " +
    "is small enough to read in full or when broad context is needed.",
  stability_class: "stable_action",
  agent_invocable: true,
  invocation_bindings: [
    {
      transport: "native_openclaw_tool",
      binding_name: "retrieve_full_document",
      readiness: "ready",
      client_exposure: { elnor_native: true, mcp_external: false, q_ui: false },
      schema_version: 1,
    },
  ],
  confirmation_policy: "none",
  safety_class: "read",
  aliases: ["read full document", "show full text", "give me everything"],
  common_phrases: [
    "show me the whole document",
    "read the full memo",
    "what's the entire filing say",
  ],
  codegen_source: "companion_registration",
  schema_version: 1,
};
```

#### 8.3.2 Request and response schemas

```typescript
interface RetrieveFullDocumentRequest {
  document_id: string;
  format: "text" | "markdown";
  max_tokens?: number;           // Optional cap; defaults to 50K tokens
}

interface RetrieveFullDocumentResponse {
  document_id: string;
  document_title: string;
  total_pages: number;
  content: string;
  extraction_method: string;     // same enumeration as retrieve_document_pages
  token_count: number;
  truncated: boolean;            // true if document exceeds max_tokens
}
```

#### 8.3.3 Behavior

The tool returns the full converted markdown for the document. If the document exceeds `max_tokens` (default 50K, configurable), the response truncates and sets `truncated: true`, returning the first N pages whose cumulative tokens fit within the cap. The LLM can then call `retrieve_document_pages` for the remaining pages.

Calling `retrieve_full_document` on a document already at Tier 1 in the conversation is wasteful but not erroneous — the tool returns its content regardless. The Context Manager (§20) may log a warning when this happens; it does not refuse the call.

### 8.4 Memory-to-source resolution wrapper

When the LLM is reasoning from a memory (per DOC73 corpus extraction or DOC72 entity graph) and decides to re-read the source, a wrapper tool resolves the memory_id to its source document and span, then calls `retrieve_document_pages` under the hood. This is a single-call ergonomic affordance for the LLM rather than a new capability.

#### 8.4.1 Tool registry entry

```typescript
const retrieveMemorySource: ActionRegistryEntry = {
  action_id: "memory.retrieve_source",
  domain: "memory_processing",
  display_name: "Retrieve Memory Source",
  user_goal: "Re-read the source passage a memory was extracted from",
  description:
    "Given a memory_id, resolves to the source document and span and returns " +
    "the relevant pages. Use when verifying a memory or needing source language.",
  stability_class: "stable_action",
  agent_invocable: true,
  invocation_bindings: [
    {
      transport: "native_openclaw_tool",
      binding_name: "retrieve_memory_source",
      readiness: "ready",
      client_exposure: { elnor_native: true, mcp_external: false, q_ui: false },
      schema_version: 1,
    },
  ],
  confirmation_policy: "none",
  safety_class: "read",
  aliases: ["verify memory", "go to source", "re-read source"],
  common_phrases: [
    "re-read the source for that",
    "verify this memory",
    "show me where that came from",
  ],
  codegen_source: "companion_registration",
  schema_version: 1,
};
```

#### 8.4.2 Request and response schemas

```typescript
interface RetrieveMemorySourceRequest {
  memory_id: string;
  context_pages?: number;        // Pages of surrounding context; default 1 (page before/after)
  format: "text" | "markdown";
}

interface RetrieveMemorySourceResponse {
  memory_id: string;
  source_resolved: boolean;
  document_id: string | null;
  document_title: string | null;
  source_span: {
    page_start: number;
    page_end: number;
    char_offset_start: number;
    char_offset_end: number;
  } | null;
  content: string | null;        // pages covering the span plus context_pages of buffer
  extraction_method: string | null;
  token_count: number;
  notes: string;                 // populated when source_resolved is false
}
```

#### 8.4.3 Behavior

The tool reads the memory's `source_ref` field (DOC72-governed; per DOC72 §20A intake contract, every extracted memory carries a source_ref pointing back to its document and span). The wrapper resolves source_ref to a `document_id` plus character span, locates the page or pages containing the span, expands by `context_pages` of buffer on each side, and calls `retrieve_document_pages` internally. The response shape is the same content payload `retrieve_document_pages` returns, with the resolved source_span fields included for the LLM's reference.

When source_ref is missing or invalid (e.g., memory was extracted before source-anchor enforcement landed; document has been deleted from the document store; source_ref points to a document the user has revoked access to), `source_resolved: false` is returned with `notes` explaining the failure mode. The LLM should not assume source resolution will always succeed.

### 8.5 Multi-document retrieval design

When the LLM needs to read multiple documents (e.g., comparing arguments across three case filings), it has two options:

1. **Multiple sequential calls** — call `retrieve_document_pages` once per document. Simple for the LLM to reason about. Each call is independent. Tool-invocation overhead is incurred per call.
2. **Single batched call** — a hypothetical `retrieve_documents_batch` tool taking a list of document_ids and returning a list of contents. Cheaper on tool-invocation overhead but more complex tool surface.

**V2.0 decision: multiple sequential calls.** The tool surface stays simple. LLMs reason more reliably about single-purpose tools than multi-target ones. Tool-invocation overhead is low in absolute terms (a few hundred tokens per call) and the ability to inspect per-call results before deciding what to retrieve next is more valuable than batching savings.

This decision can be revisited if profiling shows tool-invocation overhead becoming a real bottleneck. Until then, the tool family remains: `retrieve_document_pages` (specific pages or sections from one document), `retrieve_full_document` (full text from one document), `retrieve_memory_source` (source for one memory). Multi-document workflows make multiple calls.

### 8.6 Consumption by specialist sub-agents

These tools are consumed primarily by **DocumentIntelligenceAgent** (per `ELNOR_SUBAGENT_PRECOMPUTE_TOOL_OPTIMIZATION_NOTES_V4.md` §1.9), the named specialist sub-agent that handles retrieval-heavy document work. The primary chat agent may also call these tools directly for simple cases, but for retrieval-heavy tasks (e.g., "compare arguments across these five filings") the primary spawns DocumentIntelligenceAgent rather than running the retrieval inline.

DocumentIntelligenceAgent's tool roster includes the tools defined in this section plus DOC18 retrieval tools (`search_documents` for chunk-level retrieval). The split: DOC18 is for "find chunks discussing X across the corpus"; DOC25's tools in this section are for "give me pages 47-52 of document Y" or "re-read the source this memory came from."

The memory query tool family — `retrieve_memory_by_id`, `retrieve_related_memories`, `retrieve_memory_cluster`, `search_memories`, `retrieve_memories_by_entity`, `verify_memory` — are specified in §16 and consumed primarily by **MemoryAgent** rather than DocumentIntelligenceAgent. The split protects the boundary that memory is not source text and source text is not memory. Both agents share output schemas so the primary can consume their findings consistently.

---

## §9 OCR Pipeline Architecture

OCR support is required for the substantial portion of legal and historical document corpora that arrive as scanned PDFs without text layers. V1.0 mentioned Tesseract.js as installed but did not specify where OCR lives in the system. R1 §3 specified the architecture; V2.0 absorbs it as an operative section.

### 9.1 Design principles

1. **Model-agnostic.** OCR cannot depend on the reasoning LLM having vision capabilities. Text gets extracted once, cached, and every LLM gets searchable text regardless of its capabilities. This is the difference between OCR (a deterministic local transformation) and LLM vision (a model-specific capability).
2. **Extract once, use everywhere.** OCR output is written back into the PDF as a text layer AND stored separately for DOC72 intake and downstream consumers. No re-OCR needed once a page has been processed (unless the user explicitly requests re-OCR).
3. **Local first.** Default OCR runs locally on user hardware. Cloud OCR is opt-in for accuracy-critical cases (poor scans, handwriting, complex tables) and respects privilege/local-only corpus restrictions.
4. **Automatic detection.** The system detects scanned and image-based pages automatically and OCRs them without user intervention. The user does not have to know whether a PDF is scanned; the system figures it out.

### 9.2 OCR stack

| Layer | Tool | When used | Accuracy | Privacy |
|---|---|---|---|---|
| **Primary** | MarkItDown OCR plugin | Server-side extraction for DOC72 intake and LLM context. Default path. | High (LLM vision-based via configurable backend) | Local — self-hosted MarkItDown |
| **Browser fallback** | Tesseract.js | In-viewer quick OCR; offline scenarios; when MarkItDown unavailable | Medium (~80-90% on clean scans) | Local — runs in Electron renderer |
| **Cloud upgrade (opt-in)** | Azure Document Intelligence | Poor scans, handwriting, complex tables, forms | Highest of the three | Cloud — user opt-in per workspace; respects corpus policy |

The MarkItDown OCR plugin is the default. Tesseract.js handles the in-viewer "OCR this page right now" affordance because it produces results in seconds rather than the longer wait for a queued MarkItDown job. Azure Document Intelligence is gated behind explicit user opt-in and is unavailable for corpora marked privileged or local-only.

### 9.3 Automatic scanned page detection

When a PDF is opened or ingested, the system classifies each page to determine whether OCR is needed:

```typescript
interface PageClassification {
  page_number: number;
  has_text_layer: boolean;       // PDF.js getTextContent() returned content
  text_density: number;          // Characters per page area (normalized 0-1)
  is_scanned: boolean;           // true if has_text_layer is false OR text_density < 0.05
  needs_ocr: boolean;            // is_scanned AND no cached OCR result exists
}

async function classifyPage(page: PDFPage): Promise<PageClassification> {
  const textContent = await page.getTextContent();
  const charCount = textContent.items.reduce(
    (sum, item) => sum + item.str.length,
    0
  );
  const pageArea = page.view[2] * page.view[3];
  const density = charCount / pageArea;

  return {
    page_number: page.pageNumber,
    has_text_layer: charCount > 0,
    text_density: density,
    is_scanned: charCount === 0 || density < 0.05,
    needs_ocr: (charCount === 0 || density < 0.05) && !ocrCache.has(page.fingerprint),
  };
}
```

The threshold (density 0.05 chars per unit area) is empirical — it correctly classifies almost all scanned pages and rarely misclassifies text-heavy pages as scanned. The threshold is configurable per-deployment for edge cases.

`page.fingerprint` is a per-page content hash (computed from the rendered page bytes) used as the OCR cache key. This is independent of the document-level `rawFileHash` because OCR caching is per-page — when one page in a multi-page PDF is re-rendered (e.g., as part of a subset extraction), unchanged pages reuse cached OCR.

### 9.4 OCR trigger points

| Trigger | Action | OCR method |
|---|---|---|
| PDF opened in Q viewer | Classify pages. If any need OCR, queue for background processing. | MarkItDown (primary) |
| Document sent to LLM (Tier 1 or 2) | If scanned pages exist, OCR them before sending text. | MarkItDown |
| DOC72 intake extraction | OCR all scanned pages during entity extraction. | MarkItDown |
| User clicks "OCR this page" in viewer | Immediate single-page OCR with visual feedback. | Tesseract.js (instant) or MarkItDown (queued) |
| Batch OCR from Knowledge Manager | OCR all unprocessed documents in a folder. | MarkItDown |
| Corpus ingestion (any document with scanned pages) | OCR all scanned pages as part of universal ingestion (§11). | MarkItDown |

Background OCR (queued via the EC orchestrator) does not block user actions. The user can ask questions about a partially-OCR'd document; the system uses Tier 1 (full PDF) until OCR completes, then can shift to Tier 2 once the text is available.

### 9.5 OCR result and caching

```typescript
interface OCRResult {
  document_id: string;
  page_number: number;
  page_fingerprint: string;      // per-page content hash (cache key)
  extracted_text: string;
  confidence: number;            // 0.0-1.0, from OCR engine
  method: "markitdown_vision" | "tesseract_js" | "azure_doc_intelligence";
  processed_at: string;          // ISO datetime
  language: string;              // detected language (ISO 639-1)
  tool_version: string;          // for cache invalidation on tool upgrade
}
```

**Storage:** OCR results are cached at `~/.elnor/document_store/by_hash/{document_hash}/ocr_cache/{page_number}.json`, keyed by document content hash plus page number. Pages are never re-OCR'd if a cached result exists, unless: (a) the user explicitly requests re-OCR; or (b) the OCR tool's `tool_version` field has advanced past a configured threshold and re-OCR was scheduled as part of tool upgrade.

**Text layer write-back:** after OCR completes for a page, the extracted text is written back into the PDF as an invisible text layer. The PDF becomes searchable in the Q viewer (PDF.js text search works on the new text layer) and in any external PDF viewer.

```typescript
import { PDFDocument } from 'pdf-lib';

async function writeTextLayerToPDF(
  pdfPath: string,
  ocrResults: OCRResult[],
): Promise<string> {
  const pdfDoc = await PDFDocument.load(fs.readFileSync(pdfPath));

  for (const result of ocrResults) {
    const page = pdfDoc.getPage(result.page_number - 1);
    // Add invisible text layer matching the OCR output, positioned to overlay
    // the scanned image so text selection and search work correctly.
    page.drawText(result.extracted_text, {
      x: 0,
      y: 0,
      size: 1,
      opacity: 0,
    });
  }

  const outputPath = pdfPath.replace('.pdf', '_searchable.pdf');
  fs.writeFileSync(outputPath, await pdfDoc.save());
  return outputPath;
}
```

**Note on text positioning.** The `drawText` call shown above writes text at position (0, 0), which is a simplified placeholder. Production implementation needs proper text positioning — matching each OCR'd word or line to its visual location on the page — for accurate text selection in the viewer (a user dragging across visible text gets the corresponding text layer content). This is a known hard problem; `pdf-lib` may need supplementation with a more sophisticated overlay approach (e.g., positioned text fragments matching OCR bounding boxes, or PDF.js-rendered overlay layer maintained separately from the PDF itself). Implementation research item flagged in §26 open questions. The stub implementation above is sufficient for searchability via PDF.js text search, which works on the entire text layer regardless of position.

### 9.6 UI entry points

**PDF Viewer toolbar (DOC20 R4.3):**

```
[OCR ▾]
├─ OCR this page (Tesseract — instant, local)
├─ OCR all scanned pages (MarkItDown — background)
├─ OCR status: 3 of 47 pages are scanned, 0 OCR'd
└─ ☑ Auto-OCR scanned pages when opened
```

**Document badge in viewer header:**

- If all pages have text layer: no badge.
- If some pages are scanned and un-OCR'd: amber badge "3 scanned pages — OCR available."
- If OCR is in progress: spinner with "OCR processing..."
- If OCR is complete: green badge "Searchable."
- If OCR completed with low confidence: yellow badge "Searchable (low OCR confidence on some pages)."

**Knowledge Manager batch OCR:**

A "Run OCR on selected" action in the Documents tab allows the user to select multiple documents and queue them for batch OCR processing. Progress surfaces in the Runs tab.

### 9.7 DOC24 tool registry entry

```typescript
const ocrTool: ActionRegistryEntry = {
  action_id: "q.document.ocr",
  domain: "document_processing",
  display_name: "Document OCR",
  user_goal: "Extract text from scanned/image-based document pages",
  description:
    "Run OCR on scanned PDF pages. Returns extracted text. Automatically detects " +
    "scanned pages and uses the best available OCR method.",
  stability_class: "stable_action",
  agent_invocable: true,
  invocation_bindings: [
    {
      transport: "native_openclaw_tool",
      binding_name: "ocr_document",
      readiness: "ready",
      client_exposure: { elnor_native: true, mcp_external: false, q_ui: true },
      schema_version: 1,
    },
  ],
  confirmation_policy: "none",
  safety_class: "read",
  aliases: ["OCR", "scan text", "extract text from image"],
  common_phrases: [
    "OCR this document",
    "make this PDF searchable",
    "extract text from this scan",
  ],
  codegen_source: "companion_registration",
  schema_version: 1,
};
```

R1 specified `q.document.ocr` as "planned, not yet built" in the DOC24 tool registry. V2.0 promotes it to fully specified — the tool is operative as soon as DOC25 V2.0 lands.

### 9.8 OCR confidence and quality reporting

Per-page OCR confidence is captured in `OCRResult.confidence` and aggregated into the document's `IngestionQualityReport.ocr_confidence_summary` (§15.4). Documents whose OCR confidence average falls below threshold (default 0.7) surface as "degraded" in the document badge and in the corpus dashboard. The user sees which documents have unreliable OCR and can prioritize manual review or escalation to Azure Document Intelligence (cloud opt-in path).

When OCR confidence is low and the document is sent to an LLM at Tier 2, the formatted text content includes a note: "Note: This document was extracted via OCR with low confidence on some pages. Reading errors may be present in the text below." The LLM is informed about the reliability of the source it is reading.

---

## §10 MarkItDown Universal Extraction Backend

MarkItDown is the single extraction backend for all document types entering ELNOR's intelligence layer. It does NOT replace format-specific tools for viewing or editing — only for text extraction. R1 §4 introduced MarkItDown in this role; DOC73 V1.4.1 §15.3 specified the routing detail. V2.0 absorbs both as one operative section.

### 10.1 Architecture

```
Document enters ELNOR (any surface)
  │
  ├─ Viewing/editing path → Format-specific tool
  │   ├─ PDF → PDF.js (in-viewer rendering, text selection, area select)
  │   ├─ Word → OnlyOffice or Word Online (formatting, tracked changes, TOC)
  │   ├─ Spreadsheet → OnlyOffice (formula editing)
  │   └─ Other → format-specific viewer
  │
  └─ Text extraction path → MarkItDown
       │
       ├─ PDF with text layer → MarkItDown extracts structured markdown
       ├─ PDF scanned/image → MarkItDown OCR plugin extracts text (§9)
       ├─ DOCX → MarkItDown extracts structured markdown
       ├─ PPTX → MarkItDown extracts slide text + speaker notes
       ├─ XLSX → MarkItDown extracts as markdown tables
       ├─ Images → MarkItDown describes content (via LLM vision backend)
       ├─ Audio → MarkItDown transcribes (via Whisper integration)
       └─ HTML → MarkItDown converts to clean markdown
```

The two paths never cross. The viewing/editing path serves the user looking at or modifying a document. The text extraction path serves DOC25's intelligence layer (LLM context, DOC72 intake, retrieval indexing). A user editing a Word document in OnlyOffice never has the document's MarkItDown markdown shown to them; an LLM analyzing the document never sees the OnlyOffice-rendered version.

### 10.2 What MarkItDown replaces

| Current component | Status after MarkItDown |
|---|---|
| mammoth.js (DOCX → HTML → text) | **Replaced** for DOC72 intake extraction. Kept available if needed for Tiptap import or other UI-side conversions. |
| PDF.js `getTextContent()` for LLM context | **Replaced** for DOC72 intake and DOC25 Tier 2/3. PDF.js still used for in-viewer text selection and search. |
| Tesseract.js (OCR) | **Demoted** to browser fallback. MarkItDown OCR plugin is primary (§9). |
| No PPTX extraction | **Solved** — MarkItDown handles PPTX natively, including speaker notes. |
| No XLSX extraction | **Solved** — MarkItDown handles XLSX natively. |
| No audio transcription | **Solved** — MarkItDown handles audio via Whisper. |
| HTML tag stripping | **Replaced** — MarkItDown converts to clean structured markdown rather than stripped text. |

### 10.3 Profile-routed Docling vs MarkItDown hybrid

Docling and MarkItDown both target document-to-markdown conversion. They have different strengths: MarkItDown is fast and lightweight; Docling has stronger handling of complex tables, multi-column layouts, and image-heavy documents. V2.0 specifies a profile-routed hybrid per DOC73 V1.4.1 R5.3 E11.

#### 10.3.1 Default routing by document type

```
Default per document_type:
  text_native_pdf, text_native_doc, html, markdown → MarkItDown
  scanned_pdf, image_heavy_pdf, table_heavy_filing → Docling full pipeline
```

Document type comes from the classification step (§2). The classifier flags `table_heavy_filing` when the document has more than three tables totaling more than 30 rows, or when the document type is one of the registered table-heavy types (10-K, 10-Q, 8-K, court filings with damages tables).

#### 10.3.2 Override conditions

Even with default routing, certain runtime conditions trigger escalation:

```
Override conditions:
  scan detection (low text yield from MarkItDown attempt) → escalate to Docling
  garbled tables in MarkItDown output → escalate to Docling
  user override at corpus or document level (per E13) → respect
```

When MarkItDown's first-pass output indicates problems (low text yield or garbled tables), the orchestrator queues a Docling re-conversion. The MarkItDown output is preserved as the v1 conversion artifact; the Docling re-conversion produces a v2 conversion artifact (per E8 versioning, §12.3); the user sees both versions and can promote either as active.

User overrides at corpus or document level — set via the Documents tab in DOC7 — are respected unconditionally. A user who has determined that Docling produces better output for their corpus's document mix can route everything through Docling regardless of default rules.

#### 10.3.3 Conversion-quality contract — operational definitions

The override conditions reference quality concepts that need operational definitions to avoid coding-agent guesswork:

- **"Visual-important"** — document has tables with more than 5 rows, multi-column layouts, embedded images with text content, or a visual layout score above 0.6 on the classifier's visual-importance scale.
- **"Scanned"** — automatic detection via text-yield ratio (under 100 chars per page on text extraction attempt), OCR-success-on-no-extractable-text test, or explicit user marking.
- **"Low text yield"** — converted markdown contains under 200 chars per source page on average. Configurable per-deployment.
- **"Garbled tables"** — pipe-delimited table rows with more than 30% malformed positions (cell counts not matching header column count, missing cell separators), OR table cell counts in the markdown not matching detected column count from PDF.js layout analysis.

These thresholds are configurable; the values above are V2.0 defaults.

#### 10.3.4 Dual-conversion sampling rule

Routing decisions need empirical grounding rather than assertions like "Docling is better" or "MarkItDown is faster." V2.0 specifies a dual-conversion sampling protocol per E11:

For the first N documents per profile (N configurable, default 5), run both Docling and MarkItDown on the document. Compare the outputs across:

- Text yield (total characters extracted)
- Table count (preserved tables)
- Heading count (preserved headings)
- Page anchor integrity (page boundaries preserved)

If divergence between the two outputs exceeds threshold (default: any of the four metrics differing by more than 25%), flag the profile routing rule for user review in the DOC7 Settings tab. The user reviews the comparison and confirms which tool to default for that profile going forward.

This is the empirical basis for routing — not abstract claims about tool capability, but measured per-corpus quality on representative documents the user actually has.

### 10.4 GLiNER opt-in entity tagging

GLiNER is a local zero-shot named-entity recognition model. V2.0 specifies opt-in routing per DOC73 V1.4.1 R5.3 E10.

#### 10.4.1 Default behavior

GLiNER is **opt-in per corpus, default-on for entity-rich profiles, default-off elsewhere.**

Default-on profiles: research, legal, transcript-like (depositions, earnings calls), case files.
Default-off profiles: code, generic notes, personal correspondence, audio, images.

The default can be overridden per corpus or per document type within a corpus.

#### 10.4.2 Promotion hygiene

GLiNER produces:

- `entity_observation` records — NOT confirmed entities by default
- `appears_in` evidence edges — NOT strong co-occurrence edges

Promotion to confirmed entity requires one of:

- **Corroboration:** entity appears in N documents (default N = 3, configurable per corpus).
- **User confirmation:** manual promotion in the Knowledge Manager UI.
- **Downstream reuse:** entity referenced in extracted memories or in user actions (search, link).

This prevents GLiNER false positives from poisoning the entity graph — a documented failure mode where one mis-tagged page name becomes a confirmed entity that contaminates retrieval indefinitely.

User override granularity: GLiNER on/off per corpus, per document type, per entity type list. A research corpus might enable GLiNER for `person`, `organization`, `date`, `citation` but not for every monetary amount or generic role mentioned in passing.

### 10.5 NuExtract literal-extraction routing

NuExtract is for structured field extraction with literal output. V2.0 specifies tightened routing per DOC73 V1.4.1 R5.3 E9.

#### 10.5.1 NuExtract candidate conditions

NuExtract is a candidate ONLY if all of the following are true:

```
verification_mode = "grounding_only"
output_schema is defined
target_kind ∈ {"literal", "metadata", "numeric", "citation", "source_span"}
judgment_required = false
expected output is source-verbatim or normalized-from-source-verbatim
literal_only flag = true
```

#### 10.5.2 Routing exclusions

If any of these conditions hold, route to LLM extraction instead of NuExtract:

- Target language contains interpretive verbs: "misleading," "supports," "implies," "holding," "relevance," "scienter," "materiality." LLM regardless of schema presence.
- Schema requires summarization or paraphrase. LLM.
- Evidence must be weighed across sources. LLM.
- Counterfactual mode is active. LLM (NuExtract cannot verify counterfactual outcomes).

#### 10.5.3 Source-span verification

Every NuExtract output goes through source-span verification before being accepted. The verifier checks: does the extracted value appear at the cited source span in the original document? Don't rely on the model's structured output alone.

When verification fails, the extraction is either retried with relaxed source-span matching or routed to LLM extraction as a fallback. Verification failures are logged and surfaced in the document's `IngestionQualityReport`.

#### 10.5.4 Fallback when NuExtract is unavailable

If NuExtract is not installed or not running, extraction falls back to schema-constrained LLM output. The fallback output is labeled "lower literal-extraction confidence" unless source-span verification passes — in which case the LLM output meets the same verification bar as NuExtract output and is treated equivalently.

### 10.6 Web fetch and Firecrawl cascade

Plain web fetch is the default for retrieving web content. Firecrawl is the fallback for cases where plain fetch produces unusable output. V2.0 expands Firecrawl trigger conditions per DOC73 V1.4.1 R5.3 E12.

#### 10.6.1 Firecrawl triggers

In addition to hard fetch failure (HTTP error, timeout, DNS failure), Firecrawl is triggered by:

- **Low main-content ratio:** fetched HTML has under 20% body text vs. boilerplate or chrome (configurable threshold).
- **Suspicious boilerplate dominance:** cookie banners, navigation chrome, paywall text dominate the extracted content.
- **Missing title or body:** fetched page has no `<title>` element or empty `<body>`.
- **Script-heavy page with no rendered text:** suggests JavaScript-rendered content that the static fetch did not execute.
- **User-configured "always use crawler for this domain":** per-source override stored in settings.

#### 10.6.2 Self-hosted vs hosted Firecrawl

V2.0 assumes **self-hosted Firecrawl (open source) for privacy.** Hosted Firecrawl is a cloud service; for privileged or local-only corpora, self-hosted is required. For cloud-allowed corpora, hosted Firecrawl may be permissible per visibility class — see §10.7 for the policy gating pattern.

### 10.7 LlamaCloud policy-gated availability

LlamaCloud (LlamaParse, LlamaExtract, LlamaClassify cloud services) is **declined by default** for privileged or local-only corpora. V2.0 specifies policy-gated availability per DOC73 V1.4.1 R5.3 E17 — not absolute decline.

#### 10.7.1 Availability conditions

```
LlamaCloud allowed only if all of these are true:
  corpus visibility = cloud_allowed (or ambient with no privilege/local_only tags)
  no privilege/local_only tags on corpus or sources
  user explicitly enables cloud document AI for this deployment
  source-specific outbound policy permits upload
```

For privileged or local-only corpora: prohibited. For cloud-allowed corpora with explicit user opt-in: available as an optional provider.

#### 10.7.2 LlamaIndex OSS vs LlamaCloud distinction

The two are different products with different policy implications:

- **LlamaIndex OSS** (DOC18): local Python library; runs on user hardware; no data leaves the machine. Always available.
- **LlamaCloud services**: cloud-hosted; data is uploaded to the cloud. Policy-gated per above.

This preserves ELNOR's domain-agnostic posture without forcing absolute decline that would prevent useful capability for non-privileged corpora (e.g., a personal research corpus on a non-confidential topic).

### 10.8 DOC72 intake pipeline integration

For surface intake contracts (DOC72 §20A), MarkItDown plugs in at the extraction step:

```
Surface emits event (document opened, email received, file attached, etc.)
  → EC command triggers intake pipeline
  → Significance gate (DOC72 §20.10 — user action demonstrates intent)
  → MarkItDown converts document to structured markdown
       (Docling escalation if E11 conditions hit; see §10.3)
  → Extraction LLM runs entity extraction on the markdown
       (or NuExtract for literal+schema targets; see §10.5)
       (or GLiNER for opt-in entity tagging; see §10.4)
  → DOC72 nodes and edges written to entity graph via EC sole-writer path
```

This applies uniformly across all surfaces: Q Browser (Tier 2 extraction), Notes (attachment extraction), Document Viewer (opened document extraction), Email (attachment extraction via M365), Tasks (attached file extraction), corpus ingestion (the canonical pipeline; see §11).

### 10.9 DOC25 Tier 2/3 integration

MarkItDown improves the tiered context system by replacing PDF.js raw-text output with structured markdown for Tier 2 and Tier 3:

| Tier | Without MarkItDown | With MarkItDown |
|---|---|---|
| Tier 1 (Full PDF) | Native PDF sent to Claude or other native-PDF model | Unchanged — native PDF for visual understanding |
| Tier 2 (Text only) | PDF.js `getTextContent()` → raw text blob | MarkItDown → structured markdown (preserves headings, tables, lists) |
| Tier 3 (Summary) | PDF.js text → LLM summary | MarkItDown markdown → LLM summary (better structure → better summary) |

The key improvement: Tier 2 currently sends raw text blobs with no structure. MarkItDown sends structured markdown with headings, tables, and lists preserved. The LLM receives a document that reads like a document, not a wall of text. This directly improves comprehension quality on subsequent turns and improves the structured-summary quality on Tier 3.

### 10.10 M365 / DOC16 alignment

DOC16 Entry 16.7 specifies that ELNOR must always operate on actual `.docx` files for formatting work (TOC, TOA, tracked changes, heading styles), never on markdown conversions. MarkItDown does NOT violate this requirement.

The two paths remain separate:

- **User says "What does this motion argue?"** — read-and-understand path. MarkItDown extracts text. LLM reads it.
- **User says "Add a TOC to this brief"** — edit-and-produce path. OnlyOffice or Word Online opens the actual `.docx`. Formatting preserved.

The two paths never cross. A user editing a Word document never has the document's MarkItDown output substituted for the source. An LLM analyzing the document never sees the OnlyOffice-rendered version. The DOC16 requirement and the DOC25 universal extraction architecture compose cleanly.

### 10.11 Tool installation tracking

DOC25 maintains an inventory of extraction tools and their installation status:

- Docling — install check, version tracking, config per document type.
- MarkItDown — install check, version tracking.
- GLiNER — install check, model weight location, version tracking.
- NuExtract — install check, model weight location, version tracking.
- Firecrawl — self-hosted Docker check, version tracking.
- Tesseract — install check (browser fallback for OCR).

The Q Dashboard "Document Intelligence" settings panel (§19) shows installed tools with status and version, with an install-on-demand flow for tools that aren't yet installed. DOC25 defines the install scripts and config defaults; specifics (pip packages, Docker containers, model weight paths) are addendum-level details.

When a required tool is not installed, DOC25 routes around it (using fallback chains where defined: MarkItDown OCR → Tesseract.js → manual handling) or flags the document as "needs manual handling" with the specific missing tool identified in the failure receipt.

---

## §11 Universal Ingestion Orchestration

Every document entering ELNOR through any surface routes through one universal ingestion pipeline. This is the V2.0 expansion's load-bearing architectural commitment: surfaces differ in what they do *after* ingestion (substantive extraction under corpus rules, triage for emails, display for browser captures, etc.) but the raw ingestion work is shared.

### 11.1 Surfaces routed through the orchestrator

The orchestrator handles ingestion for every surface that brings a document into ELNOR:

- **Corpus ingestion** — primary consumer; documents added to a corpus for deep knowledge work. The original use case the orchestrator was designed against.
- **Email attachments** — documents arriving via the M365 integration (DOC16 Entry 16.7). Attachments to inbound email; attachments the user adds to outbound email; documents from Teams chat.
- **Q Browser document captures** — documents captured during browsing sessions through DOC20's browser intake, including PDFs opened from web links and HTML pages saved as documents.
- **Chat-attached documents** — documents the user attaches to a chat message (Ask panel, full chat, room substrate per DOC10/DOC14).
- **DOC3 demonstration captures** — documents that appear during a recorded demonstration (e.g., a Word doc opened during the demo).
- **DOC23 task outputs** — documents emitted by task module pipelines (extracted attachments, generated reports, fetched filings).
- **Direct file uploads** — files dragged into Q Dashboard, files added through the file picker, files placed in a watched folder.
- **Autonomous gathering task outputs** — documents pulled by autonomous gathering pipelines (EDGAR pulls, PACER docket fetches, web scrape outputs, RSS feed PDFs).

All surfaces emit ingestion events to EC. EC enqueues the document on the universal ingestion pipeline. Surface-specific work — corpus extraction, email triage, browser display, chat context preparation — happens after the universal ingestion pipeline completes.

### 11.2 Pipeline steps

The universal pipeline runs nine steps per document. Steps 1-5 and 7-9 are universal; step 6 is conditional on consuming-surface configuration.

1. **Upload reception and durable storage of original.** The original file is stored in ELNOR's document store keyed by content hash. Original files at user-controlled paths (OneDrive, local filesystem) are not copied; the system hashes them in place and stores derived artifacts under the hash. Original files arriving as bytes (email attachments, browser captures) are written to the document store at `originals/{content_hash}` once and then treated as "user-controlled" thereafter.

2. **Classification.** Document type classification per §2.2. Runs locally, fast, no API call. Output stored as part of the document's classification record.

3. **Conversion to markdown.** MarkItDown by default; Docling for `scanned_pdf`, `image_heavy_pdf`, `table_heavy_filing`, or escalation conditions per §10.3. Conversion produces the active conversion artifact; if a prior version exists, the new conversion gets the next version number per E8.

4. **Metadata extraction.** Profile-driven. For literal+schema targets (date fields, monetary amounts, citations) NuExtract is used per §10.5. For interpretive metadata (document type, summary, key parties) LLM extraction is used. Output stored as document metadata.

5. **Document summary.** One LLM pass against the converted markdown. Default model: configurable per deployment (Haiku or Sonnet typical for cost). Output stored as `structuredSummary` on the DocumentEntity.

6. **Entity tagging via GLiNER (conditional).** Opt-in per corpus or surface per §10.4. When enabled, GLiNER runs zero-shot NER over the converted markdown. Output: `entity_observation` records and `appears_in` evidence edges. Promotion to confirmed entities requires corroboration per §10.4.2.

7. **Chunking for retrieval.** Chunks computed per the corpus's `chunking_config_version`. Chunks indexed into the DOC18 LlamaIndex sidecar.

8. **Index writes to DOC72.** A `world_entity` node with `entity_type: "document"` is written via DOC72's §20A intake. Linked entity nodes (people, organizations, dates) are written if entity tagging ran. Edges created per the intake contract.

9. **Processing log write.** Per-step timestamps, outcomes, costs, errors. Stored in the document store at `by_hash/{content_hash}/processing_log.json`. Updated after each step completes; never truncated.

### 11.3 Per-step concurrency pools

The orchestrator manages per-step concurrency rather than per-document parallelism. This gives more granular control: the LLM-summary step can have 16 concurrent calls in flight while OCR is limited to 2 (because OCR is heavy on memory), without coupling those decisions.

Initial defaults, all configurable:

| Step | Default concurrency | Reasoning |
|---|---|---|
| LLM classification | 16 | Cheap; latency-sensitive only in aggregate |
| LLM summary | 16 | Same |
| LLM extraction | 16 | Same |
| Docling conversion | 4 | Heavy on memory and CPU |
| MarkItDown conversion | 12 | Lightweight |
| OCR | 2 | Very heavy when needed; thermal-sensitive |
| GLiNER entity tagging | 8 | Moderate; benefits from batching |
| Chunking | 8 | Lightweight |
| DOC72 index writes | 4 | Bounded by EC sole-writer; sequential within EC's write queue |

These values are starting points. The orchestrator monitors observed throughput and latency per step and surfaces "bottleneck of the run" in the Runs tab. The user can adjust per-step pools in the Document Intelligence settings (§19).

### 11.4 EC orchestrator ownership

The orchestrator is part of EC's background services, not a standalone service. EC owns:

- **Durable queue.** SQLite table tracking every document's pipeline state. Resumable across crashes (per §14.5 state-machine recovery).
- **Worker pool management.** Per-step worker pools with the concurrency limits above.
- **Retry, timeout, and fallback policies per step.** Configurable per step. Retries are bounded (default 3); after retries exhaust, the document enters `hard_failed` with an explicit reason code per §14.5.2.
- **Status tracking visible to consuming surfaces.** Surfaces query the orchestrator for document status; the orchestrator does not push to surfaces.
- **Dynamic pool sizing based on user activity.** When the user is actively chatting, ingestion pools throttle down to free CPU and memory for chat responsiveness. When the user is idle, pools open up to maximum configured concurrency.

DOC25 specifies the ingestion pipeline; EC executes it. DOC25 does not run its own background processes outside of EC's orchestrator. The single-orchestrator pattern is invariant.

### 11.5 Reuse versus reconversion

When a document is encountered for the second time (same `rawFileHash`), the orchestrator decides whether to reuse existing artifacts or reconvert. The default rule is: same hash → reuse conversion, summary, metadata, and entity tags; create a new corpus_membership if the document is entering a new corpus.

#### 11.5.1 What is reused

For an already-ingested document encountered again:

- Conversion artifact reused. The active conversion version is what subsequent surfaces consume.
- Document summary reused. Not regenerated.
- Per-page summaries reused. Not regenerated.
- Entity tags reused. Not re-run.
- Chunks reused if `chunking_config_version` matches the corpus's current config. Otherwise re-chunked at corpus's config (Q4 versioned chunking).

For an already-ingested document being added to a new corpus:

- All of the above is reused.
- A new `corpus_membership` is created.
- Corpus-specific extraction (DOC73 territory) runs anew because extraction is per-corpus-membership; conversion is not.

#### 11.5.2 Reconversion triggers

Reconversion is triggered by:

- **User explicit request** ("reconvert this document — Docling mangled the tables"). Per-document reconvert affordance in the Documents tab per §19. This is the E13 contract.
- **Document type profile assignment changes.** A document re-classified from `generic_document` to `structured_filing` may need reconversion with a different tool. Surfaced as a suggestion, not auto-triggered.
- **Tool upgrade with significant quality improvement.** Rare; user-driven. The system flags documents that would benefit from reconversion when a tool releases a known-better version, but does not auto-reconvert in V2.0 (Q3 resolved: manual per-document only; auto-propose batch reconversion deferred to V2.1).

When reconverting, the previous conversion artifact is preserved per E8 versioning (§12.3). The new conversion gets the next version number; both are accessible in the Documents tab. The user can revert to a prior version through the per-document reconvert affordance within the 30-day retention window.

#### 11.5.3 Per-document reconvert affordance (E13)

The Documents tab in DOC7 provides per-document actions:

- **Reconvert with Docling full pipeline** — for documents where MarkItDown's output looks degraded.
- **Reconvert with MarkItDown** — for documents where Docling overcomplicated.
- **Mark conversion acceptable** — for documents where the user has reviewed and accepts the current conversion.

Reconversion creates a new artifact version (per E8); old artifacts preserved for the 30-day retention window. Memories extracted from the new conversion version can be integrated as a re-extraction directive (per DOC73 V1.4.1 §14.7) or as a fresh extraction. The reconvert affordance does not automatically trigger re-extraction; that decision is corpus-level, surfaced separately in DOC7.

### 11.6 Status visibility

Document status is visible at three levels:

- **Per-document status** in the Documents tab. Current ingestion state per §14.5.1, last completed step, current step in progress, any quality flags, processing log on demand.
- **Per-corpus aggregate** in the corpus dashboard (Settings tab). Ingestion progress (n of N documents complete; n in each step; n hard-failed); aggregate quality flags; "bottleneck of this run" indicator.
- **Per-run** in the DOC7 Runs tab. Run-level summary: documents processed, time elapsed, hard failures, quality flags rolled up.

All three views read from the same orchestrator state. They are read-models, not separate state stores. Updates propagate as the orchestrator advances each document's state.

---

## §12 Content-Addressable Storage Model

V2.0 uses content-addressable storage rather than path-addressable. Originals stay where the user puts them; derived artifacts are stored under content-hash keys in ELNOR's private document store. This preserves user portability, enables exact dedup, and makes the system robust to user-controlled path changes.

### 12.1 Document store layout

ELNOR's private document store lives at `~/.elnor/document_store/`. Layout:

```
~/.elnor/document_store/
├── originals/                            # only for files arriving as bytes
│   └── {content_hash}.{extension}        # email attachments, browser captures
│                                          # (user-controlled path files NOT stored here)
│
├── by_hash/{content_hash}/
│   ├── conversions/
│   │   ├── v1/
│   │   │   ├── converted.md              # primary markdown conversion
│   │   │   ├── tool.json                 # which tool, which config produced this version
│   │   │   ├── quality_report.json       # IngestionQualityReport for this version
│   │   │   └── metadata.json             # extracted metadata fields
│   │   ├── v2/                           # if reconverted
│   │   │   └── ...
│   │   └── active                        # symlink or pointer to the active version
│   │
│   ├── chunks/
│   │   └── {chunking_config_version}/
│   │       └── chunks.db                 # chunks indexed (or reference into DOC18)
│   │
│   ├── ocr_cache/
│   │   └── {page_number}.json            # per-page OCR results
│   │
│   ├── entities.json                     # GLiNER output (if run)
│   │
│   ├── summary.txt                       # document summary
│
│   ├── page_summaries.json               # per-page summaries
│
│   ├── processing_log.json               # per-step timestamps, outcomes, costs, errors
│   │
│   └── previous_conversions/             # 30-day retention window for prior conversions
│       └── ...                           # populated when active version changes
│
└── document_index.sqlite                  # SQLite document index (see §12.2)
```

Originals at user-controlled paths (OneDrive, local filesystem) are not copied into `originals/`. ELNOR computes their hash and writes derived artifacts under `by_hash/{content_hash}/`. Originals arriving as bytes (email attachments, in-memory uploads) are written once to `originals/{content_hash}.{extension}` and then treated as user-controlled paths for all subsequent operations.

### 12.2 SQLite document index

The document index tracks documents by their identity (content hash) and their locations (original paths, corpus memberships). It is the primary lookup table for "is this document already ingested?"

```sql
CREATE TABLE documents (
  document_id          TEXT PRIMARY KEY,           -- UUID
  content_hash         TEXT NOT NULL UNIQUE,       -- raw_file_hash (primary dedup key)
  normalized_text_hash TEXT,                       -- for OCR-different copies
  
  title                TEXT,
  document_type        TEXT,
  file_size            INTEGER,
  mime_type            TEXT,
  
  classification_json  TEXT,                       -- DocumentClassification serialized
  
  active_conversion_version TEXT,                  -- 'v1', 'v2', etc.
  
  ingestion_status     TEXT NOT NULL,              -- state machine state
  hard_fail_reason     TEXT,                       -- populated only when hard_failed
  
  first_ingested_at    TEXT NOT NULL,              -- ISO datetime
  last_reingested_at   TEXT,
  last_accessed_at     TEXT,
  
  processing_log_ref   TEXT                        -- pointer to processing_log.json
);

CREATE INDEX idx_documents_content_hash ON documents(content_hash);
CREATE INDEX idx_documents_normalized_text_hash ON documents(normalized_text_hash);
CREATE INDEX idx_documents_status ON documents(ingestion_status);

CREATE TABLE document_original_paths (
  path_id            INTEGER PRIMARY KEY AUTOINCREMENT,
  document_id        TEXT NOT NULL REFERENCES documents(document_id),
  path               TEXT NOT NULL,
  location_type      TEXT NOT NULL,                -- 'onedrive' | 'local' | 'email_attachment' | etc.
  added_at           TEXT NOT NULL,
  is_stale           INTEGER NOT NULL DEFAULT 0,   -- 1 if path no longer resolves
  UNIQUE(document_id, path)
);

CREATE TABLE document_corpus_memberships (
  membership_id        INTEGER PRIMARY KEY AUTOINCREMENT,
  document_id          TEXT NOT NULL REFERENCES documents(document_id),
  corpus_id            TEXT NOT NULL,
  added_at             TEXT NOT NULL,
  added_by             TEXT NOT NULL,              -- user_id or system
  extraction_status    TEXT,                        -- per-corpus extraction state
  extraction_history_json TEXT,                     -- list of completed extraction runs
  UNIQUE(document_id, corpus_id)
);

CREATE TABLE document_source_instances (
  source_instance_id   TEXT PRIMARY KEY,
  document_id          TEXT NOT NULL REFERENCES documents(document_id),
  policy_context_json  TEXT NOT NULL,              -- visibility, sensitivity, scope_inference_basis
  ingested_at          TEXT NOT NULL,
  ingestion_surface    TEXT NOT NULL              -- which surface ingested this instance
);
```

The `documents` table is the canonical document identity. `document_original_paths` tracks all locations where this document has been seen. `document_corpus_memberships` tracks corpus participation. `document_source_instances` tracks each policy-context-distinct ingestion event (per Q2 / E2 source_instance_id).

EC owns writes to all four tables under its sole-writer invariant. Other components read via DOC25 service interfaces, not directly.

### 12.3 Multi-hash content addressability

A single hash is insufficient for legal document variance (Bates stamps applied, OCR differences, redactions, watermarks). V2.0 specifies multiple hash types per E2:

```
content_addressable_keys {
  raw_file_hash             // SHA-256 of original bytes
  normalized_binary_hash    // Post-normalization for trivial differences
                            // (whitespace, line endings, BOM removal)
  normalized_text_hash      // Hash of converted markdown
                            // (catches OCR-different copies of the same scan)
  page_hashes: []           // Per page, for partial-document dedup
  chunk_hashes: []          // Per chunk, for retrieval-level dedup
  source_instance_id        // Unique per ingestion event with policy context
                            // Same content from different ingestion paths
                            // has same raw_file_hash but different
                            // source_instance_ids
}
```

Dedup logic uses the appropriate hash for the question being asked:

| Question | Hash used |
|---|---|
| "Have I ingested this exact file before?" | `raw_file_hash` |
| "Have I ingested this content (after normalization)?" | `normalized_binary_hash` or `normalized_text_hash` |
| "Have I seen this page before across documents?" | `page_hashes` |
| "Have I retrieved this chunk before?" | `chunk_hashes` |
| "Is this the same content but with different policy context?" | matching `raw_file_hash` but distinct `source_instance_id` |

The `source_instance_id` is critical for policy correctness. The same document uploaded to a `firewalled` corpus and the same document uploaded to an `ambient` corpus share `raw_file_hash` but have different `source_instance_id`s — the policy context differs. This prevents incorrect dedup that would collapse policy distinctions and let firewalled content leak into ambient retrieval.

### 12.4 Versioned immutable ingestion artifacts

The worst data-loss scenario is overwriting a prior conversion or extraction artifact. V2.0 makes all ingestion artifacts versioned and immutable per E8:

```
source_instance_v1                  # original file, never modified
conversion_artifact_v1              # initial conversion to markdown
extraction_run_v1                   # memories produced from spec_version A on conversion_v1

# Reconversion creates v2:
conversion_artifact_v2              # Docling reconversion, e.g., per E13
                                    # v1 preserved

# Re-extraction creates new run:
extraction_run_v2                   # memories from spec_version B on conversion_v1
                                    # v1 preserved as historical
```

**Rules:**

- Reconversion creates a new conversion_artifact version; never overwrites.
- Re-extraction creates a new extraction_run; never overwrites prior runs.
- Dedup shares content artifacts but never collapses source membership, sensitivity labels, or extraction history.
- Each artifact version has its own `IngestionQualityReport` (for conversion versions) or its own memory set (for extraction versions).
- Old versions are kept for the 30-day retention window. After 30 days, old versions are eligible for purge but not auto-purged in V2.0 — the user can manually purge or leave them indefinitely. Q5 resolved: user-configurable retention, default keep-everything.

Versioning enables:

- **Rollback.** User can revert to a prior conversion or prior extraction run.
- **Audit trail.** "What did the system know on date X?" answerable from the versioned record.
- **Comparison across spec versions.** Two extraction runs against the same conversion can be compared.
- **Recovery from accidental misconfiguration.** A bad reconversion does not destroy the working conversion; revert is one click.

### 12.5 Chunk versioning (Q4)

Chunks are versioned by `chunking_config_version`. When a corpus's chunking config changes (e.g., chunk size from 600 to 800 tokens), previously-chunked documents have their old chunks at the prior version. The new config does not auto-trigger re-chunking; re-chunking is opt-in per corpus or per document.

Document chunks under the document store path `by_hash/{content_hash}/chunks/{chunking_config_version}/`. Multiple chunking versions can coexist for the same document. Retrieval queries include the corpus's current `chunking_config_version` and read the matching chunk set.

When a corpus is opted in to re-chunk, the orchestrator queues the affected documents on the chunking step. Re-chunking does not re-run extraction; chunks are independent of extracted memories.

---

## §13 Cross-Surface Deduplication

Cross-surface deduplication is the pattern that makes universal ingestion useful in practice. The same document encountered through multiple surfaces (corpus, email, browser, chat, EDGAR pull) is recognized as one document, processed once, and reused everywhere it appears.

### 13.1 Scenarios

The dedup mechanism must correctly handle these scenarios.

**Scenario A.** User ingests a 200-page PDF into Corpus A. Two weeks later, in a chat with no corpus engaged, the user drags the same file into the chat. The system must recognize it has already been processed. No re-conversion, no re-summary, no re-entity-tag. The chat immediately has access to the converted markdown, summary, and entity tags.

**Scenario B.** User uploads a document to Corpus A, then later tries to add the same document to Corpus B. The system must recognize it as already-ingested and already-converted. No re-conversion. A second `corpus_membership` is created, but corpus-specific extraction runs per-corpus-membership so Corpus B's extraction is new work even though the conversion is not.

**Scenario C.** User receives a PDF as an email attachment (M365 pipeline processes it). Later, the sender sends a fresh copy of the same PDF. Or the user separately downloads the same PDF from EDGAR. The system recognizes it as the same document. One document node, multiple `original_paths`, one set of derived artifacts, separate `source_instance_id`s if policy contexts differ.

**Scenario D.** User has ingested version 1 of a document. Later they ingest version 2 (a revision — same title, different content). The system must NOT dedup — these are different documents. The system may note "this looks similar to a previous document — link or keep separate?" as an affordance, but does not force a dedup.

### 13.2 Mechanism

#### 13.2.1 Exact dedup by content hash

Scenarios A, B, and C are handled by exact hash matching: same bytes → same `raw_file_hash` → instant match. This is cheap (SHA-256 on the file, computed once per ingestion event) and deterministic.

When a new ingestion event arrives, the orchestrator computes `raw_file_hash` and checks the `documents` table:

- **Hit:** the document is already ingested. Add the new path to `document_original_paths`. If the policy context differs (different corpus visibility, different sensitivity), create a new `source_instance_id`. Skip steps 3-7 of the pipeline (conversion, metadata, summary, entity tagging, chunking) — they are reused. Run step 8 (DOC72 index writes) only if the new event introduces new corpus_membership or source_instance.
- **Miss:** new document. Run the full pipeline.

#### 13.2.2 Near-dedup for revisions (Scenario D)

Near-dedup is harder. Options were considered:

- **Document content fingerprint.** Hash of normalized text (whitespace-collapsed, case-normalized). Catches formatting-only changes but not content changes.
- **Shingle / minhash similarity.** Standard near-dup detection; computes a similarity score. Above 95%, treat as revisions of the same document; user confirms whether to link or keep separate.
- **Title + date fingerprint.** Less reliable; relies on metadata.

**V2.0 decision (Q1 resolution):** exact dedup day one via `raw_file_hash`, plus normalized text matching via `normalized_text_hash` for OCR-different copies of the same scan. Deeper near-dedup (shingle/minhash similarity scoring) is deferred to V2.1. Exact dedup catches the majority of real cases; near-dedup is a refinement.

When the user manually flags two documents as related-but-distinct (or the system detects high content similarity in the future), the relationship is captured as an edge in DOC72 (`related_to` or `revises`) rather than collapsed into a single document.

### 13.3 Original path tracking

`original_paths` is an array, not a single field. Every time a document appears at a new location, the new location is added to the array without changing the document's identity. "This PDF exists in OneDrive at X, was also sent as email attachment from Y, was also downloaded from EDGAR at Z."

The `document_original_paths` table (§12.2) stores all known paths with `location_type` and `is_stale` flag. When a path no longer resolves (file was moved or deleted), the path is marked stale rather than removed. The document remains identified; subsequent encounters at new paths add to the array.

### 13.4 Cross-surface visibility UX

When a document is recognized as already-ingested, the surface that's bringing it in shows the user:

```
This document is already in your system.
It's in [Corpus A] and was originally ingested [date].

[Use existing copy] [Reingest]
```

Default action: use existing.

The reingest option is available when the user believes the document changed or the conversion was bad. Reingest re-runs the full pipeline and creates new conversion/extraction artifact versions (per E8); the prior versions are preserved.

### 13.5 Deletion semantics

Deletion is layered to prevent accidental purge of useful artifacts:

1. **Delete from a corpus.** The `corpus_membership` is removed but the document node and derived artifacts persist (other corpora or surfaces may reference them).
2. **Delete from all corpora AND no chat/email/browser surface references it.** The document node is dormant. Derived artifacts persist.
3. **User explicitly confirms "remove from ELNOR entirely."** The document node and derived artifacts are purged. Originals at user-controlled paths are not touched (ELNOR never modifies user files).

No automatic orphan cleanup in V2.0. Documents accumulate; storage management is manual or via the configurable retention setting per corpus (Q5 resolution).

### 13.6 Permission and visibility (Q2 resolution)

A document ingested in a chat with no corpus engaged is still a document node in ELNOR's index. When later added to a corpus, it is the same node. No permission changes — corpus membership is an additional access, not a more-restrictive one.

Same content with different policy/sensitivity contexts gets a different `source_instance_id` even when `raw_file_hash` matches. A document uploaded to a `firewalled` corpus and the same document uploaded to an `ambient` corpus share `raw_file_hash` but have different `source_instance_id`s — the policy context differs. This prevents incorrect dedup that would collapse policy distinctions.

The `document_source_instances` table (§12.2) stores each instance with its `policy_context_json` (visibility, sensitivity, scope_inference_basis). Retrieval queries scoped to a particular policy context only return source_instances matching that context, even if the underlying document is shared.

### 13.7 Unstable original paths (Q7 resolution)

Original paths can break. A user moves a OneDrive document the system knows about; the original path no longer resolves. The system handles this with a hybrid approach:

1. **Auto-repath best-effort by content hash matching.** When a document is encountered at a new path with the same `raw_file_hash` as a known document, the system silently adds the new path to `document_original_paths` and continues. The break is healed without user intervention.

2. **Manual relink fallback with UI affordance.** When the system cannot auto-repath (e.g., the user moved the document while ELNOR was offline; the new path has not been encountered yet), the affected document surfaces in the Documents tab with a "Relink" affordance. The user points the system at the new location; the system verifies the hash and updates the path array.

3. **ALWAYS preserve derived artifacts regardless of original-path break.** The `document_id`, conversion artifacts, summary, chunks, entity tags, and corpus memberships remain valid even if every known original path breaks. The system can answer questions about the document, retrieve from it, and continue extraction work. Only re-ingestion (which would compute a fresh `raw_file_hash`) is blocked while no original path resolves.

This three-part approach prioritizes resilience: most broken paths heal automatically; a small minority require user action; in the worst case, the document remains useful even with no resolvable original.

### 13.8 Cross-surface benefits

Once DOC25 owns universal ingestion, these patterns become possible:

- User forwards an email with a PDF attachment. M365 pipeline processes it. Later in chat, user asks "didn't I see something about X last week?" — the PDF is already ingested, summarized, entity-tagged, searchable.
- Q Browser captures a document during a browsing session. Same document is later uploaded to a corpus — no re-processing.
- DOC23 gathering task pulls 50 EDGAR filings. They flow into EC's ingestion queue, processed as a batch, deduplicated against any of the same filings already in ELNOR.
- Red-team session on a single document uses ingested artifacts without re-conversion.

These patterns require zero special-case code in the consuming surfaces. Each surface emits an ingestion event; the orchestrator handles dedup, reuse, and reuse decisions; the surface gets a uniform DocumentEntity to work with regardless of whether the document was newly ingested or recognized.

---

## §14 Pipeline State Machine

Without a strict state machine, ingestion can stick in bad states or partially commit — e.g., chunks indexed but extraction never run, leaving retrieval pointing at incomplete representations. V2.0 specifies the state machine per E3.

### 14.1 Document states

```
Document ingestion states:

  registered                  # Document known to system; hash not yet computed
       ↓
  hash_checked                # Content hashes computed; dedup check performed
       ↓
  source_stored               # Original artifact persisted (or hash-only if user-controlled path)
       ↓
  conversion_pending          # Queued for conversion
       ↓
  converted | conversion_failed
       ↓
  classified | classification_failed
       ↓
  summarized | summary_failed
       ↓
  indexed | indexing_failed
       ↓
  extraction_pending          # Universal ingestion done; extraction is corpus-specific
       ↓
  extracted | extraction_failed
       ↓
  verified | verification_failed
       ↓
  written | write_failed
       ↓
  complete | degraded_complete | hard_failed
```

The states up through `indexed` are universal (steps 1-7 of the pipeline; step 8 is `indexed`). States from `extraction_pending` onward are corpus-specific extraction territory (DOC73 owns the extraction work but reads/writes the same state machine via DOC25 service interfaces).

`degraded_complete` indicates the pipeline ran to completion but with quality flags raised (e.g., low OCR confidence, garbled tables, missing page anchors). The document is usable; the user knows it has issues.

`hard_failed` indicates a step failed after retries exhausted, with a specific reason code per §14.5.2.

### 14.2 Memory write preconditions

Memory writes are forbidden unless the required source/content/conversion references exist. This prevents memories from being written that cannot be traced back to their source.

```
Memory write requires:
  Valid source_ref                (pointer to document)
  Valid content_ref               (pointer to converted markdown — normalized_content_ref)
  Valid conversion_artifact_ref   (specific conversion version)
  Valid source span/page/section anchor (where applicable to memory type)
```

A memory written without these references is rejected by EC's write path with a `source_anchor_missing` error. This is non-negotiable: every memory in the entity graph traces back to its source.

### 14.3 Atomic transitions

State transitions are atomic with respect to graph writes. A document cannot be in `extracted` state without `converted` having been true at some point. State machine integrity is maintained by EC's sole-writer invariant: only EC can transition document state, and EC's transitions are wrapped in SQLite transactions that include any associated graph writes.

Concurrent ingestion of the same document through two surfaces is handled by EC's write queue: the first event wins on initial state transitions; the second event finds the document already past the relevant state and skips ahead (or attaches as a new corpus_membership / source_instance without re-running pipeline steps).

### 14.4 Recovery on crash

State machine resumes from the last completed state when EC restarts. The recovery procedure:

1. EC starts and reads the orchestrator queue from SQLite.
2. For each document in non-terminal state, EC determines whether the step it was on completed before the crash (idempotent steps re-run safely; non-idempotent steps check completion before retrying).
3. **Idempotent steps.** Hashing, classification, summarization, entity tagging — these can be re-run with no harmful effect. If interrupted, EC re-runs them.
4. **Non-idempotent steps.** Index writes to DOC72 (creating nodes and edges). EC checks whether the writes already happened before retrying.
5. Documents in `hard_failed` state are not auto-retried. They surface in the Runs tab with their reason code; the user can manually retry from there.

The processing log (`by_hash/{content_hash}/processing_log.json`) is the source of truth for "what happened so far" and is consulted during recovery.

### 14.5 Hard-fail reason codes

After bounded retries (default 3 per step), documents that still fail enter `hard_failed` with explicit reason codes. The reason code vocabulary is fixed:

```
Hard-fail reason codes:
  conversion_timeout                    # Conversion tool did not finish in time
  ocr_failed                            # OCR could not extract usable text
  unsupported_format                    # File format not handled by any tool
  auth_unavailable                      # Source requires auth that is not configured
  storage_exhausted                     # Disk space exhausted (per E5)
  tool_unavailable                      # Required tool is not running (per E7)
  schema_validation_failed              # Output did not conform to expected schema
  source_anchor_missing                 # Required source anchor (page, line) missing
  content_too_large                     # Content exceeds processing limits
  duplicate_with_different_metadata     # Hash matches existing but metadata differs;
                                        # requires user resolution
  doc25_contract_violation              # DOC25 returned malformed result (per §17)
```

#### 14.5.1 Reason code semantics

- **conversion_timeout.** The conversion tool ran past its configured timeout. Default timeout: 5 minutes per document per tool. User can manually retry with extended timeout from the Runs tab.
- **ocr_failed.** OCR could not extract usable text. Possible causes: image is too low-resolution; the page is blank; the OCR tool is misconfigured. The document is usable for visual reading (Tier 1) but not for text-based retrieval.
- **unsupported_format.** No tool in the system handles the file format. Surfaces with the file extension and mime type so the user can decide whether to install additional tooling or skip the document.
- **auth_unavailable.** Source requires authentication that is not currently available (e.g., expired SharePoint token, missing PACER credentials). The document is queued; ingestion resumes when auth is restored.
- **storage_exhausted.** Disk space dropped below the reserved floor (per §15.3). Ingestion paused; user prompted to free space or adjust retention.
- **tool_unavailable.** A required tool (Docling container, MarkItDown process, GLiNER model) is not running. Ingestion queued; resumes when the tool is healthy. This is distinct from `conversion_timeout` — `tool_unavailable` is upstream (tool isn't there), `conversion_timeout` is downstream (tool is there but didn't finish).
- **schema_validation_failed.** A tool returned output that didn't conform to its expected schema (e.g., NuExtract returning unstructured text where structured fields were expected). The document falls back to a less-strict extraction path on retry; if all paths fail schema validation, hard-failure is logged.
- **source_anchor_missing.** A memory write was attempted without a valid source span. The extraction step is held; the user reviews the failed extraction in the Runs tab.
- **content_too_large.** Content exceeds processing limits (e.g., a 10,000-page PDF). User can manually split the document or extend limits per corpus.
- **duplicate_with_different_metadata.** The hash matches an existing document but the metadata (title, date, source) differs significantly. Ingestion is paused pending user resolution: was this a re-ingest of the existing document with bad metadata, or a different document that happened to share content?
- **doc25_contract_violation.** DOC25 returned a `DOC25_IngestionResult` that did not conform to the contract in §17. This is an internal error and indicates a DOC25 implementation bug; surfaces in error logs as well as Runs tab.

#### 14.5.2 No quiet cycling

A hard-failed document does not sit in a retry loop forever. After bounded retries exhaust (default 3), it enters `hard_failed` and surfaces for user attention. No automatic retry unless the failure reason is explicitly transient (`tool_unavailable` may auto-retry when the tool becomes available; `auth_unavailable` may auto-retry when auth is restored). Other reason codes require user action.

### 14.6 State visibility

The current state of every document is visible in three places:

- **Documents tab** — per-document current state, hard-fail reason code if present, processing log on demand.
- **Runs tab** — aggregate of states across the current run, with hard-failures grouped by reason code. One-click retry per document.
- **Corpus dashboard** — high-level "n of N documents complete; n hard-failed" aggregate.

These views are read-only projections of the orchestrator state; they do not duplicate state.

---

## §15 Tool Health, Failure Handling, Status

The operational layer that supports the pipeline: quality reporting, storage controls, resource throttle, tool health, batch policy, realistic timing, cost reporting.

### 15.1 IngestionQualityReport (E1)

Every conversion produces a quality report. DOC25 emits, and DOC73 (or any other consumer) consumes, a structured report per document version:

```
IngestionQualityReport {
  source_ref
  selected_tool                 // "docling" | "markitdown" | "ocr_fallback" | "other"
  selected_tool_reason          // Why this tool was chosen
  conversion_artifact_ref       // Pointer to the conversion artifact this report describes
  
  quality_score                 // 0.0-1.0 overall
  page_anchor_integrity         // 0.0-1.0; 1.0 = all page boundaries preserved
  table_integrity_score         // 0.0-1.0; 1.0 = tables preserved with structure
  text_yield_score              // 0.0-1.0; 1.0 = max expected text yield achieved
  ocr_confidence_summary: {
    applied                     // boolean; was OCR run?
    page_confidences: [...]     // per-page OCR confidence (0.0-1.0) if OCR was applied
  }
  
  fallback_attempts: [
    {
      tool                      // First tool attempted
      reason                    // Why it failed
      outcome                   // "fallback_to_X" | "abandoned"
    }
  ]
  
  degraded_flags: [
    "low_text_yield" |
    "tables_garbled" |
    "missing_page_anchors" |
    "ocr_low_confidence" |
    "partial_conversion" |
    // ...
  ]
  
  user_visible_status           // "ok" | "degraded" | "failed"
}
```

Reports surface in:

- **Individual document UI** (Documents tab in DOC7).
- **Aggregate flags surface on corpus dashboard** (Settings tab).
- **DOC7 Runs tab** — per-run aggregate quality.

This is the single mechanism that turns silent degradation into visible degradation. Without it, conversion quality issues hide; with it, the user sees which documents have which issues and can prioritize fixes.

### 15.2 Storage exhaustion controls (E5)

Large corpus ingestion can fill the disk with originals (when arriving as bytes), conversions, OCR intermediates, page images, embeddings, and logs. V2.0 specifies controls to prevent storage exhaustion from corrupting the canonical store.

**Required controls:**

- **Preflight free-space estimate.** Before starting batch ingestion, estimate disk requirement (originals + conversions + chunks + embeddings + intermediate artifacts). Compare to available space. If estimated requirement exceeds available, warn the user before starting.
- **Reserved disk floor.** Configurable reserve, default 5 GB. Ingestion pauses if free space drops below the floor.
- **Low-space circuit breaker.** Halts new ingestion before durable writes can corrupt SQLite or partial-write a conversion. The circuit breaker triggers at the reserved disk floor; ingestion remains halted until free space rises 1 GB above the floor (hysteresis to prevent oscillation).
- **Cleanup of temporary artifacts before retry.** OCR intermediates, conversion scratch space, partial uploads cleaned proactively rather than waiting for tool exit.
- **Visible Q alert.** "Free space needed: X GB. Derived artifacts safe to purge: Y GB." With one-click cleanup option (purges previous_conversions older than retention window; clears OCR cache for documents not in active corpora; etc.).

Storage exhaustion never partially writes memories or corrupts SQLite. The orchestrator pauses queues before durable writes if the reserve threshold is breached.

### 15.3 Resource throttle for thermal and memory (E6)

M4 Pro thermal and memory constraints are real during sustained workloads. Running Docling, GLiNER, NuExtract, and local vector indexing simultaneously during a 200-document batch can lock up the machine. V2.0 specifies a global resource throttle:

- **Memory pressure throttle.** Pause new ingestion tasks if memory pressure is over 85%; resume below 75% (hysteresis prevents flapping).
- **CPU temperature throttle.** Pause if sustained CPU temperature exceeds threshold (configurable; default 95°C sustained for 30 seconds).
- **Concurrent-task back-off.** Reduce per-step pool sizes dynamically when the system is under load. The orchestrator sees observed throughput drop and shrinks the pool to match.

User-visible status when throttling is active: "Ingestion paused — system under load. Resuming when memory drops below 75%."

This prevents the failure mode where a 200-document batch ingestion locks the user out of their own machine during the run.

### 15.4 Tool health checks and capacity leases (E7)

Tools die mid-run. A Docling container crashes, the MarkItDown process exits, the Firecrawl service hangs. V2.0 specifies tool health and lease semantics:

- **Pre-dispatch health check.** Before assigning a job to a tool (Docling container, MarkItDown process, Firecrawl service, GLiNER worker), verify it is responsive. The health check is fast (under 100ms) and a tool that fails health check is marked unhealthy; jobs are queued rather than dispatched until it recovers.
- **Tool capacity lease.** During execution, the orchestrator holds a lease on tool capacity. The lease times out if the tool stops responding (configurable per tool; default 10 minutes for conversion tools, 5 minutes for OCR, 2 minutes for entity tagging).
- **Lease invalidation on tool death.** Active jobs whose tool dies move to `retryable_tool_failure` state, not `document_failed`. This preserves the option to retry once the tool recovers, distinct from genuine document failure.
- **Failed-lease recovery.** After tool recovery, queued retryable jobs resume.

This distinguishes transient tool failure (recoverable) from document-level failure (hard fail). A Docling container crash should not fail every queued document if Docling can be restarted.

### 15.5 5000-doc batch commit policy (E18)

At small scale, per-document commits provide responsiveness — the user sees each document complete in real time. At scale (thousands of documents in one batch), per-document commits create SQLite write-lock contention with FTS5 index updates and vector inserts, slowing the entire batch.

V2.0 specifies batch-commit queues for DB writes:

| Batch size | Commit policy |
|---|---|
| Under 100 docs | Per-document commits (small enough that responsiveness wins) |
| 100-1000 docs | Batched commits in groups of N (default 50) or every T seconds (default 5s), whichever fires first |
| Over 1000 docs | Larger batches (default 100) or longer intervals (default 10s) |

All values are configurable per-corpus or globally.

User-visible status during batch ingestion: "Ingesting 5000 documents. Committing in batches of 100 every 10 seconds."

### 15.6 200-doc realistic timing (E19)

Earlier specs implied 15-20 minutes for 200-document ingestion. That estimate did not account for thermal throttling, mixed Docling/MarkItDown work, OCR on scanned material. V2.0 corrects to realistic timing on M4 Pro hardware:

**Realistic timing for a 200-document mixed corpus on M4 Pro:**

- **Time to source registered:** seconds to minutes (depends on whether originals arrive as bytes or are at user-controlled paths).
- **Time to converted/indexable:** 30-45 minutes. Accounts for thermal throttling, mixed Docling/MarkItDown work, OCR on scanned material.
- **Time to shallow usable:** approximately 1 hour after registration. Basic retrieval works; chunks are indexed; summaries available.
- **Time to deep extraction complete:** several hours. Depends on hard target count, judgment-bearing target proportion, document complexity. Corpus-specific extraction (DOC73 territory) is the long tail.
- **Time to counterfactual review complete:** depends on review queue workflow; not bounded by ingestion. The user may take days or weeks to triage the counterfactual queue at their own pace.

**Phased usability checkpoints** surface in the UI: the user can use the corpus progressively as each phase completes. 30-45 minutes total ingestion time is acceptable when usability checkpoints make the corpus progressively useful.

Dashboard shows current phase: "Conversion 87% complete (172/200 documents). Shallow retrieval available now. Deep extraction in progress."

### 15.7 Cost reporting — model cost vs wall-clock

Cost has two dimensions and V2.0 reports them separately:

- **Model cost band.** LLM API costs only (token usage × model pricing). Optimistic base case for text-native documents.
- **Wall-clock time band.** End-to-end elapsed time on the user's hardware with local orchestration, OCR bottlenecks, thermal throttling, API rate limits.

Both bands present in:

- Onboarding UI (during corpus tool-requirement plan compilation).
- Run preview UI (Runs tab).
- Cost circuit breaker confirmation (corpus-level cost limit).

User sees: "Estimated $70-100 LLM cost; estimated 30-45 minutes wall-clock time on this hardware."

This prevents the optimistic-API-cost-only number from creating false expectations of total ingestion time.

### 15.8 Failure surfacing principle

Failure surfacing is prominent in corpus UI. The default behavior is to continue processing other documents when one fails — one bad PDF should not halt a 200-document batch. Failed documents surface with one-click retry affordances and explicit reason codes. Resumption after crash is automatic via the durable queue.

---

## §16 Runtime Retrieval Tools and Re-Read Posture

This section specifies the runtime tool surface available to LLMs and specialist sub-agents for retrieving content during a turn. It pulls together the document retrieval tools defined in §8, adds the memory query tool family, and defines the retrieval posture directive that governs when to rely on memory versus re-read source.

### 16.1 Tool families overview

Two tool families share the same overall shape (LLM-callable, scope-defaulting, result-limited):

- **Document retrieval tools** (specified in §8): `retrieve_document_pages`, `retrieve_full_document`, `retrieve_memory_source`. Return authoritative raw content from documents.
- **Memory query tools** (specified in this section): `retrieve_memory_by_id`, `retrieve_related_memories`, `retrieve_memory_cluster`, `search_memories`, `retrieve_memories_by_entity`, `verify_memory`. Return compact extracted content from the memory graph.

The split is deliberate: memory is not source text, source text is not memory. Both families share output schema patterns so the primary chat agent can consume their findings consistently when delegating to specialist sub-agents.

### 16.2 Memory query tool family

#### 16.2.1 retrieve_memory_by_id

Takes a `memory_id`, returns the full memory with metadata.

```typescript
const retrieveMemoryById: ActionRegistryEntry = {
  action_id: "memory.retrieve_by_id",
  domain: "memory_processing",
  display_name: "Retrieve Memory by ID",
  user_goal: "Get the full content and metadata for a specific memory",
  description: "Returns a memory with its provenance, confidence, and source references.",
  stability_class: "stable_action",
  agent_invocable: true,
  invocation_bindings: [
    { transport: "native_openclaw_tool", binding_name: "retrieve_memory_by_id",
      readiness: "ready",
      client_exposure: { elnor_native: true, mcp_external: false, q_ui: false },
      schema_version: 1 },
  ],
  confirmation_policy: "none",
  safety_class: "read",
  schema_version: 1,
};

interface RetrieveMemoryByIdRequest {
  memory_id: string;
}

interface RetrieveMemoryByIdResponse {
  memory_id: string;
  found: boolean;
  memory: {
    content: string;
    type: string;                  // memory_directive subtype
    confidence: number;            // 0.0-1.0
    source_ref: string;            // pointer to source document or conversation
    provenance: object;            // full provenance record
    created_at: string;
    last_verified_at: string;
  } | null;
}
```

#### 16.2.2 retrieve_related_memories

Follows graph edges from a known memory.

```typescript
interface RetrieveRelatedMemoriesRequest {
  memory_id: string;
  edge_types?: string[];           // e.g., ["supports", "contradicts"]; default: all
  max_depth?: number;              // graph traversal depth; default 1
  limit?: number;                  // default 20
  scope?: string;                  // corpus_id; default: current corpus
}

interface RetrieveRelatedMemoriesResponse {
  source_memory_id: string;
  related: Array<{
    memory_id: string;
    edge_type: string;
    edge_confidence: number;
    memory_content: string;
    memory_confidence: number;
  }>;
  truncated: boolean;
}
```

#### 16.2.3 retrieve_memory_cluster

All memories from a source document or section.

```typescript
interface RetrieveMemoryClusterRequest {
  source_ref: string;              // document_id, document_id:page_range, or conversation_id
  limit?: number;                  // default 50
}

interface RetrieveMemoryClusterResponse {
  source_ref: string;
  memories: Array<{
    memory_id: string;
    content: string;
    type: string;
    confidence: number;
    source_span: { page: number | null; offset_start: number; offset_end: number };
  }>;
  total_count: number;
  truncated: boolean;
}
```

#### 16.2.4 search_memories

Free-form query against the memory graph.

```typescript
interface SearchMemoriesRequest {
  query: string;
  scope?: string;                  // corpus_id; default: current corpus
  type_filter?: string[];          // memory types to include
  limit?: number;                  // default 20
  include_low_confidence?: boolean; // default false (excludes confidence < 0.5)
}

interface SearchMemoriesResponse {
  query: string;
  results: Array<{
    memory_id: string;
    content: string;
    type: string;
    confidence: number;
    relevance_score: number;       // search relevance (separate from confidence)
    source_ref: string;
  }>;
  truncated: boolean;
}
```

#### 16.2.5 retrieve_memories_by_entity

All memories referencing a given entity.

```typescript
interface RetrieveMemoriesByEntityRequest {
  entity_id: string;
  edge_types?: string[];           // limit by relationship type to entity
  scope?: string;
  limit?: number;                  // default 20
}

interface RetrieveMemoriesByEntityResponse {
  entity_id: string;
  entity_name: string;
  memories: Array<{
    memory_id: string;
    content: string;
    type: string;
    confidence: number;
    relationship_to_entity: string;
  }>;
  truncated: boolean;
}
```

#### 16.2.6 verify_memory

Grounding verification against source. Re-reads the source span the memory was extracted from and reports whether the memory's content is supported.

```typescript
interface VerifyMemoryRequest {
  memory_id: string;
  context_pages?: number;          // pages of context around source span; default 1
}

interface VerifyMemoryResponse {
  memory_id: string;
  source_resolvable: boolean;
  verification_result:
    | "supported"                  // memory content matches source
    | "partially_supported"        // some claims supported, some not
    | "contradicted"               // source contradicts memory
    | "source_changed"             // source has been modified since extraction
    | "source_unavailable"         // source no longer accessible
    | null;                        // verification could not run
  notes: string;                   // explanation when result is non-trivial
  source_excerpt: string | null;   // the relevant source passage
}
```

### 16.3 Common conventions

All tools in both families share these conventions:

- **Scope defaults.** Default scope is the current corpus when one is engaged; broader scope requires explicit parameter. This prevents runaway retrieval that would query the entire memory graph when the user is working in a single corpus.
- **Result limits.** All tools default to `limit: 20` (or `limit: 50` for cluster retrieval). Can be overridden explicitly. Results above the limit are not returned; the response sets `truncated: true` so the caller knows more exists.
- **Result schema consistency.** Both tool families return content with `confidence` (where applicable), `source_ref`, and a `truncated` flag. This lets specialist sub-agents consume both kinds of results uniformly.

### 16.4 Retrieval posture directive

When documents and memories are in scope, the system prompt includes a short retrieval posture directive:

> Memories are compressed representations of documents, conversations, or other sources. They are typically sufficient when they fully answer the question and the task isn't about verification. When a task requires additional context from the source, verification of source content, interpretive reasoning, or specific language from the document, the source should be re-read. Memories may be incomplete reflections of their sources and should be treated with reasonable skepticism. Use judgment based on the task at hand.

The directive is principle-based rather than rule-based. Long rule-based directives perform poorly in practice — they accumulate exceptions, get lost in prompt density, and don't generalize across task types. The principle above is short, generalizable, and trusts the LLM's judgment for the specific case.

The directive is injected only when documents or memories are actually in scope (avoids token waste in pure-chat contexts). It is consolidated with other retrieval-related directives so multiple short principles arrive together rather than scattered through the prompt. Format conformance follows the marker scheme in §18.

### 16.5 UX visibility for re-reads

When the LLM invokes document or memory retrieval during a response, the user sees it surfaced inline in the chat UI:

> Elnor re-read pages 47-52 of the Johnson deposition for this answer.

Or for memory retrieval:

> Elnor consulted the memory graph (15 results found, 3 high-confidence).

This builds user trust that the answer is grounded. It also lets the user notice when re-reading didn't happen but perhaps should have ("Elnor said X confidently but never actually re-read the source — should I verify?").

The inline indication is concise — a single line, expandable for detail — not a separate pane. The expandable provenance shows specific source refs (corpus_id, document_id, page_range), tool receipts (which tools were called), and confidence/degradation flags from the relevant `IngestionQualityReport` for documents or memory confidence for memories.

### 16.6 Specialist sub-agent consumption pattern

Both tool families are consumed primarily by specialist sub-agents (per `ELNOR_SUBAGENT_PRECOMPUTE_TOOL_OPTIMIZATION_NOTES_V4.md` §§1.7-1.11):

- **MemoryAgent** owns memory query tools. Its tools: `retrieve_memory_by_id`, `retrieve_related_memories`, `retrieve_memory_cluster`, `search_memories`, `retrieve_memories_by_entity`, `verify_memory`. MemoryAgent reasons about graph confidence, provenance, corpus scope, ConsolidatedUnderstandings, conflicts, and memory clusters.
- **DocumentIntelligenceAgent** owns document retrieval tools. Its tools: `retrieve_document_pages`, `retrieve_full_document`, `retrieve_memory_source`, plus DOC18 retrieval (`search_documents`). DocumentIntelligenceAgent owns pages, chunks, conversion artifacts, OCR, section maps, exact quotes.

The primary chat agent may call these tools directly for simple cases. For retrieval-heavy tasks, the primary spawns the appropriate specialist with an isolated context and a structured task description. The specialist returns a structured finding; the primary composes the user-facing response from it.

DOC24 routing makes MemoryAgent and DocumentIntelligenceAgent available as spawnable sub-agents when the corresponding scope is active (memory scope active → MemoryAgent available; document scope active → DocumentIntelligenceAgent available).

A single RetrievalAgent would be simpler initially but would blur the boundary that V2.0 protects: memory is not source text, and source text is not memory. The split is justified despite the added orchestration complexity.

---

## §17 DOC25_IngestionResult Consumer Contract

This section is the authoritative `DOC25_IngestionResult` contract. DOC73 V1.4.1 §15.2.1 inlines the same schema for its freeze; DOC73 V1.5 will swap to a normative reference into this section. Until the swap, §17 here and DOC73 V1.4.1 §15.2.1 must remain field-for-field identical; updates land here first and propagate to DOC73 V1.5.

### 17.1 Why this contract exists

V2.0 corpus extraction (DOC73 V1.4) and other downstream consumers depend on DOC25 returning consistent, well-shaped ingestion results per document. Without a minimum consumer contract, coding agents implementing DOC73 will guess what DOC25 returns; surfaces will fail in inconsistent ways; debugging becomes guesswork. The contract specifies the minimum DOC25 must honor.

### 17.2 DOC25_IngestionResult schema

```
DOC25_IngestionResult {
  // Original artifact (immutable)
  original_artifact_ref         // Pointer to original file/source
  
  // Normalized content
  normalized_content_ref        // Pointer to converted markdown/text
  
  // Conversion artifact (versioned)
  conversion_artifact_ref       // Specific conversion version (per E8)
  
  // Document identity
  document_type                 // Classified at ingestion
  metadata: {
    title
    date
    author
    source                      // EDGAR, Westlaw, manual upload, etc.
    // ... per profile metadata expectations
  }
  
  // Chunking
  chunk_manifest: [
    {
      chunk_id
      span                      // Character range or page range
      embedding_id              // Pointer to vector store entry
    }
  ]
  
  // Page anchor map (for documents with page structure)
  page_anchor_map: {
    page_number → character_offset
    // Reverse map also maintained
  }
  
  // Status
  ingestion_status              // "complete" | "degraded_complete" | "hard_failed"
  
  // Quality report (per E1)
  quality_report: IngestionQualityReport
  
  // Failure receipts (if any)
  failure_receipts: [
    {
      stage                     // Which state machine stage failed (per E3)
      reason_code               // Per E4
      details
      timestamp
    }
  ]
  
  // Content hashes (per E2)
  content_hashes: {
    raw_file_hash
    normalized_binary_hash
    normalized_text_hash
    page_hashes: [...]
    chunk_hashes: [...]
    source_instance_id
  }
}
```

DOC73 corpus extraction reads this structure as input. If DOC25 does not return this shape, DOC73 reports a contract violation; ingestion fails for that document with reason code `doc25_contract_violation` (per §14.5).

### 17.3 Field semantics

- **original_artifact_ref.** A pointer to the original file or byte source. For user-controlled paths, this is the path. For byte-arriving sources, this is the document store location at `originals/{content_hash}.{extension}`. The reference is stable across the document's lifetime; if a user-controlled path moves, the reference is updated through the path-tracking mechanism (§13.7) without breaking the contract.
- **normalized_content_ref.** Pointer to the converted markdown or text — the active conversion version's markdown file. Consumers should read this rather than the original for text-based work; visual work (Tier 1 LLM context) uses the original.
- **conversion_artifact_ref.** Pointer to the specific conversion version (per E8). When reconversion happens, this advances to the new version; consumers reading by `conversion_artifact_ref` always see the version the consumer was bound to, not silently upgraded. Re-extraction uses the latest active version unless explicitly bound to an older version.
- **document_type.** The classification result from §2. Profile-specific document types extend the base enumeration.
- **metadata.** Profile-driven metadata fields. Common across profiles: title, date, author, source. Profile extensions add fields (e.g., legal_case_corpus adds case_number, jurisdiction, court).
- **chunk_manifest.** List of chunks with their span (character range or page range), and `embedding_id` pointing to the vector store entry in DOC18. Consumers retrieve chunks by `chunk_id`; embeddings stay in DOC18.
- **page_anchor_map.** For documents with page structure, the bidirectional map between page_number and character_offset. Consumers use this to convert between page-based references (user-friendly) and offset-based references (precise).
- **ingestion_status.** One of `complete`, `degraded_complete`, `hard_failed` — corresponding to the terminal states of the §14 state machine.
- **quality_report.** The `IngestionQualityReport` (§15.1) for the conversion artifact this result describes. Consumers can determine "is this document reliable for my purpose?" by reading the quality report's flags and degradation indicators.
- **failure_receipts.** Populated when ingestion took shortcuts due to step failures (e.g., conversion succeeded but classification failed, so the document is `degraded_complete`). Each receipt names the stage, reason code, and details. Consumers can decide whether the failures matter for their use case.
- **content_hashes.** All six hash types from E2 (§12.3). Consumers needing dedup, partial-document matching, or chunk-level dedup use the appropriate hash for their question.

### 17.4 CU authority dependency

The implementation of consumers like DOC73 corpus extraction depends on the ConsolidatedUnderstanding (CU) authority aggregation algorithm landing in DOC72/DOC73. DOC1 explicitly disclaims authority-resolution scope (DOC1 §4.12); the actual CU authority work is specified in DOC72/DOC73 §3.2A.

For DOC25's purposes, this means: DOC25 produces `DOC25_IngestionResult` results; DOC73 consumes them and applies CU authority resolution per §3.2A; DOC25 does not need to know about CU authority semantics. The contract here is content-shape only.

The DOC25 → DOC73 dependency chain is:

```
§3.2A CU authority aggregation must land
   ↓
§14 corpus extraction can produce CUs with computed authority
   ↓
§15 ingestion writes CUs with valid authority semantics
   ↓
Cross-corpus CU resolution works (per §3.2A.3)
```

DOC25's responsibility ends at producing well-formed `DOC25_IngestionResult` per this contract. CU evaluation downstream is not DOC25's concern.

### 17.5 Versioning and breaking changes

The contract is versioned independently of DOC25 itself. Adding fields is non-breaking; consumers can ignore unknown fields. Removing or changing field semantics is breaking and triggers a contract version bump. Breaking changes are coordinated with all known consumers (DOC73 the primary; surfaces consuming DOC25 IngestionResults) before landing.

The schema field-for-field identical lives in DOC73 V1.4.1 §15.2.1 as a freeze artifact. DOC73 V1.5 will swap that inline schema to a normative reference into this section (§17.2 above). Until V1.5 ships, the two locations must remain identical; any update lands here first and propagates to DOC73 in V1.5.

---

## §18 Marker Scheme for Injected Content

This section specifies the marker scheme used to wrap content injected into LLM prompts. The scheme is what allows specialist sub-agents and the primary chat agent to programmatically reference injected content (e.g., "the third extracted memory above") and to verify the source attribution of any claim.

### 18.1 Why markers exist

LLM prompts are dense. Without consistent markers, content injected into the prompt blurs into the surrounding text and the LLM cannot reliably distinguish: (a) what came from a memory versus what came from a document, (b) which corpus a piece of content is scoped to, (c) which source the content was extracted from, (d) what is a tool schema versus an extracted fact. The marker scheme makes these distinctions explicit and machine-readable.

Markers also support the verification workflow. When the LLM cites an extracted memory in its response, the user can click through to the source by following the `memory_id` and `source_ref` carried in the marker. When MemoryAgent finds a related memory, it can return the `memory_id` to the primary, which can then re-inject that specific memory by id rather than re-running the search.

### 18.2 Marker types

V2.0 specifies the following marker types:

- **`extracted_memory`** — a memory injected from the entity graph (typically by MemoryAgent or by DOC24 turn-start memory injection).
- **`corpus_context`** — narrative context about the active corpus (corpus name, scope, working theory).
- **`tool_schema`** — a tool definition injected for the LLM's tool roster.
- **`skill`** — a skill description injected from the DOC3 skill registry.
- **`document_excerpt`** — a passage retrieved from a document (typically by DocumentIntelligenceAgent or by `retrieve_document_pages`).
- **`retrieval_posture_directive`** — the §16.4 directive when documents or memories are in scope.

This list extends as new content categories are introduced. The marker format is consistent across types; the type field identifies the category.

### 18.3 Format specification

Markers are XML-like tags with attributes. The opening tag carries identification and source attribution; the body is the content; the closing tag matches the opening type.

```
<{marker_type} {attributes}>
  {content}
</{marker_type}>
```

#### 18.3.1 extracted_memory marker

```
<extracted_memory id="mem-a3f7c2"
                  source_type="document"
                  source_ref="johnson_depo_2026_03_15:p47:offset_1234-1567"
                  corpus="marex"
                  extracted_at="2026-04-15"
                  confidence="0.78">
  Johnson stated that management reviewed DePIN revenue figures weekly.
</extracted_memory>
```

Required attributes: `id`, `source_type`, `source_ref`, `extracted_at`. Optional: `corpus`, `confidence`, `type`.

#### 18.3.2 corpus_context marker

```
<corpus_context corpus="marex">
  This corpus supports work on Marex v. Bluefin securities litigation.
  Working theory: scienter via deliberate revenue mis-recognition tied to
  DePIN token unlocking. Primary defendants: Bluefin executives plus
  external auditor relationship.
</corpus_context>
```

Required attributes: `corpus`. Body is the corpus narrative text.

#### 18.3.3 document_excerpt marker

```
<document_excerpt id="doc-9f3a"
                  document_id="doc-johnson-depo"
                  pages="47-49"
                  extraction_method="markitdown"
                  retrieved_via="retrieve_document_pages">
  Q. Mr. Johnson, what was your role in the revenue recognition decision?
  A. I was the CFO. I approved the final revenue figures each quarter
     based on the recognition policy memos prepared by the controller.
  ...
</document_excerpt>
```

Required attributes: `document_id`, `pages` (or `span`), `extraction_method`. Optional: `id` (for back-reference), `retrieved_via` (which tool fetched this excerpt).

#### 18.3.4 tool_schema marker

```
<tool_schema name="retrieve_document_pages" version="1">
  {tool definition JSON}
</tool_schema>
```

Required attributes: `name`, `version`. Body is the tool schema in the format the LLM's tool calling system consumes.

#### 18.3.5 conversation marker (for memories from chat)

```
<extracted_memory id="mem-b8e4d1"
                  source_type="conversation"
                  source_ref="chat:2026-04-10:msg-234"
                  extracted_at="2026-04-11"
                  confidence="0.82">
  Will indicated that Glazer's independence will be a primary theme.
</extracted_memory>
```

Same structure as document-source memories; `source_type` distinguishes.

### 18.4 Identification and reference

Every marker carries an `id` field allowing programmatic reference. The `id` is stable across injections — the same memory or excerpt always carries the same id — so an LLM can refer to "extracted_memory id mem-a3f7c2" in its reasoning and downstream consumers can resolve the reference back to the underlying record.

For specialist sub-agents, this enables the pattern:

1. Primary spawns MemoryAgent with a task description.
2. MemoryAgent retrieves memories via `search_memories`, returns findings to primary.
3. Primary's response cites specific memory ids: "the analysis above relies on extracted memories mem-a3f7c2, mem-c1b08e."
4. User or downstream verifier resolves those ids to the underlying memory records.

Without stable ids, the chain of citation breaks and verification becomes guesswork.

### 18.5 Source attribution requirements

Every marker carrying extracted content (extracted_memory, document_excerpt) MUST include source attribution:

- `source_ref` for memories — points to the document or conversation the memory came from, with span information where applicable.
- `document_id` and `pages` (or `span`) for excerpts — points to the document and the specific portion.
- `extracted_at` for memories — when the memory was created, so the user can assess whether it might be stale.

These attributes are non-optional. A marker without them is malformed and the marker emitter (DOC25 service that built the marker) is in error.

### 18.6 Governance

DOC25 provides content with the agreed marker format; it does NOT own the format itself. **Marker scheme governance is owned by the eventual prompt-composition coordinator — DOC15 + DOC24.** Once that coordinator lands, DOC15/DOC24 own the format authoritatively; DOC25's content production conforms to whatever scheme they codify.

V2.0 specifies the scheme as it stands today as a working contract until DOC15/DOC24 governance lands. The format above is operative; format changes require coordination with DOC15 + DOC24 and migration of all content emitters in DOC25 (and other sources of injected content). Backwards compatibility is preserved during migration windows.

The split of responsibilities:

- **DOC25 owns:** building markers around DOC25-produced content (extracted memories from corpus extraction, document excerpts from retrieval tools).
- **DOC15 + DOC24 own (eventually):** the marker format itself; the rules for which markers may be injected in which prompt contexts; the budget allocation across marker types.
- **Marker format conformance:** DOC25 conforms to the active format; content injected without conforming markers is rejected by the prompt assembler.

### 18.7 Multi-marker prompts

A single prompt typically contains multiple markers. The composition order is governed by the prompt-composition layer, not by DOC25. DOC25's role is to produce well-formed individual markers; assembly into the full prompt envelope is downstream work.

When markers reference each other (e.g., a `document_excerpt` is the source for an `extracted_memory`), the relationship is captured in attribute references rather than in marker nesting. Markers are flat — never nested — to keep parsing simple and avoid ambiguity about scope.

---

## §19 Frontend UI and Settings

This section specifies the user-facing surfaces for document intelligence and ingestion. UI implementation lives in DOC20 (Document Viewer) and the Q Dashboard general framework; this section specifies the contracts those surfaces must honor.

### 19.1 Ask Panel — document context indicator

When the Ask panel is opened from a document tab, a context indicator below the document title shows the document's current status and tier selection:

```
┌──────────────────────────────────────────┐
│ Sanli Expert Report.pdf                  │
│ 50 pages · ~150K tokens                  │
│   ○ Full document (recommended)          │
│   ○ Text only (faster, cheaper)          │
│   ○ Summary + current page               │
│ Last extracted: 2026-04-25 (MarkItDown)  │
└──────────────────────────────────────────┘
```

Default selection: automatic — the system chooses the optimal tier per §3.2. The auto-selected tier is shown with a subtle "(auto)" label.

Manual override: the user can click to force a specific tier. This is a power-user feature; most users should never need it.

When the document has degraded quality flags from the `IngestionQualityReport` (§15.1), the indicator shows them: "Tables may be garbled — reconvert?" with a one-click affordance to trigger reconversion via §11.5.3.

### 19.2 Conversation context badge

In the chat message area, when a document is included in context, a badge shows:

```
[doc] Sanli Expert Report.pdf — Full document (cached)     [×]
```

For subsequent turns:

```
[doc] Sanli Expert Report.pdf — Using cached version       [×]
```

For text-only sends:

```
[doc] Sanli Expert Report.pdf — Text extracted (50 pages)  [×]
```

The `[×]` removes the document from context for the current message.

When a model switch invalidates the cache (§4.5), the badge updates to:

```
[doc] Sanli Expert Report.pdf — Re-caching for new model
```

### 19.3 Cost indicator

In the Ask panel footer or a settings popover, estimated token usage and dollar cost surface:

```
Est. cost: ~15K tokens (cached) · $0.02
```

This helps users understand the cost implications of their tier choice. The dollar estimate uses the active model's pricing from the model capability registry; when pricing is unknown the indicator shows tokens only.

### 19.4 Document status in entity graph browser

In the browser pane, document items show intelligence status:

```
[doc] Sanli Expert Report.pdf      DOC  10m ago
      Text extracted | Summary cached | Classified: Expert Report
```

Documents without pre-computed intelligence show:

```
[doc] New_Filing.pdf               DOC  just now
      Processing...
```

Documents with quality flags show them inline:

```
[doc] Old_Scanned_Brief.pdf        DOC  yesterday
      OCR'd (low confidence on 3 of 47 pages) | Searchable
```

### 19.5 Document Intelligence settings panel

Settings > Document Intelligence is the canonical settings surface. The panel structure:

```
Settings > Document Intelligence
├─ Extraction Backend
│   ├─ Primary: MarkItDown — Active
│   ├─ Heavy-document fallback: Docling — Active
│   ├─ Browser fallback: PDF.js getTextContent()
│   └─ ☑ Use MarkItDown for all DOC72 intake extraction
│
├─ OCR
│   ├─ Primary: MarkItDown OCR plugin (LLM vision)
│   ├─ Browser fallback: Tesseract.js (local)
│   ├─ ☐ Azure Document Intelligence (cloud, opt-in)
│   ├─ ☑ Auto-OCR scanned pages when documents are opened
│   └─ ☑ Write OCR text layer back into PDFs
│
├─ LLM Document Access
│   ├─ ☑ Allow LLM to request specific pages from summarized documents
│   ├─ Max pages per retrieval request: [20]
│   └─ ☑ Show inline indicators when LLM re-reads documents
│
├─ Tool Status
│   ├─ Docling: installed, v2.7.3
│   ├─ MarkItDown: installed, v0.1.4
│   ├─ GLiNER: installed, v0.2.5 (model: urchade/gliner_medium-v2.1)
│   ├─ NuExtract: installed, v0.5.0
│   ├─ Firecrawl: self-hosted Docker, v1.6.2 — Healthy
│   └─ Tesseract: installed, v5.3.4 (browser-side)
│
├─ Universal Ingestion
│   ├─ Per-step concurrency pools (advanced)
│   ├─ Reserved disk floor: [5 GB]
│   ├─ Storage retention: [Keep everything] [Configure per corpus...]
│   └─ ☑ Show realistic timing estimates (model-cost + wall-clock)
│
├─ Multi-Model
│   ├─ ☑ Warn when model switch will invalidate document cache
│   └─ ☐ Configure per-conversation routing modes (V2.1)
│
└─ Privacy
    ├─ ☑ All extraction runs locally (MarkItDown self-hosted)
    ├─ ☐ Allow cloud OCR for accuracy-critical documents (opt-in)
    └─ ☐ Allow LlamaCloud services for cloud-allowed corpora (opt-in)
```

The "Tool Status" subsection populates from the §10.11 tool installation tracking and refreshes on a polling interval. Tools with version mismatch against the spec's expected version range surface a warning indicator.

### 19.6 Cross-corpus document browser (Q6 resolution)

A read-only document browser surfaces in Q > Knowledge > Documents, providing a unified view of all documents in ELNOR's document store across all corpora and surfaces. The browser shows:

- Document title, classification, original paths (collapsed if multiple)
- Corpus memberships (which corpora reference this document)
- Last accessed, first ingested
- Quality flags from the `IngestionQualityReport`
- Source instances if multiple (different policy contexts)

Actions available from the browser:

- View converted markdown (read-only)
- Navigate to corpus where document is a member
- Trigger reconversion (per §11.5.3)
- Export document metadata + converted markdown as a zipped bundle
- Mark for deletion (with confirmation per §13.5 layered deletion semantics)

Q6 resolution: yes, this browser exists. Export option included.

### 19.7 Per-corpus retention configuration (Q5 resolution)

Each corpus has a retention setting in its corpus settings panel:

```
Corpus Settings > Storage
├─ Retention: ● Keep everything (default)
│             ○ Keep originals + active conversion only
│             ○ Custom rules
└─ Estimated storage usage: 4.2 GB across 847 documents
```

Default is keep-everything. Users with very large corpora (10,000+ documents on the same hardware) can adjust to manage disk space. Custom rules allow expressing things like "purge previous_conversions older than 60 days but keep active conversions indefinitely."

---

## §20 Agent Conversation Context Manager

The Context Manager is the service that sits between the frontend (Ask panel, Chat) and the LLM API. It implements the routing decisions specified in §3 (tiering), §4 (caching), §6 (model-specific), and §22 (chat attachment handling). This section specifies the Context Manager's interface; its internal mechanics are implementation-bounded.

### 20.1 Architecture

The Context Manager:

1. Receives the user's message plus the list of attached documents.
2. For each document, selects the optimal tier using the algorithm in §3.2.
3. Formats the document content for the selected model using §6 routing logic.
4. Manages prompt caching state (§4) including multi-model invalidation.
5. Tracks token usage and savings.
6. Returns the formatted API request.

The Context Manager reads from EC's runtime state (DocumentEntity records, conversation document state) and never writes durable state — durable writes go through EC's sole-writer path.

### 20.2 Interface

```typescript
interface ContextManager {
  // Prepare document content for an API call
  prepareDocumentContext(
    documentId: string,
    conversationId: string,
    model: string,
    userMessage: string,
    options?: {
      forceTier?: 1 | 2 | 3;
      specificPages?: number[];
      maxTokenBudget?: number;
    }
  ): Promise<DocumentContextBlock>;

  // Track that a document was sent in a conversation
  recordDocumentSend(
    documentId: string,
    conversationId: string,
    tier: number,
    tokenCount: number,
    promptCached: boolean
  ): void;

  // Check current state of a document in a conversation
  getDocumentState(
    documentId: string,
    conversationId: string
  ): ConversationDocumentState | null;

  // Estimate cost before sending
  estimateCost(
    documentId: string,
    model: string,
    tier?: number
  ): { tokens: number; estimatedCost: number; tier: number };
}

interface DocumentContextBlock {
  tier: number;
  contentBlocks: any[];   // formatted for the target API
  tokenEstimate: number;
  promptCached: boolean;
  metadata: {
    documentId: string;
    documentTitle: string;
    pageCount: number;
    pagesIncluded: number[] | 'all';
    extractionMethod: string;       // which tool produced the content
    qualityFlags: string[];         // from IngestionQualityReport
  };
}
```

V2.0 additions to V1.0's interface: `extractionMethod` and `qualityFlags` in `DocumentContextBlock.metadata`. These let the consuming API call surface degradation flags to the LLM (per §9.8 OCR low-confidence note pattern).

### 20.3 Token budget management

When multiple documents are in context, the Context Manager allocates token budget across them. Example for Claude Opus with a 200K context window:

```
Total context window:        200K tokens
System prompt:                ~2K tokens
Conversation history:        ~10K tokens (variable)
User message:               ~500 tokens
Reserved for response:        ~4K tokens
Available for documents:    ~183K tokens

Document A (100 pages):  300K full PDF | 50K text | 5K summary
Document B (20 pages):    60K full PDF | 10K text | 2K summary
Document C (5 pages):     15K full PDF |  3K text | 1K summary

Budget allocation:
- Document A: Tier 2 (text)         50K tokens (primary focus)
- Document B: Tier 1 (full, cached) 60K tokens (needs visual analysis)
- Document C: Tier 1 (full)         15K tokens (small enough)
Total:                              125K tokens — fits within budget
```

If the budget is tight, the Context Manager automatically downtiers the least-relevant documents first. Relevance is determined by recency in the conversation (most recent attachments rank higher) and by document size (smaller documents are cheap to keep at higher tiers).

When the LLM has retrieved pages via `retrieve_document_pages` (§8), those pages count against the budget on subsequent turns. If retrieved-page accumulation crosses the §3.2 Rule 3 50% threshold, the document auto-escalates to Tier 2 to consolidate; the per-page accumulation is replaced by the full Tier 2 content.

---

## §21 Workflow-Specific Optimizations

Different document workflows have different patterns. The Context Manager applies workflow-aware optimizations when it can detect the workflow.

### 21.1 Red-teaming sessions

**Pattern:** user sends a document and iterates through multiple rounds of analysis or critique.

**Optimization:**

1. Turn 1: send full PDF with prompt cache enabled.
2. Turns 2-N: automatic cache hits at ~10% cost.
3. System tracks which arguments or sections have been addressed.
4. Each turn's prompt references the cached document without re-sending.

**UI:** "Document Analysis Session" mode in the Ask panel. Shows:

- Document attached with cache status.
- Argument or section tracker.
- Consolidated findings accumulator.

**Estimated savings:** 80-90% token reduction over a 10-turn session.

### 21.2 Cross-document comparison

**Pattern:** user wants the agent to compare two or more documents (e.g., complaint vs answer, two expert reports).

**Optimization:**

- Primary document: Tier 1 (full PDF, cached) when visual layout matters; Tier 2 (text) when comparison is purely textual.
- Comparison document: Tier 2 (text) — visual layout rarely matters for comparison.
- If both are text-heavy: both Tier 2.

**UI:** Multi-document attach in the Ask panel. Shows each document with its tier and token cost.

### 21.3 Quick reference

**Pattern:** user asks a quick question that references a document tangentially ("What was the discount rate in the Sanli report?").

**Optimization:**

- If document has been seen in this conversation: Tier 3 (summary + targeted page).
- If not seen: Tier 3 (summary only) — the question can likely be answered from the summary.
- If the model needs more detail: it calls `retrieve_document_pages` (§8) directly.

**UI:** No special UI. The system handles it automatically. The context badge shows "Summary only" with the page-retrieval availability noted.

### 21.4 Document drafting with reference

**Pattern:** user is drafting a motion and references an exhibit or deposition transcript.

**Optimization:**

- The document being drafted: always full text in context (it's a note, not a PDF).
- Referenced documents: Tier 3 (summaries + cited pages).
- If user selects specific text (Area Select tool) and sends it: just that text, no document context needed.

**UI:** the note editor's Ask panel shows both the current note content and any referenced documents with their tiers.

---

## §22 Chat Attachment Handling

When a user attaches a document directly to a chat message (via the + button or drag-and-drop), the system handles it through the universal ingestion pipeline (§11) and then prepares it for the LLM via the Context Manager.

### 22.1 PDF attachment in chat

When a PDF is attached:

1. **Read the file.** FileReader reads the bytes.
2. **Hash and dedup check.** Compute `raw_file_hash`. Check `documents` table. If hit, reuse existing artifacts (§13).
3. **If new:** route through universal ingestion (§11). Steps 1-7 of the pipeline run.
4. **Classify.** Per §2.2.
5. **Store in entity graph.** Per §11.2 step 8.
6. **Send to model.** Use Tier 1 (full PDF) for native-PDF models; Tier 2 (text) for non-native-PDF models. Use cached reference if cache is warm.
7. **Enable caching.** Include `cache_control` for subsequent turns.

For documents arriving as bytes (chat attachment is not user-controlled path), the original is stored at `originals/{content_hash}.{extension}` per §12.1.

### 22.2 Non-PDF attachment in chat

- **Word doc:** route through universal ingestion (extracts via MarkItDown). Send extracted markdown as text content block.
- **Image:** send as image content block. Image-to-text via OCR (§9) only if the user requests text or the image is primarily textual.
- **Spreadsheet:** convert to markdown table via MarkItDown; send as text.
- **Text file:** send as-is.
- **Presentation:** extract slide text + speaker notes via MarkItDown; send as text.
- **Audio:** transcribe via Whisper through MarkItDown's audio plugin; send transcript as text.

In all cases, the document is also routed through universal ingestion so it becomes available for retrieval and dedup in subsequent conversations.

### 22.3 Multiple attachments

When multiple files are attached to a single message:

1. Sort by relevance and size (most relevant first; then smallest first for tie-breaking).
2. Allocate token budget per §20.3.
3. Tier each document appropriately.
4. Show user the total token estimate before sending.

If the user has many attachments and the budget is tight, the UI suggests removing the least-relevant attachments rather than silently downgrading them to Tier 3 with poor results.

---

## §23 Files API Integration

For documents referenced across many conversations or for very large documents where re-sending is wasteful, the Files API allows uploading once and referencing by `file_id`. V2.0 integrates Files API support into the Context Manager.

### 23.1 When to use the Files API

The Files API is used for:

- Documents that are referenced across many conversations.
- Very large documents (reduces per-request payload size).
- Frequently-used templates or reference materials.

```typescript
function shouldUploadToFilesAPI(document: DocumentEntity): boolean {
  // Upload if the document is large and likely to be reused
  if (document.fileSize > 5_000_000 &&  // 5 MB
      Object.keys(document.conversationsSentTo).length > 2) {
    return true;
  }
  
  // Upload if explicitly marked as a reference document
  if (document.userMarkedAsReference) {
    return true;
  }
  
  return false;
}
```

### 23.2 File ID management

```typescript
interface FileAPIRecord {
  localDocumentId: string;     // Q Dashboard document_id
  apiFileId: string;           // provider's file_id (Claude, OpenAI, etc.)
  apiProvider: string;         // 'anthropic' | 'openai' | 'google'
  uploadedAt: string;
  fileHash: string;            // raw_file_hash at upload time, for invalidation
  expiresAt: string | null;    // API file expiration if known
}
```

When a document's `apiFileId` is set for the active provider, the Context Manager uses `{ type: "file", file_id: "..." }` instead of base64, reducing request payload size significantly.

V2.0 stores `apiFileId` per provider (not single-valued) because the same document may be uploaded to multiple providers when the user routes turns to different models. Each provider's file_id is independent.

### 23.3 Invalidation

When the document's `raw_file_hash` changes (the document was modified), all `FileAPIRecord` entries for the document are invalidated. The next time the document is sent to a model, it is re-uploaded with a fresh `apiFileId`.

When the API expiration is known (some providers expire uploads after N days), the Context Manager refreshes the upload before expiration. When expiration is unknown, the Context Manager re-uploads on demand if the API returns a "file not found" error.

---

## §24 Performance Metrics and Monitoring

The system tracks metrics for monitoring document intelligence performance, identifying bottlenecks, and surfacing user-facing cost and savings information.

### 24.1 Metrics tracked

- **Tokens sent per conversation turn** — broken down by document context vs user message vs system prompt vs marker scheme overhead.
- **Cache hit rate** — percentage of document sends that used prompt caching successfully.
- **Tier distribution** — how often each tier is used across all conversations.
- **Cost per conversation** — total token cost, broken down by document vs non-document.
- **Latency per turn** — time from user message to first response token, correlated with document tier.
- **Classification accuracy** — manual review sample of document type classifications. Tracked via a periodic audit; not a live metric.
- **Ingestion timing** — per-step timing across the §11.2 pipeline steps. Identifies bottlenecks.
- **Tool health** — per-tool uptime, capacity lease usage, retryable failure rate.
- **Quality flag distribution** — counts of `low_text_yield`, `tables_garbled`, `missing_page_anchors`, `ocr_low_confidence`, `partial_conversion` across the document store.
- **Storage usage** — per-corpus and total document store size.
- **Retrieval tool usage** — counts of `retrieve_document_pages`, `retrieve_full_document`, `retrieve_memory_source`, and the memory query tool family. Identifies which retrieval patterns are common.

Metrics are stored in a separate metrics database and aggregated on a configurable interval. They are not part of the durable graph state.

### 24.2 User-facing dashboard

In Settings > Usage, the dashboard shows:

- Total tokens used for document context this month.
- Estimated savings from caching and tiering (computed against a hypothetical naive-resend baseline).
- Most expensive document interactions (by token cost).
- Documents with quality issues (degraded conversions, low OCR confidence).
- Recommendations for optimization (e.g., "Document X has been re-sent 8 times this month — consider Files API upload").

The dashboard is informational, not interactive. Configuration changes happen in the Document Intelligence settings panel (§19.5), not from the metrics dashboard.

---

## §25 Cross-Document Obligations

DOC25 V2.0's scope intersects with many other specs. This section enumerates the obligations DOC25 V2.0 places on those specs and that those specs place on DOC25.

### 25.1 DOC72 (Hyper Intelligence Overlay)

**DOC25 obligations to DOC72:**

- DocumentEntity records are written through DOC72's §20A intake contract; DOC25 conforms to DOC72's `world_entity` node shape with `entity_type: "document"`.
- Linked entity nodes (people, organizations, dates) extracted by GLiNER or LLM extraction are written via DOC72's intake; DOC25 produces them as `entity_observation` candidates per §10.4.2.
- Source spans on memory directives (when DOC25 produces them downstream of corpus extraction) conform to DOC72's source_ref schema.

**DOC72 obligations to DOC25:**

- DOC72 §20A intake contract is stable: DOC25 routes documents through it and depends on the intake handling them correctly.
- DOC72 publishes the `document` entity_type subtype; if it changes, DOC72 coordinates with DOC25.

### 25.2 DOC73 (Positronic Brain Enhancement)

**DOC25 obligations to DOC73:**

- DOC25 produces `DOC25_IngestionResult` per §17. The schema matches DOC73 V1.4.1 §15.2.1 field-for-field. DOC73 V1.5 will swap to a normative reference into §17; DOC25 V2.0 §17 is authoritative.
- DOC25 provides hard-fail reason codes per §14.5; DOC73 corpus extraction surfaces them as failure receipts.

**DOC73 obligations to DOC25:**

- DOC73 V1.5 SHALL swap the inline schema in §15.2.1 to a normative reference into DOC25 V2.0 §17.
- Until V1.5 lands, DOC73 V1.4.1 §15.2.1 and DOC25 V2.0 §17 must remain field-for-field identical; updates land in DOC25 first and propagate.
- DOC73 corpus extraction respects the per-corpus trust posture (`aggressive_auto_commit | normal_auto_commit | review_before_commit`) when consuming DOC25 results.

### 25.3 DOC18 (LlamaIndex Retrieval Sidecar)

**DOC25 obligations to DOC18:**

- Chunks produced by DOC25 (per §11.2 step 7) are indexed into DOC18's vector store. Chunk_id and embedding_id appear in `DOC25_IngestionResult.chunk_manifest`.
- Chunking config version is propagated; DOC18 stores chunks under the version key.

**DOC18 obligations to DOC25:**

- DOC18 maintains the chunk index and exposes search APIs for chunk-level retrieval.
- DOC18 honors the `chunking_config_version` partition; it does not silently re-chunk under a new config.

### 25.4 DOC20 (Browser, Notes, Document Viewer)

**DOC25 obligations to DOC20:**

- Document Viewer renders converted markdown for the read-only "ELNOR's extracted version" view.
- OCR controls in the viewer toolbar (per §9.6) are wired to DOC25's OCR pipeline.
- Document badge in the viewer header shows the `IngestionQualityReport`-derived status.

**DOC20 obligations to DOC25:**

- DOC20's Document Viewer SHALL expose the OCR toolbar described in §9.6.
- The document badge SHALL surface the ingestion status and quality flags per §19.4.
- The cross-corpus document browser surface (§19.6) lives in Q > Knowledge > Documents per Q6 resolution.

### 25.5 DOC24 (Capability Registry, Tool Routing, Onboarding)

**DOC25 obligations to DOC24:**

- All DOC25 tools (`retrieve_document_pages`, `retrieve_full_document`, `retrieve_memory_source`, `q.document.ocr`, `document.convert_to_markdown`, the six memory query tools) register through DOC24's tool registry per §8 and §16.
- Tool registration includes capability metadata (stability_class, agent_invocable, invocation_bindings, confirmation_policy, safety_class, aliases, common_phrases) that DOC24 routes against.

**DOC24 obligations to DOC25:**

- DOC24's tool registry SHALL accept registrations matching the `ActionRegistryEntry` schema shown in §8 and §16.
- DOC24's routing cascade resolves entity references in user queries that map to documents; resolved document_ids flow into DOC25 retrieval calls.
- DOC24 makes MemoryAgent and DocumentIntelligenceAgent available as spawnable specialists when their corresponding scope is active.

### 25.6 DOC11 (OpenClaw Gateway)

**DOC25 obligations to DOC11:**

- The capability registry that the §6 routing logic reads from is owned by DOC11. DOC25 reads; does not write.
- Tool invocation for DOC25-registered tools flows through OpenClaw's `sessions_spawn` and tool-invocation paths.

**DOC11 obligations to DOC25:**

- DOC11 maintains accurate model capability metadata (native PDF support, prompt caching support, max page limits) per §6.1.
- DOC11 exposes sub-agent spawning for MemoryAgent and DocumentIntelligenceAgent per the patterns specified in `ELNOR_SUBAGENT_PRECOMPUTE_TOOL_OPTIMIZATION_NOTES_V4.md` §§1.7-1.11.

### 25.7 DOC15 (Cognitive Infrastructure Layer) and DOC24 (Prompt Composition)

**DOC25 obligations:**

- DOC25 conforms to the active marker scheme (§18) when injecting content into prompts. When DOC15 + DOC24 publish updated governance, DOC25 migrates emitters.
- DOC25's retrieval posture directive (§16.4) is composed by DOC15/DOC24 into the full system prompt; DOC25 supplies the directive text.

**DOC15 + DOC24 obligations to DOC25:**

- DOC15 + DOC24 SHALL eventually own marker scheme governance per §18.6. Until they do, DOC25's V2.0 marker format stands as a working contract.
- The token budget allocation across marker types (extracted_memory budget, corpus_context budget, etc.) is owned by DOC15/DOC24; DOC25 conforms.

### 25.8 DOC10 (Engagement Orchestration) and DOC14 (CANDOR)

**DOC25 obligations:**

- Chat-attached documents in the Ask panel and red-team room substrates route through DOC25's chat attachment handling (§22) and Context Manager (§20).

**DOC10/DOC14 obligations to DOC25:**

- Chat surfaces emit ingestion events to EC for any attached documents.
- Document context for a turn is requested from DOC25's Context Manager via `prepareDocumentContext`.

### 25.9 DOC16 (M365 Deep Integration)

**DOC25 obligations:**

- Email attachments arriving via M365 route through DOC25's universal ingestion (§11.1).
- The DOC16 read-vs-edit path separation per §10.10 is honored.

**DOC16 obligations to DOC25:**

- The M365 pipeline emits ingestion events for every inbound email attachment.
- M365 provides the source metadata (sender, subject, received timestamp) that becomes part of `DOC25_IngestionResult.metadata`.

### 25.10 DOC23 (Task System)

**DOC25 obligations:**

- Task module outputs that emit documents route through DOC25's universal ingestion.

**DOC23 obligations to DOC25:**

- Task modules emit ingestion events for any documents they produce or fetch.
- Task gathering pipelines (EDGAR, PACER, web scrape) flag the source class so DOC25 can apply autonomous-gathering trust posture per DOC73 V1.4.1 R5.3 D24.

### 25.11 EC Core (Addendum A V3.3)

**DOC25 obligations:**

- DOC25's orchestrator runs as an EC background service per §11.4. DOC25 does not run separate background processes outside EC.
- All durable writes from DOC25 go through EC's sole-writer path.

**EC Core obligations to DOC25:**

- EC provides the durable queue, per-step worker pools, retry/timeout/fallback policies, and dynamic pool sizing.
- EC's sole-writer invariant covers DOC25's writes to the document index, document store, and DOC72 entity graph.
- EC honors the §15.5 batch-commit policy.

### 25.12 DOC1 (Memory Resilience)

**DOC25 obligations:**

- DOC25 produces source-grounded extracted content but never directly writes memory directives. DOC1 governs memory promotion.
- DOC25 conforms to DOC1's `memory_directive` payload contract for any extracted content that downstream consumers (DOC73 corpus extraction) promote.

**DOC1 obligations to DOC25:**

- DOC1 disclaims authority-resolution scope per DOC1 §4.12; CU authority lives in DOC72/DOC73 §3.2A.
- DOC1's Write Gate accepts memory candidates DOC73 produces from DOC25 ingestion outputs.

---

## §26 Open Questions

V2.0 closes the eight planning-notes open questions per the resolutions in the "What's New" section. Some V1.0 questions remain open; some new V2.0 questions surface from the absorption of R1 and the planning notes.

### 26.1 Resolved questions from V1.0

V1.0 §15 listed six open questions. V2.0 dispositions:

- **Cache TTL management.** Partially addressed. The §4.4 cache warming strategy handles normal cases. Long reading sessions (10+ minutes idle) where the cache expires before the user's next message remain a cost concern. A keep-alive ping is proposed for V2.1; for V2.0, the user re-incurs first-turn cost on cache miss.
- **Multi-model conversations.** Partially addressed. §4.5 specifies cache invalidation on model switch with a dismissible warning. Per-conversation routing modes that suppress warnings when intentional routing is configured remain open for V2.1.
- **Document versioning.** Partially addressed. E8 versioned immutable artifacts (§12.4) handles conversion versioning. The narrower question — when a user annotates a PDF (highlights, redactions, text insertions), should the modified PDF be sent to the model or the original — remains open. Current behavior: send the modified PDF (it's what the user is looking at). V2.1 may add a per-document "send original" override.
- **Privacy / confidentiality "local only" flag.** Partially addressed. The `source_instance_id` mechanism (§12.3) supports policy-context distinctions. A document-level "never send to cloud" flag distinct from corpus-level visibility remains open. Work-around for V2.0: place sensitive documents only in corpora with `firewalled` visibility and rely on §10.7 LlamaCloud policy gating plus per-corpus model routing.
- **Token budget user control.** Open. Power users may want to set a per-conversation token budget ("don't spend more than 500K tokens on this conversation, even if it means degrading document quality"). V2.1 candidate.
- **Cross-session document memory.** Out of scope for DOC25. This is corpus and memory architecture territory (DOC72 + DOC73). DOC25 supplies the document layer; cross-session memory of insights from prior conversations about the same document is the corpus extraction layer's concern.

### 26.2 New V2.0 questions

- **OCR text positioning for searchable PDF write-back.** §9.5 flags this. The simplified `pdf-lib` approach in §9.5 produces a searchable PDF (text search via PDF.js works) but does not match OCR'd text to its visual location, so text selection in the viewer does not return the corresponding text layer content. Production implementation requires a more sophisticated overlay approach: positioned text fragments matching OCR bounding boxes, or a PDF.js-rendered overlay layer maintained separately from the PDF itself. Investigation needed before depending on selection-correspondence.
- **Auto-propose batch reconversion when tools update.** Q3 deferred to V2.1. When Docling or MarkItDown ships a quality-improvement release, the system should be able to flag affected documents and propose batch reconversion. The mechanism — surfacing as suggestions in the Runs tab versus opt-in per corpus — needs design.
- **Near-dedup similarity scoring for revisions.** Q1 deferred to V2.1. Shingle/minhash similarity above a threshold treated as revisions; user confirms link-or-keep-separate. The threshold value, the user resolution UX, and the integration with `document_source_instances` need design.
- **Cross-user dedup in multi-user deployments.** Q8 deferred. ELNOR's V2.0 single-user assumption holds. Future multi-user deployments need: cross-user privacy boundaries on the document store; multi-user `source_instance_id` semantics; multi-user permission/visibility model. Flagged for any future multi-user revision.
- **Marker scheme governance landing.** §18.6 notes that DOC15 + DOC24 will eventually own marker format governance. The handoff path — when does DOC15/DOC24 governance ship; what is the migration sequence for existing emitters; who validates emitted markers in the interim — needs coordination across DOC15/DOC24/DOC25 working sessions.
- **Model capability registry refresh policy.** §6.1 reads from a runtime capability registry maintained by DOC11. When new models ship (e.g., a new GPT release with native PDF in a previously-unsupported model), the registry refresh policy is open. Manual user-driven refresh is the V2.0 baseline; an auto-refresh-from-provider mechanism is open for future revision.
- **`retrieve_full_document` token cap default.** §8.3 defaults `max_tokens: 50000`. This is a working default; it has not been validated against typical document sizes in production usage. The cap may need tuning based on observed truncation rates.
- **Per-step pool size auto-tuning.** §11.3 specifies fixed default pool sizes. Auto-tuning based on observed throughput and latency was discussed but not specified for V2.0. The mechanism (when to expand a pool, when to shrink, how to detect oscillation) is open.

---

## §27 Closing Note

DOC25 V2.0 is operative. It supersedes the V1.0 Document Intelligence & Agent Context Management Specification (2026-04-11) and the R1 Document Intelligence & Extraction Pipeline Updates Proposal (2026-04-19) in full. Both source documents should not be consulted for current behavior; this document is authoritative.

The major architectural shift from V1.0 is the expansion from PDF context management to universal ingestion orchestration. V1.0 covered LLM cost optimization for one document type; V2.0 covers the full document intelligence and ingestion layer for every document entering ELNOR through every surface. The V1.0 tiered context management is preserved and refined; the universal ingestion orchestration is the V2.0 addition that makes the rest of the system work consistently.

DOC25 V2.0 is a producer of the `DOC25_IngestionResult` consumer contract (§17) that DOC73 V1.4.1 corpus extraction depends on. DOC73 V1.4.1 §15.2.1 currently inlines the schema as a freeze artifact; the field-for-field identical version is in §17 here. DOC73 V1.5 will swap the inline schema to a normative reference into this section; until that swap, the two locations must remain identical, with updates landing here first and propagating to DOC73 in V1.5.

Three V2.0 items remain genuinely partial and are flagged for V2.1 as the most likely next-revision priorities:

1. **Auto-propose batch reconversion when tools update** (deferred Q3). Without it, users have to manually trigger reconversion document by document when a quality-improvement tool release arrives.
2. **Near-dedup similarity scoring for revisions** (deferred Q1). Exact dedup catches the majority of cases; revision detection without explicit user linkage is a refinement that would help in document-review workflows.
3. **OCR text positioning for searchable PDF write-back** (new V2.0 question). The §9.5 stub produces searchable but not selection-correspondent OCR layers. Production implementation needs deeper work.

The cross-document obligations listed in §25 are normative. Stale companion docs that conflict with this spec on topics within DOC25's ownership scope should be updated to reference V2.0 rather than the other way around. DOC25 is the authoritative spec for document intelligence and universal ingestion in ELNOR.

---

*End of DOC25 V2.0.*