ELNOR REPO READER TEXT MIRROR Original path: Current Specs/DOC25/DOC25_FILE_MATERIALIZATION_AND_PROVIDER_PROFILES_PROPOSAL_V1_1.md Source repo: /Users/OpenClaw1/Elnor/Elnor Specs Git branch: main Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331 Generated: 2026-06-09T01:23:58.539Z --- # DOC25 Proposal — File Materialization, Provider Profiles, and Remote Access **Source:** Real-world failure surfaced during 2026-04-27 brief-bank search session — 18 of 26 curated MTD briefs in OneDrive Personal returned `errno 35` ("Resource deadlock avoided") when read from the sandboxed bash process. Files were OneDrive Files-On-Demand placeholders (non-zero size, zero blocks). DOC25 V2.0 has no model for this state. **Target:** DOC25 V2.0 — additive amendment introducing file-materialization tri-state, expanded sync-provider profiles (including SharePoint, iCloud, network mounts), pre-flight probe, materialization trigger sequence, and brief-bank-class consumer obligations. Absorption into DOC25 V2.1 (or successor operative version). **Status:** Proposal V1.1 — draft, awaiting fresh-window red-team. Domain-agnostic; PACER plugin V1.2, OneDrive watcher (DOC16 16.7), brief-bank corpora (DOC73 §3.1 + Corpus Source Bindings V1), and any future cloud-sync-backed surface will inherit. **Version:** V1.1 **Supersedes:** `DOC25_FILE_MATERIALIZATION_AND_PROVIDER_PROFILES_PROPOSAL_V1.md` (2026-04-27) **Date:** 2026-04-27 **Author context:** Will Brody, principal architect. **Companion specs read for cross-doc consistency:** - DOC25 V2.0 (2026-04-26) — operative target. - DOC16 Entry 16.7 R2.1 (M365 deep integration) — already specifies Microsoft Graph API access, `OneDrivePathResolver` (DOC20 Addendum B §3), Graph downloads to working dir, multiple OneDrive accounts (firm + personal), Graph Search across tenant. This addendum HOOKS INTO that surface rather than re-specifying it. - DOC73 V1.4.1 §15.2 — `DOC25_IngestionResult` consumer contract. Extension here is additive (new optional fields), not breaking. DOC73 V1.5 inherits via consumer contract; no parallel DOC73 change required. - DOC72 §20A — entity intake contracts; document entity nodes hydrate from `DocumentEntity`. No DOC72 changes required. - DOC18 R2 — LlamaIndex retrieval sidecar. Chunk-index storage interaction clarified in §12 (V1.1 addition). - ELNOR Search Architecture Reference R1 — V1.1 clarifications (§12) ensure routing logic in the Search Architecture Reference's Layer A path remains coherent under cloud-fetch / working-copy fallback. --- ## V1.1 Changes from V1 V1.1 is a documentary refinement of V1 — no schema changes, no behavioral changes, no new failure modes. Two clarifications surfaced during integration with the Search Architecture Reference R1 and are folded in to prevent ambiguity at implementation time. 1. **§12 — DOC18 chunk-index lifecycle vs. working-copy lifecycle.** Made explicit that DOC18 LlamaIndex chunks are stored under the canonical `document_id`, not under `working_copy_path`. Working-copy eviction (per §6 `working_copy_expires_at`) does NOT invalidate the chunk index. Only `raw_file_hash` change triggers re-indexing. 2. **§12 — Docling / MarkItDown routing unaffected by materialization.** Made explicit that the DOC25 V2.0 §10.3 profile-routed Docling vs. MarkItDown hybrid runs AFTER bytes are accessible. The router operates on materialized bytes regardless of whether they live at the original path or in `working_copy_path`. Reading order is unchanged: probe → trigger if needed → route → convert. Both are clarifications, not architectural shifts. V1's architecture stands. --- ## 0. Why this exists DOC25 V2.0 §11–§14 assumes that when a path resolves, the file's bytes are readable. This assumption fails for cloud-sync placeholder files (OneDrive Files-On-Demand, iCloud Drive optimized storage, Google Drive File Stream, Dropbox Smart Sync) and for network-mount paths whose host is unreachable. The failure modes are *different*: - **Cloud sync placeholder:** path resolves, metadata returns, `stat` shows non-zero size and `Blocks: 0`. Read attempt returns `EAGAIN` / `errno 35` because the user-space sync daemon isn't materializing the bytes for the requesting process. - **Network mount unreachable:** path may not resolve at all (mount point absent), or may hang on read (server timeout), or may return permission denied (credentials expired). No "materialization" concept — bytes are network-reachable or not. - **Stale path:** file moved or deleted, path doesn't resolve. Already partially handled by DOC25 V2.0 §13 (auto-repath via content hash matching). DOC25 V2.0 conflates these into either silent retry loops or an undifferentiated `read_error` reason code, neither of which is correct. This addendum: 1. Models the missing state (`materialized` / `placeholder` / `network_unreachable` / `permission_denied` / `stale`) explicitly. 2. Adds a cheap pre-flight probe before any read. 3. Specifies materialization triggers ordered most-cheap to most-expensive. 4. Wires DOC16 16.7's existing Graph API path as the upgrade path for OneDrive / SharePoint placeholders. 5. Adds remote-access handling for firm-internal ELNOR network folders accessed over VPN / Tailscale / similar. 6. Distinguishes brief-bank-class consumers (require full text + OCR + metadata for every member) from lighter consumers (filename-only or chat-attachment uses). --- ## 1. Scope and Domain-Agnosticism Operates at the **document level** (DOC25's domain), not the corpus level (DOC73's). Every document handled by DOC25 — corpus member, ambient graph node, chat attachment, Q viewer document, demonstration capture, EDGAR pull, anything — gets the materialization treatment. Brief banks are a use case, not a special case. Brief-bank corpora (DOC73 `knowledge_corpus` with `extraction_profile: securities_litigation_briefs` etc., per `DOC73_CORPUS_SOURCE_BINDINGS_PROPOSAL_V1.md`) require full materialization for every member because their extraction profile reads full text and runs OCR. But the same materialization fix applies to a one-off chat-attached PDF. Out of scope: corpus-specific extraction logic (DOC73 §14 owns), entity graph schema (DOC72 owns), capability registration (DOC24 owns), Microsoft Graph auth and scope (DOC3 + DOC16 16.7 own). --- ## 2. The Materialization Tri-State (extension to §14 state machine) Add a tri-state property `materialization_state` to every entry in `document_original_paths` (DOC25 V2.0 §12.2). Independent of the existing `is_stale` flag. ```typescript type MaterializationState = | "materialized" // bytes are locally readable now (or path is local-only and always readable) | "placeholder" // path resolves, metadata reads, but bytes are cloud-only | "network_unreachable"// network mount path; mount missing or host unreachable | "permission_denied" // path resolves but read returns EACCES (creds expired, ACL change, etc.) | "unknown" // not yet probed since last invalidation ``` `is_stale` (V2.0 §13) covers a different concept — *path no longer resolves at all*. The two are orthogonal. A path can be `materialized + not stale`, `placeholder + not stale`, or `stale` (in which case materialization is moot). State transitions are not durable on their own — they're a cached observation of the filesystem at probe time, refreshed per §3 below. The cache exists so consumers don't repeatedly probe the same file in a tight loop. --- ## 3. Pre-Flight Probe (insertion in §11 pipeline) Insert a probe step at the top of the §11 pipeline, before conversion / extraction. The probe is intentionally cheap so corpus-scale workloads (5,000-doc batches per E18) don't pay a heavy tax. ### 3.1 Probe sequence ```typescript async function probeMaterialization(path: string): Promise<{ state: MaterializationState; cause?: string; // optional human-readable explanation bytes_present?: number;// from stat.st_size blocks_present?: number;// from stat.st_blocks }> { // Layer A: stat (very cheap, ~microseconds) const st = await safeStat(path); if (st === null) return { state: "stale" }; if (st.size > 0 && st.blocks === 0) { // Classic cloud-sync placeholder signature on macOS APFS / Windows. return { state: "placeholder", bytes_present: st.size, blocks_present: 0 }; } // Layer B: non-blocking 1-byte read probe (~milliseconds) // Detects placeholders that don't show blocks=0 (e.g., partially materialized) // and detects EACCES / EAGAIN that stat alone misses. try { const fd = await openNonBlocking(path, "r"); try { const buf = await readBytes(fd, 1); // read 1 byte return { state: "materialized", bytes_present: st.size, blocks_present: st.blocks }; } finally { await closeFd(fd); } } catch (err) { if (err.code === "EAGAIN" || err.code === "EWOULDBLOCK") return { state: "placeholder", cause: err.message }; if (err.code === "EACCES" || err.code === "EPERM") return { state: "permission_denied", cause: err.message }; if (err.code === "ENOENT") return { state: "stale" }; if (err.code === "EHOSTDOWN" || err.code === "ETIMEDOUT" || err.code === "ECONNREFUSED") return { state: "network_unreachable", cause: err.message }; // Unknown — treat conservatively as placeholder until §4 trigger sequence // attempts a full read. Don't prematurely conclude `materialized`. return { state: "unknown", cause: err.message }; } } ``` Implementations on Windows MUST use the equivalent attribute-based probe (`FILE_ATTRIBUTE_RECALL_ON_OPEN`, `FILE_ATTRIBUTE_RECALL_ON_DATA_ACCESS`) instead of `Blocks: 0`, which is a Unix-only signal. ### 3.2 Probe budget and caching - Per-document probe runs once per pipeline cycle (or on UI hover for the Documents tab). - Result is cached on the path entry with `last_materialization_state` + `last_materialization_check_at` (per §6 schema additions). - Cache TTL: configurable, default 5 minutes for active documents, 24 hours for archived. After TTL, next access re-probes. - Webhook notifications from DOC16 16.7 (Graph file-modification webhooks) invalidate the cache for OneDrive / SharePoint paths. ### 3.3 What the pipeline does with the result | Probe state | Pipeline action | |---|---| | `materialized` | Proceed to conversion (V2.0 §11.X). | | `placeholder` | Invoke materialization trigger sequence (§4). On success, re-probe and proceed. On final failure, hard-fail with `materialization_pending_or_failed` (§5). | | `network_unreachable` | Invoke remote-access handling (§7). On success, proceed. On failure, hard-fail with `network_unreachable_or_offline`. | | `permission_denied` | Hard-fail with `permission_denied`. Surface to user with re-auth affordance per §8. | | `stale` | Existing §13 V2.0 auto-repath flow runs. If repath fails, hard-fail with `path_stale_no_alternate`. | | `unknown` | Treat as `placeholder` and run trigger sequence. If trigger sequence's full read succeeds, transition cached state to `materialized`. | --- ## 4. Materialization Trigger Sequence Ordered most-cheap to most-expensive. Each step is bounded; failure escalates to the next. ### 4.1 Step 1 — Filesystem-native trigger (cheap, in-process) **macOS (APFS file provider):** ```c // Sets the materialize hint on the file. The OS calls back into the file // provider daemon (OneDrive, iCloud, etc.) which downloads the bytes. setxattr(path, "com.apple.fileprovider.materialize", NULL, 0, 0, 0); ``` Then a blocking read of one byte to wait for completion. Bound by configurable timeout (default 30s for files <50MB, 120s for larger). **Windows:** Set `FILE_ATTRIBUTE_RECALL_ON_DATA_ACCESS` clear; use `CopyFileEx` with `COPY_FILE_OPEN_AND_COPY_REPARSE_POINT` flag, or `CldHydratePlaceholders` API for the cloud filter API. **Linux (FUSE-mounted cloud syncs):** Provider-specific. Most FUSE-based clients respond to a normal blocking read. If not, this step is a no-op and we fall to step 2. ### 4.2 Step 2 — User-space provider trigger via AppleScript / IPC (mid-cost) If step 1 returns failure or times out, send the provider's user-space daemon an explicit "download this file" command. For macOS OneDrive: ```applescript tell application "System Events" tell process "OneDrive" -- Provider-specific. As of OneDrive Mac client v25.X, the -- correct invocation is via the file's Finder URL plus the -- "Always keep on this device" command. Spec leaves the exact -- syntax to implementation; this addendum requires only that -- such a trigger exists. end tell end tell ``` Bounded by the same timeout as step 1. This step is best-effort. If the provider doesn't expose a script-able trigger, step 2 is skipped. ### 4.3 Step 3 — Cloud provider API fallback (most expensive, network-bound) If steps 1 and 2 fail or aren't available, invoke the provider's cloud API to fetch bytes directly. This sidesteps the local sync state entirely. **Microsoft Graph (OneDrive Personal, OneDrive for Business, SharePoint):** route through DOC16 16.7's existing infrastructure. - `OneDrivePathResolver` (DOC20 Addendum B §3) — resolves the local path to `(account, drive_id, item_id)`. Already exists for the "Open in Word Online" flow per DOC16 16.7 §F. This addendum extends its use to materialization. - Graph API call: `GET /drives/{drive_id}/items/{item_id}/content` returns the file bytes. - Per DOC16 16.7 §K, Graph downloads land in the configured working directory (`copy_to_working_dir` mode). DOC25 reads from the working-dir copy, not the original path. - The original path's `materialization_state` remains `placeholder`, but the document's `working_copy_path` field (§6) holds the readable local path. DOC25 transparently uses the working copy for downstream conversion. **Google Drive, Dropbox, iCloud:** equivalent provider API calls. Out of scope to fully specify in V1; the architecture is identical (resolver → API call → working-dir copy → DOC25 reads). V1.1 of this addendum may flesh out specific providers as they're prioritized. ### 4.4 What "success" means Step succeeds when subsequent re-probe returns `materialized` (steps 1, 2) OR when a `working_copy_path` is set and readable (step 3). Either way, the pipeline can proceed to conversion. ### 4.5 Final failure If all three steps fail, hard-fail with `materialization_pending_or_failed`. Reasons preserved in the failure record for user diagnosis (provider unreachable, daemon unresponsive, item not found via Graph, OAuth scope missing, etc.). --- ## 5. Hard-Fail Reason Code Vocabulary (extension to §14 V2.0) Add the following reason codes to DOC25 V2.0 §14's hard-fail vocabulary. Existing codes are unchanged. | Code | Meaning | Suggested user remediation | |---|---|---| | `materialization_pending_or_failed` | Bytes not local; trigger sequence (§4) exhausted without success. | Right-click "Always keep on this device" in Finder, OR re-auth Graph credentials, OR check provider daemon health. | | `network_unreachable_or_offline` | Network mount / remote-host path; reachability test failed. | Verify VPN connection, Tailscale state, mount status. | | `permission_denied` | Read returned EACCES; stat OK but bytes inaccessible. | Re-auth (Graph token expired) or check ACL. | | `path_stale_no_alternate` | §13 auto-repath attempted and failed; no other known path holds same content hash. | Manual relink via Documents tab. | | `provider_api_failed` | Step 3 cloud API call failed for non-auth reasons (rate limit, throttling, item moved server-side, etc.). | Surface error; usually self-resolves. | These codes integrate with the existing `IngestionQualityReport` (V2.0 §15 / E1) without breaking the schema — they're additional discrete values for the existing `reason_code` enum, which DOC73 V1.4.1 §15.2 reads from. --- ## 6. Schema Extensions to `document_original_paths` (V2.0 §12.2) Backward-compatible additions. DOC73 V1.4.1 §15.2's consumer contract is unchanged; new fields are optional from the consumer perspective. ```sql -- V2.0 §12.2 columns preserved unchanged. -- Additions: ALTER TABLE document_original_paths ADD COLUMN sync_provider TEXT; -- enum: 'local' | 'onedrive_personal' | 'onedrive_business' | 'sharepoint' -- | 'icloud' | 'gdrive' | 'dropbox' | 'box' | 'network_mount' | 'unknown' -- 'onedrive_business' and 'sharepoint' both route through Microsoft Graph -- via DOC16 16.7's OneDrivePathResolver. They are kept distinct for UI -- clarity (a SharePoint team site path looks structurally different from -- a personal OneDrive for Business path) but share the materialization -- trigger sequence. ALTER TABLE document_original_paths ADD COLUMN sync_provider_subtype TEXT; -- For 'network_mount': 'smb' | 'afp' | 'nfs' | 'webdav' | 'sshfs' | 'other' -- For 'sharepoint': site identifier or 'personal' for ODfB -- For 'icloud': 'optimized' | 'always_keep' | unknown -- Free-form for forward compatibility. ALTER TABLE document_original_paths ADD COLUMN sync_provider_stable_id TEXT; -- Provider-side stable identifier: -- Microsoft Graph: '{drive_id}/{item_id}' -- Google Drive: '{file_id}' -- iCloud: '{cloud_doc_id}' -- network_mount: '{server_uri}#{stable_path_or_inode}' -- Used for direct cloud API access (§4 step 3) and webhook subscriptions. ALTER TABLE document_original_paths ADD COLUMN cloud_drive_id TEXT; -- Disambiguates multiple accounts (e.g., personal OneDrive vs firm OneDrive -- for the same user). Maps to a DOC16 16.7 account record. ALTER TABLE document_original_paths ADD COLUMN last_materialization_state TEXT; -- Cached probe result per §3. ALTER TABLE document_original_paths ADD COLUMN last_materialization_check_at TEXT; -- ISO timestamp. ALTER TABLE document_original_paths ADD COLUMN working_copy_path TEXT; -- Set when §4 step 3 (cloud API fallback) wrote a local working copy. -- DOC25 conversion reads from working_copy_path when set, otherwise from -- the original path. The working copy is content-hashed and content-equal -- to the original; it's a read-side cache, not a divergent copy. ALTER TABLE document_original_paths ADD COLUMN working_copy_expires_at TEXT; -- Working copies are evictable. After expiry without re-use, the working -- copy is deleted; next access re-fetches via §4 step 3. ``` Equivalent additions to the in-memory shape used by the §11 pipeline. --- ## 7. Network Mount Handling (firm-internal ELNOR / remote operation) Will's use case: an ELNOR instance running on the firm's LAN with files on a firm file server (`/Volumes/SchallFirm-Cases/` or similar SMB share), accessed remotely (Will at home over VPN / Tailscale). This is structurally different from cloud sync: - No placeholder concept. Bytes are either reachable over the network or not. - No materialization trigger. The closest analog is "is the network reachable + is the mount alive." - Failure modes: host_unreachable, mount_dropped, slow_read, permission_denied (creds expired or ACL changed). `sync_provider: 'network_mount'` covers this. Subtypes (`smb`, `afp`, `nfs`, `webdav`, `sshfs`) are documented for forward compatibility but the materialization logic is the same: probe via stat + 1-byte read, treat `EHOSTDOWN` / `ETIMEDOUT` / `ECONNREFUSED` as `network_unreachable`, surface a re-mount affordance to the user. ### 7.1 Reachability probe For network_mount paths, before the §3.1 probe, a pre-pre-flight check: ```typescript async function probeMountReachability(mount_root: string): Promise<{ state: 'reachable' | 'mount_missing' | 'host_unreachable' | 'permission_denied'; }> { // Cheap: stat the mount root. If it returns ENOENT or the mount appears // empty when it shouldn't be, the mount is dropped. // If stat hangs (default 3s timeout), the host is unreachable. } ``` This runs once per mount per pipeline cycle, not per document. The result is cached at the mount level so a 5,000-document corpus on the firm share doesn't perform 5,000 redundant reachability probes. ### 7.2 ELNOR-to-ELNOR remote API (forward-looking, V2 of this addendum) A more robust path than relying on raw network mounts: the firm-internal ELNOR instance exposes a small authenticated API for "give me document X's bytes." Will's home ELNOR instance authenticates and fetches via that API rather than over an SMB mount that may go stale. This is structurally similar to the cloud-API fallback (§4 step 3) — different network endpoint, same shape: resolver, fetch, working-dir copy. Out of scope for V1 of this addendum; flagged for V2 once Will has decided whether to run one or two ELNOR instances. ### 7.3 What the addendum specifies for V1.x - `sync_provider: 'network_mount'` is a recognized value. - `network_unreachable_or_offline` is a recognized hard-fail reason. - The probe and remediation pattern is documented. - The remote ELNOR-to-ELNOR API path is identified as future work. --- ## 8. UI Surfaces (extensions to V2.0 §19) ### 8.1 Documents tab Per-document state column extends to show materialization. Visual treatment: - ● Materialized (green) — readable now. - ◐ Placeholder (amber) — cloud-only, will be auto-fetched on next access. - ◯ Stale (red) — path broken; click to relink. - ⚠ Permission denied (red) — re-auth required. - ⊘ Network unreachable (red) — mount or VPN down. Hover for cause string. Per-document context menu adds: - "Bring back from cloud" — invokes §4 trigger sequence inline. Useful when user wants to pre-stage a document before running corpus extraction. - "Open original in Finder" — for diagnosis. - "Re-probe state" — bypasses cache. ### 8.2 Corpus dashboard Each corpus shows aggregate materialization status: ``` securities_mtd_oppositions (47 members) ● 29 materialized ◐ 12 cloud-only (auto-fetch on extraction) ⚠ 3 permission denied (re-auth needed) ⊘ 3 network unreachable [Pre-stage all members] [Run extraction now] ``` "Pre-stage all members" walks the materialization trigger sequence for every member with state `placeholder`. Useful before running a long extraction batch — avoids mid-run stalls. This is exactly the visibility the V2.0 §15 `IngestionQualityReport` was supposed to provide for silent degradation; the addendum extends its surface area to cover materialization. ### 8.3 Settings (V2.0 §19) Add a "Sync providers" panel under Document Intelligence settings: ``` Sync Providers Microsoft 365 (Graph API) Firm OneDrive (wbrody@schallfirm.com): ● Connected, last refreshed 2m ago Personal OneDrive: ● Connected, last refreshed 1h ago SharePoint sites: 4 indexed [Re-auth firm] [Re-auth personal] Network Mounts /Volumes/SchallFirm-Cases (SMB, server smb://files.schallfirm.local): ● Reachable · Last probe: 5m ago [Add mount] [Re-probe all] Other Providers iCloud Drive: not configured Google Drive: not configured Dropbox: not configured Materialization Behavior ☑ Auto-fetch placeholders on first access ☑ Pre-stage corpus members before batch extraction ☐ Pre-pin all corpus members to local sync state at write time [Cache TTL: 5 minutes (active) / 24 hours (archived)] ``` The "Pre-pin all corpus members" toggle is the cross-cutting fix described in §9 — when on, files written into a sync-managed folder by ELNOR plugins (PACER, OneDrive watcher, etc.) are marked locally-pinned at write time so they don't drift back to placeholder state. --- ## 9. Cross-Doc Obligation: Write-Side Surfaces Pre-Pin Local Plugins that save files into user-controlled sync folders (PACER plugin V1.2 §6 / §7.3, OneDrive watcher per DOC16 16.7, future RSS / browser saver / email-attachment ingest) MUST pre-pin the file as locally-cached at write time. Otherwise the corpus extraction profile (DOC73 §14) hits placeholder state on its own freshly-written documents. **macOS:** ```c // After writing the file: setxattr(path, "com.apple.fileprovider.AlwaysKeepDownloaded", "1", 1, 0, 0); // Plus (per provider): provider-specific "always keep" attribute. ``` **Windows:** equivalent file attribute calls. This is a required behavior; opting out is configurable per the §8.3 settings toggle, but the default is on. PACER plugin V1.2 will reference this section by adding a one-paragraph requirement to its §6.1 (CM/ECF email auto-capture save flow) and §7.3 (corpus-binding-driven write to OneDrive). Future plugin specs do the same. --- ## 10. `IngestionQualityReport` Extension (V2.0 §15 / E1) Backward-compatible field additions: ```typescript interface IngestionQualityReport { // ... existing V2.0 fields preserved unchanged ... // NEW (this addendum): materialization_status?: { state: MaterializationState; probed_at: string; // ISO timestamp cause?: string; // human-readable trigger_sequence_used?: // populated when §4 was invoked 'fs_native' | 'provider_ipc' | 'cloud_api' | 'multiple'; working_copy_path?: string; }; sync_provider?: string; // mirrors document_original_paths sync_provider_stable_id?: string; } ``` The corpus-aggregate version surfaces "12 cloud-only / 3 permission denied / 3 network unreachable" per the §8.2 dashboard. DOC73 V1.4.1 §15.2 reads the existing report shape; the additions are optional and don't break consumers. DOC73 V1.5 may expose a corpus-level summary view that reads the new fields, but that's a DOC73 V1.5 enhancement, not a DOC73 obligation triggered by this addendum. --- ## 11. Runtime Tool: `ensure_materialized` (extension to V2.0 §16) A new LLM-callable tool joins the V2.0 §16 retrieval tool family. Same shape as `retrieve_document_pages` — scoped, rate-limited, sub-agent-callable. ```typescript // Tool registered through DOC24 R2.5. ensure_materialized(args: { document_id: string; // The DOC25 canonical document_id. timeout_seconds?: number; // Default 60. prefer_local?: boolean; // Default true; if false, skip §4 steps 1-2 and go directly to cloud API. }): Promise<{ state: MaterializationState; trigger_sequence_used?: string; working_copy_path?: string; cause?: string; }>; ``` Use cases: - DOC73 corpus extraction calls `ensure_materialized` for each member before the extraction profile runs. Inline single-doc materialization. - "Pre-stage all members" UI button (§8.2) iterates members and calls `ensure_materialized` for each. - Specialist sub-agents that read documents on behalf of LLM conversations call this before reading. Returns synchronously when the file is already materialized (no-op). Otherwise blocks up to `timeout_seconds` while the trigger sequence runs. --- ## 12. Brief-Bank-Class Consumers (clarifying §1) Some consumers need full text + OCR + metadata for every document they touch. Brief banks are the canonical example: a brief bank corpus (DOC73 `knowledge_corpus` with `extraction_profile: securities_litigation_briefs`) needs to extract argument structure, citations, and fact patterns from every member, plus OCR if any briefs are scanned. This is heavier than the chat-attachment use case where DOC25 might surface just a structured summary. DOC25 V2.0 §9 already specifies the OCR pipeline (R1 §3 absorbed). DOC25 V2.0 §10 already specifies MarkItDown as the universal text extraction backend, with the §10.3 profile-routed Docling vs. MarkItDown hybrid for documents that benefit from Docling's stronger handling of complex tables, multi-column layouts, and image-heavy content. **The capability is present; the missing piece is reliable byte access.** That's what this addendum delivers. For brief-bank corpora specifically: - The corpus extraction profile (DOC73 §14) calls `ensure_materialized` for each member as the first step of extraction. Until it returns `state: materialized` (or a `working_copy_path` set), extraction is paused for that document. - The pre-stage button (§8.2) lets Will warm up the entire corpus locally before running batch extraction, so the run doesn't pause partway through to fetch from cloud. - OCR (V2.0 §9) runs against the materialized bytes, not against the original path — because for cloud-API fallback (§4 step 3), bytes live in the working copy. - Full-text indexing into LlamaIndex (DOC18 R2) reads the same materialized bytes / working copy. ### 12.1 Docling/MarkItDown routing is unaffected by materialization (V1.1 clarification) The DOC25 V2.0 §10.3 profile-routed Docling vs. MarkItDown hybrid runs **after** bytes are accessible. The router selects the converter based on the document's classified type (text-native PDF / Word / HTML → MarkItDown; scanned / image-heavy / table-heavy filing → Docling) regardless of whether the bytes live at the original path or in `working_copy_path`. Reading order is unchanged from V2.0: ``` probe (§3) → trigger if needed (§4) → route per §10.3 → convert → chunk → index ``` The materialization layer is upstream of the routing decision, not a peer of it. A brief that comes through cloud-API fallback gets the same Docling vs. MarkItDown treatment as a brief read straight from a local path with the same content. ### 12.2 DOC18 chunk-index lifecycle is independent of working-copy lifecycle (V1.1 clarification) Chunks produced by DOC18 LlamaIndex indexing are stored under the canonical `document_id` (DOC25 §17), **not** under `working_copy_path`. This has two important consequences: 1. **Working-copy eviction does not invalidate the chunk index.** When `working_copy_expires_at` fires (§6) and the working copy is deleted, the chunks remain valid for retrieval. Subsequent semantic searches (Search Architecture Reference R1 Layer A) hit the existing chunk index without re-indexing. 2. **Re-indexing is required only on `raw_file_hash` change.** If the document's underlying content changes — detected via the existing V2.0 §13 cache-invalidation rules and re-hash on next access — both the chunk index and any prior conversion artifacts are versioned per the existing E8 versioned-immutable-artifacts pattern. Materialization cycling alone (placeholder → working copy → eviction → re-fetch with same bytes) does NOT trigger re-indexing. This is the operationally important property: brief banks can be read repeatedly, with working copies materializing and evicting on a fast cycle, without paying re-indexing cost. The chunk index is the durable artifact; the working copy is a transient byte-access mechanism. This satisfies Will's stated requirement: full metadata + OCR'd text search for every doc, including OneDrive / SharePoint / network-mount sources, with predictable cost characteristics under repeated access. --- ## 13. Outside-Corpus Use Cases (clarifying §1) The addendum operates at the document level. Documents not in any corpus (chat attachments, ambient ingestion, demonstration captures, browser saves, EDGAR pulls without a target corpus) get the same materialization treatment. Specifically: - A user drops a OneDrive-Personal placeholder into a chat: §3 probe fires; if placeholder, §4 trigger runs; once materialized, V2.0 §22 (chat attachment handling) proceeds normally. - A demonstration capture (DOC3) records a path; on later replay, materialization probe + trigger fires before the recorded extraction step. - An EDGAR pull lands a 10-K in a OneDrive-synced folder: §9 (write-side pre-pin) ensures it's pinned local at write time; later corpus binding evaluations (per `DOC73_CORPUS_SOURCE_BINDINGS_PROPOSAL_V1.md`) see it as already materialized. No consumer-side change required. DOC25 transparently delivers the bytes regardless of corpus participation. --- ## 14. Cross-Doc Obligations | Doc | Obligation | |---|---| | DOC25 V2.1 | Absorbs this addendum. Sections affected: §11 (pipeline pre-flight probe), §12.2 (schema), §14 (state machine + reason codes), §15 (quality report extension), §16 (`ensure_materialized` tool), §19 (settings + Documents tab + corpus dashboard). | | DOC16 16.7 | No change. Addendum hooks into existing `OneDrivePathResolver` and Graph API plumbing. The `OneDrivePathResolver` may need to be exported as a callable from outside DOC16 16.7's "Open in Word Online" use case to be reachable from DOC25's §4 step 3 — likely a one-line export change, not a spec change. | | DOC73 V1.5 | No required change. Additive consumer-contract additions in §10 of this addendum; DOC73 inherits transparently. DOC73 V1.5 may add corpus-dashboard hooks (per §8.2) but that's an enhancement, not an obligation from this addendum. | | DOC72 | No change. Document entity nodes hydrate from `DocumentEntity` (DOC25 §17) which is unchanged at the consumer-contract level. | | DOC24 R2.5 | One new tool registration (`ensure_materialized` per §11 of this addendum). Otherwise no change. | | EC Core Addendum A | The §3.1 probe and §4 trigger sequence run inside the existing intake pipeline and use existing EC capacity-lease + throttle (V2.0 §15). No EC Core spec change. | | DOC18 R2 | No required change. V1.1 §12.2 clarifies that chunks are stored under `document_id`, which matches DOC18 R2's existing identification model. The clarification is documentary; no DOC18 schema change. | | DOC20 R4.3 | No required change; the Documents tab UI extensions (§8.1) are a Q-side rendering concern. If the Documents tab spec is owned in DOC20, that spec may need a one-paragraph addition for the new state column. | | DOC23 | No change. Task module outputs flowing into DOC25 inherit transparently. | | PACER plugin V1.2 (forthcoming) | MUST reference §9 of this addendum and apply pre-pin-on-write to all files saved into OneDrive-synced folders. | | OneDrive watcher (DOC16 16.7) | MUST reference §9 of this addendum. | | Future cloud-sync-backed plugins (RSS, browser saver, email attachment ingest) | MUST reference §9. | --- ## 15. Red-Team Targets For fresh-window review, the following points warrant explicit attention: 1. **Probe budget realism (§3).** Two layers (stat + 1-byte read) per document. For a 5,000-document corpus per E18, that's 5,000 stats + up to 5,000 1-byte reads at pipeline-start. Likely well under a second total on local disk; potentially expensive over a slow SMB mount. Consider whether layer B should be opt-in for `network_mount` paths. 2. **Cache TTL correctness (§3.2).** 5 minutes for active / 24 hours for archived. May be wrong defaults; worth empirical tuning. Webhook-driven invalidation should make most cache staleness self-correcting in practice. 3. **AppleScript / IPC trigger reliability (§4 step 2).** OneDrive Mac client behavior changes between versions. The spec deliberately leaves the exact invocation to implementation, but a brittle implementation here means we fall through to step 3 (cloud API) more often than necessary. Worth validating against current OneDrive Mac client. 4. **Working-copy storage growth (§4.4, §6).** Cloud-API fallbacks land in working dir. For a corpus of 5,000 ~5MB briefs that's 25GB of working copies, which is real. Eviction policy via `working_copy_expires_at` is specified; correctness of the eviction (LRU? most-likely-needed? user-configurable?) deserves scrutiny. V1.1 §12.2 clarifies that chunk index survives eviction, so re-fetching after eviction is bytes-only, not bytes-plus-reindex. 5. **Permission/scope drift (§5 `permission_denied`).** Re-auth flow needs to be explicit and obvious. Otherwise a single token expiry silently breaks materialization for the entire firm OneDrive corpus until someone notices the dashboard showing red. Worth specifying a proactive refresh policy. 6. **Concurrent materialization (§4).** If 50 documents in a batch all hit step 3 simultaneously, that's 50 Graph API calls in parallel. Need rate limiting against Graph throttling. Existing EC capacity-lease / throttle (V2.0 §15) may or may not be sufficient — worth verifying. 7. **Content equality of working copies (§4.3).** The cloud API fetch might return a *different* version than the local placeholder represents (e.g., if there's an unsynced server-side change). Recompute `raw_file_hash` on the working copy and compare to the document's known hash. If they differ, this is actually a "newer version exists" signal, not just materialization — may want to surface as an explicit user-visible "this document has been updated server-side" notification. V1.1 §12.2 also notes that hash-change is what triggers re-indexing. 8. **Network-mount detection precision (§7).** macOS `/Volumes/X` paths obviously look like mounts; Linux `/mnt/X` similarly; but mounts can be at any path. The schema's `sync_provider: 'network_mount'` requires reliable detection. Possibly consult `mount(8)` output at probe time. Spec leaves implementation choice but worth a clearer detection algorithm. 9. **The PACER plugin V1.2 obligation (§9).** Pre-pin-on-write for files landing in OneDrive folders means PACER plugin needs to call `setxattr` (or Windows equivalent) at every write. Spec says "MUST" but the cost of accidentally not doing it is invisible until extraction time. Consider adding a smoke test in DOC25 V2.1 that periodically checks corpus members for placeholder state and surfaces a warning if newly-written files drift back. 10. **iCloud-specific behavior.** macOS iCloud Drive uses its own optimization model that's distinct from OneDrive. The `setxattr` API may be the same but behavior differs. Spec is OneDrive-tested; iCloud should be empirically validated before claiming support. --- ## 16. Versioning and Filename Discipline This is V1.1, superseding V1 (`DOC25_FILE_MATERIALIZATION_AND_PROVIDER_PROFILES_PROPOSAL_V1.md`, 2026-04-27). V1 is preserved per the post-absorption versioning rule — never re-edited. V1.1 adds the two §12 clarifications (DOC18 chunk lifecycle and Docling/MarkItDown routing order); architecture and schema are unchanged from V1. When this is absorbed into DOC25 V2.1 (or later operative version), V1.1 is archived alongside V1. Future revisions to file-materialization concepts author a fresh proposal against the absorbed §X owner text in DOC25, not a re-edit of this V1.1 proposal. If review reveals structural changes are needed before absorption, V1.2 (or V2 for breaking changes) of this proposal is authored as a new file, never as a re-edit of V1.1.