ELNOR REPO READER TEXT MIRROR Original path: Current Specs/DOC73/DOC73_CORPUS_SOURCE_BINDINGS_PROPOSAL_V1.md Source repo: /Users/OpenClaw1/Elnor/Elnor Specs Git branch: main Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331 Generated: 2026-06-09T01:23:58.539Z --- # DOC73 Proposal — Corpus Source Bindings **Source:** Cross-doc need surfaced during PACER plugin spec V1.1 review (2026-04-27). **Target:** DOC73 §3 extension (new §3.6 Corpus Source Bindings) for absorption into DOC73 V1.5 (or successor operative version). **Status:** Proposal — draft, awaiting fresh-window red-team before absorption. **Date:** 2026-04-27 **Author context:** Will Brody, principal architect. **Companion:** `EC_CORE_ADDENDUM_A_INTAKE_ROUTING_FOR_CORPUS_BINDINGS_PROPOSAL_V1.md` (EC-side implementation reference). --- ## 1. Purpose DOC73 V1.4.1 §3.1 specifies the `knowledge_corpus` entity and its lifecycle. V1.4.1 documents three lifecycle paths by which a node can become a corpus member: 1. **Cluster absorption** — automatic promotion when a corpus-shaped cluster crosses thresholds (§4.2). 2. **Suggested-corpora promotion** — user confirms a candidate corpus (§4.2 promote action). 3. **Explicit user tagging** — user manually assigns `corpus_id` to a node. This addendum adds a fourth path: **direct intake from a bound source**. A scheduled or watched source — PACER monitor, OneDrive folder watcher, RSS feed, calendar sync, email mailbox label, browser-captured page, future plugins — is configured to contribute its output, optionally filtered by content type or predicate, directly into a named `knowledge_corpus`. New items emitted by the source carry the corpus's `corpus_id` scope tag from the moment they are written. This is a domain-agnostic capability — applicable to any source surface that emits ingestion events through EC. It is not specific to PACER or to the legal domain. The PACER plugin spec V1.1 §7.3 is the first user of this contract and motivates it; future plugins (OneDrive watcher, RSS, calendar, email, browser saver, third-party plugins) inherit it without re-specifying. --- ## 2. Data Model ### 2.1 The `corpus_source_binding` Node A binding is itself a first-class DOC72 node. This keeps it durable, queryable, audit-trailable, and consistent with EC-as-sole-writer. ```typescript interface CorpusSourceBinding { binding_id: string; // Stable identifier node_kind: "corpus_source_binding"; // DOC72 node kind entity_subtype: "corpus_source_binding"; // ─── Targeting ──────────────────────────────────────── corpus_id: string; // Target corpus (DOC73 §3.1) source_kind: string; // Registered source-kind enum (see §4) // e.g., "pacer.watched_case", // "pacer.scheduled_search", // "onedrive.folder_watch", // "rss.feed", // "calendar.sync", // "email.mailbox", // "email.label", // "browser.captured_page" source_id: string; // Source-specific stable id // (e.g., the specific watched case's // case_id, the specific RSS feed_id, // the specific OneDrive folder path) // ─── Filtering ──────────────────────────────────────── content_type_filter?: string[]; // Optional whitelist of registered // content types per DOC20 §6.18.2. // If set, only events whose // registered content_type appears // in this list trigger the binding. // e.g., for PACER: // ["legal.pacer.filing.opposition_brief", // "legal.pacer.filing.reply_brief"] inclusion_predicate?: string; // Optional declarative filter // evaluated at intake. Expressed in // the existing DOC72 predicate DSL // (no new language). Examples: // "filing.pages > 5" // "filing.filed_date >= '2024-01-01'" // "case.nature_of_suit == '850'" // Predicates that fail to evaluate // are treated as `false` and logged // to the binding's error log. // ─── Behavior ───────────────────────────────────────── contribution_mode: "scope_only" | "scope_and_extract"; // scope_only: tag corpus_id on the // resulting node; defer extraction // to other corpus-membership paths // or to user-driven promotion. // scope_and_extract: tag, then run // the corpus's extraction_profile // against the node payload per // DOC73 §14 (Corpus Extraction // Architecture). // Default: "scope_only" (safer — // respects per-corpus trust // posture without doubling cost). // ─── State ──────────────────────────────────────────── active: boolean; // Disabled bindings are inert but // preserved for audit and re-enable. created_at: string; created_by: "user" | "system" | // "user" = manually configured "suggested_acceptance"; // "suggested_acceptance" = user // accepted a system-suggested // binding via §6. last_contribution_at?: string; contribution_count: number; // Total members produced by this binding. last_evaluation_error?: { // Most recent predicate or filter timestamp: string; // evaluation failure, if any. error: string; }; } ``` ### 2.2 Edges Two edge kinds, both DOC72 native: - `binding_targets_corpus` — from `corpus_source_binding` node to `knowledge_corpus` node. Cardinality: each binding targets exactly one corpus. - `binding_consumes_source` — from `corpus_source_binding` node to the source entity (which is itself a DOC72 node — a watched-case `world_entity`, a scheduled-search `procedure`, an OneDrive folder `work_product`, etc.). Cardinality: each binding consumes exactly one source. Member nodes produced via a binding carry the `corpus_id` scope tag per DOC73 §3.1. They do NOT carry an edge back to the binding — the binding's contribution is recorded via `contribution_count` and via the audit trail in EC's JSONL log, not via a per-member edge. This is to avoid edge-table bloat for high-throughput sources. ### 2.3 Cardinality - A given **corpus** may have N bindings (multi-source feed). E.g., `firm_brief_bank` corpus might be fed by both `pacer.watched_case` bindings (live cases) and `onedrive.folder_watch` bindings (historical briefs migrated from OneDrive). - A given **source** may participate in M bindings (one source can feed several corpora). E.g., a single PACER watched case might feed both `securities_mtd_oppositions` (filtered to opposition briefs only) AND `firm_active_litigation` (all substantive filings) simultaneously. - A given **content type** can be permitted by overlapping `content_type_filter` lists across bindings — there is no requirement that filters be disjoint. A single inbound event may satisfy multiple bindings and produce multi-corpus membership on the resulting node. --- ## 3. Intake Protocol (EC-side) When EC processes an intake event from any source, the binding lookup is integrated into the existing intake pipeline. EC remains the sole writer; the binding lookup adds a routing step, not a new write path. ### 3.1 Sequence 1. Plugin or surface emits an intake event to EC carrying `(source_kind, source_id, content_type, payload)`. 2. EC normalizes the event into the appropriate intake contract per DOC72 §20A (e.g., `intake.pacer.filing`, `intake.onedrive.file`, `intake.rss.item`). 3. **Before** writing the resulting node, EC queries the binding table for bindings matching `(source_kind, source_id)` with `active = true`. 4. For each matching binding, in order of `created_at`: 1. If `content_type_filter` is set, verify the event's registered `content_type` matches; otherwise skip. 2. If `inclusion_predicate` is set, evaluate against event payload; if evaluation fails, log to `last_evaluation_error` and treat as no-match. 3. If both checks pass, append `corpus_id` to the resulting node's scope-tag set (multi-corpus permitted; deduplicated). 5. EC writes the resulting node with the accumulated scope tags. Single write, multiple corpus memberships. 6. For each binding with `contribution_mode: "scope_and_extract"`, EC enqueues a corpus-scoped extraction task per DOC73 §14 (Corpus Extraction Architecture), bound to that corpus's `extraction_profile`. Multiple `scope_and_extract` bindings on the same event produce multiple extraction tasks (one per binding). 7. For each fired binding, EC updates `last_contribution_at` and increments `contribution_count`. These are durable writes via EC; the binding node is mutable in metadata only. ### 3.2 EC-as-sole-writer compliance The plugin / source emits an event. EC reads the binding table, evaluates filters, and writes both the resulting domain node AND the binding metadata updates. The plugin does NOT touch the binding table. This preserves the "EC sole writer" invariant per DOC72 R5.73. ### 3.3 Failure modes | Failure | Handling | |---|---| | Binding refers to a `corpus_id` that no longer exists (corpus deleted) | Auto-deactivate the binding; surface a "rebind to..." prompt in the corpus admin UI on next user session. | | Binding's source entity no longer exists (source deleted) | Same — auto-deactivate, prompt user. | | `inclusion_predicate` evaluation throws | Log to `last_evaluation_error`, treat as no-match for this event, continue with other bindings. Surface to user only after N consecutive failures (default 5) to avoid noise. | | `content_type` is unregistered | The event itself fails DOC20 §6.18.2 validation upstream of binding evaluation. Bindings never see it. This is by design — the binding system trusts content-type registration. | | Multi-corpus membership produces conflicting trust postures (one corpus is `aggressive_auto_commit`, another is `review_before_commit`) | The most restrictive trust posture wins for this node. Member appears as a candidate in the restrictive corpus and as confirmed in the permissive corpus, with an explanatory note. | | Concurrent intake of the same event from two sources (deduplication) | Out of scope for this addendum — handled by existing DOC25 deduplication logic. The binding lookup is idempotent on the deduplicated node. | --- ## 4. Source-Kind Registration Each source that wants to participate in bindings must declare its `source_kind` enum and the content types it can emit. This is part of the surface intake contract obligations per DOC72 §20A and content-type registration per DOC20 §6.18.2. A `source_kind` registration record: ```typescript interface SourceKindRegistration { source_kind: string; // e.g., "pacer.watched_case" owning_plugin_or_doc: string; // e.g., "plugin:pacer", "doc:doc16" human_label: string; // e.g., "PACER watched case" description: string; emits_content_types: string[]; // Whitelist of content types this // source can emit. Used by the // binding UI to populate // `content_type_filter` choices. example_source_ids?: string[]; // For docs / debugging. } ``` V1.0 source-kind registrations to ship with this addendum's absorption (subject to V1.2 plugin specs declaring them): | `source_kind` | Owner | Emits | |---|---|---| | `pacer.watched_case` | plugin:pacer | All `legal.pacer.filing.*` content types | | `pacer.scheduled_search` | plugin:pacer | All `legal.pacer.filing.*`, `legal.pacer.case.*` | | `onedrive.folder_watch` | doc:doc16 | All `file.*` content types matched by folder filter | | `rss.feed` | plugin:rss (forthcoming) | `web.feed.item` | | `calendar.sync` | doc:doc11 (or plugin) | `calendar.event` | | `email.mailbox` | doc:doc16 | `email.message` | | `email.label` | doc:doc16 | `email.message` filtered by label | | `browser.captured_page` | plugin:browser (forthcoming) | `web.captured_page` | Source-kind registrations are themselves DOC72 nodes (`node_kind: "source_kind_registration"`) for auditability and to support the binding UI's source picker. --- ## 5. Trust Posture Interaction (R5.3 A2) Per-corpus trust posture (DOC73 §3.1) governs whether new members are auto-confirmed or land as candidates awaiting review. Bindings inherit and respect this: - **Corpus has `aggressive_auto_commit`:** Members arriving via bindings are auto-confirmed immediately. Use case: a personal reading-list corpus fed by a browser-saver binding, low-stakes, user trusts the source. - **Corpus has `normal_auto_commit`:** Members arriving via bindings are auto-confirmed after the corpus's standard delay (default per DOC73 §3.1). Members appear in the corpus immediately but are flagged "pending confirmation" until the delay elapses, with a one-click revoke. - **Corpus has `review_before_commit`:** Members arriving via bindings land as **candidates**. They appear in the corpus admin UI under a "Pending review" section. The user confirms (member becomes part of the corpus), edits (modify scope tags or extraction targets before confirming), or rejects (tombstone the candidate, optionally feed back as a binding tuning signal). The PACER `securities_mtd_oppositions` use case from PACER spec V1.1 §7.3 is naturally `review_before_commit` for a high-stakes legal corpus — extraction sits as candidates until Will confirms. A casual `interesting_articles` corpus fed by a browser saver might be `aggressive_auto_commit`. This interaction matches Will's CLAUDE.md failure-mode concern: **"silent auto-promotion of weak signals"**. Bindings cannot bypass trust posture. A binding into a `review_before_commit` corpus produces a queue, not silent ingestion. --- ## 6. UI Surfaces Two equivalent entry points; both write to the same DOC72 `corpus_source_binding` table. ### 6.1 Source-side configuration (the common case) Each source's own configuration UI offers a "Feed corpus" picker. PACER spec V1.1 §7.3 shows the per-watched-case version. The pattern is: ``` ┌─ CORPUS CONTRIBUTION ───────────────────┐ │ Feed items to corpus(es): │ │ ☑ {corpus_a} │ │ Filter: {content_type or "all"} │ │ ☐ {corpus_b} │ │ Filter: {content_type or "all"} │ │ [+ Add binding] [Manage in DOC73] │ └─────────────────────────────────────────┘ ``` The picker is populated from the user's existing corpora plus a "[+ Create new corpus]" option that hands off to DOC73's corpus admin (creates the corpus, returns to source UI with the new corpus pre-selected). The "Filter" dropdown is populated from the source's `emits_content_types` registration. ### 6.2 Corpus-side configuration DOC73's corpus admin UI shows a "Source Bindings" panel for each corpus, listing every binding feeding the corpus across every source: ``` {corpus_name} — Source Bindings (3 active) ● PACER watched case: White v. Brooge ─ filter: opposition_brief Last contribution: 2 days ago · Total: 4 [Edit filter] [Pause] [Remove] ● PACER scheduled search: "S.D.N.Y. sec MTD oppositions" ─ filter: opposition_brief, reply_brief Last contribution: 6 hours ago · Total: 47 [Edit filter] [Pause] [Remove] ● OneDrive folder: /Briefs/Securities/MTD/ ─ filter: file.docx, file.pdf Last contribution: 18 days ago · Total: 312 [Edit filter] [Pause] [Remove] [+ Add binding...] — picker walks the user through source_kind → source_id → content_type_filter selection ``` This panel is where a user sees a corpus holistically: what feeds it, how often, and at what filter. It is the natural place to add bindings for sources the user hadn't thought of from the source-side UI. ### 6.3 System-suggested bindings (lightweight, optional in V1) When the user explicitly tags a node with a `corpus_id` (path 3 from §1) AND that node arrived via a source the user has used before, the system MAY suggest "Always feed items like this from {source} to {corpus}? [Yes — create binding] [No, just this one]". This is a soft pattern, optional in V1 of this addendum and can be deferred to V2 if it complicates red-team. If shipped, the resulting binding has `created_by: "suggested_acceptance"` for audit clarity. --- ## 7. Lifecycle and Retention - Bindings outlive individual events. A binding persists until the user disables / deletes it, the source is removed, or the target corpus is deleted (auto-deactivation per §3.3). - **Disabling** a binding (`active: false`) stops new contributions but does NOT retroactively remove `corpus_id` from previously contributed members. Members produced under the binding remain corpus members on their own merits. - **Deleting** a binding tombstones the binding node. By default, contributed members keep their `corpus_id` (same rationale as disable). Optionally, the delete action exposes a "also remove `corpus_id` from N previously contributed members" toggle for cleanup of mis-configured bindings. Default is off — preserves user work. - **Source removal** (e.g., user removes the watched case from PACER) auto-deactivates all bindings consuming that source. Members remain. - **Corpus deletion** is the most disruptive case. DOC73 §3.1 (V1.4.1) does not specify corpus-deletion semantics in detail; this addendum defers to whatever DOC73 V1.5 (or later) defines, but requires that corpus deletion auto-deactivate all bindings targeting the deleted corpus and surface a "rebind these to..." prompt aggregating all affected sources. --- ## 8. Cross-Doc Obligations Absorbing this addendum implies the following obligations on adjacent specs: - **DOC73 (this doc's home)**: add §3.6 referencing this addendum's contract; preserve §3.1 corpus model unchanged. The four lifecycle paths (cluster absorption, suggested promotion, explicit user tag, source binding) should be enumerated together in a refreshed §3.1 lifecycle paragraph. - **DOC72 §20A (Surface-Specific Intake Contracts)**: each contract (`intake.todo`, `intake.calendar`, `intake.pacer`, etc.) declares its `source_kind` enums per §4 of this addendum. - **DOC20 §6.18.2 (Content-Type Registration)**: every content type referenced in a `content_type_filter` must be registered. No unregistered types — Will's stated failure mode. Bindings against unregistered types fail at creation time, not at runtime. - **EC Core Addendum A (intake routing)**: implements steps 1–7 of §3.1. See the companion EC Core proposal `EC_CORE_ADDENDUM_A_INTAKE_ROUTING_FOR_CORPUS_BINDINGS_PROPOSAL_V1.md`. - **DOC24 R2.5 (delivery / tag vocabulary)**: corpus-scoped extraction tasks queued at §3.1 step 6 emit delivery events using the existing primary_tag / hedge_mode / force_level vocabulary. No second prompt-control language. (One-tag-vocabulary invariant.) - **DOC8 (learning computation)**: bindings that fire frequently with low downstream user engagement (rejected candidates, immediately-archived members) are signals into DOC8's learning computation — "this binding is over-broad." Out of scope to specify the learning logic here; flagged for DOC8's own treatment. - **BDSM V6.4 (utility ledger)**: each contributed member is a utility-ledger event. Members the user actually engages with (read, cited, surfaced in chat) credit the binding's utility; rejected members debit. Out of scope to specify here; flagged for BDSM integration during DOC24 absorption. - **PropA R6.3 (knowledge pipeline sensitivity)**: bindings are an intake-pipeline structure. Sensitivity / self-improvement signals from PropA may include "this binding's content-type filter is too narrow / too broad" — out of scope, flagged. --- ## 9. Domain-Agnostic Demonstration To verify domain-agnosticism (per the "Domain-agnostic core" architectural invariant), the contract should demonstrably work for at least three unrelated domains. Reference instantiations: **Legal (PACER plugin V1.1):** ``` binding { corpus_id: "securities_mtd_oppositions", source_kind: "pacer.scheduled_search", source_id: "search_sdny_sec_mtd_opps_2020_2026", content_type_filter: ["legal.pacer.filing.opposition_brief"], contribution_mode: "scope_only", active: true, } ``` **Personal / hobby (recipe corpus):** ``` binding { corpus_id: "weeknight_recipes", source_kind: "browser.captured_page", source_id: "rule_recipe_sites_filter", content_type_filter: ["web.captured_page.recipe"], inclusion_predicate: "page.estimated_cook_time_minutes <= 45", contribution_mode: "scope_and_extract", active: true, } ``` **Music production (tutorial corpus):** ``` binding { corpus_id: "bitwig_modulation_techniques", source_kind: "rss.feed", source_id: "feed_bitwig_youtube_channel_id_xxx", content_type_filter: ["web.feed.item.video"], inclusion_predicate: "item.title LIKE '%modulation%' OR item.title LIKE '%modulator%'", contribution_mode: "scope_only", active: true, } ``` Same contract, three unrelated domains, no domain-specific logic in the binding model. The PACER plugin is the first user but not the architectural shape. --- ## 10. Plugins Immediately Benefiting | Plugin / Surface | source_kind(s) | Example corpus | |---|---|---| | PACER (V1.1) | `pacer.watched_case`, `pacer.scheduled_search` | `securities_mtd_oppositions`, `firm_brief_bank` | | OneDrive watcher (DOC16) | `onedrive.folder_watch` | `firm_brief_bank` (legacy migration), `expert_reports` | | RSS / news (forthcoming plugin) | `rss.feed` | `securities_news`, `enforcement_actions` | | Calendar (DOC11 / plugin) | `calendar.sync` | `litigation_deadlines` | | Email (DOC16) | `email.mailbox`, `email.label` | `client_correspondence_smith_v_acme` | | Browser saver (forthcoming plugin) | `browser.captured_page` | `legal_research_pleadings`, `weeknight_recipes` | Each plugin spec adds a "Corpus Bindings" subsection citing this addendum rather than reinventing the concept. --- ## 11. What This Addendum Does NOT Do - Does NOT define corpus extraction profiles — DOC73 §14 owns that. - Does NOT define corpus trust posture semantics — DOC73 §3.1 owns that. - Does NOT define content-type registration — DOC20 §6.18.2 owns that. - Does NOT replace cluster absorption (§4.2) or explicit user tagging — bindings are a fourth path, not a replacement. - Does NOT touch ambient-graph nodes (no `corpus_id`) — those continue to behave per DOC73 V1.4.1 untouched. - Does NOT introduce a new tag vocabulary — DeliveryDirective is unchanged. - Does NOT introduce a new node kind beyond `corpus_source_binding` and `source_kind_registration`, both modest extensions to DOC72's existing node-kind set. --- ## 12. Red-Team Targets For fresh-window review, the following points warrant explicit attention: 1. **Predicate language scope.** §2.1 says `inclusion_predicate` uses "the existing DOC72 predicate DSL." If no such DSL exists yet, the addendum has a hidden dependency. Confirm DOC72 has a predicate sub-language; if not, propose either deferring `inclusion_predicate` to V2 or specifying a minimal expression grammar inline. 2. **Multi-corpus extraction cost.** §3.1 step 6 says multiple `scope_and_extract` bindings produce multiple extraction tasks. For a node hitting N bindings with N profiles, cost is N×. Review whether deduplication (run the more permissive profile that subsumes the others) is worth specifying, or whether duplication is acceptable. 3. **Predicate evaluation security.** If `inclusion_predicate` is user-authored, evaluating it at intake time crosses a trust boundary. Should there be a sandboxing requirement? Probably yes; flag for V2 if not addressed in DOC72 predicate DSL spec. 4. **Suggested-binding pattern (§6.3).** The "always feed items like this" suggestion is a learning-loop into DOC8. May warrant explicit DOC8 spec text rather than just a flag in §8. 5. **Corpus deletion cascade.** §7 defers to DOC73 V1.5+ for corpus-deletion semantics. If V1.5 does not specify, this addendum needs a fallback rule. 6. **Ambient-graph promotion.** Can a previously ambient (no `corpus_id`) node be retroactively bound via a binding fired against historical data? §3 implies bindings only fire on new events. Retroactive backfill is a useful capability but architecturally distinct — should be explicitly out of scope here, with V2 introducing it via a separate `binding_backfill` operation rather than overloading the live binding. 7. **Cardinality limits.** Should there be a per-corpus binding count cap? Not strictly necessary, but worth considering for UI sanity (a corpus with 200 bindings is hard to manage). --- ## 13. Versioning and Filename Discipline This is V1 of the proposal. Per the post-absorption versioning rule, when this is absorbed into DOC73 V1.5 (or a later operative version), V1 is archived. Future revisions to the source-bindings concept author a fresh proposal against the absorbed §3.6 owner text in DOC73, not a revision of this V1 proposal. If review reveals the addendum needs structural changes, V2 of this proposal is authored as `DOC73_CORPUS_SOURCE_BINDINGS_PROPOSAL_V2.md`, never as a re-edit of V1.