DOC72_DOC7_BUCKET_INTAKE_INSERTION_TEXT.md

Current Specs/DOC72/DOC72_DOC7_BUCKET_INTAKE_INSERTION_TEXT.md
Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.
Open text page · Open raw txt · Open path URL
# DOC72 + DOC7 — Bucket File Intake: Insertion Text

**Instructions:** These are exact text blocks to insert into DOC72 R5.5 and DOC7 R6. Each block specifies its insertion point.

---

## DOC72 Insertions

### Insertion 1: Add "bucket" to IntakeSurface type (§20A.2)

**Location:** §20A.2, line ~2128, the `IntakeSurface` type definition.

**Replace:**
```ts
type IntakeSurface = "note" | "document" | "browser" | "chat" | "room" | "panel" | "forum" | "candor" | "task" | "email";
```

**With:**
```ts
type IntakeSurface = "note" | "document" | "browser" | "chat" | "room" | "panel" | "forum" | "candor" | "task" | "email" | "bucket";
```

---

### Insertion 2: Add §20B.14 — Bucket Files (after §20B.13)

**Location:** After §20B.13 (Extraction Scheduling Summary), before the `---` separator and §20C.

**Insert:**

```
### 20B.14 Context Bucket Files (DOC7) — `intake.bucket.file_added`

Context bucket files are high-value intake sources. The user explicitly curated them as relevant reference material — adding a document to a bucket is an intentional act that signals the content matters. Bucket files receive elevated extraction priority and starting confidence.

**Trigger:** When EC processes a `context_bucket_file_add` command (DOC7) and the file reaches `index_status: "ready"`, EC emits an `intake.bucket.file_added` observation. File content updates (detected via `content_hash` change on re-add or version increment) also trigger extraction. File removals do NOT trigger extraction or entity deletion — extracted knowledge persists independently of the bucket file's lifecycle.

**Default significance rules:**

| Signal | Assessment | Rationale |
|---|---|---|
| File added to any bucket | `deep` | User explicitly curated this content — always extract |
| File updated (content_hash changed) | `deep` | User refreshed the document — re-extract |
| File removed from bucket | No action | Knowledge persists; bucket membership is ephemeral |
| File already extracted (same content_hash) | `skip` | Dedup — don't re-extract identical content |

**Extraction prompt (bucket-file-specific):**

The extraction prompt for bucket files is enriched with bucket context:

```ts
const BUCKET_FILE_EXTRACTION_PROMPT = `
Extract structured knowledge from this document.

Document title: {file_title}
Source bucket: {bucket_title}
Bucket summary: {bucket_summary}
Bucket matter associations: {linked_matter_names}
User's active matters: {active_matter_names}

Extract:
- Named entities (people, organizations, courts, case names, case numbers)
- Dates and deadlines mentioned
- Key facts, holdings, or rulings
- Document type and purpose (complaint, motion, brief, letter, memo, order, etc.)
- Parties and their roles
- Claims or causes of action
- Key arguments or positions taken
- Relief sought
- Any procedural history mentioned

For each extracted item, include:
- The entity or fact
- Confidence (how clearly stated vs inferred)
- The approximate location in the document (beginning/middle/end or section reference)

Output as structured JSON candidates.
Do NOT extract boilerplate, signature blocks, certificates of service, or formatting instructions.
`;
```

**Provenance:** All candidates extracted from bucket files carry:

```ts
provenance: {
  entry_type: "learned_from_bucket_file",
  source_ref: "{bucket_id}:{file_id}",
  source_content_hash: "{content_hash}",
  extraction_model: "{model_used}",
  extracted_at: "{timestamp}",
}
```

This provenance chain enables DOC24's overlap detection — DOC24 can identify knowledge cards that were extracted FROM a bucket file and suppress them when the same file is being inlined by DOC7.

**Starting confidence:** Bucket file candidates receive elevated starting confidence because the user explicitly curated the file:
- Entities: starting α = 3 (vs α = 2 for conversation-mined entities)
- Facts and holdings: starting α = 3
- The user chose to put this document in a bucket — that is an endorsement of its relevance

**Matter association:** If the bucket is assigned to a project or matter, all extracted entities SHALL be linked to that matter via edges. If the bucket has no matter association, extracted entities are linked based on standard entity resolution against existing graph nodes.

**Scheduling:** Bucket file extraction is queued as high priority in the BackgroundJobOrchestrator (EC Core Addendum A §3) using the tier2_extractor agent profile. High priority because the user explicitly added the file — this is an intentional action indicating the content matters now.

**Memory mode gating:** Bucket file extraction respects the global memory control hierarchy (EC Core Addendum A §2). If `memory_system_enabled = false`, `collection_enabled = false`, or the bucket surface collection toggle is off, no extraction occurs. Incognito mode does not apply to bucket files (buckets are persistent curated content, not ephemeral browsing).
```

---

### Insertion 3: Add provenance entry type (§20B.11)

**Location:** §20B.11, the provenance entry type table (line ~2677-2687).

**Add after** `"agent_observation"`:

```
| "learned_from_bucket_file"       // Extracted from DOC7 context bucket file
```

---

### Insertion 4: Add to extraction scheduling summary (§20B.13)

**Location:** §20B.13, the extraction scheduling summary table (line ~2695-2705).

**Add row after** the Browser row:

```
| Bucket files (DOC7) | File add or content update | Tier 2 (high) | 1-5/day |
```

---

## DOC7 Insertion

### Insertion 5: Add §4.7 to DOC7 (after §4.6)

**Location:** DOC7 R6, after §4.6 (Synergies), before §5 (or wherever the next major section begins after the context injection section).

**Insert:**

```
### 4.7 DOC72 knowledge extraction from bucket files

When a file reaches `index_status: "ready"` after a `context_bucket_file_add` command, EC SHALL emit an `intake.bucket.file_added` observation to DOC72's intake pipeline (DOC72 §20B.14). The file content is queued for entity extraction through the BackgroundJobOrchestrator (EC Core Addendum A §3) as a high-priority tier2_extractor task.

DOC7 is NOT responsible for the extraction logic — it only emits the trigger event. DOC72 owns the extraction pipeline, entity linking, and promotion rules. DOC7 provides the file content and metadata; DOC72 produces knowledge nodes.

File updates (content_hash change on re-add or version increment) re-trigger extraction. File removals do NOT trigger extraction or entity deletion — extracted knowledge persists independently of the bucket file's lifecycle.

This ensures that bucket files — which the user explicitly curated as relevant reference material — contribute to DOC72's knowledge graph and DOC24's entity resolution, not just DOC7's raw context injection.

**Provenance coordination with DOC24:** DOC72 tags all bucket-file-extracted entities with `provenance.source_ref = "{bucket_id}:{file_id}"`. DOC24 uses this provenance to detect overlap between knowledge cards and inlined bucket content, suppressing redundant cards when the full document is already in the LLM's context. See DOC24 Unified Context Budget Governance for the coordination mechanism.
```