DOC 23 Add B RT Reviews 1 (5.28).md

Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC 23 Add B RT Reviews 1 (5.28).md
Short text page 032586c73294. Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.
Open readable HTML page · Open raw txt · Open path URL
ELNOR REPO READER TEXT MIRROR
Original path: Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC 23 Add B RT Reviews 1 (5.28).md
Source repo: /Users/OpenClaw1/Elnor/Elnor Specs
Git branch: main
Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331
Generated: 2026-06-09T01:23:58.539Z

---

# Claude

# Claude — Consolidated Red-Team Review & Design Notes

## DOC23 Addenda B Specification Family

**Reviewer:** Claude (fresh-window read of all six in-scope documents)
**Documents under review:** Core R0.7.1 · Outcome Evaluator + Revisor V3.3.1 · Evaluation Common Contracts V1.1.1 · Source Workspace V1.0.1 · Feedback Delivery V1.0.1 · Task Forum + Run Board V1.0.1
**Prompt:** RED_TEAM_DOC23_ADDENDA_B_SET_V2.md
**Status of this document:** Save-point snapshot of Claude's complete position before external reviewer responses are folded in. Sections 1–4 are the original structured review; Sections 5–9 are everything produced after it (deeper-dive findings, proposed surfaces, the fork/chat/re-prompt design, the evaluation of the four other reviewers, the recommended hardening pass, and the reusable purpose-question library).

---

## 0. How to read this document

This consolidates several review passes into one reference:

- **§1 Executive synthesis** — the overall assessment, the two dominant defect patterns, the decision-set rule, and where an implementer would be forced to guess.

- **§2 Original findings** — 33 findings from the structured red-team response: 6 conceptual (A1–A6), 12 targeted (B1–B12), 15 defect-hunt (C1–C15). Verbatim. (The original pass self-labeled "32"; it undercounted Part B — there are 12 targeted findings, not 11.)

- **§3 Deeper-dive additional findings** — 24 further findings (D1–D24) produced after the original pass, including two CONFIRMED affirmations.

- **§4 New user-facing surfaces proposed** — six surfaces (S1–S6) with schemas.

- **§5 Fork / chat-in-session / re-prompt unified design** — TaskRunFork, ModuleActivationChat, and the single-addendum three-section decision.

- **§6 Evaluation of the four external reviewers** — what changed Claude's thinking, the additions to adopt, and the proposals deliberately not adopted (with reasons).

- **§7 Recommended R0.4 "Math + Contract Hardening Pass" scope.**

- **§8 Appendix: reusable purpose-question library** + reviewer-assignment grid.

  Finding IDs are stable. Cite them directly (e.g. "B1", "D14", "S2") in downstream work.

  Tags follow the prompt's set: `BUG` (wrong/contradictory), `GAP` (must be specified but isn't), `RISK` (distributed-systems/reliability hazard), `UX` (trust/transparency/reviewability), `IDEA` (reconceptualization/missing capability), `CONFIRMED` (genuinely well-designed). Severity: `CRITICAL` / `HIGH` / `MEDIUM` / `LOW`.

  **On severity and disposition (read this).** Severity describes how broken a given defect is — its impact if left as-is. It is **not** a triage instruction and does not rank what to fix first. There are no priority tiers in this document. The disposition is binary: **every finding and proposal here is to be addressed, unless it appears in the "Proposed but not adopted" set (§6.2), each item of which carries an explicit reason for declining.** Severity tells you what each defect costs; it does not tell you whether to fix it. (All findings are worth fixing — that is why they were surfaced rather than discarded.)

---

## 1. Executive synthesis

**Overall assessment.** This set is architecturally ambitious and substantially sound, but its integration layer is not yet complete — as a *set*, the six documents do not yet fully agree, and that is where the defects concentrate. The strongest design decision is the strict separation of planner (Revisor) from executor (revision-capable modules) via the `revision_in` port contract — that is the single highest-leverage idea in the set and it is well-specified. The defeasibility of findings (FD V1 §3.4) and the goal-advancement sycophancy fix (V3.3.1 §6.12 + §13.3.1) are unusually thoughtful pieces of design that most AI orchestration systems get wrong. The cross-pattern coordination work (Pattern A/B/C) and the explicit graph-native discipline (no hidden dispatch) reflect mature thinking about why AI systems lose user trust.

However, the family-topology split into six documents has produced concrete schema collisions (the `EvaluationFinding` double-definition between V3.3.1 §5.7 and FD V1 §3.3 is the most severe — same name, incompatible fields, both currently authoritative). Phantom schema references (`EvaluationAffirmation` doesn't exist; `ClaimSetBundle` is in the wrong document) tell a coding agent to look in the wrong place. Mid-level mechanisms like the revalidation cascade lack convergence proofs, indeterminate-state handling is incomplete in routing policies, and several enums are referenced before being defined. The set displays the classic symptom of multiple documents authored at slightly different times: each is internally consistent, the integration isn't.

The bar Will set ("a high-stakes professional relies on it") is not yet met — but mostly because of *integration defects*, not core design. The shape of the system is right; the seams aren't tightened.

**Two dominant patterns.** (1) *Schema-level locks are used precisely where they matter* — AutonomousModePolicy locked-false fields (§6.6.2), the goal_advancement sycophancy exclusion (§6.12 / §13.3, AttributionRecord explicitly excludes `revisor_self_assessment`). That instinct is correct and should be the model for the rest. (2) *The boundary problems are bigger than the interior problems* — the most severe findings live at the seams between documents (B1, B5, B6, B11), not inside any one document's logic. The interior of each document is largely solid.

**Disposition (the decision set).** Everything in this document is to be addressed. There is no priority ordering and no "fix these, defer those" — the only items *not* being acted on are the four in §6.2 (proposed mechanisms declined, each with its reason). The most consequential single defect is **B1** (two incompatible `EvaluationFinding` schemas, both currently authoritative); the most consequential *structural* fix surfaced by the other reviewers is the **Formula Registry** (§6.1), which collapses a dozen-plus "this number has no formula" gaps into one place. The new capabilities worth calling out — because they add reach the system doesn't have today, not because they outrank the fixes — are **TaskHealthCard (A1)**, **WorkProductCertification (S1)**, and **FindingsInbox (S2)**. All are in the adopt set.

**Where an implementer would be forced to guess.** Handed this set as-is, a coding agent would have to invent answers to: (a) which `EvaluationFinding` schema is canonical when both are referenced; (b) where `ClaimSetBundle` actually lives; (c) what to do when the Pattern C chain ID doesn't resolve; (d) how a sub-agent-less Evaluator should proceed; (e) what taint class to assign to a `web_source`. None are deep design problems — they are unfinished spec details, and each is removed by a specific finding above. Closing the cross-doc obligations and naming Common Contracts as the single schema-of-record for shared types removes the largest cluster of guess-points at once; the natural place to do that is the DOC23 R3.2 absorption.

---

## 2. Original findings (33)

### Part A — Conceptual / UX findings (§3)

#### A1 · [IDEA] [HIGH] · Set-wide — No first-class "task health" surface, only fragmented signals

**Finding:** The set produces extraordinarily rich per-step instrumentation — `RevisionOperationReceipt`, `FeedbackConsumptionReceipt`, `EvaluationResultEnvelope`, `RepairCycleSignal`, `PatternHealthState`, `SubAgentReputation`, eight signal types, four delivery channels, five evaluation slices. But for a professional asking "is this task going well?", no document defines a unified, decision-grade health view. The user must reconstruct health by joining run board events, current evaluation states, pending Hard Calls, budget burn, taint state, and repeated-failure counters across at least four documents.

**Why it matters:** A high-stakes user does not read receipts; they read dashboards. CI/CD systems, clinical decision support, and Bloomberg terminals all converged on this lesson decades ago. Without a defined "Task Health Card" with a clear top-line ("on track / at risk / blocked / unrecoverable") and the three to five drivers behind that label, the user is forced into the operator role this design explicitly rejects (Core §28 closing note: "every task created, run, inspected... makes the Task Agent better"). The richness of the substrate is wasted if reviewability does not exist at this level.

**Recommendation:** Add a `TaskHealthCard` schema to Core R0.7.1 (probably §20 or a new §20A) that aggregates the runtime signals into a top-line state with explicit driver references. Specify the inputs (e.g., `unresolved_blocking_findings_count`, `repeated_failure_pattern_active`, `budget_burn_to_date`, `hard_calls_pending`, `upstream_failure_outcomes`, `taint_escalations_in_run`, `cost_estimator_confidence`) and the deterministic aggregation function. The card lives next to — not replacing — the Run Board.

**Reference:** Core R0.7.1 §12 telemetry spine (passive); Run Board V1 §1.2 events; FD V1 §6.4 repeated-failure; V3.3.1 §21 surfaces (cards by record kind, no aggregate).

#### A2 · [IDEA] [HIGH] · Set-wide — Cost predictability is asserted but never computable end-to-end before run

**Finding:** Budget bifurcation (V3.3.1 §11.15) and `EstimatorConfidence` (§11.15.4) are sophisticated, but there is no `TaskRunCostForecast` produced *before* a run — only RevisionCostEstimate produced inside a planning compiler invocation, after the user has committed to running. The Source Research module attributes cost to the workspace and tools (Source Workspace §7.6) but never produces a forecast either. The forum's BoardDigest carries `token_count` after generation, not predicted.

**Why it matters:** A professional needs to know "this task will probably cost about $X and 8 minutes" *before* clicking Run, especially for delegated/scheduled runs and overnight batches. Without a forecast, the user discovers cost only by running, which forces over-cautious manual oversight on every run — the opposite of the platform's purpose.

**Recommendation:** Define `TaskRunCostForecast` in Core R0.7.1 produced at task pre-execution (Task Assessment time per Core §16): module-level cost estimates + estimator confidence per module, summed with a forecast confidence band. Display in the Task Assessment surface (Core §24.8). Defer accuracy goals to Phase 2 but ship the surface and the band.

**Reference:** Core R0.7.1 §16 (Task Assessment); V3.3.1 §11.15.4 RevisionCostEstimate; Source Workspace §7.6 cost attribution.

#### A3 · [UX] [HIGH] · V3.3.1 §21 + FD V1 §8 — Reviewability is fragmented across at least three surfaces

**Finding:** When a revision regenerates a 30-page brief, the user is offered: SemanticChangelog (V3.3.1 §7.11 / §21.6), then raw diff (§21.6), then FeedbackConsumptionReceipt chain (FD V1 §10.3 DOC20 insert "show the path from finding → repair instruction → consumer → produced artifact"), then RunBoard timeline (Run Board V1 §8.2), then Pattern display (V3.3.1 §21.8 — `from_memory` vs `adapted_from_memory`), then ExplanationTrace markdown (V3.3.1 §7.10), then SubAgent metrics (V3.3.1 §15.8.3). Each makes sense in isolation; none defines the *review session* shape a user actually performs.

**Why it matters:** A securities litigator reviewing AI work product does not work surface-by-surface. They want to ask: "what changed substantively, what was the AI's reasoning, what evidence backs it, where can it have gotten this wrong, and can I revert?" That is one continuous review activity, not seven separate surfaces. The spec puts the burden of integration on the implementer; the implementer will guess.

**Recommendation:** Define a `RevisionReviewSession` UI contract in V3.3.1 §21 (or a new sub-section of Core §20) that names the canonical surfaces the user navigates in order: top-of-pack `SemanticChangelog` summary → ExplanationTrace narrative → diff drill-in → Evaluation Result Card showing why this was needed → SubAgent provenance for advised steps → Revert/partial-accept controls. The contract names the surfaces and their navigation links; visual layout is non-normative.

**Reference:** V3.3.1 §21.6, §21.7, §21.8, §7.10, §7.11; FD V1 §10.3 (DOC20 insert).

#### A4 · [IDEA] [MEDIUM] · Set-wide — No "fork and continue" primitive at the task-run level

**Finding:** V3.3.1 has `fork_from_checkpoint` as a `RevisionPlanStepKind` (§0.4.6, §7.5) and Core mentions "rerun/fork-from-module" UI affordances in §27H.13. But at the task-run level — when a professional says "this run was almost right, branch from here and try a different drafting approach" — no schema and no flow are defined. `RunWorkspace` is explicitly run-scoped (V3.3.1 §12.1.1); branching is plan-internal only.

**Why it matters:** This is the most-asked feature for any drafting workflow. Without it, the user's recovery mechanism for "this run went sideways at module 5" is to delete or restart, losing the upstream work. Will's explicit goal (Core §1.4) is "high-stakes professional work where errors are expensive" — and the most expensive error is throwing away work that was nearly right.

**Recommendation:** Define `TaskRunFork` in Core R0.7.1 as a first-class primitive with `parent_run_id`, `fork_point_module_id_and_activation_seq`, `copied_workspace_refs`, `divergence_reason`, and `fork_disposition: "experimental" | "alternate_path" | "recovery"`. Define the user gesture (right-click a Run Board event → "fork from here"). Specify how `SourceWorkspace` (persistent) vs `RunWorkspace` (ephemeral) are handled at fork time. *(See §5 — this proposal was substantially expanded after the original review.)*

**Reference:** Core R0.7.1 §27H.13 (mentions but does not specify); V3.3.1 §0.4.6 (plan-internal only).

#### A5 · [UX] [MEDIUM] · FD V1 §3.4 + V3.3.1 §6.12 — Over-relies on user contesting; doesn't surface contestability

**Finding:** Findings are explicitly defeasible (FD V1 §1.3, §3.4 rules 1–7). User-contesting transitions a finding to `contested` and unblocks downstream. But nowhere is "this finding is the kind a user often contests" surfaced in the UI before they look at it. The Pattern primitive tracks `contested_finding_count` (V3.3.1 §13.3) and `false_positive_count` for sub-agents (§15.8); neither surfaces into the user's review flow.

**Why it matters:** A professional with limited review time wants to know up-front: "this evaluator tends to over-flag X — start there." Without contestability signaling, the user reviews findings in order and burns attention budget on findings that aren't really problems. The infrastructure exists; the surfacing doesn't.

**Recommendation:** Add `historical_false_positive_rate_in_context` and `historical_contest_rate_in_context` fields to the EvaluationFinding (or computed view) when rendered. Sort findings by reliability descending in the §21.1 Evaluation Result Card by default. Mark findings produced by sub-agents currently in `watch` status (§15.8.3) with a "lower-confidence" badge.

**Reference:** FD V1 §3.4; V3.3.1 §13.3 contested_finding_count; §15.8 sub-agent reputation; §21.1.

#### A6 · [IDEA] [MEDIUM] · Set-wide — The "process gap vs substantive gap" distinction is correct, but no learning loop closes it

**Finding:** Forum V1 §9 correctly distinguishes process gaps (graph design problem) from substantive gaps (work-product problem). `TaskProcessGapSignal` (Core §9.0.3) feeds the Task Agent assessment queue. But Task Agent's response to process gaps is described as "proposes graph patches"; there is no enumerated trigger that converts repeated process-gap signals into Task Blueprint amendment proposals or DOC72 task-design patterns. The signal flows but the loop doesn't close at the design layer.

**Why it matters:** A system that detects "your task design lacks a source verification stage" is half a system. The other half — promoting the detection into a learned default for future tasks of this kind — is what makes the platform compound (Core §28: "the value comes from compounding"). Without the closure, every task is taught the same lesson.

**Recommendation:** Add a §9.X to Core R0.7.1 specifying the process-gap-to-design-pattern conversion: threshold N of similar `TaskProcessGapSignal`s over similar TaskBlueprints → Task Agent emits a `TaskBlueprintAmendmentProposal` to the Task Design Learning Review Queue (Core §24.8 mentions this queue but does not specify the conversion rule). Specify what "similar" means structurally. *(Note: this touches the paused self-learning surface; treat as flagged, not deep-dived.)*

**Reference:** Forum V1 §9; Core R0.7.1 §9.0.3, §24.8; Core §28 closing.

### Part B — Targeted findings (§4)

#### B1 · [BUG] [CRITICAL] · FD V1 §3.3 vs V3.3.1 §5.7 — Two incompatible `EvaluationFinding` schemas

**Finding:** `EvaluationFinding` is defined in BOTH V3.3.1 §5.7 and Feedback Delivery V1.0.1 §3.3 with mutually exclusive fields. V3.3.1's version has `finding_text`, `severity (4 values)`, `state: FindingState (12 values)`, `basis: AssuranceBasis`, `target_artifact_ref`, `taint_class`, `confidence: "low"|"medium"|"high"`. FD V1's version has `finding_kind (12 values)`, `severity (4 values)`, `authority_basis: EvaluationAuthorityBasis[] (9 values, NEW enum)`, `lifecycle_state: EvaluationFindingLifecycleState (7 values)`, `target_criterion_id`, `target_scope_ref`, `affected_claim_refs`, `confidence: number (0-1)`, `based_on_board_digest_ref`. No mapping is provided. Common Contracts §4.2 references "EvaluationFinding[] — Addenda B V3.1 §5.7" — meaning the V3.3.1 schema is canonical. So FD V1 is declaring a schema that doesn't match the canonical reference.

**Why it matters:** A coding agent implementing the Outcome Evaluator emits `EvaluationFinding` per V3.3.1 §5.7; the same agent implementing Feedback Delivery reads `EvaluationFinding` per FD V1 §3.3. The schemas don't share fields. The `FindingState` enum (V3.3.1) and `EvaluationFindingLifecycleState` enum (FD V1) overlap by name only. This is a single-name-two-schemas defect; the build will produce one model that fits neither.

**Recommendation:** Declare one canonical `EvaluationFinding` schema in DOC23 Evaluation Common Contracts V1.1.1 (the natural home — it is shared across Addenda A and B). Move it now, before V3.4 / V1.2 work proceeds. Reconcile field sets: V3.3.1's `state` and FD V1's `lifecycle_state` are the same concept; `basis: AssuranceBasis` (V3.3.1) is unrelated to `authority_basis: EvaluationAuthorityBasis[]` (FD V1) — pick one model or define how both coexist (they aren't redundant — `authority_basis` is what makes a finding a hard blocker per FD V1 §3.4 rule 2). Make the consolidation a coordinated schema bump per Common Contracts §10.2.

**Reference:** V3.3.1 §5.7; FD V1.0.1 §3.3; Common Contracts §4.2.

#### B2 · [BUG] [HIGH] · V3.3.1 §0.4.1 vs Common Contracts §3.1/§3.2 — `OutcomeEvaluationState` has 15 values; mapping treats it as 14

**Finding:** V3.3.1 §0.4.1 enumerates `OutcomeEvaluationState` with 15 values (`pending`, `pending_dependency`, `evaluating`, `satisfied`, `needs_revision`, `needs_information`, `needs_verification`, `needs_human_judgment`, `unable_to_evaluate`, `blocked_by_policy`, `regressed`, `unrecoverable`, `dirty`, `superseded`, `upstream_failure`). Common Contracts §3.1 says the field "is populated from V3.1's 14-value enum"; §3.2's verdict mapping covers 14 of the 15 — `evaluating` is not mapped at all, and the doc itself counts wrong.

**Why it matters:** A producer in the `evaluating` state has an undefined `evaluation_verdict` per the mapping. Real systems hit this transient state regularly (it's literally the in-flight state). Without a mapping, either the producer can't emit an envelope while evaluating (which would block §5.18's "every Evaluator activation emits exactly one envelope" guarantee), or it emits something the schema doesn't sanction.

**Recommendation:** Either (a) declare `evaluating` a transient state that MUST NOT emit an envelope (matching how `dirty`, `superseded`, `pending`, `pending_dependency` are treated in §3.2), AND fix Common Contracts to say "15-value enum, of which 5 are transient and 10 map to envelope verdicts," or (b) define `evaluating → indeterminate` and bump the version count. Update §5.18.4 emission discipline to disambiguate.

**Reference:** V3.3.1 §0.4.1; Common Contracts V1.1.1 §3.1, §3.2; V3.3.1 §5.18.4.

#### B3 · [GAP] [HIGH] · V3.3.1 §5.18 + Common Contracts §3.7 — Pattern C chain ID lifecycle is unspecified

**Finding:** V3.3.1 §5.18.4 says the upstream Evaluator populates `target_evaluation_chain_id`. Common Contracts §3.7 says "the upstream Evaluator's envelope populates `target_evaluation_chain_id` with a UUID identifying the evaluation chain" and the downstream Judge sets the same value. But neither doc specifies: (1) Who generates the chain UUID — the Evaluator at every activation, even when no Pattern C Judge will attach? (2) What happens when the chain ID does not resolve at the consumer side? (3) How long the chain ID is retained / when it can be GC'd. (4) Whether different Evaluator activations of the same outcome share a chain ID or always get fresh ones.

**Why it matters:** If every Evaluator activation always gets a fresh UUID, then audit reconstruction (§3.7 "given the chain id, retrieve all envelopes in the chain") works only when Pattern C wiring is present — but the spec says every Evaluator emits this field. If the Evaluator does NOT always emit it, the field is sometimes-null and downstream Judges may attach to envelopes lacking the linkage primitive. Either way the field has no lifecycle.

**Recommendation:** In Common Contracts §3.7 specify: the upstream Evaluator MUST generate a fresh ULID at activation time (or at envelope emission time) and emit it as `target_evaluation_chain_id`; the Judge in Pattern C MUST read it from `evaluator_output_in.target_evaluation_chain_id` and set its own envelope's field to the same value; orphan envelopes (no downstream consumer) keep the field but it is unused. Add a validation code `validation.pattern_c_chain_id_mismatch` for when the Judge's value does not match its declared upstream Evaluator. Specify retention: the chain ID has the same retention as the envelope itself.

**Reference:** V3.3.1 §5.18.4; Common Contracts V1.1.1 §3.7. *(See also the cross-reviewer finding in §6 on the Pattern C envelope FIELD mismatch — `evaluated_target`/`evaluation_basis` are read by the Judge but live on a different schema. That is a distinct, additional defect.)*

#### B4 · [GAP] [HIGH] · FD V1 §6.2 + V3.3.1 §5.18 — `on_indeterminate` is missing from FeedbackRoutingPolicy

**Finding:** `EvaluationDecision.verdict` has four values: `"passed" | "failed" | "indeterminate" | "not_applicable"` (FD V1 §2.1). `FeedbackRoutingPolicy` has branches for `on_satisfied`, `on_needs_revision`, `on_needs_more_sources`, `on_needs_source_verification`, `on_needs_format_repair`, `on_repeated_failure` — but no `on_indeterminate` and no `on_not_applicable`. V3.3.1 has 5 `OutcomeEvaluationState` values mapping to `indeterminate` (`needs_information`, `needs_verification`, `needs_human_judgment`, `unable_to_evaluate`, `blocked_by_policy`), so `indeterminate` is not rare.

**Why it matters:** When an evaluator returns `indeterminate`, the policy router has no branch to fire. Implementations will either pick a default silently (forum post? pause?) or hit a no-route condition and stall. This is precisely the kind of edge case the spec warns about ("trail off" semantics for indeterminate per the prompt §4.10).

**Recommendation:** Add explicit branches:

```
on_indeterminate:
  | "pause_for_human"
  | "post_to_forum"
  | "ask_task_agent_for_process_assessment"
  | "continue_with_warning"
on_not_applicable:
  | "continue"
  | "log_only"
```

The five indeterminate sub-states map to `on_indeterminate` unless the policy declares finer-grained handling. Add `on_unrecoverable` similarly (V3.3.1 maps `unrecoverable` to `failed` for the envelope, but the routing implications differ — repeated `unrecoverable` should not get the same router treatment as a fixable `needs_revision`).

**Reference:** FD V1.0.1 §6.2, §2.1; V3.3.1 §0.4.1; Common Contracts §3.2.

#### B5 · [BUG] [HIGH] · V3.3.1 §5.17.7 vs Common Contracts §1.2 — `ClaimSetBundle` and `ExtractedEvaluationUnit` ownership conflict

**Finding:** V3.3.1 §5.17.7 says "DOC23 Evaluation Common Contracts — `ExtractedEvaluationUnit` union, `ClaimSetBundle` schema, `ArtifactScopeRef` source-anchor primitives." Common Contracts V1.1.1 §1.2 explicitly says **out of scope**: "The Claim Extractor module schema or its 22-type unit registry (lives in Addenda A R4.1 V3)." Common Contracts §6.1 references "ClaimType[] (from Addenda A 22-type registry)" — pointing at Addenda A as owner. Grep confirms Common Contracts does NOT define `ClaimSetBundle` or `ExtractedEvaluationUnitBundle` anywhere.

**Why it matters:** Two of the three named primitives V3.3.1 §5.17 says are in Common Contracts aren't there. A coding agent implementing the `claims_in` port (typed for `"ClaimSetBundle" | "ExtractedEvaluationUnitBundle"`) cannot find these schemas in the document V3.3.1 sends them to.

**Recommendation:** Update V3.3.1 §5.17.7 cross-references to point `ExtractedEvaluationUnit` and `ClaimSetBundle` to Addenda A R4.1 V3 (their actual owner), keeping only `ArtifactScopeRef` cross-referenced to Common Contracts §7. Add an OP-A row tracking the import; if Addenda A doesn't expose them yet as public contracts, mark the obligation `pending`.

**Reference:** V3.3.1 §5.17.7; Common Contracts V1.1.1 §1.2, §6.1, §6.5.

#### B6 · [BUG] [HIGH] · Common Contracts §4.2 — `ResearchNeed` and `EvaluationAffirmation` are not where the doc says they are

**Finding:** Common Contracts §4.2 (QualitativeSlice) says: "The full `EvaluationFinding`, `OutcomeRepairInstruction`, `ResearchNeed`, and `EvaluationAffirmation` schemas live in Addenda B Core R0.7.1 (their owning addendum)." This is wrong twice: (1) `ResearchNeed` is defined in **Source Workspace V1.0.1 §6.2**, NOT in Core (grep of Core for `EvaluationAffirmation`/`ResearchNeed` returns zero hits). (2) `EvaluationAffirmation` is undefined anywhere — Core does not contain the term, and no schema appears in V3.3.1, Source Workspace, or Run Board. `OutcomeRepairInstruction` is defined in **FD V1.0.1 §5.2**, NOT Core. Only `EvaluationFinding` is in V3.3.1 (and per B1 it conflicts with a second definition in FD V1).

**Why it matters:** A coding agent implementing the QualitativeEvaluationSlice fills four fields whose owning documents are misidentified. `EvaluationAffirmation` does not exist as a schema at all — it's a phantom type. Implementations will either skip the field, invent a structure, or copy from FD V1 §3 (which has its own bugs).

**Recommendation:** Fix Common Contracts §4.2 to point each schema to its actual owner: `EvaluationFinding` → consolidated location per B1; `OutcomeRepairInstruction` → FD V1.0.1 §5.2; `ResearchNeed` → Source Workspace V1.0.1 §6.2. Either define `EvaluationAffirmation` somewhere or remove it from the slice. If kept, specify what it carries (presumably "what the artifact got right" — the positive counterpart to a finding) and where it lives. *(This phantom is also a lead-in to the cross-reviewer "flawless-execution denominator" finding in §6.)*

**Reference:** Common Contracts V1.1.1 §4.2; Source Workspace V1.0.1 §6.2; FD V1.0.1 §5.2.

#### B7 · [RISK] [HIGH] · V3.3.1 §11.21 — Revalidation cascade has no convergence proof

**Finding:** §11.21 Phase 4 says: "If revalidation produces regression (outcome was satisfied, now needs_revision): Trigger Revisor activation for the regressed outcome. Plan may use fork_from_checkpoint to address regression." No bound is specified on cascade depth. If outcome A's revision causes B to regress, B's revision can cause A to re-regress (because the dependency declaration goes both ways via `OutcomeDependencySpec.invalidated_by_outcomes` traversal). The `per_outcome_retry_budget` is per-outcome but cascades cross outcomes; the only stated bound is `per_plan_max_replans` which is at the *plan* level, not the cascade level.

**Why it matters:** Two outcomes with bidirectional declared dependencies can ping-pong revisions indefinitely. Each cycle consumes logical budget on a different outcome, so per-outcome budget never trips. `consecutive_insufficient_limit` (§6.7.3) detects insufficient plans for the same outcome, not regression chains across outcomes. The cascade can deadlock the task without ever firing a budget exhaustion.

**Recommendation:** Add `RevisorConfig.max_revalidation_cascade_depth: number` (default 5) measured from the originating mutation receipt. The Loop Controller tracks cascade chains; when a regression triggers revision and that revision triggers another regression on an outcome already in the chain, the chain is "circular" and aborts with `validation.revalidation_cascade_loop`. Add the validation code to V3.3.1 §22. Surface the cycle to the user as a Hard Call (`HardRevisionCallKind = "revalidation_cycle"` — new value) so the user can break the tie.

**Reference:** V3.3.1 §11.21; §6.7.3; §6.14 RevisorConfig.

#### B8 · [BUG] [HIGH] · V3.3.1 §11.21.2 + §5.14.1 — `upstream_failure_cascade` rule duplicated with slight divergence

**Finding:** The upstream-failure rule is specified twice. §5.14.1 and §11.21.2 currently agree on the trigger conditions (`execution_status in {could_not_fix, failed_runtime, rejected_capability}` AND `retry_count >= per_outcome_retry_budget`; both restrict the cascade target to outcomes in `pending_dependency`). But they live in different sections that a maintainer might update independently, and neither handles the case where an outcome is in `evaluating` (not `pending_dependency`) but is *about to* depend on the missing artifact — a race where the outcome becomes `pending_dependency` *after* the cascade fires.

**Why it matters:** The set is internally consistent here today, but duplication invites drift, and the race leaves an outcome hanging on an artifact that will never arrive.

**Recommendation:** Consolidate the rule in one section (§11.21.2, the named cascade); §5.14.1 links to it rather than restating. Add handling: when an outcome transitions to `pending_dependency` after a cascade has already fired on the upstream module, the new outcome is auto-evaluated against the `upstream_failure` set as part of its state-entry guard, not via a new cascade pass. Document the race in §11.21.

**Reference:** V3.3.1 §5.14.1; §11.21.2.

#### B9 · [RISK] [MEDIUM] · V3.3.1 §11.22 + §11.20.2 — Parallel batches racing the live-edit rolling hash

**Finding:** §11.22 allows parallel step execution up to `max_parallel_steps_per_plan: 4`. §11.20.2 Rolling Hash Mode B requires "Step N+1 validates against predicted hash from Step N output." But two parallel steps in the same batch with rolling-hash semantics cannot both validate against the same base; they need to be serialized. §11.20.2 mentions rolling hash is "available only when no concurrent plans target the artifact" — but does NOT say "no concurrent *steps* within the same plan."

**Why it matters:** A coding agent reading §11.22 + §11.20.2 will allow rolling-hash plans with parallel steps targeting the same artifact, producing nondeterministic hash chains. Sometimes the chain validates; sometimes it doesn't depending on completion order. The bug surfaces as flaky `validation.rolling_hash_chain_broken` in test and corrupted artifacts in production.

**Recommendation:** §11.20.2 add: "Rolling hash mode B requires sequential step execution across all steps that mutate the same artifact. Parallelism within a plan is allowed only between steps targeting disjoint artifacts." Add `validation.rolling_hash_parallel_steps_same_artifact`. Document in §11.22 (parallel batches automatically degrade to sequential when any step is rolling-hash mode B).

**Reference:** V3.3.1 §11.22.1, §11.22.2; §11.20.2, §11.20.3.

#### B10 · [GAP] [MEDIUM] · V3.3.1 §8.4 + FD V1 §7 — Sub-agent dispatch has no "no sub-agent available" fallback for the evaluator point

**Finding:** §8.4 `AdvisorySubAgentProfile.allowed_coordination_points: ["outcome_compiler" | "evaluator" | "revision_compiler" | "feedback_interpreter"]`. Three of four are well-handled (outcome_compiler / revision_compiler / feedback_interpreter have explicit invoke + accept/reject/defer semantics). But for `evaluator`, no spec text describes what happens when an evaluator coordination point has no sub-agent registered for the relevant `AdvisorySubAgentOutput` variant. Run Board V1 §7.3 allows fan-out to `target: "task_agent"`, but task-agent fallback for an evaluator-stage sub-agent gap isn't specified.

**Why it matters:** A coding agent implementing the Evaluator can ship without sub-agent support and call it conformant by silence. The "no sub-agent available" path is exactly the most important fallback to specify: every implementation hits it before any sub-agent is registered.

**Recommendation:** Add V3.3.1 §8.X "No-sub-agent fallback per coordination point" listing, for each of the four allowed points, the deterministic behavior when no compatible sub-agent is available. For evaluator: proceed with the default specialist-subevaluator path; emit a `quality_signal` with `signal_kind = "sub_agent_unavailable_evaluator"`, `actionability = "metric_only"`; do not emit a Hard Call (sub-agent absence is not a strategic gap). For revision_compiler: proceed with single-Compiler reasoning; emit signal.

**Reference:** V3.3.1 §8.4; §3.4; §6.1.5; Run Board V1 §7.3.

#### B11 · [GAP] [MEDIUM] · FD V1 §8 — DOC23/DOC15/DOC24 boundary leaks at the `instruction_in` overload

**Finding:** §8.4 says "Existing ports carry feedback today: `instruction_in` — Repair Instructions and structured directives; `context_in` — Feedback bundle as context; `data_in` — Research needs as input data." But `instruction_in` is a general DOC23 R3.1 port. Overloading it to carry both ordinary instructions AND typed `OutcomeRepairInstruction` payloads means receiving modules must runtime-discriminate the payload — and the spec gives no discriminator field.

**Why it matters:** A module's `instruction_in` consumer cannot know whether it's getting an OutcomeRepairInstruction (structured: `target_scope_refs`, `preservation_constraints`, `suggested_route`) or a free-form instruction string. DOC15/CIL prompt assembly renders the wired input but doesn't know the type to render. The typed ports (`repair_instruction_in` etc.) are listed as "ergonomics, not required for V1" — so all V1 implementations go through the overload and each solves discrimination differently.

**Recommendation:** Either (a) elevate the typed ports (`feedback_in`, `repair_instruction_in`, `run_guidance_in`, `source_need_in`) to **required** for V1, with DOC23 R3.1's port registry insert (§10.3) listing them, OR (b) define a discriminator field in the payload union that DOC15/CIL uses to dispatch rendering. Don't ship the overload.

**Reference:** FD V1.0.1 §8.4, §10.3.

#### B12 · [RISK] [MEDIUM] · Run Board V1 §1.1 + §5.3 — Passive board "auto-publishes every event" is unbounded

**Finding:** §1.1: "Every task run has a passive Run Board... every system event auto-publishes." §1.2 lists ~12 event categories. §5.3: "Posts are append-only; MUST NOT be modified after creation." No retention, deduplication, throttling, or coalescing policy is specified.

**Why it matters:** A long-running task with frequent evaluator activations and high-cardinality artifact updates can produce thousands of posts. With no coalescing, the BoardDigest (§6.2, default `max_digest_tokens: 1200`) truncates non-deterministically or summarizes lossily. A debugging-grade audit feed is great; for the UI feed it's noise.

**Recommendation:** Add §1.5 "Coalescing and throttling policy": (a) cost/duration events coalesce to per-module summaries every N seconds; (b) artifact_reference events deduplicate by `artifact_ref + version_ref`; (c) the digest selection rule prefers terminal/blocking events over status events. Retention follows EC Core policy per matter class (Source Workspace §9.5 precedent). UI default filter excludes status_update; user opts in.

**Reference:** Run Board V1.0.1 §1.1, §1.2, §5.3, §6.2.

### Part C — Defect-hunt findings (§5)

#### C1 · [BUG] [HIGH] · Common Contracts §3.7 — Pattern C `route_recommendation` resolution is procedural, not specified

**Finding:** §3.7 says "Judge's quantitative recommendation governs when Pattern C is wired, since Judge is the more recent producer in the chain." But where is this resolved? In the Loop Controller? The Switch module? "Resolution is by consumer policy" — but no consumer policy schema exists. A Switch wired to both upstream Evaluator and downstream Judge envelopes via Pattern C has no rule to decide which `route_recommendation` to obey.

**Why it matters:** Pattern C is sold as a killer feature (§5.18.1). If routing is "consumer policy" without a schema, every consumer invents its own.

**Recommendation:** Define `PatternCRouteResolutionPolicy` in Common Contracts §3.7 (or V3.3.1 §5.18.7): "When a Switch or Loop Controller reads multiple envelopes sharing a `target_evaluation_chain_id`, it resolves the effective `route_recommendation` as the recommendation of the envelope with the largest `producer_activation_seq` (most recent). When recommendations conflict at the same seq, the consumer policy declares precedence (Judge > Evaluator > deterministic_scorer, or explicit override)." Add a validation code for missing policy.

**Reference:** Common Contracts V1.1.1 §3.7; V3.3.1 §5.18.7.

#### C2 · [BUG] [HIGH] · V3.3.1 §0.4.4 vs §5.7 + §5.7.1 — `FindingState` has a pass-through state with no negative exit

**Finding:** §0.4.4 enumerates `FindingState`: `proposed | active | contested | resolved | superseded_by_revision | superseded_by_source_change | user_approved | tool_verified | human_verified | rejected_by_user | dismissed | unrecoverable`. The §5.7.1 transition table lists `proposed → active` as the only outbound transition from `proposed`. There is no `proposed → dismissed` transition for a proposed finding the Evaluator decides NOT to confirm.

**Why it matters:** What happens to a `proposed` finding that's never confirmed? Implementations will leave findings in `proposed` indefinitely or invent a transition the table doesn't list.

**Recommendation:** Add `proposed → dismissed` to §5.7.1 with a typed predicate ("Evaluator did not confirm before activation completion"). Specify that a finding in `proposed` at activation termination auto-transitions to `dismissed` with `dismissal_reason = "not_confirmed_at_termination"`.

**Reference:** V3.3.1 §0.4.4; §5.7.1.

#### C3 · [BUG] [HIGH] · V3.3.1 §6.5.2 + §7.9 — `HardRevisionCall.options` may be empty, but spec requires bounded options

**Finding:** §6.5.1 says detection produces a Hard Call "with bounded `HumanDecisionOption[]`." §7.9.1's schema has `options: HumanDecisionOption[]` with no non-empty constraint. If the Compiler cannot enumerate options (e.g., `human_preference_needed` with unbounded alternatives), the schema allows empty `options`.

**Why it matters:** Empty `options[]` makes the §21.4 UI surface non-functional — no buttons. The user cannot resolve; the Dispatcher stays in `waiting_hard_call` indefinitely (§6.5.3). The `default_if_no_response` fallback's trigger ("no response") is itself unspecified.

**Recommendation:** Constrain `options[]` to `MIN_LENGTH = 2` (e.g., "Accept" and "Reject" always present). For Hard Calls where the Compiler cannot enumerate substantive options, default to `{"continue_with_compiler_proposal", "pause_for_my_input"}`. Add `validation.hard_call_options_empty`. Specify the timeout that triggers `default_if_no_response`.

**Reference:** V3.3.1 §6.5.1, §6.5.3, §7.9.1, §21.4.

#### C4 · [GAP] [HIGH] · V3.3.1 §11.21.1 — Outcome dependency direction is undefined for cycles

**Finding:** §11.21.1 says cascading is determined by declared `OutcomeDependencySpec` and `EvaluationTargetClosurePolicy` (§5.11). Neither says what happens when `OutcomeDependencySpec` is bidirectional (A depends on B and B depends on A). The closure traversal will hit a cycle.

**Why it matters:** "Apply EvaluationTargetClosurePolicy to ensure closure" will not terminate if cycles aren't detected. Even with a visited-set guard, the *order* the cycle is evaluated in is undefined, and the user can't express the resolution.

**Recommendation:** Add to §5.11: cycle detection during closure with `validation.outcome_dependency_cycle_detected` (warning, not error — cycles are allowed; spec just needs an order). Specify cycle-evaluation order: topological order among non-cyclic outcomes, then cyclic outcomes by `outcome_priority` descending (require `outcome_priority` to be declared for cyclic outcomes).

**Reference:** V3.3.1 §11.21.1; §5.11.

#### C5 · [BUG] [MEDIUM] · V3.3.1 §6.16 + §13.3 — `LearningMode` enum is referenced but never enumerated

**Finding:** §6.14 `RevisorConfig.learning_mode: LearningMode // default "production"; see §6.16`. §0.4 says the V3.2 inventory was "extended with `LearningMode`, `ModelClass`, `CrossModelApplicability`" but the `LearningMode` values are not enumerated in §0.4 where they should be. The enum is used as `"production"` in the default and as `"signal_generation"` in the Common Contracts §5.4 gating, but the complete value set is not declared in the canonical enum inventory.

**Why it matters:** The signal envelope gates cheap-model learning on `model_class` AND `learning_mode = "signal_generation"`. Implementations need the full value set (presumably `production`, `signal_generation`, `cross_calibration`).

**Recommendation:** Explicitly enumerate `LearningMode` in V3.3.1 §0.4 alongside the other enums; cross-reference §6.16 implementation. Verify all referenced enum values from §6.16 onward exist in §0.4. *(Touches the paused learning surface; flag-only.)*

**Reference:** V3.3.1 §6.14, §6.16; Common Contracts §5.4.

#### C6 · [BUG] [MEDIUM] · Common Contracts §6.4 vs §9.4 — `unanchored_llm_judgment` acknowledgment has no field to write into

**Finding:** §6.4: `unanchored_llm_judgment` is NOT aggregation-eligible by default; Judge's `OutcomeComplianceScoringConfig` can override with an audit flag. §9.4: `scoring_basis = "unanchored_llm_judgment"` AND `required = true` requires explicit user acknowledgment. But no schema has a field to store that acknowledgment.

**Why it matters:** The warning has no acknowledgment field to write into. Implementations either fire the warning every time (loud) or store ack in an undocumented field.

**Recommendation:** Add `Criterion.unanchored_aggregation_acknowledged_by_user: boolean` (with `user_ref` + `acknowledged_at` for audit). The warning fires when `scoring_basis == "unanchored_llm_judgment" && required == true && unanchored_aggregation_acknowledged_by_user == false`; once acknowledged, it silences.

**Reference:** Common Contracts §6.4, §9.4.

#### C7 · [GAP] [MEDIUM] · Common Contracts §7.1 + §9.5 — `ArtifactScopeRef` structured anchor can be validly empty

**Finding:** §9.5: "`ArtifactScopeRef.anchor` null is allowed only when `scope_kind = "document"`." So a `field` scope requires a non-null anchor. But `StructuredAnchor` has all-optional sub-fields (`section_id?`, `field_path?`, `citation_ref?`). A `StructuredAnchor {}` is non-null but carries no anchor information, passing §9.5 while being useless.

**Why it matters:** A coding agent can construct `ArtifactScopeRef { scope_kind: "field", anchor: StructuredAnchor {} }` that validates but cannot be resolved. Findings or repair instructions referencing this scope are un-locatable.

**Recommendation:** Strengthen §9.5: `StructuredAnchor` must have at least one of `section_id`, `field_path`, `citation_ref` populated. Add `validation.structured_anchor_empty`.

**Reference:** Common Contracts V1.1.1 §7.1, §7.3, §9.5.

#### C8 · [BUG] [MEDIUM] · Source Workspace §4.1 + V3.3.1 §15.10 — `SourceRecord.taint_class` "inherited from source kind" is unspecified

**Finding:** Source Workspace §4.1: `SourceRecord.taint_class: TaintClass // inherited from source kind / retrieval method`. But no `source_kind → taint_class` mapping exists. The 15 `source_kind` values don't map to any of the 8 taint classes anywhere.

**Why it matters:** Two implementations of `step.source_research` pick different defaults. A PACER `case_law` source might be `external_authority_trusted` in one impl and `external_untrusted` in another. Pattern C Judges that gate sandboxed-vs-trusted handling on the taint class produce inconsistent results.

**Recommendation:** Add Source Workspace §4.1A "Default taint_class per source_kind": `document/email/file → user_trusted_bounded`; `web_source → external_untrusted`; `api_result/database_record → external_authority_trusted`; `library_entry → internal_corpus_trusted`; `case_law/statute/regulation → external_authority_trusted`; `prior_task_output → user_trusted_bounded`. Allow `taint_class_override` per query.

**Reference:** Source Workspace V1.0.1 §4.1; V3.3.1 §15.10, §15.10.1.

#### C9 · [GAP] [MEDIUM] · Run Board V1 §3.1 + §6.2 — BoardDigest filter rule is unspecified

**Finding:** §6.2 BoardDigest carries `included_post_ids` etc., but the **filter rule** that selects `included_post_ids` from all forum posts is not specified. `BoardDigestPolicy` (§3.2) has include-flags but no `include_post_kinds`, no severity threshold, no max-count, no selection strategy.

**Why it matters:** A 500-post forum cannot ship all posts in a 1200-token digest. Without selection rules, implementations pick differently (newest 20? highest-severity? tagged?), producing divergent downstream behavior.

**Recommendation:** Extend `BoardDigestPolicy` with `included_post_kinds: TaskRunBoardPostKind[]`, `included_severity_threshold`, `max_posts: number`, `selection_strategy: "recency" | "severity" | "score" | "mixed"`. Default: include `evaluation_finding`, `repair_instruction`, `process_gap`, `user_guidance`; severity ≥ medium; max 30; mixed (50% recency, 50% severity).

**Reference:** Run Board V1.0.1 §3.1, §3.2, §6.2.

#### C10 · [BUG] [MEDIUM] · V3.3.1 §11.6 + §0.4.7 — `RevisionOperationKind = "hard_call_resolved"` has no producer

**Finding:** §0.4.7 `RevisionOperationKind` includes `hard_call_resolved`. §7.9 defines `HardCallResolution` persisted to the ledger. §11.6 lists operation kinds but doesn't say which actor emits an operation with `operation_kind = "hard_call_resolved"` (Dispatcher? UI? Revisor?).

**Why it matters:** Operation receipts feed RepairCycleSignal (Core §9.0.2). If the actor is ambiguous, the receipt is missing or duplicated, breaking the `hard_call_resolved → revision_operation_receipt_ref` chain in `RevisorActionRecord`.

**Recommendation:** Add §7.9.4: "When a HardCallResolution is recorded, the Dispatcher emits a `RevisionOperationReceipt` with `operation_kind = "hard_call_resolved"` and `hard_call_ref` pointing to the resolved Hard Call. The receipt's `actor_ref` is the Dispatcher's runtime identity; the resolution records `resolved_by: UserRef` separately."

**Reference:** V3.3.1 §0.4.7; §7.9; §11.6; Core §9.0.2.

#### C11 · [GAP] [MEDIUM] · FD V1 §9.4 — "Silent ignoring fires validation" is unenforceable as written

**Finding:** §9.4: modules that explicitly ignore feedback emit a receipt; "Silent ignoring (no receipt) fires `validation.feedback_consumed_without_receipt` at run audit time." But the detection mechanism — which bundles were routed to which modules and whether receipts returned — is not specified, so the audit has no way to enumerate "expected receipts."

**Why it matters:** The validation never fires in practice. A module that processed-and-ignored without a receipt looks identical at audit time to one that received nothing.

**Recommendation:** Add §9.4A "Receipt expectation tracking": when the feedback router (§6) dispatches a bundle to a consumer, it records a `FeedbackDispatchExpectation` keyed to `(feedback_bundle_id, consumer_module_id, consumer_activation_seq)`. At run end (or after a receipt-grace period) the audit compares expectations to receipts; missing pairs fire the validation. Specify the grace period default (e.g., 5 minutes after consumer activation completes).

**Reference:** FD V1.0.1 §9.4.

#### C12 · [BUG] [MEDIUM] · Run Board V1 §5.4 + Source Workspace §9.4 — Cross-matter post visibility unspecified

**Finding:** Run Board §5.4's 5 visibility values don't restrict by matter. §5.2 has a `matter_id?` field. Source Workspace §9.4 says cross-matter retrieval doesn't surface privileged-matter entries. But a forum post in a privileged matter with `visibility: "all_task_modules"` — is it visible to a module in another matter? The rule isn't stated.

**Why it matters:** A privilege firewall breach via forum posts is catastrophic for a litigator. If matter scoping is implicit, it's not enforced.

**Recommendation:** Add Run Board §5.5 "Cross-matter visibility rule": a post with `matter_id == X` is visible only to readers operating under `matter_id == X`, regardless of `visibility`; `visibility` scopes within the matter. Add `validation.forum_post_cross_matter_leak`. Document interaction with `privileged: true` (always matter-scoped; never cross-matter regardless of access tier).

**Reference:** Run Board V1.0.1 §5.2, §5.4; Source Workspace V1.0.1 §9.4; V3.3.1 §13.4.

#### C13 · [GAP] [LOW] · Source Workspace §3.3 — Tier transitions lack policy/validation

**Finding:** §3.3 `SourceTierTransition` is recorded on promotion, but no rule prevents demotion (tier 3 → tier 1), no validation requires a substantive reason, and no policy gates ad-hoc tier changes by access tier (per V3.3.1 §16).

**Why it matters:** A read-only user can demote a tier-3 source card to tier 1, dropping rich content and metadata. The transition is recorded but the data is gone.

**Recommendation:** Add §3.3A "Tier transition policy": demotion (`from_tier > to_tier`) requires `cleared_by_access_tier >= "matter_team_access"` and a non-empty `reason`. Promotion is unrestricted by access tier. Add `validation.source_tier_demotion_without_authority`.

**Reference:** Source Workspace V1.0.1 §3.3; V3.3.1 §16.

#### C14 · [BUG] [LOW] · Common Contracts §3.4 — Parallel Judge+Evaluator example contradicts §5.18.8

**Finding:** §3.4: "Two envelopes can reference the same snapshot... e.g., Judge and Evaluator running in parallel on the same artifact version." But V3.3.1 §5.18.8 places Patterns A/B Evaluator inside an Experiment context, and Pattern C has the Judge *consume* the Evaluator's output (so they cannot run in parallel). The example fits no specified topology.

**Why it matters:** A coding agent reading §3.4 thinks parallel Judge+Evaluator is a supported topology and tries to wire it.

**Recommendation:** Fix the example: "e.g., two Evaluator activations on the same snapshot during an Experiment, or an Evaluator and a deterministic scorer in parallel." Either remove the Judge example or specify the topology where Judge and Evaluator are genuinely parallel.

**Reference:** Common Contracts V1.1.1 §3.4; V3.3.1 §5.18.8.

#### C15 · [GAP] [LOW] · Core §3D + Run Board §6.4 — `TaskOpportunityPacket` and `TaskRunContextPacket` taxonomy is unstated

**Finding:** Core §3D defines `TaskOpportunityPacket` (DOC24-assembled; direct-mode; 150–600 token budget). Run Board §6.3 defines `TaskRunContextPacket` (DOC24-assembled; for module activations during a run). Both are DOC24-assembled contextual packets; the differentiation is real (pre-run opportunity vs in-run context) but unstated, and their fields overlap conceptually.

**Why it matters:** A DOC24 coding agent sees both and picks one or invents a third.

**Recommendation:** Add a "Packet taxonomy" subsection to Core §3D listing the known packet types: `TaskOpportunityPacket` (pre-task, ambient), `TaskRunContextPacket` (in-run, module-scoped), and the future `TaskAgentDesignPacket` (§3D.3). State each one's DOC24 lane and audience. *(See D18 — this overlaps the "token budget fragmented across packets" finding.)*

**Reference:** Core R0.7.1 §3D.2, §3D.3; Run Board V1.0.1 §6.3, §6.5.

---

## 3. Deeper-dive additional findings (D1–D24)

These were produced after the original 32, on closer reading of the two large documents and the seams. Two are CONFIRMED affirmations.

#### D1 · [BUG] [HIGH] · V3.3.1 §17.1 vs §8.4 — Sub-agent coordination point count mismatch (4 vs 5)

§17.1 lists five sub-agent coordination points (including a Plan Verifier), but the `allowed_coordination_points` enum at §8.4 has only four (`outcome_compiler`, `evaluator`, `revision_compiler`, `feedback_interpreter`). A sub-agent profile cannot declare itself for the Plan Verifier point. Reconcile: either add the fifth enum value or remove the §17.1 reference. **Reference:** V3.3.1 §17.1, §8.4.

#### D2 · [RISK] [MEDIUM] · V3.3.1 §6.7.2 — Success-condition 5 races the cascade

Success condition 5 ("Cascaded dependent outcomes are re-evaluated") is checked at revision-cycle completion, but the cascade (§11.21) can still be firing when the Loop Controller evaluates the seven conditions. A revision can be marked successful before its own regression cascade has settled. Gate condition 5 on cascade quiescence. **Reference:** V3.3.1 §6.7.2, §11.21.

#### D3 · [GAP] [MEDIUM] · Run Board V1 §4.6 — Moderator failure path unspecified

`ForumModeratorPolicy` defines five moderator modes but no behavior when the moderator agent (for `task_agent_advisory`, `domain_moderator`) is unavailable or errors. The forum has no defined degradation (fall back to `none`? pause? queue?). Specify a moderator fallback analogous to the Task Agent fallback (§6.9.1). **Reference:** Run Board V1.0.1 §4, §4.6.

#### D4 · [GAP] [MEDIUM] · V3.3.1 §6.12 / §13.3 — `goal_advancement_count` has no decrement path

`goal_advancement_count` only increments. A pattern whose later applications stop advancing the goal never loses count; the metric is monotonic and so becomes stale-positive over time. Define a decay or re-evaluation rule (or a windowed denominator). *(Touches paused learning surface; flag-only, but the no-decrement asymmetry is structural.)* **Reference:** V3.3.1 §13.3.

#### D5 · [BUG] [MEDIUM] · Common Contracts §11.5 — Backward-compat claim overstates stability

§11.5 claims read-compatibility across versions, but the documents reference each other by section number (e.g., "Addenda B V3.1 §5.7"). On absorption into DOC23 R3.2, section numbers shift and these references break. The compatibility claim holds for schemas but not for the cross-references that bind them. Convert section-number cross-refs to stable symbolic anchors before absorption. **Reference:** Common Contracts V1.1.1 §11.5.

#### D6 · [GAP] [MEDIUM] · Source Workspace §6.2 — `ResearchNeed.status` lacks exit transitions for `human_needed`

`ResearchNeed` has a status lifecycle, but the `human_needed` status has no defined exit (who resolves it, what transitions it to satisfied/abandoned, what happens if never resolved). Open research needs in `human_needed` can leak across the run. Specify the exit transitions and a default disposition at run end. **Reference:** Source Workspace V1.0.1 §6.2.

#### D7 · [RISK] [MEDIUM] · Core §9.0.6 — Signal emission ordering vs receipt persistence is undefined

The §9.0.6 signal flow shows signals emitted then passing through the EC policy gate, but the ordering relationship between signal emission and the durable persistence of the receipts those signals reference is not specified. A consumer can receive a signal referencing a receipt not yet durably written. Specify emit-after-persist (or a read-your-writes guarantee at the gate). **Reference:** Core R0.7.1 §9.0.6.

#### D8 · [BUG] [LOW] · Common Contracts §3.1 vs V3.3.1 §5.1 — `evaluation_chain_id` naming asymmetry

The field is `target_evaluation_chain_id` on the envelope (Common Contracts §3.1) but referred to as `evaluation_chain_id` in V3.3.1 §5.1 prose. Same concept, two names; a coding agent may treat them as distinct. Normalize the name. **Reference:** Common Contracts §3.1; V3.3.1 §5.1.

#### D9 · [CONFIRMED] · V3.3.1 §6.6.2 — AutonomousModePolicy locked fields

The four `may_skip_*` fields locked to `false` at the schema level (validator rejects any record where they differ), with only `skip_low_risk_judgment_gate` user-settable, is exactly the right way to make autonomy safe by construction. Hard Calls, policy gates, privileged artifacts, and external side effects always require human gates. This is the model the rest of the spec's safety-critical invariants should follow. **Reference:** V3.3.1 §6.6.2.

#### D10 · [CONFIRMED] · FD V1 §3.4 — Defeasible findings

The seven defeasibility rules (findings are contestable; user-contesting transitions to `contested` and unblocks downstream; `authority_basis` is what makes a finding a hard blocker) are a genuinely good piece of design. Most evaluation systems treat findings as ground truth; treating them as defeasible-but-typed is the correct trust model for high-stakes review. **Reference:** FD V1.0.1 §3.4.

#### D11 · [IDEA] [MEDIUM] · Set-wide — A `TaskReplay` primitive would close the determinism story

The set has `execution_watermark_ref` (Common Contracts §3.1) and snapshots, but no first-class replay primitive that reconstructs a run deterministically from recorded state. For a litigator who must demonstrate "this is exactly what the system did," replay is the proof. Define `TaskReplay` (inputs: run_id, watermark; output: reconstructed event sequence with divergence detection). *(Relates to S5 RunReplayPreview and to the determinism purpose-question A3.)* **Reference:** Common Contracts §3.1; V3.3.1 §3.7.4.

#### D12 · [UX] [LOW] · V3.3.1 §6.10 — Planner confidence threshold needs calibration guidance

The default 0.7 confidence threshold (below which plans surface for human review) is a single magic number with no calibration guidance. Different goal kinds warrant different thresholds. Add per-goal-kind threshold guidance or a calibration mechanism, and surface the active threshold in the UI. **Reference:** V3.3.1 §6.10.

#### D13 · [GAP] [HIGH] · V3.3.1 §5.2 / §6.10 — `compiler_confidence_score` has no computation

The Compiler confidence score gates human review (§6.10) but no document specifies how it is computed. A self-reported confidence with no grounding is a sycophancy vector — the same risk the §6.12 goal-advancement fix carefully avoids. Specify the computation (or bind it to an independent signal, not self-assessment). *(This is subsumed by the cross-reviewer "Formula Registry" finding — see §6.)* **Reference:** V3.3.1 §5.2, §6.10.

#### D14 · [GAP] [MEDIUM] · V3.3.1 §7.9.3 — HardCallResolution hash normalization unspecified

The reuse rule (§7.9.3) compares `outcome_definition_hash` and `goal_context_hash` for compatibility, but how those hashes are computed/normalized is unspecified. Two semantically-identical outcomes with cosmetic differences (whitespace, field ordering) produce different hashes, so a valid prior resolution is needlessly re-escalated. Specify a canonical normalization before hashing. **Reference:** V3.3.1 §7.9.3.

#### D15 · [GAP] [HIGH] · Source Workspace + Run Board — Concurrency model undefined

Neither Source Workspace nor the Forum specifies a concurrency model for simultaneous writes (two modules adding sources to the same workspace, two participants posting to the same forum segment). Append-only posts help the forum, but Source Workspace `SourceRecord` mutation and tier transitions have no concurrency control. Specify last-write-wins vs optimistic-locking vs serialized writes per record kind. *(Relates to scenario C1(a) in the purpose-question library — two modules satisfying the same ResearchNeed.)* **Reference:** Source Workspace V1.0.1; Run Board V1.0.1 §5.

#### D16 · [GAP] [HIGH] · Set-wide — `HumanOutcomeFeedbackEvent` referenced but never schema-defined

`HumanOutcomeFeedbackEvent` appears in the Core §0.3.6 terminology table ("Feedback event | HumanOutcomeFeedbackEvent | ... yes") and is the natural target of the Teach-from-feedback flow and the ModuleActivationChat advise mode (§5 below), but no document defines its schema. Define it (probably in FD V1 or Common Contracts) before any feedback-producing surface depends on it. **Reference:** Core §0.3.6; FD V1 §2; V3.3.1 §14.3.

#### D17 · [RISK] [MEDIUM] · V3.3.1 §5.18 — Pattern C doubles per-turn evaluation latency with no budget

Pattern C wires a Judge downstream of every standalone Evaluator. In an iterative revision loop, that doubles the evaluation latency (and cost) per turn, but no budget governs the Pattern C Judge invocation separately from the Evaluator. A long revision loop silently doubles cost. Add a Pattern C invocation budget or a "score every N turns, not every turn" cadence option. **Reference:** V3.3.1 §5.18; §11.15.

#### D18 · [GAP] [MEDIUM] · Set-wide — Token budget is fragmented across four packets

Token budgets are declared independently on `TaskOpportunityPacket` (§3D), `TaskRunContextPacket` (Run Board §6.3), `BoardDigest` (§6.2), and the evaluation packets — but nothing reconciles them into a single per-activation budget. A module can receive a context packet + a board digest + a feedback bundle that jointly exceed its model's context window with no single authority checking the sum. Define a per-activation total-context budget that the packet assembler enforces across all packet sources. *(Overlaps C15.)* **Reference:** Core §3D; Run Board §6.2, §6.3.

#### D19 · [GAP] [MEDIUM] · Set-wide — Task Agent appears in two forum-shaped surfaces

The Task Agent participates both via the Forum (`task_agent_advisory` moderator mode, Run Board §4.3) and via its own assessment queue (Core §24.8) consuming `TaskProcessGapSignal`. These are two forum-shaped surfaces for the same agent with no stated relationship. Clarify whether they are one surface or two, and how a process-gap observation in one reaches the other. **Reference:** Run Board §4.3; Core §9.0.3, §24.8.

#### D20 · [GAP] [MEDIUM] · Forum V1 — Forum is run-scoped only; no task-scoped forum

The Forum and Run Board are scoped to a single run (`run_id` throughout). But coordination that should persist across runs of the same task (recurring research needs, standing user guidance, design discussions) has no home. A task-scoped forum (or a promotion path from run-forum to task-forum) is missing. **Reference:** Run Board V1.0.1 §1, §3.

#### D21 · [GAP] [MEDIUM] · V3.3.1 §13.1 — `cross_model_applicability = "requires_validation"` has no runtime behavior

`Pattern.cross_model_applicability` can be `"requires_validation"`, but no document specifies what the runtime does with that value — what validation, when, what happens while validation is pending, what gates on it. The enum value exists with no consuming behavior. Specify the validation procedure or mark the value Phase 2. *(Touches paused learning surface; flag-only.)* **Reference:** V3.3.1 §13.1.

#### D22 · [GAP] [MEDIUM] · V3.3.1 §15.8 — Sub-agent reputation portability is unspecified

`SubAgentReputation` accrues per `advisory_agent_id`, but whether reputation is scoped per-matter, per-firm, or global — and whether it ports when a sub-agent is promoted across scopes — is unspecified. A sub-agent trusted in one matter may be untested in another; reputation portability needs a scope rule mirroring the taint/pattern scope discipline. **Reference:** V3.3.1 §15.8; §16.

#### D23 · [BUG] [MEDIUM] · FD V1 — `ApplicabilityScope.authority_level` vs `domain_payload.authority_level` conflict

Two `authority_level` fields exist at different nesting levels (on the applicability scope and on the domain payload) with no rule for which governs when they disagree. A coding agent cannot determine the effective authority level. Define precedence (or merge to one field). **Reference:** FD V1.0.1 §3, §5.

#### D24 · [RISK] [MEDIUM] · FD V1 §6.3 — Multiple delivery branches can fire simultaneously with no cost/idempotency control

The `FeedbackRoutingPolicy` branches (§6.3) are not stated to be mutually exclusive; a single evaluation result can match multiple branches (e.g., `on_needs_revision` and `on_needs_more_sources`), firing multiple deliveries. No idempotency key or cost guard prevents duplicate or conflicting deliveries. Specify branch exclusivity (or an explicit multi-fire policy with idempotency keys). **Reference:** FD V1.0.1 §6.3.

---

## 4. New user-facing surfaces proposed (S1–S6)

These are six surfaces the set needs but does not deliver. S1 and S2 are the highest-leverage. Schemas are starting points, not final.

#### S1 · WorkProductCertification — "the page you staple to the cover sheet" (highest-leverage)

**Purpose.** A single artifact a professional can attach to a finished work product that certifies what the system did and didn't verify. Today the assurance story is scattered across findings, verification records, judgment limitations, and assurance slices. S1 collapses them into one signed statement: what was checked, by what basis, what was NOT checked, what the user accepted over a warning, and what remains a judgment limitation.

**Schema (sketch).**

```ts
WorkProductCertification {
  certification_id: string
  task_id: string
  run_id: string
  artifact_ref: StorageRef
  artifact_version_ref: StorageRef

  verified_items: Array<{ description: string; basis: AssuranceBasis; verification_record_refs: StorageRef[] }>
  not_verified_items: Array<{ description: string; reason: "out_of_scope" | "no_capability" | "degraded" | "user_declined" }>
  judgment_limitations: JudgmentLimitationRecord[]        // §5.9
  user_overrides: Array<{ what: string; warning_shown: string; user_ref: UserRef; at: ISO8601 }>
  unresolved_hard_calls: string[]                          // should be empty for a clean cert
  taint_summary: { highest_unc1eared: TaintClass; cleared_records: StorageRef[] }
  certified_at: ISO8601
  certified_by: UserRef | "system_auto"
  schema_version: 1
}
```

**Home.** Core R0.7.1 (new §20.X). Generated at run completion; surfaced in the §21 review surfaces and as a downloadable artifact.

#### S2 · FindingsInbox — cross-task review queue (highest-leverage)

**Purpose.** A litigator running many matters needs one place to triage every open finding across every active task, sorted by reliability and stakes — not one Evaluation Result Card per run. This is the "what needs my attention right now, everywhere" surface that A1's TaskHealthCard implies at the per-task level, lifted to the portfolio level.

**Schema (sketch).**

```ts
FindingsInboxView {
  generated_for: UserRef
  scope: "all_matters" | { matter_ids: string[] }
  entries: Array<{
    finding_ref: StorageRef
    task_id: string; run_id: string; matter_id?: string
    severity: "low" | "medium" | "high" | "blocking"
    state: FindingState
    historical_false_positive_rate_in_context?: number    // per A5
    produced_by_subagent_status?: "active" | "watch" | "quarantined"
    is_blocking_delivery: boolean
  }>
  sort: "reliability_desc" | "severity_desc" | "stakes_desc" | "recency"
  schema_version: 1
}
```

**Home.** Core R0.7.1 or DOC20. Honors matter firewall (C12) — cross-matter entries never leak privileged content.

#### S3 · RunDiff — compare two runs of the same task

**Purpose.** When a user forks a run (A4/S5) or re-runs a task, they need to see what differed between runs: which outcomes flipped verdict, which findings appeared/disappeared, cost delta, which modules behaved differently. Today there is no run-to-run comparison.
**Home.** Core R0.7.1; consumes run records + EvaluationResultEnvelopes from both runs. Pairs naturally with TaskRunFork (§5).

#### S4 · DecisionAuditView — the "why did it decide that" surface

**Purpose.** ExplanationTrace (§7.10) is per-plan. S4 is the run-level audit view that threads every decision point — strategy selections, escalations, sub-agent consultations, Hard Call resolutions, taint clearances — into one auditable timeline a supervising attorney can walk. This is what makes the system defensible, not just usable.
**Home.** V3.3.1 §21.X, consuming ExplanationTrace decision_points + HardCallResolution ledger + TaintClearanceRecords.

#### S5 · RunReplayPreview — see what a replay would produce before committing

**Purpose.** Bound to D11 (TaskReplay). Before re-running or forking, show the user a dry-run preview: which steps would re-execute, which would reuse cached results, what would diverge given current state vs the recorded watermark.
**Home.** V3.3.1 §15.X; consumes `execution_watermark_ref` and EvaluationSnapshots.

#### S6 · LongitudinalPatternView — pattern behavior over time

**Purpose.** Extends §21.8's per-pattern card into a longitudinal view: how a pattern's convergence/regression/rollback counts have moved across many uses, in this context vs globally. Surfaces the D4 monotonicity problem visibly (a pattern that stopped working but never lost count shows a flat line a human can question).
**Home.** V3.3.1 §21.8 extension. *(Touches paused learning surface; build the view, defer the learning mechanics.)*

---

## 5. Fork / chat-in-session / re-prompt unified design

This section consolidates the design work that grew out of A4. Three capabilities turned out to share one mechanism (session continuation), and the decision was to ship them as one three-section addendum rather than as edits scattered across DOC23 R3.2.

### 5.1 TaskRunFork (Core R0.7.1)

A first-class fork primitive keyed to any module activation, not just plan-internal checkpoints. The fork point is `(module_id, activation_seq)`. The decision after the original review: forking copies workspace refs, but **side effects cannot branch** — an email already sent is sent in every branch. So the schema carries an explicit irrevocable-side-effects record at fork time.

```ts
TaskRunFork {
  fork_id: string
  parent_run_id: string
  fork_point: { module_id: string; activation_seq: number }
  fork_disposition: "experimental" | "alternate_path" | "recovery"
  copied_workspace_refs: string[]                  // persistent SourceWorkspace copied; ephemeral RunWorkspace re-derived
  divergence_inputs?: { user_directive?: string }  // what the user changed at the fork
  irrevocable_side_effects_at_fork: Array<{         // present in parent before fork point; cannot be undone in the child
    side_effect_kind: "external_send" | "external_tool_call" | "durable_promotion" | "candidate_accepted"
    receipt_ref: StorageRef
  }>
  created_at: ISO8601
  schema_version: 1
}
```

**User gesture.** Right-click a Run Board event → "fork from here." **Pairs with** S3 (RunDiff) and S5 (RunReplayPreview).

### 5.2 ModuleActivationChat (Core R0.7.1, new §4.4)

A chat attached to any module activation, with three modes. Chat is the *input mechanism*; fork and feedback are the two *destinations*.

- **inspect (DEFAULT).** Read-only conversation against a frozen snapshot of the activation. No state change. "Why did you flag this?" / "What sources did you use here?"

- **advise.** Produces a `HumanOutcomeFeedbackEvent` (D16 — needs its schema defined) that flows into the existing Feedback Interpreter pipeline (V3.3.1 §14.3). This is teach-from-feedback reached conversationally.

- **branch.** Opens a `TaskRunFork` (§5.1) from this activation, carrying the chat's `user_directive` as `divergence_inputs`.

  **Cross-doc obligations:** DOC23 R3.1 supplies `TaskModuleSessionRef` and the port mechanics; DOC11/OpenClaw supplies the read-only session handle; DOC15 renders chat context (in branch mode only); DOC20 supplies the three-mode toggle UI.

### 5.3 Re-prompts — sibling of chat-in-session

Re-prompts are **pre-scripted** (`re_prompts: string[]`, defined at design time) that fire on module kinds like `agent_task`, `coding`, `agent_review_gate`, `red_team`. They are a *sibling* of chat-in-session, not the same thing: both extend the OpenClaw/Gateway session and share session-continuation mechanics, but re-prompts are author-time and deterministic while chat is user-time and interactive.

### 5.4 Packaging decision (settled)

Rejected: separate edits touching DOC23 R3.2 (too costly). **Agreed:** one standalone addendum, three sections —

- **§1 Re-Prompt System** (re_prompts mechanics, applicability table, DOC23 compilation points)

- **§2 Module Activation Chat + TaskRunFork** (the §5.1/§5.2 design)

- **§3 Shared Module Session Continuation Mechanics** (references Addenda A §A9; portable; fold into DOC23 R3.2 later only if another feature needs it)

  This addendum is the **next build** after the current review round closes. Session continuity itself lives in Addenda A §A9.

---

## 6. Evaluation of the four external reviewers

Will routed the red-team prompt (and, separately, purpose-questions) to ChatGPT, Grok, Gemini, and Claude in various configurations. Below: what those reviews surfaced that changed or extended Claude's thinking (§6.1, all adopted), the proposals deliberately *not* adopted with the reason for each (§6.2), and a flat list of the reviewer-surfaced additions being adopted (§6.3). No priority ordering — per the disposition rule in §1, everything in §6.1 and §6.3 is to be done; §6.2 is the only "not doing this" set.

### 6.1 What the other reviewers found that Claude rates high-value (and largely missed)

- **Pattern C envelope FIELD mismatch (highest-value catch).** A reviewer found that in Pattern C, the Judge reads `evaluated_target` and `evaluation_basis`, but those fields are NOT on `EvaluationResultEnvelope` — they live on `EvaluationFeedbackBundle`. This is distinct from Claude's B3 (chain-ID *lifecycle*); it's a field-location bug that breaks Pattern C wiring directly. **Adopted as a fix; it sits with B1 and B3 as a contract-consolidation item (§7).**
- **Memory Hydration phase missing.** No formal pre-run memory-read phase is specified. The system writes signals and patterns prolifically but never specifies when/how a run *reads* memory before executing. A genuine architectural absence (purpose-audit class, not wiring).
- **Memory Precedence Hierarchy undefined.** When local intent, matter policy, and global DOC72 patterns conflict, nothing says which wins. Proposed order: Local Intent > Matter Policy > Global DOC72.
- **Sub-Agent Amnesia / SubAgentPrior.** Sub-agents are invoked without a specified mechanism for carrying prior context (a `SubAgentPrior` injection). Each invocation starts cold.
- **Formula Registry (subsumes ~15 findings).** Scores throughout the set (compiler_confidence_score per D13, quality_index, reputation, goal_advancement) are bare numbers with no specified computation. A single Formula Registry — one place that defines how every numeric score is computed — subsumes D13 and roughly a dozen scattered "this number has no formula" findings. This is the highest-leverage *structural* fix the other reviewers surfaced.
- **CalibratedScore type.** A score type that carries its own confidence interval and calibration metadata, so a 0.7 from one scorer is comparable to a 0.7 from another. Pairs with the Formula Registry.
- **Flawless-Execution Denominator.** Learning fires only on failure; there is no positive-reinforcement signal when a run executes flawlessly. The denominator is missing, so the system can't distinguish "never failed" from "never ran." (Relates to Claude's B6 phantom `EvaluationAffirmation` — the positive counterpart to a finding.)
- **Other named gaps worth absorbing:** `HardCallBlockingScope` (what exactly a blocking Hard Call blocks — the step, the outcome, the whole run); `TaskCancelProtocol` (clean cancel semantics mid-run — relates to scenario C1(e)); `ResearchNeedLease` (so two modules don't both satisfy one need — D15/scenario C1(a)); `EvaluationContractReview` (a pre-execution check that the evaluation plan is coherent before spending); `EvidencePackage` (bundling evidence for a finding into one reviewable unit); `TaskBlueprint` Topology/Payload bifurcation (separating graph shape from graph content); `KnownGoodState` (a named, restorable checkpoint).

### 6.2 Proposed mechanisms not adopted (the declined set, with reasons)

These are the only items in this document **not** being acted on. Each was genuinely proposed (by a reviewer or by Claude) and is declined for a stated reason. In every case the *underlying need* is real and is met another way — so "declined" means "not this mechanism," not "ignore the problem."

- **Literal Git-style branching / copy-on-write workspace with reversible operations (Grok's ShadowWorkspace, taken literally).** *Declined* because side effects cannot branch — an email already sent, an external tool call already made, a durable promotion already committed cannot be "un-sent" in a child branch, and a copy-on-write model that implies they can is a phantom (the exact failure class Will guards against). *Adopted instead:* the branching *concept* via `TaskRunFork` (§5.1) plus the explicit `irrevocable_side_effects_at_fork` record, which makes the un-undoable visible at the fork point rather than pretending it's reversible.
- **`TaskConfirmationSignal` as a new signal type.** *Declined* because it duplicates existing signal/receipt machinery; the need (recording confirmation) is already served by existing signals and receipts. Adding a new type widens the surface for no benefit. *Adopted instead:* nothing new — fold into existing signal/receipt records.
- **Flawless-execution as a new signal type.** *Declined* for the same redundancy reason: `OutcomeEvaluationSignal` already carries the verdict, so a separate "it went fine" signal is unnecessary. *Adopted instead:* the real problem (the missing positive-reinforcement denominator — the system can't tell "never failed" from "never ran") is fixed by making the existing `OutcomeEvaluationSignal` verdict-aware and counting clean passes in it. (See §6.3.)
- **Chunking findings to manage KV-cache bloat.** *Declined* because chunking fragments a unit the reviewer needs whole; splitting a findings set across cache boundaries trades one problem (size) for a worse one (loss of coherence at review time). *Adopted instead:* render a *compressed envelope view* at prompt-assembly time — keep the finding set whole, compress the rendering.

### 6.3 Reviewer-surfaced additions to adopt (no ordering)

Every item below is adopted. They are listed flat, not ranked — pick them up in whatever sequence the work naturally takes (several cluster into the §7 hardening pass; the learning-touching ones carry the Phase B dependency noted in §7). The Formula Registry is called out only because it is the one item that *subsumes* others, not because it ranks above them.

- **Pattern C envelope field fix** — the `evaluated_target` / `evaluation_basis` field-location bug (§6.1); sits with B1/B3.
- **Formula Registry** — one place defining how every numeric score is computed; subsumes D13 (`compiler_confidence_score`) and the dozen-plus scattered "this number has no formula" gaps.
- **Memory Hydration phase** — a formal pre-run memory-read phase.
- **Memory Precedence Hierarchy** — Local Intent > Matter Policy > Global DOC72 when they conflict.
- **SubAgentPrior injection** — a mechanism for sub-agents to carry prior context instead of starting cold.
- **Flawless-execution denominator** — via the enhanced (verdict-aware) `OutcomeEvaluationSignal`, not a new signal type (per §6.2).
- **HardCallBlockingScope** — define what a blocking Hard Call actually blocks (step / outcome / run).
- **TaskCancelProtocol** — clean cancel semantics mid-run (relates to scenario C1(e)).
- **ResearchNeedLease** — so two modules don't both satisfy one need (closes D15 / scenario C1(a)).
- **EvaluationContractReview** — a pre-execution check that the evaluation plan is coherent before spending.
- **EvidencePackage** — bundle the evidence for a finding into one reviewable unit.
- **CalibratedScore type** — a score that carries its own confidence interval and calibration metadata (pairs with the Formula Registry).
- **RunGuidanceItem persistence** — persist run-guidance items rather than letting them evaporate.
- **Forum deadlock breaker** — a defined mechanism to break a stalled forum (relates to D3, D20).
- **TaskBlueprint Topology/Payload bifurcation** — separate graph shape from graph content.
- **Regenerate `previous_attempt_hash`** — so regeneration can detect and reference the prior attempt.
- **Chaos test fixtures** — fixtures that inject the failure conditions the purpose-questions surface (storage-full, malformed LLM output, mid-run privilege change).
- **KnownGoodState** — a named, restorable checkpoint.

### 6.4 Note on the Gemini memory review

Gemini's strong memory-system review arose from answering a *purpose* question ("does this unify the memory system?"), not the standard completeness-audit prompt. The lesson, captured in §8: the standard prompt is a **completeness audit** (finds defects in what's specified); purpose-questions are a **purpose audit** (find architectural absences like Memory Hydration). Both are needed; they find different classes of problem.

---

## 7. Recommended R0.4 — "Math + Contract Hardening Pass"

This is not a priority subset — it is an *execution grouping*. The contract, enum, routing, math, and hashing fixes below are tightly coupled (they all touch the shared-schema layer), so doing them as one coherent pass is more efficient than scattering them. Everything else in this document — the conceptual findings (A-series), the new surfaces (§4), the fork/chat/re-prompt addendum (§5), and the reviewer-surfaced additions (§6.3) — is equally to be done, in its own workstream. Grouping these here says "do these together," not "do these instead of those." Scope of the pass:

1. **Contract consolidation.** Make Common Contracts V1.1.x the schema-of-record for all shared types. Resolve B1 (one `EvaluationFinding`), fix B5/B6 phantom and misattributed references, fix the Pattern C envelope field location (§6.1).

2. **Enum and mapping closure.** Fix B2 (`OutcomeEvaluationState` count + `evaluating` mapping), C5 (`LearningMode` enumeration), D8 (chain-id naming).

3. **Routing completeness.** B4 (`on_indeterminate`/`on_not_applicable`/`on_unrecoverable`), C1 (Pattern C route resolution policy), D24 (branch exclusivity + idempotency).

4. **Cascade safety.** B7 (cascade depth + cycle detection), B8 (consolidate upstream-failure rule + race), C4 (dependency cycles), D2 (success-condition-5 quiescence).

5. **The Formula Registry.** One document/section defining how every numeric score is computed (subsumes D13 and the scattered score-formula gaps); introduce CalibratedScore.

6. **Hash/normalization + concurrency.** D14 (HardCallResolution hash normalization), D15 (Source Workspace/Forum concurrency model), C7 (structured-anchor non-empty), C8 (source_kind→taint mapping).

7. **Define the named-but-missing schemas.** D16 (`HumanOutcomeFeedbackEvent`), and decide on `EvaluationAffirmation` (define or remove).

8. **Pre-absorption hygiene.** D5 (convert section-number cross-refs to stable anchors before DOC23 R3.2 absorption).

   **Dependency note (not a deprioritization).** The learning-engine work and the learning-surface findings (D4, D21, A6, S6, and the learning side of the flawless-execution denominator) carry a hard sequencing dependency: they are **gated on the Phase B corpus audit**, because the audit is what surfaces their own requirements — writing them ahead of it means guessing at the spec they're supposed to implement. This is a "can't be done yet," not a "do later by choice." The hardening pass above is the contract/math layer and has no such dependency — it can proceed now.

---

## 8. Appendix — Reusable purpose-question library

The standard red-team prompt is a **completeness audit**: it finds defects in what is specified. These questions are a **purpose audit**: each one, run in a fresh window, finds architectural *absences* the completeness audit structurally cannot (Memory Hydration was found this way). Run one question per fresh window — one-question focus is what produces the depth; bundling several into one window dilutes it.

Domain-specific phrases are marked `{like this}` as fill-in slots so the set is portable across the whole spec suite (DOC72, DOC24, etc.). Swap three tokens and the kit works for any spec.

**Set A — purpose / reliance**

- **A1 (reliance).** "You are reviewing `{the SPEC}`. Answer one question in depth: can `{a securities litigator}` rely on a work product this system produces without re-checking the underlying sources and reasoning themselves? If yes, walk through exactly what they'd rely on and why it's sufficient. If no, identify precisely what's missing before reliance is justified. Don't list general bugs — answer the reliance question and let the gaps fall out of it."

- **A2 (inventability).** "Reading `{the SPEC}` as the AI coding agent that must build it: where would you be forced to invent behavior the spec doesn't determine? List each place you'd have to choose between unstated alternatives, and state exactly what specification would remove the guess."

- **A3 (determinism / replayability).** "For `{the SPEC}`: if you had to reproduce a past run exactly — same inputs, same recorded state — what would and wouldn't reproduce deterministically? Inventory every source of nondeterminism (model versions, hashes, ordering, timing) and say where the spec is silent on pinning it."

- **A4 (degradation honesty).** "Reviewing `{the SPEC}`: when this system can't do something — a source is unavailable, a sub-agent is down, a budget is exhausted, a required contract is missing — does it tell the user the truth about what it couldn't do, or can it return something that looks complete while being silently degraded? Find every place a partial or degraded result could be presented as a clean one, and say what's missing that would force the degradation to surface."

  **Set B — persona shifts**

- **B1 (opposing-counsel adversarial audit).** "Read `{the SPEC}` as `{opposing counsel}` trying to discredit a work product it produced. What about how this system operates could you attack — gaps in provenance, unverifiable claims, decisions with no recorded rationale? Everything you could attack is a finding."

- **B2 (inheriting operator).** "Read `{the SPEC}` as a colleague handed someone else's `{task run}` that's halfway finished — the original person is unavailable, and you must understand where it is and carry it forward. What can and can't you determine from the system's recorded state: what's been done, what's pending, what decisions were made and why, what's safe to touch? Everything you can't determine is the finding."

- **B3 (partner review).** "Read `{the SPEC}` as the `{supervising partner}` who must sign off. What would you need to see to take responsibility for this output, and what does the system fail to give you?"

- **B4 (scale).** "Read `{the SPEC}` as the user running this across `{forty active legal matters}` at once, each with its own `{tasks, sources, forums, findings}`. What breaks or degrades at that volume that works fine for one? Be specific about where attention becomes unmanageable, where `{cross-matter}` isolation could fail, where shared resources contend, and where cost or latency stops being acceptable. Name the mechanisms that quietly assume a small number of concurrent `{matters}` without saying so."

  **Set C — scenario / probe**

- **C1 ("what happens when").** "For `{the SPEC}`, answer 'what happens when…' for each, and where the spec doesn't determine the answer, say so: (a) `{two modules try to satisfy the same need at once}`; (b) `{a privilege/permission level changes mid-run}`; (c) `{an LLM returns malformed output on a load-bearing call}`; (d) `{storage fills during a durable write}`; (e) `{the user cancels mid-operation after an irreversible side effect}`."

- **C2 (staleness / temporal).** "Walk `{the SPEC}` and answer one question: what goes stale, and what happens when it does? Inventory everything with a freshness dependency — `{caches, context packets, snapshots, resolved decisions, applied patterns, hash preconditions}`. For each, state how the system knows it's stale, what invalidates it, and what happens if a consumer uses it after it's gone stale. Where the spec is silent on invalidation, that silence is the finding."

- **C3 (reversibility / irreversible actions).** "For `{the SPEC}`, answer one question: what are all the irreversible actions this system can take — anything that leaves the system, mutates shared state others depend on, or can't be cleanly undone? For each, describe what protects the user from triggering it by mistake, and whether they can recover if it was wrong. The ones with no guard and no recovery path are the findings."

- **C4 (load-bearing assumptions).** "Name the load-bearing assumptions in `{the SPEC}` that are never stated as constraints — things the design quietly depends on that aren't written down anywhere as requirements. For each, describe what breaks if the assumption is false, and whether the spec gives any signal that the assumption exists."

### Reviewer-assignment grid (for Addenda B, current round)

Soft assignment; for findings it matters less who gets what than that each is answered in a fresh window. The low-friction way to get a fresh window without re-attaching documents is a **new chat inside the project** (docs already attached via project knowledge; no prior-conversation anchoring). Reusing an old window anchors the reviewer to its earlier completeness-audit conclusions and partly defeats the purpose-audit framing. Round 1 first; Round 2 if it pays off.

| Round | Question                    | Reviewer | Hunts for                              |
| ----- | --------------------------- | -------- | -------------------------------------- |
| 1     | C2 staleness                | Grok     | invalidation gaps, temporal coupling   |
| 1     | C3 reversibility            | ChatGPT  | unguarded irreversible side effects    |
| 1     | A4 degradation honesty      | Gemini   | silent-degradation paths               |
| 1     | B4 scale (40 matters)       | Claude   | cross-matter isolation, contention     |
| 2     | A1 reliance                 | Gemini   | architectural absences before reliance |
| 2     | C1 scenarios                | Grok     | concrete undetermined failure modes    |
| 2     | B2 inheriting operator      | Claude   | continuity / mid-run legibility        |
| 2     | C4 load-bearing assumptions | ChatGPT  | unstated requirements                  |

Dropped from this round as overlapping the completeness audit: A2, A3, B1, B3.

---

*End of consolidated review. Next inputs expected: external reviewer responses to the Round 1 / Round 2 purpose questions, to be folded into §3 (new findings), §6.1 (cross-reviewer catches), and §6.3 (additions to adopt) — with any genuinely not-worth-it proposals added to §6.2 with their reason.*





# ChatGPT

Below is a single consolidated red-team response to the V2 prompt, incorporating the first-pass critique, the deeper math/function review, the follow-up hardening pass, and the Claude self-learning analysis only where it affects the Addenda B task-system surface.

------

# Red Team Review — DOC23 Task System + Addenda B Document Set V2

## Review posture

This review treats the six Addenda B documents as an implementation contract for AI coding agents, not as design notes. The prompt’s bar is explicit: ELNOR’s specs are the product at this stage, and defects in the spec become defects in the system. The requested review has three simultaneous obligations: conceptual critique, seam review, and exhaustive defect hunting.

That standard matters because V3.3.1 itself forbids implementers from filling gaps with “reasonable defaults”: unspecified mechanisms must block, no-op with typed failure, or enter an explicitly named degraded path. It also requires schema fidelity, state-machine fidelity, error-taxonomy fidelity, and executable invariants.

The conclusion is blunt:

**The architecture is directionally excellent, but the six-document set is not yet build-ready.** It is strongest at the level of architectural shape — typed plans, deterministic dispatch, candidate versions, receipts, graph-native wiring, no hidden agent orchestration. It is weakest at the contract/math layer: duplicated schemas, route ambiguity, non-total formulas, missing state matrices, missing packet receipts, and underspecified recovery semantics.

The next draft should not focus only on a “top five.” It should be a full **Contract + Math + Runtime + Professional Reliance Hardening Pass**.

------

# Findings

## [IDEA] [CRITICAL] [System-wide / Conceptual] — Add a final professional reliance artifact

**Finding:** The current set has evaluation envelopes, feedback bundles, source workspaces, run-board posts, candidate versions, operation receipts, and learning signals, but no single final artifact that answers: “Can a professional rely on this output, and why?”

**Why it matters:** High-stakes professionals do not rely on scattered internal state. They need an integrated reliance packet showing final artifacts, what was checked, what sources support it, what changed, what remains uncertain, what humans approved, and what failed or was skipped.

**Recommendation:** Add `TaskReliancePacket` as a final task-run artifact. It should aggregate final artifact refs, evaluation chains, assurance basis, unresolved limitations, source evidence package, revision review packets, HardCall resolutions, policy decisions, budget narrative, known-good states, and reliance status.

**Reference:** Conceptual reliability bar in prompt; Common Contracts `EvaluationResultEnvelope`; Feedback Delivery `EvaluationFeedbackBundle`; Task Forum run-board and context packet surfaces.

------

## [IDEA] [CRITICAL] [System-wide / Conceptual] — Add pre-execution Evaluation Contract Review

**Finding:** Outcome Compiler inference is powerful, but the inferred plan is not treated as a reviewable contract. The user states an outcome, and the Compiler infers lanes, thresholds, source needs, assurance basis, and hard-call triggers.

**Why it matters:** If the Compiler misunderstands the outcome, the system optimizes against the wrong target while appearing disciplined. A false green check is worse than an obvious failure.

**Recommendation:** Add `EvaluationContractReview` before execution. For high-stakes tasks, low compiler confidence, material preview/resolved drift, or hard-call-sensitive work, require explicit approval or policy-approved waiver.

**Reference:** V3.3.1 Outcome Compiler and CompiledEvaluationPlan preview design; prompt’s professional reliance requirement.

------

## [IDEA] [HIGH] [System-wide / Reviewability] — Add RevisionReviewPacket for every meaning-bearing revision

**Finding:** SemanticChangelog exists, especially around regenerate, but professional review needs more than a changelog. A user needs to see what changed, which findings were addressed, whether sources changed, whether preservation constraints passed, and what new regression risk was introduced.

**Why it matters:** A revision can locally satisfy one failed criterion while globally degrading the artifact.

**Recommendation:** Add `RevisionReviewPacket` for every meaning-bearing revision, not only regenerate. It should include semantic diff, finding-to-change map, source changes, preservation checks, revalidation results, and regression-risk summary.

**Reference:** V3.3.1 regeneration changelog fixture and revision-cycle mechanics.

------

## [IDEA] [HIGH] [System-wide / Professional UX] — Add an AttentionLedger / DecisionQueue

**Finding:** The system creates many user-actionable objects: HardCalls, feedback bundles, research needs, forum posts, repair instructions, source warnings, Task Agent proposals, policy gates, and run guidance. They are individually useful but not normalized into one user attention model.

**Why it matters:** Professionals cannot become operators of the machine. Too many scattered surfaces cause alert fatigue; too few cause hidden automation risk.

**Recommendation:** Add `AttentionLedger` with materiality tiers: blocks reliance, blocks execution, needs approval, advisory, audit-only. Every user-visible item should say why it is being shown now and what happens if ignored.

**Reference:** Feedback Delivery routes findings and repair instructions; Task Forum and Run Board publish task-run posts and context packets.

------

## [IDEA] [HIGH] [System-wide / Recovery] — Productize KnownGoodState

**Finding:** The specs include candidates, rollback, checkpoints, and immutable snapshots, but the user-facing recovery concept is still too internal.

**Why it matters:** Professionals think in terms of “return to the last safe draft,” “fork before source update,” or “compare current to accepted v2,” not candidate-state internals.

**Recommendation:** Add `KnownGoodState` with restore/fork/compare actions, accepted artifact refs, source snapshot refs, evaluation snapshot refs, reason-safe summary, and user acceptance/refusal history.

**Reference:** V3.3.1 candidate versions, rollback, receipts, and revalidation.

------

## [IDEA] [HIGH] [Source Workspace / Evidence] — Convert Source Workspace into an exportable EvidencePackage

**Finding:** Source Workspace solves shared source state inside a task, but the final user need is an evidence package keyed to output claims and evaluation criteria.

**Why it matters:** “Source A was in the workspace” is not enough. The reviewer needs to verify which source supports which claim, at which anchor, with what freshness and applicability.

**Recommendation:** Add `EvidencePackage` generated from Source Workspace: source records, evidence anchors, claim-support links, unsupported claims, source freshness, verification results, applicability scopes, and unresolved research needs.

**Reference:** Source Workspace defines persistent queryable source substrate, SourceRecord, ResearchNeed, and verification records.

------

## [IDEA] [HIGH] [System-wide / Budget UX] — Add BudgetNarrative

**Finding:** V3.3.1 separates logical and infrastructure budgets, but the user-facing cost story is incomplete.

**Why it matters:** A professional needs to know not only that a cap existed, but what quality tradeoffs occurred as the run approached the cap: which optional checks were skipped, which safety checks were preserved, and what budget domain drove the stop.

**Recommendation:** Add `BudgetNarrative` to the Run Board and final reliance packet: planned estimate, actual cost, logical calls, infrastructure retries, local compute, external tools, DOC24 packet assembly, source research, skipped helpers, degraded modes, and quality impact.

**Reference:** Prompt explicitly calls out budget governance; V3.3.1 has budget and local-compute surfaces.

------

## [IDEA] [HIGH] [System-wide / Reliability Engineering] — Add chaos and concurrency fixtures

**Finding:** V3.3.1 has conformance fixtures, but the hardest failures are distributed-systems failures: artifact written but receipt failed, candidate index failed, parallel sibling failed, policy expired mid-plan, context packet stale, source workspace race, user live-edited during acceptance.

**Why it matters:** High-stakes systems fail through state corruption and ambiguous recovery as often as model error.

**Recommendation:** Add `TaskChaosFixture` classes for crash-after-artifact-before-receipt, receipt persistence failure, candidate index failure, parallel partial failure, policy expiry mid-dispatch, stale context packet after blocker, source workspace race, rolling-hash mismatch, and live user edit during candidate acceptance.

**Reference:** V3.3.1 already treats fixtures as conformance tests, not prose.

------

## [CONFIRMED] [LOW] [V3.3.1] — Typed planning plus deterministic dispatch is the core strength

**Finding:** The separation between LLM planning and deterministic dispatch is the strongest architectural move in the set.

**Why it matters:** It prevents the system from becoming an opaque free-form agent loop.

**Recommendation:** Preserve this boundary aggressively. Any future convenience feature should be tested against: “Does this smuggle runtime LLM routing into dispatch?”

**Reference:** V3.3.1 implementation discipline and five-stage pipeline.

------

## [CONFIRMED] [LOW] [Core R0.7.1] — Direct-first task invocation is the right product invariant

**Finding:** Core’s direct-first invariant is correct: task awareness must not degrade ordinary chat, and task mode should require meaningful task signals.

**Why it matters:** Users will reject ELNOR if every complex request turns into workflow theater.

**Recommendation:** Keep this invariant and add conformance tests around vague requests, recent task-suggestion rejection, and high-complexity but direct-answer-appropriate work.

**Reference:** Core defines task veto signals, vague-request direct-first behavior, and TaskOpportunityPacket injection limits.

------

# Cross-document seam and hard-mechanic findings

## [BUG] [CRITICAL] [Pattern C / V3.3.1 §5.18.5 / Common Contracts §3] — Judge reads fields not present on EvaluationResultEnvelope

**Finding:** V3.3.1 says Pattern C Judge reads `evaluated_target`, `evaluation_basis`, and `target_artifact_version_ref` from the upstream envelope. Common Contracts’ `EvaluationResultEnvelope` instead carries `target_artifact_ref`, `target_artifact_version_ref`, `target_scope_ref`, and `evaluation_snapshot_ref`; `evaluated_target` and `evaluation_basis` are FeedbackBundle fields.

**Why it matters:** A coding agent will either crash, read the wrong object, or silently couple Judge to FeedbackBundle when Pattern C says it consumes the envelope.

**Recommendation:** Amend V3.3.1 §5.18.5: Judge reads `target_artifact_ref`, `target_artifact_version_ref`, `target_scope_ref`, `evaluation_snapshot_ref`, `criterion_lineage`, and `qualitative_slice.findings`. If Judge needs `EvaluationBasis`, it should consume the linked `EvaluationFeedbackBundle` as a separate optional input.

**Reference:** V3.3.1 §5.18.5; Common Contracts §3; Feedback Delivery §2.1.

------

## [GAP] [CRITICAL] [Pattern C / Common Contracts §3.7] — No EvaluationChainRegistry or chain-resolution failure semantics

**Finding:** Pattern C uses `target_evaluation_chain_id`, but the specs do not define what happens when the chain ID is missing, duplicated, stale, resolves to the wrong artifact version, or collides across parallel chains.

**Why it matters:** Chain corruption damages audit, UI, learning correlation, and route decisions.

**Recommendation:** Add `EvaluationChainRegistry` with chain status, expected producers, target artifact/version/scope, created_at, superseded_by_chain_id, and validation failure codes: `chain_id_missing`, `chain_target_mismatch`, `chain_stale_snapshot`, `chain_ambiguous`, `chain_consumer_timeout`.

**Reference:** Common Contracts defines Pattern C chain linkage by `target_evaluation_chain_id`.

------

## [BUG] [HIGH] [Pattern C / Common Contracts §3.7] — Route resolution is left to unsafe “consumer policy”

**Finding:** Common Contracts admits that upstream Evaluator and downstream Judge may emit different route recommendations, then says resolution is by consumer policy, “typically” Judge governs.

**Why it matters:** “Typically” is not executable. Revisor, Loop Controller, UI, and learning consumers can disagree on whether the task passed.

**Recommendation:** Add `EvaluationChainResolutionPolicy`. Default: blocking qualitative findings survive quantitative pass unless the finding is non-blocking, contested, or explicitly configured as Judge-overridable.

**Reference:** Common Contracts Pattern C route recommendation language.

------

## [BUG] [CRITICAL] [V3.3.1 / Feedback Delivery / Common Contracts] — `EvaluationFinding` is defined incompatibly in multiple places

**Finding:** V3.3.1 defines `EvaluationFinding` with `finding_text`, `state: FindingState`, `basis: AssuranceBasis`, target artifact/version refs, confidence labels, and taint. Feedback Delivery defines another `EvaluationFinding` with `finding_kind`, `authority_basis: EvaluationAuthorityBasis[]`, `explanation`, `affected_artifact_refs`, lifecycle enum, and optional provenance refs. Common Contracts references `EvaluationFinding[]` but says the full schema lives in Addenda B.

**Why it matters:** Producer and consumer cannot both be correct. This will break serialization and make downstream repair, forum publication, and learning inconsistent.

**Recommendation:** Create one canonical `EvaluationFinding` in Common Contracts or V3.3.1. Rename Feedback Delivery’s version to `FeedbackFindingView` if it is a delivery projection. Do not allow two normative schemas with the same name.

**Reference:** V3.3.1 §5.7; Feedback Delivery §3.3; Common Contracts §4.2.

------

## [BUG] [HIGH] [Feedback Delivery §3 / V3.3.1 §5.4] — `EvaluationAuthorityBasis` conflicts with `AssuranceBasis`

**Finding:** Feedback Delivery uses `EvaluationAuthorityBasis` values such as `source_reference` and `saved_criteria`, while V3.3.1 and Common Contracts use `AssuranceBasis` values such as `source_verified_external`, `claim_grounded_internal`, and `structured_validation`.

**Why it matters:** Blocking-finding logic cannot reconcile authority rules with envelope assurance rules.

**Recommendation:** Replace `EvaluationAuthorityBasis` with `AssuranceBasis[]` plus delivery-specific fields, or define a normative mapping table.

**Reference:** Feedback Delivery authority basis; V3.3.1 assurance basis.

------

## [BUG] [HIGH] [Feedback Delivery §1.2 / §2.3] — Feedback bundle emission discipline contradicts itself

**Finding:** Feedback Delivery says both envelope and feedback bundle are emitted by every evaluator producer, then says deterministic scorers emit bundles only on failure and pass cases may emit envelope only.

**Why it matters:** Absence of a bundle becomes ambiguous: not emitted by design, lost, not applicable, or policy-filtered.

**Recommendation:** Add an emission matrix by producer kind and verdict. Require either a bundle or explicit `feedback_bundle_absent_reason`.

**Reference:** Feedback Delivery §1.2, §2.3.

------

## [BUG] [HIGH] [Feedback Delivery §5.3 / V3.3.1 §9] — Revisor input is confused with `revision_in`

**Finding:** Feedback Delivery says Revisor consumes `OutcomeRepairInstruction` via `revision_in`. But V3.3.1 treats `revision_in` as the standard port on revision-capable modules that perform repairs; Revisor is the planner that compiles repair instructions into a plan.

**Why it matters:** This reverses the architecture and risks implementing Revisor as a revision target instead of a planner.

**Recommendation:** Add Revisor inputs such as `feedback_bundle_in`, `repair_instruction_in`, or `evaluation_result_in`. Reserve `revision_in` for artifact-mutating modules.

**Reference:** Feedback Delivery §5.3; V3.3.1 revision contract.

------

## [BUG] [CRITICAL] [Feedback Delivery §7.2 / V3.3.1 §3.1.3] — Direct repair wiring can bypass the central `revision_in` safety contract

**Finding:** V3.3.1 says meaning-bearing repair must go through declared `revision_in` capability. Feedback Delivery examples route repair directly to `DraftRevision.instruction_in`, `FormatChecker.data_in`, and `RevisionModule.context_in`.

**Why it matters:** This bypasses capability validation, preconditions, candidate versions, policy gates, preservation checks, and receipts.

**Recommendation:** Amend Feedback Delivery: repair instructions may execute only through `revision_in` or a port explicitly declared `revision_compatible = true` with `ModuleRevisionCapability` coverage. Other direct wiring is advisory context only.

**Reference:** V3.3.1 meaning-bearing repair invariant; Feedback Delivery direct wiring.

------

## [BUG] [HIGH] [Feedback Delivery §7.2 / Source Workspace §7.4] — ResearchNeed is routed to the wrong Source Research port

**Finding:** Feedback Delivery routes `research_need_out → SourceResearch.data_in`, while Source Workspace defines `research_need_in` for unresolved ResearchNeed records.

**Why it matters:** Typed source needs become generic data input, losing lifecycle and queue semantics.

**Recommendation:** Change wiring to `research_need_out → step.source_research.research_need_in`. Reserve `data_in` for artifacts to be source-checked.

**Reference:** Feedback Delivery §7.2; Source Workspace §7.4.

------

## [GAP] [CRITICAL] [State semantics / all docs] — Add canonical state matrix

**Finding:** V3.3.1 defines a 15-value `OutcomeEvaluationState` including transient states such as `pending`, `pending_dependency`, `evaluating`, `dirty`, and `superseded`; other sections still refer to V3.1’s 14-value enum; the state machine introduces `max_iterations_reached` without adding it to the enum.

**Why it matters:** State values drive persistence, feedback routing, UI, learning eligibility, revalidation, and final aggregation.

**Recommendation:** Add a canonical state matrix with columns: `state`, `runtime_or_persisted`, `terminal`, `emitted_to_envelope`, `verdict_mapping`, `feedback_branch`, `revisor_action`, `UI_status`, `learning_eligibility`, and `allowed_transitions`.

**Reference:** V3.3.1 §0.4.1, §5.6.1; Common Contracts §3.2; Feedback Delivery §6.3.

------

## [BUG] [CRITICAL] [V3.3.1 §5.5.3 / §5.6] — Sufficiency failure maps limitation enum values into outcome-state field

**Finding:** V3.3.1 says failed required `SufficiencyCheck` emits `OutcomeEvaluationResult` with `overall_state` set to the limitation kind. But `overall_state` is `OutcomeEvaluationState`, while `insufficient_reason` is `EvaluationLimitationKind`.

**Why it matters:** A faithful implementation will write invalid states such as `insufficient_evidence` into `overall_state`.

**Recommendation:** Add a normative limitation-to-state mapping. Store limitation reason in `limitations[]`, never as state.

**Reference:** V3.3.1 §5.5.3, §5.6.

------

## [BUG] [CRITICAL] [V3.3.1 §5.5.3 / Common Contracts §3.1] — “No verdict” is required but not representable cleanly

**Finding:** V3.3.1 says failed sufficiency produces a limitation rather than a verdict, and verdict/limitation are mutually exclusive. Common Contracts requires `evaluation_verdict`, with `indeterminate` as the closest value.

**Why it matters:** “Indeterminate after substantive evaluation” and “blocked before substantive verdict” collapse into the same value.

**Recommendation:** Keep `evaluation_verdict = "indeterminate"` for compatibility, but add `substantive_verdict_status = "substantive_verdict_reached" | "blocked_before_substantive_verdict" | "partial_substantive_verdict"`.

**Reference:** V3.3.1 sufficiency gate; Common Contracts envelope.

------

## [BUG] [HIGH] [Common Contracts §3.1] — `IndeterminateCause` is referenced but not defined in the set

**Finding:** `EvaluationResultEnvelope.indeterminate_reasons` uses `IndeterminateCause[]`, but the six-doc set does not define the taxonomy.

**Why it matters:** Indeterminate outcomes are the hardest to route. Without cause taxonomy, the system cannot decide whether to ask for sources, human judgment, policy escalation, retry, or proceed with warning.

**Recommendation:** Define `IndeterminateCause` and map each cause to default route, blocking behavior, recovery action, and user-facing language.

**Reference:** Common Contracts `EvaluationResultEnvelope`.

------

## [BUG] [HIGH] [Feedback Delivery §2.1] — `EvaluationDecision.pass: boolean` cannot represent indeterminate or not-applicable states

**Finding:** `EvaluationDecision` has both `verdict: passed | failed | indeterminate | not_applicable` and `pass: boolean`.

**Why it matters:** For `indeterminate` and `not_applicable`, boolean `pass` is misleading. Implementers may encode both as false and route them like failures.

**Recommendation:** Replace `pass` with `decision_outcome = "pass" | "fail" | "indeterminate" | "not_applicable"`. Keep boolean only as deprecated derived display field for pass/fail states.

**Reference:** Feedback Delivery `EvaluationDecision`.

------

## [BUG] [HIGH] [Feedback Delivery §6] — Routing policy cannot represent multiple simultaneous actions

**Finding:** FeedbackRoutingPolicy branch fields are single union values, but the UI shows checkboxes and prose says multiple branches may fire.

**Why it matters:** The UI can express a policy the schema cannot store.

**Recommendation:** Make branch fields arrays of `FeedbackRouteAction`, and make `FeedbackRoutingRecommendation` carry `route_actions[]` plus optional `primary_route_action_id`.

**Reference:** Feedback Delivery §6.3–§6.6.

------

## [GAP] [HIGH] [Feedback Delivery §6.3] — Missing full state-to-routing matrix

**Finding:** The router reads `EvaluationDecision.state` and maps to policy branches, but no table maps every `OutcomeEvaluationState` and `evaluation_verdict` to routing branches.

**Why it matters:** Implementers must guess how `regressed`, `upstream_failure`, `blocked_by_policy`, `unable_to_evaluate`, `needs_human_judgment`, and `indeterminate` route.

**Recommendation:** Add normative routing matrix for all states and verdicts, including precedence when multiple branches apply.

**Reference:** Feedback Delivery §6.3.

------

## [BUG] [MEDIUM] [Feedback Delivery §2.2 / §3.3] — Optional fields are required by invariants

**Finding:** Feedback Delivery says `findings[i].based_on_artifact_version_ref` must resolve to a version in the snapshot, but the field is optional.

**Why it matters:** Validators cannot enforce the invariant consistently.

**Recommendation:** Make it required for artifact-targeted findings, or add `based_on_artifact_version_absent_reason`.

**Reference:** Feedback Delivery §2.2, §3.3.

------

## [GAP] [HIGH] [Feedback Delivery §9] — Delivery receipts are missing

**Finding:** Consumption receipts confirm a consumer used or ignored feedback, but there is no separate delivery receipt showing whether feedback was delivered, filtered, timed out, or failed due to missing port/consumer.

**Why it matters:** Absence of consumption receipt is ambiguous.

**Recommendation:** Add `FeedbackDeliveryReceipt` with delivery status: delivered, filtered_by_policy, consumer_unavailable, port_unwired, timed_out, not_applicable.

**Reference:** Feedback Delivery consumption receipt discipline.

------

## [GAP] [HIGH] [Feedback Delivery §9] — Consumption receipts need per-item status

**Finding:** A module may use one finding, reject another, defer a repair instruction, and ignore a run-guidance item. A single receipt-level mode cannot express that.

**Why it matters:** Learning signals and audit need item-level truth.

**Recommendation:** Add `consumed_items[]` with item kind, status, reason, and resulting action ref.

**Reference:** Feedback Delivery §9.4–§9.5.

------

## [BUG] [HIGH] [Common Contracts §4.2] — Qualitative slice owner map is wrong

**Finding:** Common Contracts says `EvaluationFinding`, `OutcomeRepairInstruction`, `ResearchNeed`, and `EvaluationAffirmation` live in Addenda B Core R0.7.1. In the current set, repair instructions live in Feedback Delivery, ResearchNeed lives in Source Workspace, and EvaluationFinding is duplicated.

**Why it matters:** Common Contracts is supposed to prevent schema drift; here it points to the wrong owners.

**Recommendation:** Add `TypeOwnerRegistry` and correct ownership: `EvaluationFinding` canonical in Common Contracts or V3.3.1; `OutcomeRepairInstruction` in Feedback Delivery; `ResearchNeed` in Source Workspace; define or remove `EvaluationAffirmation`.

**Reference:** Common Contracts qualitative slice.

------

## [BUG] [HIGH] [Core §9 / Common Contracts §5] — Core duplicates the common learning signal envelope with malformed TypeScript

**Finding:** Core reproduces `EvaluationLearningSignalEnvelope` and includes `task_id, run_id: string`, which is invalid TypeScript-like notation and differs from Common Contracts.

**Why it matters:** A coding agent may copy the malformed Core schema.

**Recommendation:** Remove duplicate normative envelope body from Core. Core should say it uses Common Contracts §5 and owns only signal payload schemas.

**Reference:** Core §9.0; Common Contracts §5.

------

## [GAP] [HIGH] [Common Contracts §3.3] — Required `EvaluationArtifactEnvelope` wrapper lives outside the document

**Finding:** Common Contracts requires `EvaluationArtifactEnvelope<EvaluationResultEnvelope>`, but the wrapper is owned by Addenda A and not minimally defined in Common Contracts.

**Why it matters:** A coding agent implementing Common Contracts alone cannot validate or serialize the required wrapper.

**Recommendation:** Inline a minimal wrapper contract or hard dependency section with exact owner, schema hash, and validation requirements.

**Reference:** Common Contracts §3.3.

------

## [GAP] [HIGH] [Common Contracts §12] — Pending consumers lack degraded-mode behavior

**Finding:** Common Contracts lists pending target updates to DOC8/BDSM, EC Core, DOC20, DOC72, PropA, and others.

**Why it matters:** Producers may emit envelopes or signals that no consumer can interpret.

**Recommendation:** Add “until target lands” behavior for each pending obligation: persist-only, suppress promotion, disable UI affordance, emit validation warning, or block route.

**Reference:** Common Contracts cross-doc obligations.

------

# Revision runtime and distributed-systems findings

## [BUG] [CRITICAL] [V3.3.1 §11.8 / §7.5] — Idempotency formula is not total over all plan-step kinds

**Finding:** The idempotency formula assumes fields such as `target_module_id`, `target_version_precondition_ref`, and `typed_instruction.idempotency_input_hash`, but many `RevisionPlanStepKind` variants are not module revision steps.

**Why it matters:** Crash recovery and retry safety are impossible if idempotency keys are missing or unstable.

**Recommendation:** Replace single formula with discriminated `StepIdempotencyInput` per step kind, each with test vectors.

**Reference:** V3.3.1 step kinds and dispatch mechanics.

------

## [BUG] [HIGH] [V3.3.1 §11.20] — Rolling hash mode is linear but revision plans are DAGs

**Finding:** Rolling hash validation says step N validates against step N−1 in topological order, but plan steps are DAGs and may run in parallel. Topological order is not unique.

**Why it matters:** Validity may depend on arbitrary ordering; parallel disjoint edits cannot be represented correctly.

**Recommendation:** Either forbid parallelism in rolling-hash mode or use Merkle-style per-artifact/per-scope read/write-set hashes.

**Reference:** V3.3.1 rolling hash mode and revalidation.

------

## [BUG] [HIGH] [V3.3.1 §11.20.3] — Rolling hash fallback after mutation is unsafe

**Finding:** A rolling-hash mismatch “forces fallback to candidate-only mode plus replan,” but if in-place mutation already occurred, fallback is not safe unless rollback succeeds and is verified.

**Why it matters:** The live artifact may already be partially mutated.

**Recommendation:** Mismatch → freeze artifact → verify live hash → rollback if mutation occurred → emit rollback receipt → only then candidate-only replan. If rollback verification fails, enter `manual_recovery_required`.

**Reference:** V3.3.1 §11.20.3.

------

## [GAP] [HIGH] [V3.3.1 §11.21] — Declared-dependency-only revalidation is auditable but too brittle

**Finding:** V3.3.1 explicitly says only declared dependents are re-evaluated, even if a human would intuit an impact.

**Why it matters:** Professional artifacts have implicit dependencies: definitions, source assumptions, factual premises, and strategic framing.

**Recommendation:** Keep declared dependencies as deterministic core, but add advisory `SemanticImpactScan` to propose additional revalidation targets. Deterministic policy can auto-expand for high-risk tasks or require human approval.

**Reference:** V3.3.1 revalidation cascade.

------

## [RISK] [HIGH] [V3.3.1 §11.21.2] — Upstream failure cascade only catches artifact-ref waits

**Finding:** The cascade only transitions outcomes pending on `missing_artifact_refs` produced by a failed module.

**Why it matters:** Outcomes can wait on module activations, port emissions, source needs, human answers, capabilities, external tools, or forum decisions.

**Recommendation:** Generalize `PendingDependencyInfo` to `DependencyRef` union: artifact, module activation, port emission, source need, human answer, capability, forum decision, external tool result.

**Reference:** V3.3.1 pending-dependency cascade.

------

## [RISK] [HIGH] [V3.3.1 §11.22] — Parallel sibling outputs can become orphaned after batch failure

**Finding:** If one parallel step fails, successful siblings may produce candidates, but the spec does not fully define their disposition.

**Why it matters:** Downstream stages may consume candidates from an aborted plan, or valid work may be lost without review.

**Recommendation:** Add `ParallelBatchFinalizationReceipt` with candidate dispositions: discarded, retained_for_manual_review, blocked_from_downstream, accepted.

**Reference:** V3.3.1 parallel execution and candidate semantics.

------

## [GAP] [HIGH] [V3.3.1 §6.14 / §11.22] — Parallelism controls are not surfaced as first-class config

**Finding:** Parallel execution is allowed, but `RevisorConfig` does not clearly expose max parallel steps, max parallel artifact writers, or parallel failure policy.

**Why it matters:** Parallelism is a race, cost, and reviewability parameter.

**Recommendation:** Add `max_parallel_steps_per_plan`, `max_parallel_artifact_writers`, `parallel_batch_failure_policy`, and `parallel_candidate_disposition`.

**Reference:** V3.3.1 RevisorConfig and parallel dispatch.

------

## [BUG] [HIGH] [V3.3.1 §11.19] — Policy freshness checks a field not in `PolicyDecisionRef`

**Finding:** Freshness requires no `superseded_by_decision_id` flag, but `PolicyDecisionRef` has no such field.

**Why it matters:** Supersession cannot be represented.

**Recommendation:** Add `issued_at`, `subject_hash`, `decision_scope_hash`, `superseded_by_decision_id`, and `policy_engine_version`, or make EC’s policy record the required freshness source of truth.

**Reference:** V3.3.1 §11.19.2.

------

## [BUG] [HIGH] [V3.3.1 §11.6 / §11.17] — `receipt_persist_failed` is emitted but not registered

**Finding:** Receipt persistence failure produces `receipt_persist_failed`, but that value is not in `RevisionOperationKind`.

**Why it matters:** Failure-of-audit is one of the most important events and cannot be invalid.

**Recommendation:** Add `RevisionIntegrityEventKind` with `receipt_persist_failed`, `candidate_index_repair_failed`, `artifact_receipt_chain_broken`, and `rollback_receipt_missing`.

**Reference:** V3.3.1 operation receipt and workspace failure sections.

------

## [BUG] [HIGH] [V3.3.1 §11.16 / §0.4.8] — `preempted` is used as `ExecutionStatus` but only defined as failure event

**Finding:** Preemption protocol emits `execution_status = "preempted"`, but `ExecutionStatus` does not include it.

**Why it matters:** Validators reject the receipt.

**Recommendation:** Use `execution_status = "failed_runtime"` plus `failure_event_kind = "preempted"`, or add `preempted` to `ExecutionStatus`. Prefer the former.

**Reference:** V3.3.1 enum inventory and preemption.

------

## [BUG] [HIGH] [V3.3.1 §8.8 / §0.4.2] — `local_resource_exhausted` is emitted but absent from failure taxonomy

**Finding:** Hardware-aware degradation says resource exhaustion aborts with `local_resource_exhausted`, but the failure taxonomy has only `budget_exhausted` and other generic values.

**Why it matters:** Local-first Apple Silicon constraints need first-class failure classification.

**Recommendation:** Add `local_resource_exhausted` or encode as `budget_exhausted` plus `budget_failure_kind = "local_compute_budget_exceeded"`.

**Reference:** V3.3.1 local compute and failure taxonomy.

------

## [BUG] [HIGH] [V3.3.1 §15 / §0.4.2] — Budget failure taxonomy is split three ways

**Finding:** Metrics refer to logical and infrastructure budget failures, local compute has separate preemption behavior, and `RevisionFailureEventKind` only has `budget_exhausted`.

**Why it matters:** Dashboards cannot distinguish logical LLM budget, infrastructure retry budget, local compute, external tool fees, or token/context budget.

**Recommendation:** Add `BudgetFailureKind` and attach it to `RevisionFailureEventKind = "budget_exhausted"`.

**Reference:** V3.3.1 budget governance.

------

## [BUG] [MEDIUM] [V3.3.1 §11.11] — CandidateArtifactVersion state mixes lifecycle and head-pointer state

**Finding:** `ArtifactVersionState` includes `current`, `candidate`, `accepted`, `rejected`, `superseded`, and `reverted`.

**Why it matters:** “Current” is an artifact head pointer, not candidate lifecycle state.

**Recommendation:** Split `ArtifactHeadState` from `CandidateVersionState`.

**Reference:** V3.3.1 artifact version state enum.

------

## [BUG] [HIGH] [V3.3.1 §11.18] — Candidate artifacts cannot model external side effects

**Finding:** Candidate versions work for artifacts, but side effects include external messages, calendar writes, webhooks, filings, and memory writes.

**Why it matters:** A candidate email send or filing is not an artifact candidate; it needs dry-run intent and approval semantics.

**Recommendation:** Add `SideEffectIntentCandidate` with dry-run payload, side-effect class, approval status, execution policy, and final execution receipt.

**Reference:** V3.3.1 side-effect policy.

------

## [GAP] [HIGH] [Cancel / Skip / Abort] — TaskCancelProtocol is missing

**Finding:** Gate skip has metadata, but user cancel during an active multi-step task is not fully specified.

**Why it matters:** Cancel must define what happens to in-flight module calls, child sessions, candidates, source workspace writes, pending questions, side effects, Run Board state, and learning signals.

**Recommendation:** Add `TaskCancelProtocol` with scope, in-flight handling, side-effect handling, candidate disposition, workspace state, learning suppression, and cancel receipt.

**Reference:** Prompt explicitly requires cancel/skip/indeterminate review.

------

## [GAP] [HIGH] [HardCall / Dispatcher / UI] — Pending HardCall blocking scope is missing

**Finding:** The specs define `waiting_hard_call`, HardCall signals, and human resolution, but not what parts of a task may continue while a HardCall is pending.

**Why it matters:** Parallel modules may continue based on unresolved judgment, or the whole task may pause unnecessarily.

**Recommendation:** Add `HardCallBlockingScope = entire_run | segment | artifact | outcome | module | side_effect_only`, plus resume/timeout/defer/reject semantics.

**Reference:** Common Contracts Hard Call surface; V3.3.1 dispatcher states.

------

# Math, formula, and scoring findings

## [BUG] [CRITICAL] [Common Contracts §4.1 / Addenda A seam] — `QualityIndex` and `DimensionScore` are not computable from this set

**Finding:** `QuantitativeEvaluationSlice` references `QualityIndex` and `DimensionScore`, but the set does not define scale, aggregation, required gates, missing dimension policy, or threshold comparison.

**Why it matters:** Pattern C’s numeric side cannot be implemented without invention.

**Recommendation:** Add `OutcomeComplianceScoreFunction` with normalized 0–1 dimension scores, weights, required-gate handling, aggregation method, missing-dimension policy, threshold comparison, and test vectors.

**Reference:** Common Contracts `QuantitativeEvaluationSlice`.

------

## [BUG] [HIGH] [Criterion math] — Criterion weights and required gates are underspecified

**Finding:** `Criterion.weight` is 0–1 or null; `default_weight_policy` can be uniform/from weight/from priority; `required` is separate. No normalization or required-gate rule is defined.

**Why it matters:** One implementation may let optional scores offset failed required criteria; another may fail hard.

**Recommendation:** Required criteria are hard gates by default. Optional criteria are normalized only after required gates pass. Null weights under `from_criterion_weight` are validation errors. `from_priority` maps must_have/should_have/nice_to_have to declared defaults.

**Reference:** V3.3.1 `Criterion` schema.

------

## [BUG] [MEDIUM] [Common Contracts §6.3 / V3.3.1 §5.1.1] — `criterion_semantics_hash` is lexical, not semantic

**Finding:** The hash is lowercased, stripped, content-word normalized text. That is lexical normalization, not semantic equivalence.

**Why it matters:** Equivalent rephrasings can hash differently; small but material changes may hash similarly.

**Recommendation:** Rename to `criterion_text_normalization_hash`. Add `criterion_lineage_id` for adjudicated semantic continuity.

**Reference:** V3.3.1 `Criterion`.

------

## [GAP] [HIGH] [V3.3.1 §4.5] — Novelty score assumes an undefined metric space

**Finding:** `novelty_score = closest_pattern_distance`, with threshold 0.7, but the metric, normalization, model version, feature vector, and no-pattern behavior are unspecified.

**Why it matters:** A 0.7 threshold is meaningless unless distance is calibrated and bounded.

**Recommendation:** Add `NoveltyMetricSpec` with feature definition, model/version, normalization, calibration dataset, no-pattern fallback, and threshold confidence.

**Reference:** V3.3.1 novelty assessment.

------

## [BUG] [HIGH] [Core §3C] — Task-mode scores are not calibrated enough to protect direct-first behavior

**Finding:** `task_fit_score`, `direct_answer_fit_score`, and `existing_template_match_score` have no ranges, weights, hard-veto precedence, thresholds, or calibration.

**Why it matters:** ELNOR can become too task-happy or too task-avoidant.

**Recommendation:** Add `TaskModeScoringFunction` with features, weights, veto precedence, thresholds, calibration metric ref, and deterministic decision rules after feature extraction.

**Reference:** Core TaskModeDecision and TaskOpportunityPacket.

------

## [BUG] [HIGH] [Core §7.5] — TemplateMatchScore has no aggregation formula

**Finding:** Component scores and `overall_score` are listed, but no formula combines them or applies hard vetoes.

**Why it matters:** Template ranking will diverge across implementations.

**Recommendation:** Add weighted normalized component formula, soft penalties, hard veto cap, default weights, and test vectors.

**Reference:** Core TemplateMatchScore.

------

## [BUG] [HIGH] [V3.3.1 §15] — Several quality metrics misuse numerator/denominator terminology

**Finding:** Some metrics are rates, but others are means or correlations. `avg_revision_cycles_to_convergence` and `planner_confidence_calibration` are not numerator/denominator metrics in the same sense.

**Why it matters:** Dashboards and gates become mathematically misleading.

**Recommendation:** Add `MetricKind = rate | mean | median | correlation | distribution | calibration_curve` and define per-kind fields.

**Reference:** V3.3.1 quality program.

------

## [BUG] [HIGH] [V3.3.1 §15.8] — Sub-agent metrics lack zero-denominator and insufficient-sample handling

**Finding:** Metrics like hit rate and cost per successful recommendation can divide by zero.

**Why it matters:** New or sandboxed sub-agents often have zero accepted or successful recommendations.

**Recommendation:** Add `MetricValue = number | "undefined_insufficient_data"`, sample size, denominator, confidence interval, and zero-denominator policy.

**Reference:** V3.3.1 sub-agent quality metrics.

------

## [BUG] [HIGH] [V3.3.1 §11.15] — Cost estimator calibration can divide by zero and ignores tail risk

**Finding:** Calibration uses `actual_cost / estimated_cost`, but does not handle `estimated_cost = 0`; it also ignores P90 overruns and dimension-specific errors.

**Why it matters:** A calibrated estimator can still understate rare high-cost runs.

**Recommendation:** Use per-dimension MAPE, median absolute percentage error, P90 overrun, bias, sample-size minimum, and zero-estimate handling.

**Reference:** V3.3.1 cost estimator confidence.

------

## [GAP] [HIGH] [System-wide] — Shared `CostEstimate` / `TaskCostRecord` is missing

**Finding:** Source Workspace uses `CostEstimate`; V3.3.1 uses revision estimates; sub-agent metrics include cost; Task Forum uses token counts. There is no shared cost schema.

**Why it matters:** Source research, DOC24 packet assembly, local compute, LLM calls, external tools, and retries cannot reconcile into a task-level cost ledger.

**Recommendation:** Add `CostEstimate` and `TaskCostRecord` to Common Contracts or Core.

**Reference:** SourceQueryRecord uses `result_cost: CostEstimate`; V3.3.1 has cost governance.

------

## [IDEA] [HIGH] [System-wide math/functions] — Add FormulaRegistry with test vectors

**Finding:** Formula-like rules are scattered: QualityIndex, criterion weighting, novelty, task fit, template match, sub-agent metrics, cost calibration, taint aggregation, dependency impact, and route resolution.

**Why it matters:** Every undefined formula becomes invented code.

**Recommendation:** Add `FormulaRegistry` with typed inputs, output range, units, missing-input policy, zero-denominator policy, normalization policy, version, and test vectors.

**Reference:** Cross-cutting.

------

# Source Workspace findings

## [BUG] [HIGH] [Source Workspace §0.3 / V3.3.1 §12] — `TaskSourceWorkspace` and V3.3 `SourceWorkspace` identity is split

**Finding:** Source Workspace says both documents consume the same underlying `TaskSourceWorkspace`, but V3.3.1’s workspace split uses a narrower revision/candidate-oriented `SourceWorkspace` concept.

**Why it matters:** Implementers may create two workspaces or merge fields ad hoc.

**Recommendation:** Make `TaskSourceWorkspace` canonical and define `RevisionSourceWorkspaceView` for V3.3.1.

**Reference:** Source Workspace §0.3, §2; V3.3.1 §12.

------

## [BUG] [MEDIUM] [Source Workspace §3 / §4] — Tier 0 contradicts `SourceRecord.tier`

**Finding:** Tier 0 is “tool receipt only,” but `SourceRecord.tier` allows `0`.

**Why it matters:** Either Tier 0 creates no SourceRecord, or it creates a minimal SourceRecord. Current text says both.

**Recommendation:** Remove `0` from `SourceRecord.tier`; Tier 0 should be `ToolReceipt` only.

**Reference:** Source Workspace tier table and SourceRecord.

------

## [GAP] [MEDIUM] [Source Workspace §3.3] — Source tier transitions are defined but not stored

**Finding:** `SourceTierTransition` exists but `TaskSourceWorkspace` has no transition log field.

**Why it matters:** The spec says transitions are recorded but gives no storage location.

**Recommendation:** Add `source_tier_transition_records: SourceTierTransition[]` or route transitions through telemetry with a canonical workspace index.

**Reference:** Source Workspace §3.3.

------

## [BUG] [HIGH] [Source Workspace §6.2] — `ResearchNeed.run_id` conflicts with task-scoped workspaces

**Finding:** `TaskSourceWorkspace.run_id` is optional for task-scoped workspaces, but `ResearchNeed.run_id` is required.

**Why it matters:** Persistent task-scoped research needs cannot be represented cleanly.

**Recommendation:** Make `ResearchNeed.run_id?: string` and add `need_scope = "run" | "task" | "workspace"`.

**Reference:** Source Workspace `TaskSourceWorkspace` and `ResearchNeed`.

------

## [BUG] [MEDIUM] [Source Workspace §6.2] — Source refs and target refs use wrong reference types

**Finding:** `ResearchNeed.source_refs` is typed as `StorageRef[]` but described as SourceRecords produced; `target_claim_or_section_ref` is typed as `StorageRef` but described as typically `ArtifactScopeRef`.

**Why it matters:** Joins between research needs, source records, claims, and artifact scopes break.

**Recommendation:** Use explicit `SourceRecordRef[]` and `ArtifactScopeRef | ClaimRef`.

**Reference:** Source Workspace §6.2.

------

## [BUG] [MEDIUM] [Source Workspace §2.3 / §4.1] — Verification states do not align

**Finding:** `SourceVerificationResult.verification_state` uses `partial`, while `SourceRecord.verification_state` uses `partially_verified`.

**Why it matters:** Rollup from verification results to SourceRecord state is undefined.

**Recommendation:** Standardize enum or define mapping: `partial → partially_verified`.

**Reference:** Source Workspace verification schemas.

------

## [GAP] [HIGH] [Source Workspace §2 / Governance] — Workspace taint aggregation is undefined

**Finding:** Workspace has `current_taint_class`; records have taint and privilege fields; no aggregation lattice is defined.

**Why it matters:** One adversarial or privileged source can change workspace prompt availability and cross-matter safety.

**Recommendation:** Add `TaintAggregationPolicy`: max-taint lattice, any-privileged propagation, matter-scope union/error rule, policy override references.

**Reference:** Source Workspace taint fields.

------

## [GAP] [MEDIUM] [Source Workspace §4] — Source records need first-class evidence anchors

**Finding:** Source records have summaries, key points, and applicability scopes, but not required page/quote/section/timestamp anchors linked to claims.

**Why it matters:** A professional must verify claim support, not just inspect source summaries.

**Recommendation:** Add `SourceEvidenceAnchor` with anchor kind, anchor ref, supported claim refs, support strength, and contradiction flag.

**Reference:** Source Workspace SourceRecord and ApplicabilityScope.

------

## [GAP] [HIGH] [Source Workspace §6.4] — ResearchNeed routing needs leases and idempotency

**Finding:** Multiple modules can route or answer the same ResearchNeed, but no lease/lock/idempotency model exists.

**Why it matters:** Duplicate expensive research and status races are likely.

**Recommendation:** Add `ResearchNeedLease` with holder, expires_at, lease_version, idempotency_key, and status precondition.

**Reference:** Source Workspace ResearchNeed routing.

------

## [GAP] [MEDIUM] [Source Workspace §6.3] — Tool receipt as “material research” record is too weak

**Finding:** Material research can be recorded as a universal tool receipt.

**Why it matters:** Tool receipts may not be queryable or digestible by downstream modules.

**Recommendation:** Material research affecting output claims must produce at least `SourceQueryRecord` plus `SourceRecord`, `ResearchAnswer`, or `SourceVerificationResult`. Tool receipt alone is acceptable only for non-material diagnostics.

**Reference:** Source Workspace hidden-research rule.

------

## [BUG] [MEDIUM] [Source Workspace §5] — `domain_payload` needs registry/versioning

**Finding:** Domain payloads are open `Record<string, unknown>`, but examples imply structured legal, securities, and regulatory payloads.

**Why it matters:** Consumers cannot validate or safely use domain-specific fields.

**Recommendation:** Add `domain_payload_kind`, `domain_payload_schema_ref`, and `domain_payload_version`.

**Reference:** Source Workspace domain payload examples.

------

# Task Forum + Run Board findings

## [BUG] [HIGH] [Task Forum §5.2] — `selected_modules` visibility cannot represent selected modules

**Finding:** Visibility supports `selected_modules`, but schema only has singular `target_module_id` and `target_segment_id`.

**Why it matters:** A post visible to modules A and B but not C cannot be represented.

**Recommendation:** Replace singular target fields with `visibility_target_refs[]`.

**Reference:** TaskRunBoardPost schema and visibility resolution.

------

## [BUG] [MEDIUM] [Task Forum §5.3] — Supersession is described but not modeled

**Finding:** Posts are append-only and may be superseded, but schema lacks lifecycle and supersession refs.

**Why it matters:** UI and digest generation cannot filter stale posts.

**Recommendation:** Add `lifecycle_state`, `supersedes_post_ids`, `superseded_by_post_id`, and `supersession_reason`.

**Reference:** TaskRunBoardPost creation discipline.

------

## [RISK] [HIGH] [Task Forum §1 / §5] — Passive auto-publication needs privacy and volume controls

**Finding:** Every system event auto-publishes to the Run Board, while governance is only `privileged?: boolean` and `matter_id?: string`.

**Why it matters:** Metadata can leak privileged information, and high-volume runs can flood the board.

**Recommendation:** Add `RunBoardRetentionPolicy`, `RunBoardCompactionPolicy`, `RunBoardEventClass`, and full `RunBoardGovernanceEnvelope`.

**Reference:** Passive board always-on and post schema.

------

## [BUG] [HIGH] [Task Forum §5.2] — Board posts need full governance envelopes

**Finding:** Board posts only carry optional privilege and matter fields.

**Why it matters:** Board posts can contain tainted findings, sensitive source identities, or policy-restricted summaries.

**Recommendation:** Add `data_class`, `taint_class`, `policy_decision_refs`, `sanitization_required`, `governance_class`, and digest filtering by these fields.

**Reference:** TaskRunBoardPost schema.

------

## [GAP] [HIGH] [Task Forum §6.3 / §6.4] — TaskRunContextPacket needs request, receipt, freshness, and omission manifest

**Finding:** DOC24 assembles TaskRunContextPacket and caps it at `max_tokens`, but the packet has no request schema, actual token count, content hash, omissions list, expiry, or invalidation triggers.

**Why it matters:** A module can consume stale or incomplete context after a blocking finding or source update.

**Recommendation:** Add `TaskRunContextPacketRequest`, `TaskRunContextPacketAssemblyReceipt`, `packet_content_hash`, `actual_token_count`, `omitted_items[]`, `valid_until_event_seq`, and `invalidated_by_event_kinds[]`. Blockers should be non-droppable.

**Reference:** Task Forum §6.3–§6.4.

------

## [BUG] [MEDIUM] [Task Forum §6.2 / §6.3] — Digest and packet audience enums diverge

**Finding:** `BoardDigest.generated_for` includes `drafting_module`; `TaskRunContextPacket.audience` uses `authoring_module`.

**Why it matters:** DOC24 cannot reliably map digest audience to packet audience.

**Recommendation:** Create one shared `TaskContextAudience` enum.

**Reference:** BoardDigest and TaskRunContextPacket schemas.

------

## [BUG] [HIGH] [Task Forum §7.2] — ModuleAssistanceRequest lacks endpoint, lease, timeout, and answer schema

**Finding:** Assistance requests target source workspace, research module, evaluator, Task Agent, human, specific module, or forum, but do not specify target endpoint, input port, lease, timeout, or answer schema.

**Why it matters:** `pause_until_answer` can deadlock; route-to-instruction can target the wrong port.

**Recommendation:** Add `target_endpoint_ref`, `target_input_port_id`, `answer_schema_ref`, `lease`, `timeout_ms`, `on_timeout`, and route-resolver policy.

**Reference:** Task Forum §7.2.

------

## [BUG] [MEDIUM] [Task Forum §3.2] — Participant policy only models modules

**Finding:** `ForumParticipantPolicy` has `module_id`, but participants include users, Task Agent, domain moderators, sub-agents, and system actors.

**Why it matters:** Non-module participants must pretend to be modules.

**Recommendation:** Replace with `participant_ref` and `participant_kind`.

**Reference:** Task Forum participant policy.

------

## [BUG] [MEDIUM] [Task Forum §4.6] — Moderator agent requirement condition is incoherent

**Finding:** `moderator_agent_ref` is commented as required if `mode != "none" && "human_moderator"`.

**Why it matters:** The condition is syntactically and semantically wrong.

**Recommendation:** Use explicit conditional fields: `moderator_agent_ref` for domain/task-agent modes, `moderator_human_ref` for human moderator, `router_rules_ref` for deterministic router.

**Reference:** Task Forum moderator policy.

------

## [GAP] [MEDIUM] [Task Forum §3.1] — `decision_out` and `signal_out` payload schemas are missing

**Finding:** Forum outputs include `decision_out` and `signal_out`, but no `ForumDecision` or `ForumSignal` schemas exist.

**Why it matters:** Plan approvals, moderator decisions, and learning signals become untyped.

**Recommendation:** Add `ForumDecision` and `ForumSignal` schemas.

**Reference:** Task Forum module ports.

------

# Core R0.7.1 task-mode findings

## [GAP] [CRITICAL] [Core §3D / DOC24] — InjectionSlotRegistry gap must be closed now

**Finding:** Core defines `TaskOpportunityPacket` and token budgets but not DOC24 InjectionSlotRegistry entries: slot IDs, slot kinds, rendering constraints, required-per-packet matrix, redaction rules, or receipts. The red-team prompt flags this as known open item.

**Why it matters:** DOC24 cannot implement runtime packet lanes without guessing.

**Recommendation:** Add Core §3D.6 `TaskSystemInjectionSlotRegistration` with slots for ambient card, task opportunity packet, Task Agent design packet, top-k invocation directive cards, template cards, and module preset cards.

**Reference:** Core §3D; prompt §4.9.

------

## [GAP] [HIGH] [Core §3D] — Compact top-k card schemas are missing

**Finding:** `TaskOpportunityPacket` contains `CompactTaskInvocationDirectiveCard[]`, `CompactTaskTemplateCard[]`, and `CompactModulePresetCard[]`, but Core does not define their compact schemas.

**Why it matters:** DOC24 cannot assemble the packet without card shape, scores, redaction, token estimates, and source authority.

**Recommendation:** Add all three compact card schemas.

**Reference:** Core `TaskOpportunityPacket`.

------

## [BUG] [HIGH] [Core §4B] — Core defines a DOC24-owned capability registry entry normatively

**Finding:** Core defines `TaskAgentCapabilityRegistryEntry` but the field says `owner_doc: "DOC24"`.

**Why it matters:** Core and DOC24 may both own the same schema.

**Recommendation:** Core should define required semantic contract and `[XDOC-INSERT: DOC24]`; DOC24 should own canonical capability registry schema.

**Reference:** Core Task Agent capability registry.

------

## [GAP] [HIGH] [Core §4.6 / §6 / §7] — Undefined imported types need registry

**Finding:** Core schemas reference many types not defined locally or imported via explicit owner registry: `TaskRunStatus`, `TaskDefinitionStatus`, `GraphProgressSummary`, `InputContract`, `OutputContract`, `QualityCriterion`, `CapabilityNeed`, and others.

**Why it matters:** Coding agents must guess owners and versions.

**Recommendation:** Add `CoreTypeImportRegistry`.

**Reference:** Core TaskAgentRunLens, Blueprint, template/preset schemas.

------

## [GAP] [HIGH] [Core §5A] — TrackedTaskReceipt booleans need command refs

**Finding:** `TrackedTaskReceipt` has `can_cancel`, `can_pause`, `can_edit`, and `can_convert_to_template`, but no command refs or failure modes.

**Why it matters:** This risks phantom UI controls.

**Recommendation:** Replace booleans with `available_actions[]` including command_ref, enabled, disabled_reason, and idempotency requirement.

**Reference:** Core visible saved-task receipt.

------

## [GAP] [HIGH] [Core routes/commands] — Command registry appendix should be mandatory

**Finding:** Core describes many routes and actions but not uniformly request schemas, response schemas, idempotency keys, durable writes, telemetry events, or read-model invalidations.

**Why it matters:** Coding agents will invent APIs.

**Recommendation:** Add `TaskCommandRegistry` for every user-visible action introduced by Core and sub-addenda.

**Reference:** Core no-phantom UI principle.

------

## [GAP] [HIGH] [Core §24A / Status] — “Build-ready” conflicts with pending cross-doc obligations

**Finding:** Core labels itself build-ready, but several target-doc updates remain pending: DOC24, DOC15, DOC20, DOC23 R3.2, DOC8/BDSM, DOC72, EC Core, Addenda A.

**Why it matters:** A coding agent handed only the six docs cannot implement without guessing or parking behavior.

**Recommendation:** Change status to “architecturally complete; contract-hardening required before build,” and add a build-readiness matrix: must-land-before-build, degraded-compatible, optional, deferred.

**Reference:** Core frontmatter and cross-doc obligations.

------

# Sub-agent dispatch findings

## [BUG] [HIGH] [V3.3.1 §8.4 / §8.5] — `output_contract_ref` singular conflicts with variant semantics

**Finding:** `AdvisorySubAgentProfile.output_contract_ref` is singular, while prose says profiles declare which variants the sub-agent may emit.

**Why it matters:** A sub-agent emitting multiple validated output kinds cannot be represented.

**Recommendation:** Change to `output_contract_refs: SchemaRef[]` and `accepted_output_variant_kinds: AdvisorySubAgentOutputKind[]`.

**Reference:** V3.3.1 advisory sub-agent profile.

------

## [GAP] [HIGH] [V3.3.1 §8 / DOC24 Boundary] — No-sub-agent fallback policy is incomplete

**Finding:** Sub-agents are optional/advisory in many flows, but no fallback matrix exists for Outcome Compiler, Evaluator lanes, Revision Compiler, and Feedback Interpreter.

**Why it matters:** Missing sub-agent can silently downgrade quality or block execution unpredictably.

**Recommendation:** Add `SubAgentFallbackPolicy` per invocation point: proceed_without, use_generalist, downgrade_assurance, ask_human, block_evaluation, or graph_patch_required.

**Reference:** Prompt §4.7; V3.3.1 sub-agent contract.

------

## [GAP] [MEDIUM] [Sub-agent learning] — Reputation should slice by outcome/task context and model class

**Finding:** Sub-agent quality metrics exist, but context-conditioned reputation is under-specified.

**Why it matters:** A sub-agent can be excellent for one citation/source type and poor for another.

**Recommendation:** Slice reputation by `(sub_agent_id, outcome_cluster_or_task_signature, model_class, source_kind, failure_kind)`, with insufficient-data handling and TIE review.

**Reference:** Claude review flags sub-agent dispatch and model-class mismatches as system-wide TIE targets.

------

# Taint, privilege, and governance findings

## [GAP] [HIGH] [Common Contracts / Feedback / Forum / Source] — Taint propagation into feedback and forum objects is incomplete

**Finding:** Envelopes have governance fields, but FeedbackBundle, RunGuidance, RepairInstruction, Forum posts, BoardDigest, and TaskRunContextPacket do not all carry full taint/access governance.

**Why it matters:** A tainted finding can become prompt-visible guidance or board digest without clearance.

**Recommendation:** Every feedback, board, digest, context, and guidance object that can influence prompts must carry `taint_class`, `data_class`, `matter_id`, `privileged`, `policy_decision_refs`, and `sanitization_required`.

**Reference:** Common Contracts safety slice; Feedback and Forum schemas.

------

## [GAP] [MEDIUM] [Source Workspace §9 / Retention] — Persistence policy should reference EC policy decisions

**Finding:** Source Workspace persistence fields include DOC72 summary and DOC73 promotion flags, but retention/promotion are not explicitly bound to policy decisions.

**Why it matters:** Source records can contain privileged or matter-scoped data.

**Recommendation:** Add `retention_policy_ref`, `policy_decision_ref`, and `promotion_gate_ref` on workspace and source records.

**Reference:** Source Workspace persistence policy.

------

## [GAP] [HIGH] [Learning signals] — Multi-user forward-compatibility fields should be added now

**Finding:** Learning signal envelope has `data_class` and `matter_id`, but lacks `principal_id`, learning scope, share eligibility, and scope inference basis.

**Why it matters:** Retrofitting team/firm/networked learning scope later is dangerous, especially with privilege and matter data.

**Recommendation:** Add `principal_id`, `learning_scope`, `scope_inference_basis`, `default_scope_rule`, and `share_eligibility` to learning artifacts now. Claude’s self-learning review separately identified this as a forward-compatibility requirement.

**Reference:** Common Contracts learning envelope; Claude self-learning analysis.

------

# Self-learning surface — flagged, not deep-dived

## [SELF-LEARNING] [HIGH] [Claude analysis / Addenda B learning seam] — Add Task Improvement Engineer, but keep it outside this runtime hardening pass

**Finding:** Claude correctly identifies TIE as the missing LLM-reasoning layer over diagnostic data: it turns accumulated signals into recommendations for rubrics, OutcomeDefinitions, configurations, strategy selection, sub-agent dispatch, topology, schemas, and code.

**Why it matters:** Without TIE, the system accumulates signals but lacks a structured reasoning agent that proposes improvements beyond DSPy and passive pattern emergence.

**Recommendation:** Add TIE schemas and lifecycle hooks as future-facing learning substrate, but do not let TIE become a hot-path runtime decision-maker.

**Reference:** Claude self-learning review.

------

## [SELF-LEARNING] [HIGH] [Claude analysis] — Add LoopEffectivenessTestRunRecord

**Finding:** Claude correctly identifies the missing measurement loop: baseline Judge score vs Evaluator→Revisor→Judge revised score.

**Why it matters:** Without loop-effectiveness testing, the revision system cannot prove it improves outputs.

**Recommendation:** Add `LoopEffectivenessTestRunRecord` and feed it to dashboards, TIE, DSPy training, and trust calibration.

**Reference:** Claude review.

------

## [SELF-LEARNING] [HIGH] [Claude analysis] — BDSM utility-compilation gap remains a blocker

**Finding:** Claude correctly notes that adding more signal types does not help if BDSM’s utility compilation rules remain unspecified.

**Why it matters:** Signals without compilation rules become inert telemetry.

**Recommendation:** Add BDSM utility-compilation spec before relying on learning signals for runtime guidance.

**Reference:** Claude review.

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Revealed-preference loops need dampening

**Finding:** Edit-trace learning can collapse into parroting if user edits become overly strong priors.

**Why it matters:** The system can become sycophantic while appearing personalized.

**Recommendation:** Add time decay, diversity injection, drift detection, active disagreement signals, and TIE oversight.

**Reference:** Claude review.

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Outcome clustering may fail at single-user scale

**Finding:** HDBSCAN-like clustering can fragment or collapse with sparse single-user data.

**Why it matters:** Cluster IDs may be unstable or meaningless.

**Recommendation:** Consider k-NN retrieval or hierarchical memory architecture instead of hard cluster IDs; if clustering remains, persist durable cluster definitions and admission rules.

**Reference:** Claude review.

------

# Paste-ready amendment package for the next draft

The next draft should include these appendices or equivalent sections. This is the concrete “what to put into the spec” package.

------

## Appendix A — TypeOwnerRegistry

```ts
type OwnerDoc =
  | "DOC23_PARENT"
  | "DOC23_ADDENDA_A"
  | "DOC23_ADDENDA_B_CORE"
  | "DOC23_ADDB_V3_3_1"
  | "DOC23_EVALUATION_COMMON_CONTRACTS"
  | "DOC23_ADDB_FEEDBACK_DELIVERY"
  | "DOC23_ADDB_SOURCE_WORKSPACE"
  | "DOC23_ADDB_TASK_FORUM_RUN_BOARD"
  | "DOC24"
  | "DOC15"
  | "DOC20"
  | "DOC72"
  | "DOC8_BDSM"
  | "EC_CORE"
  | "DOC11_OPENCLAW";

type TypeOwnershipStatus =
  | "canonical"
  | "projection"
  | "imported"
  | "deprecated_alias"
  | "pending_absorption";

interface TypeOwnerRegistryEntry {
  type_name: string;
  canonical_owner_doc: OwnerDoc;
  canonical_owner_section: string;
  schema_version: string;
  status: TypeOwnershipStatus;
  local_aliases?: string[];
  projection_types?: string[];
  consumed_by_docs: OwnerDoc[];
  validation_code_on_drift: string;
  notes?: string;
}

const ADDENDA_B_TYPE_OWNER_REGISTRY: TypeOwnerRegistryEntry[] = [
  {
    type_name: "EvaluationResultEnvelope",
    canonical_owner_doc: "DOC23_EVALUATION_COMMON_CONTRACTS",
    canonical_owner_section: "§3",
    schema_version: "1.0",
    status: "canonical",
    consumed_by_docs: [
      "DOC23_ADDB_V3_3_1",
      "DOC23_ADDB_FEEDBACK_DELIVERY",
      "DOC23_ADDB_TASK_FORUM_RUN_BOARD",
      "DOC20",
      "DOC8_BDSM"
    ],
    validation_code_on_drift: "validation.type_owner_drift.evaluation_result_envelope"
  },
  {
    type_name: "EvaluationFinding",
    canonical_owner_doc: "DOC23_EVALUATION_COMMON_CONTRACTS",
    canonical_owner_section: "§4.2A",
    schema_version: "2.0",
    status: "canonical",
    projection_types: ["FeedbackFindingView"],
    consumed_by_docs: [
      "DOC23_ADDB_V3_3_1",
      "DOC23_ADDB_FEEDBACK_DELIVERY",
      "DOC23_ADDB_TASK_FORUM_RUN_BOARD",
      "DOC8_BDSM"
    ],
    validation_code_on_drift: "validation.type_owner_drift.evaluation_finding"
  },
  {
    type_name: "FeedbackFindingView",
    canonical_owner_doc: "DOC23_ADDB_FEEDBACK_DELIVERY",
    canonical_owner_section: "§3",
    schema_version: "1.1",
    status: "projection",
    consumed_by_docs: ["DOC23_ADDB_FEEDBACK_DELIVERY", "DOC23_ADDB_TASK_FORUM_RUN_BOARD"],
    validation_code_on_drift: "validation.type_owner_drift.feedback_finding_view"
  },
  {
    type_name: "OutcomeRepairInstruction",
    canonical_owner_doc: "DOC23_ADDB_FEEDBACK_DELIVERY",
    canonical_owner_section: "§5",
    schema_version: "1.1",
    status: "canonical",
    consumed_by_docs: ["DOC23_ADDB_V3_3_1", "DOC23_ADDB_TASK_FORUM_RUN_BOARD"],
    validation_code_on_drift: "validation.type_owner_drift.outcome_repair_instruction"
  },
  {
    type_name: "ResearchNeed",
    canonical_owner_doc: "DOC23_ADDB_SOURCE_WORKSPACE",
    canonical_owner_section: "§6",
    schema_version: "1.1",
    status: "canonical",
    consumed_by_docs: [
      "DOC23_ADDB_FEEDBACK_DELIVERY",
      "DOC23_ADDB_TASK_FORUM_RUN_BOARD",
      "DOC23_ADDB_V3_3_1"
    ],
    validation_code_on_drift: "validation.type_owner_drift.research_need"
  },
  {
    type_name: "TaskSourceWorkspace",
    canonical_owner_doc: "DOC23_ADDB_SOURCE_WORKSPACE",
    canonical_owner_section: "§2",
    schema_version: "1.1",
    status: "canonical",
    projection_types: ["RevisionSourceWorkspaceView"],
    consumed_by_docs: ["DOC23_ADDB_V3_3_1", "DOC23_ADDB_TASK_FORUM_RUN_BOARD"],
    validation_code_on_drift: "validation.type_owner_drift.task_source_workspace"
  },
  {
    type_name: "CostEstimate",
    canonical_owner_doc: "DOC23_EVALUATION_COMMON_CONTRACTS",
    canonical_owner_section: "§9A",
    schema_version: "1.0",
    status: "canonical",
    consumed_by_docs: [
      "DOC23_ADDB_V3_3_1",
      "DOC23_ADDB_SOURCE_WORKSPACE",
      "DOC23_ADDB_TASK_FORUM_RUN_BOARD",
      "DOC23_ADDENDA_B_CORE"
    ],
    validation_code_on_drift: "validation.type_owner_drift.cost_estimate"
  }
];
```

------

## Appendix B — Canonical state semantics matrix

```ts
type EvaluationVerdict =
  | "passed"
  | "failed"
  | "indeterminate"
  | "not_applicable";

type OutcomeEvaluationRuntimeState =
  | "pending"
  | "pending_dependency"
  | "evaluating"
  | "dirty";

type OutcomeEvaluationDisposition =
  | "satisfied"
  | "needs_revision"
  | "needs_information"
  | "needs_verification"
  | "needs_human_judgment"
  | "unable_to_evaluate"
  | "blocked_by_policy"
  | "regressed"
  | "upstream_failure"
  | "unrecoverable"
  | "superseded"
  | "max_iterations_reached";

type OutcomeEvaluationState =
  | OutcomeEvaluationRuntimeState
  | OutcomeEvaluationDisposition;

type FeedbackBranch =
  | "on_satisfied"
  | "on_needs_revision"
  | "on_needs_more_sources"
  | "on_needs_source_verification"
  | "on_needs_human_judgment"
  | "on_blocked_by_policy"
  | "on_upstream_failure"
  | "on_unrecoverable"
  | "on_repeated_failure"
  | "none";

interface OutcomeStateSemantics {
  state: OutcomeEvaluationState;
  state_class: "runtime" | "disposition";
  persisted_in_outcome_runtime_state: boolean;
  emitted_to_evaluation_envelope: boolean;
  terminal: boolean;
  verdict_mapping?: EvaluationVerdict;
  feedback_branch: FeedbackBranch;
  blocks_downstream_default: boolean;
  revisor_default_action:
    | "none"
    | "route_to_revisor"
    | "request_information"
    | "request_verification"
    | "request_human_judgment"
    | "abort_or_escalate"
    | "mark_superseded";
  ui_label:
    | "Pending"
    | "Waiting on dependency"
    | "Evaluating"
    | "Satisfied"
    | "Needs revision"
    | "Needs information"
    | "Needs verification"
    | "Needs human judgment"
    | "Unable to evaluate"
    | "Blocked by policy"
    | "Regressed"
    | "Upstream failure"
    | "Unrecoverable"
    | "Superseded"
    | "Max iterations reached"
    | "Dirty";
  learning_eligibility:
    | "eligible_positive"
    | "eligible_negative"
    | "eligible_diagnostic_only"
    | "ineligible_transient";
}

const OUTCOME_STATE_MATRIX: OutcomeStateSemantics[] = [
  {
    state: "pending",
    state_class: "runtime",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: false,
    terminal: false,
    feedback_branch: "none",
    blocks_downstream_default: true,
    revisor_default_action: "none",
    ui_label: "Pending",
    learning_eligibility: "ineligible_transient"
  },
  {
    state: "pending_dependency",
    state_class: "runtime",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: false,
    terminal: false,
    feedback_branch: "none",
    blocks_downstream_default: true,
    revisor_default_action: "none",
    ui_label: "Waiting on dependency",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "evaluating",
    state_class: "runtime",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: false,
    terminal: false,
    feedback_branch: "none",
    blocks_downstream_default: true,
    revisor_default_action: "none",
    ui_label: "Evaluating",
    learning_eligibility: "ineligible_transient"
  },
  {
    state: "dirty",
    state_class: "runtime",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: false,
    terminal: false,
    feedback_branch: "none",
    blocks_downstream_default: true,
    revisor_default_action: "none",
    ui_label: "Dirty",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "satisfied",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "passed",
    feedback_branch: "on_satisfied",
    blocks_downstream_default: false,
    revisor_default_action: "none",
    ui_label: "Satisfied",
    learning_eligibility: "eligible_positive"
  },
  {
    state: "needs_revision",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: false,
    verdict_mapping: "failed",
    feedback_branch: "on_needs_revision",
    blocks_downstream_default: true,
    revisor_default_action: "route_to_revisor",
    ui_label: "Needs revision",
    learning_eligibility: "eligible_negative"
  },
  {
    state: "needs_information",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: false,
    verdict_mapping: "indeterminate",
    feedback_branch: "on_needs_more_sources",
    blocks_downstream_default: true,
    revisor_default_action: "request_information",
    ui_label: "Needs information",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "needs_verification",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: false,
    verdict_mapping: "indeterminate",
    feedback_branch: "on_needs_source_verification",
    blocks_downstream_default: true,
    revisor_default_action: "request_verification",
    ui_label: "Needs verification",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "needs_human_judgment",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: false,
    verdict_mapping: "indeterminate",
    feedback_branch: "on_needs_human_judgment",
    blocks_downstream_default: true,
    revisor_default_action: "request_human_judgment",
    ui_label: "Needs human judgment",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "unable_to_evaluate",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "indeterminate",
    feedback_branch: "on_unrecoverable",
    blocks_downstream_default: true,
    revisor_default_action: "abort_or_escalate",
    ui_label: "Unable to evaluate",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "blocked_by_policy",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "indeterminate",
    feedback_branch: "on_blocked_by_policy",
    blocks_downstream_default: true,
    revisor_default_action: "abort_or_escalate",
    ui_label: "Blocked by policy",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "regressed",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: false,
    verdict_mapping: "failed",
    feedback_branch: "on_needs_revision",
    blocks_downstream_default: true,
    revisor_default_action: "route_to_revisor",
    ui_label: "Regressed",
    learning_eligibility: "eligible_negative"
  },
  {
    state: "upstream_failure",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "failed",
    feedback_branch: "on_upstream_failure",
    blocks_downstream_default: true,
    revisor_default_action: "abort_or_escalate",
    ui_label: "Upstream failure",
    learning_eligibility: "eligible_diagnostic_only"
  },
  {
    state: "unrecoverable",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "failed",
    feedback_branch: "on_unrecoverable",
    blocks_downstream_default: true,
    revisor_default_action: "abort_or_escalate",
    ui_label: "Unrecoverable",
    learning_eligibility: "eligible_negative"
  },
  {
    state: "max_iterations_reached",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: true,
    terminal: true,
    verdict_mapping: "failed",
    feedback_branch: "on_repeated_failure",
    blocks_downstream_default: true,
    revisor_default_action: "abort_or_escalate",
    ui_label: "Max iterations reached",
    learning_eligibility: "eligible_negative"
  },
  {
    state: "superseded",
    state_class: "disposition",
    persisted_in_outcome_runtime_state: true,
    emitted_to_evaluation_envelope: false,
    terminal: true,
    feedback_branch: "none",
    blocks_downstream_default: false,
    revisor_default_action: "mark_superseded",
    ui_label: "Superseded",
    learning_eligibility: "eligible_diagnostic_only"
  }
];
```

------

## Appendix C — Limitation and indeterminate taxonomy

```ts
type EvaluationLimitationKind =
  | "insufficient_evidence"
  | "human_judgment_needed"
  | "missing_capability"
  | "source_unavailable"
  | "policy_blocked"
  | "stale_evidence"
  | "unable_to_ground_claim";

type IndeterminateCause =
  | "missing_information"
  | "missing_source"
  | "stale_source"
  | "missing_capability"
  | "policy_block"
  | "human_judgment_required"
  | "conflicting_evidence"
  | "tool_failure"
  | "timeout"
  | "unsupported_scope"
  | "blocked_before_substantive_verdict";

interface LimitationStateMapping {
  limitation: EvaluationLimitationKind;
  default_state: OutcomeEvaluationDisposition;
  default_indeterminate_cause: IndeterminateCause;
  default_recovery_route:
    | "request_more_information"
    | "route_to_source_research"
    | "route_to_source_verifier"
    | "route_to_human"
    | "route_to_policy_resolution"
    | "graph_patch_required"
    | "abort";
}

const LIMITATION_STATE_MAPPING: LimitationStateMapping[] = [
  {
    limitation: "insufficient_evidence",
    default_state: "needs_information",
    default_indeterminate_cause: "missing_information",
    default_recovery_route: "request_more_information"
  },
  {
    limitation: "human_judgment_needed",
    default_state: "needs_human_judgment",
    default_indeterminate_cause: "human_judgment_required",
    default_recovery_route: "route_to_human"
  },
  {
    limitation: "missing_capability",
    default_state: "unable_to_evaluate",
    default_indeterminate_cause: "missing_capability",
    default_recovery_route: "graph_patch_required"
  },
  {
    limitation: "source_unavailable",
    default_state: "needs_information",
    default_indeterminate_cause: "missing_source",
    default_recovery_route: "route_to_source_research"
  },
  {
    limitation: "policy_blocked",
    default_state: "blocked_by_policy",
    default_indeterminate_cause: "policy_block",
    default_recovery_route: "route_to_policy_resolution"
  },
  {
    limitation: "stale_evidence",
    default_state: "needs_verification",
    default_indeterminate_cause: "stale_source",
    default_recovery_route: "route_to_source_verifier"
  },
  {
    limitation: "unable_to_ground_claim",
    default_state: "needs_verification",
    default_indeterminate_cause: "unsupported_scope",
    default_recovery_route: "route_to_source_verifier"
  }
];

type SubstantiveVerdictStatus =
  | "substantive_verdict_reached"
  | "blocked_before_substantive_verdict"
  | "partial_substantive_verdict";
```

------

## Appendix D — Canonical EvaluationFinding and FeedbackFindingView

```ts
type FindingState =
  | "proposed"
  | "active"
  | "contested"
  | "resolved"
  | "superseded_by_revision"
  | "superseded_by_source_change"
  | "user_approved"
  | "tool_verified"
  | "human_verified"
  | "rejected_by_user"
  | "dismissed"
  | "expired"
  | "unrecoverable";

type FindingKind =
  | "criterion_failed"
  | "missing_requirement"
  | "unsupported_assertion"
  | "source_mismatch"
  | "format_violation"
  | "style_violation"
  | "incomplete_coverage"
  | "research_gap"
  | "subjective_quality_issue"
  | "tool_verified_failure"
  | "process_observation"
  | "custom";

type ConfidenceBasis =
  | "deterministic_check"
  | "source_anchor"
  | "tool_verification"
  | "rubric_anchored"
  | "multi_reviewer_consensus"
  | "human_label"
  | "model_judgment_only"
  | "mixed";

interface FindingMatchKey {
  failure_kind: FailureKind;
  target_scope_ref: ArtifactScopeRef | null;
  finding_summary_hash: string;
  criterion_id?: string;
}

interface EvaluationFinding {
  finding_id: string;
  result_id: string;

  finding_kind: FindingKind;
  finding_text: string;
  explanation: string;

  severity: "low" | "medium" | "high" | "blocking";
  state: FindingState;

  assurance_basis: AssuranceBasis[];
  confidence_score: number;             // 0..1
  confidence_basis: ConfidenceBasis[];
  confidence_explanation: string;

  target_artifact_ref?: StorageRef;
  target_version_ref?: StorageRef;
  target_scope_ref?: ArtifactScopeRef;
  target_criterion_id?: string;
  affected_claim_refs?: string[];

  evidence_refs: StorageRef[];
  verification_record_refs: StorageRef[];
  supporting_material_snapshot_refs: StorageRef[];

  based_on_artifact_version_ref?: StorageRef;
  based_on_artifact_version_absent_reason?:
    | "non_artifact_target"
    | "human_review_no_artifact"
    | "process_observation";
  based_on_source_workspace_snapshot_ref?: StorageRef;
  based_on_board_digest_ref?: StorageRef;

  taint_class: TaintClass;
  data_class: "public" | "internal" | "privileged" | "local_only";
  matter_id?: string;
  privileged: boolean;
  policy_decision_refs: PolicyEvaluationRef[];

  evaluation_target_state:
    | "current_artifact"
    | "candidate_artifact"
    | "sandboxed_candidate";
  candidate_artifact_version_ref?: CandidateArtifactVersionRef;
  promotion_policy_ref?: StorageRef;

  match_key: FindingMatchKey;

  superseded_by_finding_id?: string;
  expires_at?: ISO8601;
  created_at: ISO8601;
  schema_version: "2.0";
}

interface FeedbackFindingView {
  finding_id: string;
  source_evaluation_finding_ref: string;
  display_summary: string;
  display_explanation: string;
  finding_kind: FindingKind;
  severity: "low" | "medium" | "high" | "blocking";
  lifecycle_state: FindingState;
  blocking_authority_satisfied: boolean;
  routed_action_refs: StorageRef[];
  schema_version: "1.1";
}
```

------

## Appendix E — Pattern C chain registry and route policy

```ts
type EvaluationChainKind =
  | "pattern_c_evaluator_then_judge"
  | "single_evaluator"
  | "single_judge"
  | "multi_hop_evaluation";

type EvaluationChainStatus =
  | "open"
  | "complete"
  | "partial"
  | "blocked"
  | "superseded"
  | "invalid";

type EvaluationChainResolutionMode =
  | "blocking_qualitative_first"
  | "judge_quantitative_first"
  | "human_if_disagreement";

interface EvaluationChainRegistryRecord {
  chain_id: string;
  chain_kind: EvaluationChainKind;
  task_id: string;
  run_id: string;

  target_artifact_ref: StorageRef | null;
  target_artifact_version_ref: StorageRef | null;
  target_scope_ref: ArtifactScopeRef | null;
  evaluation_snapshot_ref: StorageRef;

  expected_producers: ProducerKind[];
  received_result_ids: string[];

  status: EvaluationChainStatus;
  created_at: ISO8601;
  completed_at?: ISO8601;
  superseded_by_chain_id?: string;

  validation_failures: Array<
    | "chain_id_missing"
    | "chain_target_mismatch"
    | "chain_stale_snapshot"
    | "chain_ambiguous"
    | "chain_consumer_timeout"
  >;

  schema_version: "1.0";
}

interface EvaluationChainResolutionPolicy {
  policy_id: string;
  chain_kind: EvaluationChainKind;

  qualitative_blockers_survive_numeric_pass: boolean;
  judge_can_override_evaluator: boolean;
  override_allowed_only_for_finding_states: FindingState[];

  route_precedence: EvaluationChainResolutionMode;

  disagreement_route:
    | "human_review"
    | "task_agent_assessment"
    | "block_until_resolved"
    | "prefer_blocking";

  schema_version: "1.0";
}

const DEFAULT_PATTERN_C_RESOLUTION_POLICY: EvaluationChainResolutionPolicy = {
  policy_id: "default.pattern_c.v1",
  chain_kind: "pattern_c_evaluator_then_judge",
  qualitative_blockers_survive_numeric_pass: true,
  judge_can_override_evaluator: false,
  override_allowed_only_for_finding_states: ["contested", "dismissed", "rejected_by_user"],
  route_precedence: "blocking_qualitative_first",
  disagreement_route: "human_review",
  schema_version: "1.0"
};
```

------

## Appendix F — Feedback routing, delivery, and consumption receipts

```ts
type FeedbackRouteAction =
  | "continue"
  | "post_summary"
  | "score_only"
  | "emit_repair_instruction"
  | "route_to_prior_module"
  | "route_to_revision_module"
  | "post_to_forum"
  | "pause_for_human"
  | "emit_research_need"
  | "route_to_source_research"
  | "route_to_source_verifier"
  | "route_to_format_checker"
  | "ask_task_agent_for_process_assessment"
  | "fork_from_checkpoint"
  | "none";

interface FeedbackRoutingPolicy {
  on_satisfied: FeedbackRouteAction[];
  on_needs_revision: FeedbackRouteAction[];
  on_needs_more_sources: FeedbackRouteAction[];
  on_needs_source_verification: FeedbackRouteAction[];
  on_needs_human_judgment: FeedbackRouteAction[];
  on_blocked_by_policy: FeedbackRouteAction[];
  on_upstream_failure: FeedbackRouteAction[];
  on_repeated_failure: FeedbackRouteAction[];

  broadcast_policy:
    | "none"
    | "post_blockers_only"
    | "post_all_findings"
    | "post_summary_only";

  downstream_visibility:
    | "none"
    | "same_segment"
    | "selected_modules"
    | "all_future_modules";

  schema_version: "1.1";
}

interface FeedbackRouteActionRecord {
  route_action_id: string;
  action: FeedbackRouteAction;
  target_ref?: string;
  source_item_refs: string[];
  status:
    | "planned"
    | "emitted"
    | "delivered"
    | "consumed"
    | "failed"
    | "skipped_by_policy";
  rationale: string;
}

interface FeedbackRoutingRecommendation {
  applied_policy_ref: string;
  route_actions: FeedbackRouteActionRecord[];
  primary_route_action_id?: string;
  routing_rationale: string;
  schema_version: "1.1";
}

interface FeedbackDeliveryReceipt {
  delivery_id: string;
  feedback_bundle_id: string;
  item_refs: string[];
  intended_consumer_ref: string;
  delivery_channel:
    | "control_flow"
    | "direct_wiring"
    | "forum"
    | "doc24_packet"
    | "process_gap";
  delivery_status:
    | "delivered"
    | "filtered_by_policy"
    | "consumer_unavailable"
    | "port_unwired"
    | "timed_out"
    | "not_applicable"
    | "failed";
  failure_reason?: string;
  delivered_at?: ISO8601;
  schema_version: "1.0";
}

interface FeedbackConsumedItem {
  item_ref: string;
  item_kind:
    | "finding"
    | "repair_instruction"
    | "research_need"
    | "run_guidance"
    | "feedback_bundle";
  consumption_status:
    | "used"
    | "ignored"
    | "rejected"
    | "deferred"
    | "blocked"
    | "not_applicable";
  consumption_reason?: string;
  resulting_action_ref?: StorageRef;
}

interface FeedbackConsumptionReceipt {
  receipt_id: string;
  feedback_bundle_id: string;
  consumer_ref: string;
  delivery_receipt_ref?: string;
  consumed_items: FeedbackConsumedItem[];
  created_at: ISO8601;
  schema_version: "1.1";
}
```

------

## Appendix G — Source Workspace hardening additions

```ts
type SourceRecordRef = string;
type SourceSetRef = string;
type ClaimRef = string;

interface SourceEvidenceAnchor {
  anchor_id: string;
  source_id: SourceRecordRef;
  anchor_kind:
    | "page"
    | "quote"
    | "section"
    | "timestamp"
    | "row"
    | "url_fragment"
    | "paragraph"
    | "line_range";
  anchor_ref: TextAnchor | StructuredAnchor | StorageRef;
  supports_claim_refs: ClaimRef[];
  support_strength: "direct" | "indirect" | "contextual" | "contradicts";
  extracted_text_ref?: StorageRef;
  created_at: ISO8601;
  schema_version: "1.0";
}

interface ResearchNeed {
  need_id: string;
  task_id: string;
  run_id?: string;
  need_scope: "run" | "task" | "workspace";

  created_by_module_id: string;
  created_by_activation_seq: number;

  need_kind:
    | "new_source_needed"
    | "source_currentness_check"
    | "citation_support_needed"
    | "contrary_source_check"
    | "background_document_missing"
    | "format_rule_needed"
    | "domain_rule_needed"
    | "source_gap_from_evaluator"
    | "clarification_needed"
    | "custom";

  question: string;
  target_claim_or_section_ref?: ArtifactScopeRef | ClaimRef;
  priority: "low" | "medium" | "high" | "blocking";

  routed_to_module_id?: string;
  status:
    | "open"
    | "leased"
    | "in_progress"
    | "answered"
    | "unresolved"
    | "human_needed"
    | "cancelled";

  answer_ref?: StorageRef;
  source_record_refs: SourceRecordRef[];

  created_at: ISO8601;
  updated_at: ISO8601;
  schema_version: "1.1";
}

interface ResearchNeedLease {
  lease_id: string;
  need_id: string;
  holder_ref: string;
  lease_version: number;
  expires_at: ISO8601;
  idempotency_key: string;
  status_precondition: ResearchNeed["status"];
  created_at: ISO8601;
  schema_version: "1.0";
}

interface TaintAggregationPolicy {
  policy_id: string;
  taint_lattice_order: TaintClass[];
  workspace_taint_rule: "max_taint" | "policy_computed";
  privileged_rule: "any_privileged_record_marks_workspace_privileged";
  matter_scope_rule:
    | "single_matter_only"
    | "policy_gate_for_multi_matter"
    | "allow_multi_matter_with_explicit_policy";
  schema_version: "1.0";
}

interface DomainPayloadRegistryRef {
  domain_payload_kind: string;
  domain_payload_schema_ref: SchemaRef;
  domain_payload_version: string;
}
```

------

## Appendix H — Task Forum / Run Board hardening additions

```ts
type TaskContextAudience =
  | "authoring_module"
  | "drafting_module"
  | "source_research_module"
  | "outcome_evaluator"
  | "style_revision_module"
  | "format_checker"
  | "task_agent"
  | "human_reviewer"
  | "custom";

interface VisibilityTargetRef {
  target_kind: "module" | "segment" | "actor" | "role";
  target_ref: string;
}

interface RunBoardGovernanceEnvelope {
  data_class: "public" | "internal" | "privileged" | "local_only";
  taint_class: TaintClass;
  policy_decision_refs: PolicyEvaluationRef[];
  sanitization_required: boolean;
  governance_class: AccessTier;
  privileged: boolean;
  matter_id?: string;
}

interface TaskRunBoardPost {
  post_id: string;
  task_id: string;
  run_id: string;
  forum_module_id?: string;

  author_kind:
    | "module"
    | "subagent"
    | "outcome_evaluator"
    | "task_agent"
    | "user"
    | "system";
  author_ref: string;
  post_kind: TaskRunBoardPostKind;

  segment_id?: string;
  module_id?: string;
  activation_seq?: number;
  visibility:
    | "all_task_modules"
    | "selected_modules"
    | "segment_only"
    | "user_only"
    | "task_agent_only";
  visibility_target_refs: VisibilityTargetRef[];

  content_ref: StorageRef;
  artifact_refs: StorageRef[];
  source_workspace_refs: string[];
  repair_instruction_refs: StorageRef[];
  outcome_result_refs: StorageRef[];
  feedback_bundle_refs: StorageRef[];

  lifecycle_state: "active" | "superseded" | "retracted";
  supersedes_post_ids?: string[];
  superseded_by_post_id?: string;
  supersession_reason?: string;

  governance: RunBoardGovernanceEnvelope;

  created_at: ISO8601;
  schema_version: "1.1";
}

interface TaskRunContextPacketRequest {
  request_id: string;
  task_id: string;
  run_id: string;
  requesting_module_id: string;
  requesting_activation_seq: number;
  audience: TaskContextAudience;
  requested_max_tokens: number;
  required_item_refs: StorageRef[];
  optional_item_refs: StorageRef[];
  staleness_policy: "latest_only" | "allow_cached_until_event_seq" | "allow_cached_for_ms";
  created_at: ISO8601;
  schema_version: "1.0";
}

interface OmittedPacketItem {
  item_ref: StorageRef;
  item_kind:
    | "board_post"
    | "research_need"
    | "repair_instruction"
    | "run_guidance"
    | "source_record"
    | "artifact"
    | "policy_record";
  reason: "token_cap" | "permission" | "irrelevant" | "stale" | "duplicate";
  required: boolean;
}

interface TaskRunContextPacketAssemblyReceipt {
  receipt_id: string;
  request_id: string;
  packet_id: string;
  packet_content_hash: string;
  actual_token_count: number;
  omitted_items: OmittedPacketItem[];
  valid_until_event_seq: number;
  invalidated_by_event_kinds: string[];
  assembly_policy_ref: StorageRef;
  created_at: ISO8601;
  schema_version: "1.0";
}

interface ModuleAssistanceRequest {
  request_id: string;
  task_id: string;
  run_id: string;

  from_module_id: string;
  from_activation_seq: number;

  target:
    | "source_workspace"
    | "source_research_module"
    | "outcome_evaluator"
    | "task_agent"
    | "specific_module"
    | "human"
    | "task_forum";

  target_module_id?: string;
  target_endpoint_ref?: string;
  target_input_port_id?: string;
  answer_schema_ref?: SchemaRef;

  request_kind:
    | "source_lookup"
    | "more_research"
    | "verify_reference"
    | "clarify_prior_output"
    | "format_rule_lookup"
    | "style_guidance"
    | "repair_instruction"
    | "human_question"
    | "custom";

  question: string;
  priority: "low" | "medium" | "high" | "blocking";

  response_policy:
    | "post_to_board"
    | "route_to_instruction_in"
    | "route_to_context_in"
    | "pause_until_answer"
    | "continue_with_warning";

  lease?: {
    holder_ref: string;
    lease_version: number;
    expires_at: ISO8601;
  };

  timeout_ms: number;
  on_timeout:
    | "resume_with_warning"
    | "abort"
    | "escalate_human"
    | "ask_task_agent";

  response_ref?: StorageRef;
  created_at: ISO8601;
  schema_version: "1.1";
}
```

------

## Appendix I — Core task packet, slots, and command registry

```ts
interface CompactTaskInvocationDirectiveCard {
  card_id: string;
  directive_id: string;
  display_label: string;
  invocation_policy:
    | "suggest_task"
    | "run_existing_task_after_confirmation"
    | "invoke_task_agent_for_design"
    | "invoke_task_agent_for_review"
    | "answer_direct_but_offer_task_chip"
    | "do_not_suggest_task";
  reason: string;
  score: CalibratedScore;
  risk_flags: string[];
  token_estimate: number;
  redaction_state: "none" | "redacted" | "blocked";
  source_authority: "user_taught" | "system_learned" | "architect_authored";
  schema_version: "1.0";
}

interface CompactTaskTemplateCard {
  card_id: string;
  template_id: string;
  display_label: string;
  description: string;
  recommended_action:
    | "use_as_is"
    | "customize"
    | "fork"
    | "create_new_from_pattern"
    | "create_new_blank";
  match_score: CalibratedScore;
  reason: string;
  risk_flags: string[];
  token_estimate: number;
  redaction_state: "none" | "redacted" | "blocked";
  source_authority: "template_registry" | "user_saved" | "architect_authored";
  schema_version: "1.0";
}

interface CompactModulePresetCard {
  card_id: string;
  preset_id: string;
  module_type: string;
  display_label: string;
  role_in_workflow: string;
  match_score: CalibratedScore;
  reason: string;
  risk_flags: string[];
  token_estimate: number;
  redaction_state: "none" | "redacted" | "blocked";
  source_authority: "preset_registry" | "user_saved" | "architect_authored";
  schema_version: "1.0";
}

interface TaskSystemInjectionSlotRegistration {
  slot_id:
    | "DOC24.ambient_task_awareness_card"
    | "DOC24.task_opportunity_packet"
    | "DOC24.task_agent_design_packet"
    | "DOC24.task_template_top_k"
    | "DOC24.task_invocation_directive_top_k"
    | "DOC24.module_preset_top_k";

  slot_kind: "card" | "packet" | "top_k_card_set";
  surfaces: Array<"prompt_inline" | "ui_inspector" | "agent_context_only">;
  token_cap: number;
  pii_redaction_required: boolean;
  required_for_packet_kinds: string[];
  optional_for_packet_kinds: string[];
  on_unavailable: "omit" | "degrade_direct_first" | "block_explicit_task_route";
  receipt_required: boolean;
  schema_version: "1.0";
}

interface TaskCommandRegistryEntry {
  command_id: string;
  user_visible_action_label: string;
  request_schema_ref: SchemaRef;
  response_schema_ref: SchemaRef;
  idempotency_key_required: boolean;
  durable_write:
    | "required"
    | "optional"
    | "none"
    | "explicit_no_op";
  telemetry_event_kind: string;
  read_model_invalidations: string[];
  failure_codes: string[];
  required_policy_checks: string[];
  owning_doc: OwnerDoc;
  schema_version: "1.0";
}

interface AvailableTaskAction {
  action_kind:
    | "cancel"
    | "pause"
    | "resume"
    | "edit"
    | "fork"
    | "restore_known_good_state"
    | "convert_to_template"
    | "inspect_run"
    | "open_artifact";
  command_ref: string;
  enabled: boolean;
  disabled_reason?: string;
  idempotency_key_required: boolean;
}

interface TrackedTaskReceipt {
  receipt_id: string;
  task_id: string;
  run_id?: string;
  creation_mode: TaskCreationMode;
  visible_to_user: true;
  reason_for_task_mode: string;
  graph_inspector_url: string;
  run_inspector_url?: string;
  available_actions: AvailableTaskAction[];
  doc72_task_memory_ref?: string;
  created_at: string;
  schema_version: "1.1";
}
```

------

## Appendix J — FormulaRegistry and scoring functions

```ts
interface TypedInput {
  name: string;
  type_ref: string;
  required: boolean;
}

interface FormulaTestVector {
  vector_id: string;
  inputs_ref: StorageRef;
  expected_output_ref: StorageRef;
  tolerance?: number;
}

interface FormulaSpec {
  formula_id: string;
  owner_doc: OwnerDoc;
  inputs: TypedInput[];
  output_type: string;
  units?: string;
  range?: [number, number];
  missing_input_policy:
    | "fail_validation"
    | "indeterminate"
    | "default_value"
    | "exclude_with_penalty";
  zero_denominator_policy?:
    | "undefined_insufficient_data"
    | "return_zero"
    | "return_one"
    | "fail_validation";
  normalization_policy?: string;
  version: string;
  test_vectors: FormulaTestVector[];
}

interface CalibratedScore {
  value: number;
  range: [number, number];
  metric_version: string;
  calibration_ref?: StorageRef;
  confidence_interval?: [number, number];
  sample_size?: number;
  explanation_ref?: StorageRef;
}

interface DimensionScore {
  dimension_id: string;
  criterion_id?: string;
  raw_score: number;
  normalized_score: number;       // 0..1
  weight: number;
  required_gate_status?: "passed" | "failed" | "not_applicable";
  scoring_basis: Criterion["scoring_basis"];
  evidence_refs: StorageRef[];
  metric_semantics_version: string;
}

interface QualityIndex {
  aggregate_score: number;        // 0..1
  aggregation_method:
    | "all_required_then_weighted_mean"
    | "weighted_mean"
    | "min_required_score"
    | "all_or_nothing";
  pass_threshold: number;
  required_gate_failures: string[];
  missing_dimension_policy:
    | "fail_required"
    | "exclude_optional_with_penalty"
    | "indeterminate";
  passed: boolean;
  metric_semantics_version: string;
}

interface OutcomeComplianceScoringConfig {
  aggregation_method: QualityIndex["aggregation_method"];
  pass_threshold: number;
  required_gate_default: "hard_gate" | "weighted";
  default_weight_policy: "uniform" | "from_criterion_weight" | "from_priority";
  priority_weight_map: {
    must_have: number;
    should_have: number;
    nice_to_have: number;
  };
  missing_dimension_policy: QualityIndex["missing_dimension_policy"];
  unanchored_llm_judgment_policy:
    | "exclude"
    | "include_with_audit_flag"
    | "indeterminate";
  schema_version: "1.0";
}
```

Default criterion scoring rule:

```ts
function normalizeCriterionWeights(criteria: Criterion[], policy: OutcomeComplianceScoringConfig): Record<string, number> {
  const eligible = criteria.filter(c => c.scoring_basis !== "unanchored_llm_judgment");

  if (policy.default_weight_policy === "from_criterion_weight") {
    for (const c of eligible) {
      if (c.weight == null) {
        throw new Error("validation.criterion_weight_missing_under_from_criterion_weight");
      }
    }
    const sum = eligible.reduce((acc, c) => acc + (c.weight ?? 0), 0);
    if (sum <= 0) throw new Error("validation.criterion_weight_sum_zero");
    return Object.fromEntries(eligible.map(c => [c.criterion_id, (c.weight ?? 0) / sum]));
  }

  if (policy.default_weight_policy === "from_priority") {
    const raw = eligible.map(c => {
      const priority = c.priority ?? "should_have";
      return [c.criterion_id, policy.priority_weight_map[priority]] as const;
    });
    const sum = raw.reduce((acc, [, w]) => acc + w, 0);
    return Object.fromEntries(raw.map(([id, w]) => [id, w / sum]));
  }

  const n = eligible.length;
  if (n === 0) throw new Error("validation.no_aggregation_eligible_criteria");
  return Object.fromEntries(eligible.map(c => [c.criterion_id, 1 / n]));
}
```

------

## Appendix K — Task mode and template match formulas

```ts
interface TaskModeScoringFunction {
  formula_id: "task_mode_score_v1";
  feature_weights: Record<TaskUseSignal, number>;
  veto_precedence: Record<TaskUseVetoSignal, "hard" | "soft">;
  score_range: [0, 1];
  thresholds: {
    none_max: number;
    low_max: number;
    medium_min: number;
    high_min: number;
    explicit_rule: "explicit_task_request_or_existing_task_reference";
  };
  calibration_metric_ref: StorageRef;
  schema_version: "1.0";
}

interface TemplateMatchFormula {
  formula_id: "template_match_score_v1";
  component_weights: {
    semantic_intent_match: number;
    task_type_match: number;
    input_contract_match: number;
    output_contract_match: number;
    capability_availability_match: number;
    entity_context_match: number;
    user_preference_match: number;
    prior_assessment_score: number;
    recency_or_staleness_score: number;
  };
  hard_veto_cap: number;
  soft_penalty_max_total: number;
  schema_version: "1.0";
}

function computeTemplateOverallScore(
  components: Record<keyof TemplateMatchFormula["component_weights"], number>,
  weights: TemplateMatchFormula["component_weights"],
  softPenaltySum: number,
  hardVetoCount: number,
  hardVetoCap: number
): number {
  const weighted = Object.entries(weights).reduce((acc, [key, weight]) => {
    const value = components[key as keyof typeof components];
    if (value < 0 || value > 1) throw new Error("validation.template_match_component_out_of_range");
    return acc + value * weight;
  }, 0);

  const totalWeight = Object.values(weights).reduce((a, b) => a + b, 0);
  const normalized = weighted / totalWeight;
  const penalized = Math.max(0, normalized - softPenaltySum);

  return hardVetoCount > 0 ? Math.min(penalized, hardVetoCap) : penalized;
}
```

------

## Appendix L — Novelty metric specification

```ts
interface NoveltyMetricSpec {
  metric_id: string;
  feature_vector_definition_ref: StorageRef;
  embedding_model_ref?: string;
  symbolic_feature_version?: string;
  distance_metric: "cosine_distance" | "euclidean_normalized" | "jaccard" | "hybrid";
  distance_range: [0, 1];
  no_pattern_fallback: "novelty_one" | "indeterminate" | "use_domain_baseline";
  calibration_dataset_ref: StorageRef;
  threshold_default: number;
  threshold_confidence_ref?: StorageRef;
  metric_semantics_version: string;
  schema_version: "1.0";
}

interface NoveltyAssessment {
  input_signature_hash: string;
  closest_pattern_id?: string;
  closest_pattern_distance: number;
  similarity_score: number;
  novelty_score: number;
  forces_fresh_reasoning: boolean;
  triggers_task_agent: boolean;
  metric_spec_ref: string;
  metric_semantics_version: string;
  schema_version: 2;
}
```

------

## Appendix M — Runtime recovery, cancel, and hard-call protocols

```ts
type DependencyRef =
  | { kind: "artifact"; artifact_ref: ArtifactRef }
  | { kind: "module_activation"; module_id: string; activation_seq?: number }
  | { kind: "port_emission"; module_id: string; port_id: string; activation_seq?: number }
  | { kind: "source_need"; need_id: string }
  | { kind: "human_answer"; request_id: string }
  | { kind: "capability"; capability_ref: CapabilityRef }
  | { kind: "forum_decision"; decision_id: string }
  | { kind: "external_tool_result"; tool_call_ref: StorageRef };

interface PendingDependencyInfo {
  outcome_id: string;
  dependencies: DependencyRef[];
  expected_producer_modules: string[];
  wait_timeout: number;
  on_timeout:
    | "fail_outcome"
    | "mark_indeterminate"
    | "escalate_human"
    | "ask_task_agent"
    | "abort_task"
    | "continue_without_outcome";
  dependency_optional: boolean;
  required_for_overall_pass: boolean;
  schema_version: 2;
}

type HardCallBlockingScope =
  | "entire_run"
  | "segment"
  | "artifact"
  | "outcome"
  | "module"
  | "side_effect_only";

interface HardCallPendingPolicy {
  hard_call_id: string;
  blocking_scope: HardCallBlockingScope;
  blocked_refs: string[];
  allowed_to_continue_refs: string[];
  context_visible_to_continuing_modules:
    | "none"
    | "hard_call_pending_summary"
    | "full_context_redacted";
  on_defer:
    | "continue_with_warning"
    | "pause_scope"
    | "abort_scope";
  on_timeout:
    | "escalate"
    | "abort"
    | "continue_with_warning";
  timeout_ms?: number;
  schema_version: "1.0";
}

interface TaskCancelProtocol {
  cancel_request_id: string;
  task_id: string;
  run_id: string;
  requested_by_ref: string;
  cancel_scope:
    | "entire_run"
    | "segment"
    | "module_activation"
    | "revision_plan"
    | "side_effect_intent";
  target_refs: string[];

  in_flight_handling:
    | "request_graceful_stop"
    | "preempt_immediately"
    | "finish_current_step_then_stop";

  side_effect_policy:
    | "do_not_cancel_executed_side_effects"
    | "cancel_unexecuted_intents"
    | "create_corrective_artifact";

  candidate_disposition:
    | "discard"
    | "retain_for_manual_review"
    | "orphan_until_reconciled";

  source_workspace_disposition:
    | "retain_records"
    | "mark_records_cancelled"
    | "rollback_if_uncommitted";

  learning_signal_policy:
    | "suppress_success_signals"
    | "emit_cancel_diagnostic_only"
    | "emit_full_signals_with_cancel_flag";

  user_receipt_ref?: StorageRef;
  created_at: ISO8601;
  schema_version: "1.0";
}

interface KnownGoodState {
  state_id: string;
  task_id: string;
  run_id: string;
  artifact_version_refs: StorageRef[];
  source_workspace_snapshot_ref?: StorageRef;
  evaluation_snapshot_refs: StorageRef[];
  accepted_by_ref?: string;
  reason_safe: string;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## Appendix N — Side-effect intent candidates

```ts
interface SideEffectIntentCandidate {
  intent_id: string;
  task_id: string;
  run_id: string;
  side_effect_class: RevisionSideEffectClass;
  target_system_class: string;
  target_ref?: StorageRef;

  dry_run_payload_ref: StorageRef;
  rendered_preview_ref?: StorageRef;

  approval_required: boolean;
  approval_status:
    | "not_required"
    | "pending"
    | "approved"
    | "rejected"
    | "expired"
    | "blocked_by_policy";

  execution_policy_ref: StorageRef;
  policy_decision_refs: PolicyEvaluationRef[];

  state:
    | "draft"
    | "approved"
    | "executed"
    | "cancelled"
    | "blocked";

  execution_receipt_ref?: StorageRef;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## Appendix O — Professional reliance artifacts

```ts
interface EvaluationContractReview {
  review_id: string;
  task_id: string;
  run_id?: string;
  compiled_plan_ref: StorageRef;

  interpreted_goal: string;
  criteria_summary: string[];
  threshold_summary: string[];
  source_requirements: string[];
  required_capabilities: CapabilityRef[];
  hard_call_triggers: HardRevisionCallKind[];

  material_differences_from_preview?: string[];
  user_approval_required: boolean;
  approval_status: "pending" | "approved" | "rejected" | "edited" | "waived_by_policy";
  approval_ref?: StorageRef;

  created_at: ISO8601;
  schema_version: "1.0";
}

interface RevisionReviewPacket {
  packet_id: string;
  task_id: string;
  run_id: string;
  revision_plan_ref: StorageRef;

  before_artifact_version_ref: StorageRef;
  candidate_artifact_version_ref: StorageRef;

  semantic_diff_ref: StorageRef;
  finding_to_change_map: Record<string, string[]>;
  preservation_constraint_result_refs: StorageRef[];
  source_changes: SourceRecordRef[];
  revalidation_result_refs: StorageRef[];

  regression_risk_summary: string;
  reviewer_action:
    | "accept"
    | "reject"
    | "fork"
    | "request_changes"
    | "restore_known_good_state"
    | "no_user_review_required";

  created_at: ISO8601;
  schema_version: "1.0";
}

interface EvidencePackage {
  evidence_package_id: string;
  task_id: string;
  run_id: string;

  final_artifact_refs: StorageRef[];
  source_workspace_snapshot_ref: StorageRef;
  source_record_refs: SourceRecordRef[];
  evidence_anchor_refs: string[];

  claim_support_map: Array<{
    claim_ref: ClaimRef;
    supporting_anchor_refs: string[];
    contradicting_anchor_refs: string[];
    support_status:
      | "supported"
      | "partially_supported"
      | "unsupported"
      | "contradicted"
      | "not_checked";
  }>;

  unresolved_research_need_refs: string[];
  stale_or_unverified_source_refs: SourceRecordRef[];

  created_at: ISO8601;
  schema_version: "1.0";
}

interface BudgetNarrative {
  budget_narrative_id: string;
  task_id: string;
  run_id: string;

  planned_estimate_ref?: StorageRef;
  actual_cost_records: TaskCostRecord[];

  summary: string;
  logical_llm_calls: number;
  infrastructure_retries: number;
  local_compute_seconds: number;
  external_tool_cost_usd?: number;
  doc24_packet_assembly_cost?: CostEstimate;
  source_research_cost?: CostEstimate;

  skipped_optional_helpers: string[];
  preserved_non_degradable_modes: string[];
  degraded_modes_used: string[];
  quality_impact_summary: string;

  created_at: ISO8601;
  schema_version: "1.0";
}

interface TaskReliancePacket {
  packet_id: string;
  task_id: string;
  run_id: string;

  final_artifact_refs: StorageRef[];
  evaluation_chain_ids: string[];
  assurance_summary_ref: StorageRef;
  unresolved_limitations: EvaluationLimitationKind[];

  evidence_package_ref?: StorageRef;
  revision_review_packet_refs: StorageRef[];
  hard_call_resolution_refs: StorageRef[];
  policy_decision_refs: PolicyEvaluationRef[];
  budget_narrative_ref?: StorageRef;
  known_good_state_refs: string[];

  reliance_status:
    | "safe_to_rely_within_scope"
    | "rely_with_limitations"
    | "not_safe_to_rely"
    | "human_review_required";

  reliance_scope: string;
  user_visible_summary: string;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## Appendix P — TIE and loop effectiveness additions

```ts
type LearningScope =
  | "user_only"
  | "team_eligible"
  | "firm_eligible"
  | "public_eligible";

interface EvaluationLearningSignalEnvelopeExtension {
  principal_id: PrincipalRef;
  learning_scope: LearningScope;
  scope_inference_basis:
    | "user_explicit"
    | "policy_inferred"
    | "default";
  default_scope_rule: string;
  share_eligibility:
    | "not_shareable"
    | "eligible_after_review"
    | "eligible_after_redaction"
    | "shareable";
}

interface LoopEffectivenessTestRunRecord {
  test_run_id: string;
  task_id: string;
  run_id: string;

  original_artifact_ref: StorageRef;
  revised_artifact_ref: StorageRef;

  branch_a_judge_result_ref: StorageRef;
  branch_b_evaluator_result_ref: StorageRef;
  branch_b_revisor_plan_ref: StorageRef;
  branch_b_judge_result_ref: StorageRef;

  judge_score_original: number;
  judge_score_revised: number;
  score_delta_by_dimension: Record<string, number>;

  loop_iterations: number;
  cost_total: CostEstimate;
  outcome:
    | "improved"
    | "no_effect"
    | "worsened"
    | "inconclusive";

  consumed_by_bDSM: boolean;
  consumed_by_tie: boolean;
  created_at: ISO8601;
  schema_version: "1.0";
}

interface ImprovementIssue {
  issue_id: string;
  detected_at: ISO8601;
  severity: "low" | "medium" | "high" | "critical";
  pattern_summary: string;
  evidence_signal_ids: string[];
  threshold_crossed: string;
  routed_to_tie: boolean;
  routed_at?: ISO8601;

  principal_id: PrincipalRef;
  learning_scope: LearningScope;
  schema_version: "1.0";
}

interface DiagnosticImprovementRecommendation {
  rec_id: string;
  issue_id: string;
  tier: 1 | 2 | 3 | 4;

  observed_pattern: string;
  diagnostic_interpretation: string;
  evidence_signal_ids: string[];

  recommended_changes: Array<{
    change_kind:
      | "rubric_refinement"
      | "outcome_definition_refinement"
      | "pattern_primitive_emergence"
      | "stated_values_update"
      | "memory_injection_rule"
      | "configuration_change"
      | "strategy_selection_update"
      | "model_class_change"
      | "sub_agent_dispatch_change"
      | "task_graph_topology_change"
      | "schema_change"
      | "code_change"
      | "new_module_proposal";
    change_target: string;
    change_specification_ref: StorageRef;
    confidence: number;
    expected_impact: string;
  }>;

  produced_at: ISO8601;
  produced_by_model: string;
  schema_version: "1.0";
}

interface TieAnalysisRecord {
  analysis_id: string;
  recommendation_ref: string;
  reviewed_by_models: StorageRef[];
  refined_recommendation_ref: string;
  user_gate_status:
    | "pending"
    | "skipped"
    | "approved"
    | "rejected"
    | "revised";
  user_revision_notes?: string;
  schema_version: "1.0";
}

interface ImplementationProposal {
  proposal_id: string;
  analysis_id: string;
  change_kind: string;
  proposed_artifact_changes: StorageRef[];
  proposed_code_changes?: StorageRef[];
  generated_by_agent: string;
  user_approval_status:
    | "pending"
    | "approved"
    | "rejected"
    | "approved_with_modifications";
  applied_at?: ISO8601;
  schema_version: "1.0";
}

interface ImprovementOutcomeRecord {
  proposal_id: string;
  applied_at: ISO8601;
  baseline_metrics_ref: StorageRef;
  post_change_metrics_ref: StorageRef;
  outcome: "resolved" | "improved" | "no_effect" | "worsened";
  tie_accuracy_signal: number;
  outcome_assessed_at: ISO8601;
  schema_version: "1.0";
}
```

------

# Synthesis

## Overall assessment

The Addenda B family has the right architectural instincts. It rejects hidden autonomous orchestration; it keeps DOC23 graph primacy; it separates LLM planning from deterministic dispatch; it treats candidate versions, receipts, snapshots, taint, and policy as first-class; and it moves toward inspectable evaluation/revision rather than free-form agent loops. Those are the correct foundations for high-stakes professional work.

The weakness is not ambition. The weakness is executable precision. Too many shared primitives are duplicated or version-skewed. Too many formulas are named but not defined. Too many state transitions are prose-only. Too many DOC23/DOC15/DOC24 boundaries rely on implicit understanding. Too many recovery cases are happy-path or partial. That is especially dangerous because ELNOR will be built by AI coding agents, and the specs themselves prohibit implementers from inventing missing mechanisms.

My final grade: **architecturally strong, contract-incomplete, not build-ready as a six-document set.** The next draft should not add major capability. It should harden every schema, formula, state, port, packet, route, receipt, and recovery rule until a coding agent can implement without interpretation.

## Top 5 must-fix before build

These are not a scoping limit. The full backlog above should be addressed.

1. **Canonical type-owner and schema normalization** — especially `EvaluationFinding`, `OutcomeRepairInstruction`, `ResearchNeed`, `EvaluationLearningSignalEnvelope`, `SourceWorkspace`, and `CostEstimate`.
2. **Canonical state and routing matrix** — outcome states, verdicts, terminality, indeterminate causes, feedback branches, UI labels, learning eligibility.
3. **Formula Registry** — QualityIndex, DimensionScore, criterion weights, task-fit scoring, template match, novelty, sub-agent metrics, cost calibration, taint aggregation.
4. **Runtime recovery and distributed-systems semantics** — idempotency per step kind, DAG-safe rolling hash, rollback after mismatch, parallel batch finalization, cancel, HardCall blocking scope, policy freshness, receipt failure.
5. **DOC23/DOC15/DOC24 packet and slot contracts** — InjectionSlotRegistry, compact card schemas, TaskRunContextPacket request/receipt/freshness/omissions, command registry, delivery receipts.

## Biggest conceptual opportunity

The biggest product-level improvement is the **Professional Reliance Layer**: `EvaluationContractReview` before execution, `RevisionReviewPacket` during revision, `EvidencePackage` from Source Workspace, `KnownGoodState` for recovery, `BudgetNarrative` for cost/quality tradeoffs, and `TaskReliancePacket` at completion. This converts internal correctness machinery into something a high-stakes professional can actually audit and rely on.

## Coding-agent readiness

A coding agent could implement isolated portions of these specs today, but it could not implement the full set without guessing. The largest sources of required invention are:

- duplicate and conflicting schemas;
- missing formula definitions;
- state/version skew;
- route-policy underspecification;
- Pattern C envelope mismatch and chain ambiguity;
- feedback delivery vs `revision_in` safety conflict;
- DOC24 packet assembly and injection-slot gaps;
- incomplete cancel/skip/HardCall/indeterminate semantics;
- missing delivery/context/packet receipts;
- Source Workspace reference-type and taint aggregation gaps;
- unowned or pending cross-doc obligations.

The next revision should be a **full hardening pass**, not a selective bugfix pass. Every bug and underspecification above should either be fixed, explicitly parked with degraded behavior, or assigned to a named owner-doc insertion with validation failure behavior until it lands.

Yes — after auditing the consolidated response against the prior turns, I would add an explicit **Audit Addendum**. The prior response covered the main architecture, seam defects, formula gaps, and most of the proposed schemas, but it softened or omitted several narrower issues from the later deep passes, especially around revalidation naming, skip semantics, source snapshot hashing, metric schemas, Run Guidance receipts, forum decision payloads, and Claude’s self-learning additions. The prompt’s standard is “every bug” and “every missing piece,” so these should be included rather than assumed.

You can paste the following as an addition after the main findings and before the Synthesis, or as “Audit Addendum — Additional Findings and Amendment Schemas.”

------

# Audit Addendum — Additional Findings and Amendments

## Audit note

The consolidated red-team response is directionally complete, but this audit identifies several findings and paste-ready schema additions that should be explicitly added so the revision chat does not miss them. These are not a replacement for the prior response; they are amendments to it.

------

## [BUG] [HIGH] [V3.3.1 §7.5 / §11.7] — Revalidation policy is named inconsistently

**Finding:** The revision-step base layer uses `revalidation_trigger`, while the mutation protocol and dispatch language refer to `step.revalidation_expectation`. The module-level typed instruction also carries revalidation expectations, creating three possible sources of truth.

**Why it matters:** The Dispatcher cannot deterministically know which field governs revalidation scheduling. A coding agent may read the wrong field or implement multiple conflicting fields.

**Recommendation:** Replace all local variants with one base-level field:

```ts
type RevisionStepRevalidationPolicy =
  | "none"
  | "revalidate_targeted_outcomes"
  | "revalidate_declared_dependents"
  | "revalidate_full_closure"
  | "revalidate_full_task";

interface RevisionPlanStepBase {
  step_id: string;
  step_kind: RevisionPlanStepKind;
  depends_on_step_ids: string[];
  target_refs: StorageRef[];
  revalidation_policy: RevisionStepRevalidationPolicy;
  revalidation_rationale: string;
  schema_version: "1.1";
}
```

Module-level `TypedRevisionInstruction.revalidation_expectation` should either be removed or marked advisory only. The Dispatcher should read only `RevisionPlanStepBase.revalidation_policy`.

**Reference:** V3.3.1 revision plan and revalidation cascade.

------

## [BUG] [HIGH] [V3.3.1 §5.16 / Source Workspace] — Source workspace snapshot hashes are typed as artifact hashes

**Finding:** `EvaluationSnapshot.source_workspace_head_hashes` is typed as `Record<ArtifactRef, string>`, but the values being snapshotted are Source Workspace heads, source records, source sets, research queues, and verification records — not artifact refs. The current field appears in the snapshot schema alongside artifact hashes, creating an invalid type model.

**Why it matters:** Snapshot consumers cannot reliably compare source-workspace state if workspace hashes are keyed by artifact refs. This breaks audit, revalidation, live-edit checks, and source freshness detection.

**Recommendation:** Replace with a workspace-native snapshot hash structure:

```ts
interface SourceWorkspaceSnapshotHashSet {
  source_workspace_ref: string;
  source_workspace_head_hash: string;
  source_record_hashes: Record<SourceRecordRef, string>;
  source_set_hashes?: Record<SourceSetRef, string>;
  research_need_queue_hash?: string;
  verification_record_hashes?: Record<string, string>;
  freshness_record_hashes?: Record<string, string>;
  run_guidance_hashes?: Record<string, string>;
  schema_version: "1.0";
}

interface EvaluationSnapshot {
  // existing fields...

  source_workspace_state_ref: StorageRef;
  source_workspace_snapshot_hashes: SourceWorkspaceSnapshotHashSet[];

  // remove:
  // source_workspace_head_hashes: Record<ArtifactRef, string>
}
```

**Reference:** V3.3.1 `EvaluationSnapshot`; Source Workspace `TaskSourceWorkspace`.

------

## [BUG] [HIGH] [V3.3.1 §5.5.4 / §5.7] — ProgressSignal matching references fields missing from EvaluationFinding

**Finding:** `ProgressSignalRecord` compares findings by `(failure_kind, target_artifact_section_ref, finding_summary_hash)`, but V3.3.1 `EvaluationFinding` does not carry those fields directly.

**Why it matters:** Repeated-failure detection, same-reason classification, and strategy switching depend on match keys. Without an explicit match-key projection, implementers must invent hashing and scope rules.

**Recommendation:** Add a derived `FindingMatchKey` to canonical `EvaluationFinding`:

```ts
interface FindingMatchKey {
  failure_kind: FailureKind;
  target_scope_ref: ArtifactScopeRef | null;
  finding_summary_hash: string;
  normalized_finding_text_hash: string;
  criterion_id?: string;
  evidence_signature_hash?: string;
  schema_version: "1.0";
}

interface EvaluationFinding {
  // existing canonical fields...
  match_key: FindingMatchKey;
}
```

Add deterministic rules:

```text
finding_summary_hash = sha256(normalized one-sentence finding summary)
normalized_finding_text_hash = sha256(lowercase + punctuation-stripped finding_text)
target_scope_ref = ArtifactScopeRef when available; null only for global findings
failure_kind must be assigned before ProgressSignal computation
```

**Reference:** V3.3.1 ProgressSignal and EvaluationFinding.

------

## [BUG] [HIGH] [Common Contracts §3.1] — `overall_state` is too Addenda-B-specific for all producers

**Finding:** `EvaluationResultEnvelope.overall_state` is required, but Common Contracts allows producers such as Judge, deterministic scorers, human review, and agent review gates. Not all of those producers naturally emit the full Addenda B `OutcomeEvaluationState` enum.

**Why it matters:** Non-Addenda-B producers will either fake Addenda B states or silently implement incompatible mappings.

**Recommendation:** Make `overall_state` nullable for non-Addenda-B producers and require `producer_state_mapping` when populated by an adapter:

```ts
interface ProducerStateMapping {
  producer_kind: ProducerKind;
  native_state_value?: string;
  mapped_outcome_state?: OutcomeEvaluationState;
  mapping_policy_ref: string;
  mapping_confidence:
    | "exact"
    | "lossy"
    | "not_applicable";
  schema_version: "1.0";
}

interface EvaluationResultEnvelope {
  // existing fields...
  evaluation_verdict: EvaluationVerdict;
  result_lifecycle_status: ResultLifecycleStatus;

  overall_state?: OutcomeEvaluationState;
  producer_state_mapping?: ProducerStateMapping;
}
```

**Reference:** Common Contracts producer model and envelope schema.

------

## [GAP] [HIGH] [Skip semantics / V3.3.1 §21] — Add TaskSkipProtocol and skip receipts

**Finding:** The prior response covered cancel and HardCall scope, but did not fully specify skip semantics. V3.3.1 has `GateSkippability`, but a skip operation needs scope, consequences, receipts, learning rules, and downstream invalidation.

**Why it matters:** “Skip” can mean “ignore this optional review,” “accept risk,” “abort plan,” or “continue without required evidence.” Those are materially different.

**Recommendation:** Add `TaskSkipProtocol` and `TaskSkipReceipt`:

```ts
type SkipScope =
  | "human_gate"
  | "hard_call"
  | "evaluation_outcome"
  | "revision_step"
  | "source_need"
  | "verification_request"
  | "forum_decision"
  | "optional_subagent_review";

type SkipConsequence =
  | "no_consequence"
  | "continue_with_warning"
  | "risk_acceptance_required"
  | "mark_indeterminate"
  | "blocks_reliance"
  | "aborts_plan"
  | "aborts_run";

interface TaskSkipProtocol {
  skip_policy_id: string;
  skip_scope: SkipScope;
  target_ref: string;
  skippability: GateSkippability;
  default_consequence: SkipConsequence;
  requires_user_confirmation: boolean;
  requires_risk_acceptance_receipt: boolean;
  downstream_state_after_skip:
    | OutcomeEvaluationState
    | RevisionPlanStatus
    | "unchanged";
  learning_signal_policy:
    | "diagnostic_only"
    | "suppress"
    | "emit_user_preference_signal";
  schema_version: "1.0";
}

interface TaskSkipReceipt {
  receipt_id: string;
  task_id: string;
  run_id: string;
  skipped_by_ref: string;
  skip_policy_ref: string;
  skip_scope: SkipScope;
  target_ref: string;
  consequence_applied: SkipConsequence;
  risk_acceptance_text?: string;
  downstream_refs_invalidated: string[];
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Feedback Delivery §4] — Run Guidance needs promotion and lifecycle receipts

**Finding:** Feedback Delivery defines `RunGuidanceCandidate` → `RunGuidanceItem` promotion and anti-drift filtering, but no receipt records the promotion, rejection, contest, expiry, or supersession of guidance. Feedback-derived guidance is later selected by DOC15/DOC24 into prompts, so its lifecycle needs auditability.

**Why it matters:** Without guidance receipts, the system cannot explain why a downstream module saw a guidance item, why it did not see one, or why stale guidance disappeared.

**Recommendation:** Add:

```ts
type RunGuidanceTransitionKind =
  | "candidate_created"
  | "promoted_to_active"
  | "user_approved"
  | "tool_verified"
  | "contested"
  | "superseded"
  | "expired"
  | "rejected"
  | "scope_changed";

interface RunGuidanceLifecycleReceipt {
  receipt_id: string;
  guidance_id: string;
  source_candidate_id?: string;
  source_finding_refs: string[];

  transition_kind: RunGuidanceTransitionKind;
  prior_lifecycle_state?: string;
  new_lifecycle_state: string;

  performed_by_ref: string;
  basis:
    | "routing_policy"
    | "user_action"
    | "tool_verification"
    | "source_update"
    | "artifact_version_change"
    | "timeout"
    | "manual_admin";

  scope_before?: string[];
  scope_after?: string[];

  doc15_visibility_after_transition:
    | "eligible_for_prompt"
    | "audit_only"
    | "suppressed";
  doc24_packet_visibility_after_transition:
    | "eligible_for_packet"
    | "audit_only"
    | "suppressed";

  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Forum §3.1] — `decision_out` and `signal_out` need payload schemas

**Finding:** Task Forum defines `decision_out` and `signal_out` ports, but the consolidated response only flagged the issue; it did not provide the actual payload schemas. The port list names the outputs, but not what they emit.

**Why it matters:** Plan approval, plan rejection, moderator decisions, forum escalations, and learning signals will become untyped.

**Recommendation:** Add:

```ts
type ForumDecisionKind =
  | "plan_approved"
  | "plan_rejected"
  | "plan_needs_revision"
  | "guidance_accepted"
  | "guidance_rejected"
  | "research_need_routed"
  | "repair_instruction_routed"
  | "human_answer_recorded"
  | "moderator_escalation"
  | "graph_patch_recommended"
  | "custom";

interface ForumDecision {
  decision_id: string;
  task_id: string;
  run_id: string;
  forum_module_id?: string;

  decision_kind: ForumDecisionKind;
  target_ref: StorageRef | string;
  target_kind:
    | "revision_plan"
    | "run_guidance"
    | "research_need"
    | "repair_instruction"
    | "module_assistance_request"
    | "board_post"
    | "graph_patch"
    | "custom";

  decided_by_kind:
    | "user"
    | "task_agent"
    | "domain_moderator"
    | "deterministic_router"
    | "system";
  decided_by_ref: string;

  decision:
    | "approved"
    | "rejected"
    | "needs_revision"
    | "routed"
    | "deferred"
    | "escalated";

  rationale: string;
  evidence_refs: StorageRef[];
  resulting_action_refs: StorageRef[];

  governance: RunBoardGovernanceEnvelope;
  created_at: ISO8601;
  schema_version: "1.0";
}

type ForumSignalKind =
  | "coordination_failure"
  | "unanswered_request"
  | "routing_conflict"
  | "moderator_disagreement"
  | "stale_context_detected"
  | "participant_unavailable"
  | "forum_overload"
  | "user_attention_overload"
  | "learning_candidate";

interface ForumSignal {
  signal_id: string;
  task_id: string;
  run_id: string;
  forum_module_id?: string;

  signal_kind: ForumSignalKind;
  severity: "low" | "medium" | "high" | "blocking";
  source_post_refs: string[];
  source_request_refs: string[];
  payload_ref: StorageRef;

  recommended_action:
    | "none"
    | "notify_user"
    | "ask_task_agent"
    | "pause_run"
    | "route_to_human"
    | "emit_learning_signal";

  governance: RunBoardGovernanceEnvelope;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Run Board §1 / §5] — Add RunBoardRetentionPolicy, CompactionPolicy, and EventClass

**Finding:** The Run Board auto-records every major task event, and every system event auto-publishes to the passive board. The consolidated response noted privacy and volume risk but did not include a complete retention/compaction schema.

**Why it matters:** Long-running and complex tasks will produce massive event streams. Without retention and compaction, the board becomes expensive, noisy, and potentially privacy-leaky.

**Recommendation:** Add:

```ts
type RunBoardEventClass =
  | "audit_critical"
  | "user_visible"
  | "module_lifecycle"
  | "artifact_lifecycle"
  | "source_lifecycle"
  | "evaluation_lifecycle"
  | "feedback_lifecycle"
  | "cost_update"
  | "debug_only"
  | "policy_sensitive"
  | "privileged_metadata";

interface RunBoardRetentionPolicy {
  policy_id: string;
  task_id?: string;
  default_retention_days: number;
  per_event_class_retention_days: Partial<Record<RunBoardEventClass, number>>;
  privileged_retention_rule:
    | "matter_policy"
    | "retain_until_manual_purge"
    | "short_retention_then_summary";
  debug_event_retention_days: number;
  purge_requires_policy_decision: boolean;
  schema_version: "1.0";
}

interface RunBoardCompactionPolicy {
  policy_id: string;
  compaction_trigger:
    | "event_count_threshold"
    | "token_budget_threshold"
    | "run_completion"
    | "age_threshold"
    | "manual";
  event_count_threshold?: number;
  preserve_event_classes: RunBoardEventClass[];
  compact_event_classes: RunBoardEventClass[];

  compaction_output_kind:
    | "summary_post"
    | "digest_snapshot"
    | "audit_bundle_ref"
    | "external_archive_ref";

  original_events_retained: boolean;
  original_events_retention_days_after_compaction?: number;

  schema_version: "1.0";
}

interface RunBoardEventClassification {
  post_id: string;
  event_class: RunBoardEventClass;
  digest_eligible: boolean;
  user_visible_default: boolean;
  module_visible_default: boolean;
  audit_only: boolean;
  policy_suppressed: boolean;
  classification_rationale: string;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Runtime / V3.3.1 §11] — Step idempotency needs concrete discriminated inputs

**Finding:** The consolidated response identified idempotency-key incompleteness but did not provide a full replacement schema. The current formula is not total across `module_revision`, `direct_fix`, `revalidate`, `information_request`, `verification_request`, `human_judgment_request`, `fork_from_checkpoint`, `wait`, and `no_op_record`.

**Why it matters:** Retry and crash recovery require idempotency keys that are stable and kind-specific.

**Recommendation:** Add:

```ts
type StepIdempotencyInput =
  | ModuleRevisionIdempotencyInput
  | DirectFixIdempotencyInput
  | RevalidationIdempotencyInput
  | InformationRequestIdempotencyInput
  | VerificationRequestIdempotencyInput
  | HumanJudgmentRequestIdempotencyInput
  | ForkFromCheckpointIdempotencyInput
  | WaitIdempotencyInput
  | NoOpRecordIdempotencyInput;

interface BaseStepIdempotencyInput {
  task_id: string;
  run_id: string;
  revision_plan_id: string;
  source_evaluation_result_id: string;
  step_id: string;
  step_kind: RevisionPlanStepKind;
  base_artifact_version_refs: StorageRef[];
  normalized_step_payload_hash: string;
  schema_version: "1.0";
}

interface ModuleRevisionIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "module_revision";
  target_module_id: string;
  module_revision_capability_ref: string;
  target_version_precondition_ref: StorageRef;
  typed_instruction_hash: string;
}

interface DirectFixIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "direct_fix";
  direct_fix_class: DirectFixAllowedClass;
  target_scope_refs: ArtifactScopeRef[];
  expected_content_hashes: Record<string, string>;
  direct_fix_payload_hash: string;
}

interface RevalidationIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "revalidate";
  target_outcome_ids: string[];
  evaluation_snapshot_refs: StorageRef[];
  revalidation_policy: RevisionStepRevalidationPolicy;
}

interface InformationRequestIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "information_request";
  request_text_hash: string;
  target_user_or_module_ref: string;
}

interface VerificationRequestIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "verification_request";
  verification_kind: string;
  target_refs: StorageRef[];
  required_source_refs: StorageRef[];
}

interface HumanJudgmentRequestIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "human_judgment_request";
  hard_call_kind: HardRevisionCallKind;
  judgment_prompt_hash: string;
}

interface ForkFromCheckpointIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "fork_from_checkpoint";
  checkpoint_ref: StorageRef;
  fork_reason_hash: string;
}

interface WaitIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "wait";
  dependency_refs: DependencyRef[];
  wait_timeout_ms: number;
}

interface NoOpRecordIdempotencyInput extends BaseStepIdempotencyInput {
  step_kind: "no_op_record";
  no_op_reason_hash: string;
}

interface StepIdempotencyKeyRecord {
  key_id: string;
  step_id: string;
  input: StepIdempotencyInput;
  idempotency_key: string; // sha256(canonical_json(input))
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Runtime / rolling hash] — Add DAG-safe rolling hash state

**Finding:** The consolidated response identified the linear rolling-hash problem but did not provide the replacement state structure.

**Why it matters:** Revision plans are DAGs; topological order is not unique. Rolling hash must validate read/write sets, not “previous step.”

**Recommendation:** Add:

```ts
interface ArtifactScopeHash {
  artifact_ref: StorageRef;
  artifact_version_ref: StorageRef;
  scope_ref?: ArtifactScopeRef;
  normalized_content_hash: string;
  hash_algorithm: "sha256";
}

interface RevisionStepReadWriteSet {
  step_id: string;
  reads: ArtifactScopeHash[];
  writes: ArtifactScopeHash[];
  declared_conflict_group?: string;
  schema_version: "1.0";
}

interface RollingHashDAGState {
  plan_id: string;
  base_hashes: ArtifactScopeHash[];
  step_read_write_sets: RevisionStepReadWriteSet[];

  merkle_root_before: string;
  merkle_root_after?: string;

  conflict_detection_policy:
    | "exact_scope_overlap"
    | "semantic_neighborhood_overlap"
    | "artifact_level_exclusive";

  on_hash_mismatch:
    | "freeze_and_rollback"
    | "manual_recovery_required";

  schema_version: "1.0";
}
```

Rule:

```text
A step validates against the hash of its declared read set at dispatch time.
A step produces hashes for its declared write set at completion.
The plan Merkle root is recomputed from all accepted write-set hashes.
Parallel steps may execute only if their write sets do not overlap under conflict_detection_policy.
```

------

## [GAP] [HIGH] [Runtime / parallelism] — Add ParallelBatchFinalizationReceipt

**Finding:** The prior response described the risk of orphan candidates after parallel batch failure but did not include the receipt in the paste-ready schema package.

**Recommendation:** Add:

```ts
interface ParallelBatchFinalizationReceipt {
  receipt_id: string;
  task_id: string;
  run_id: string;
  revision_plan_id: string;
  batch_id: string;

  failed_step_ids: string[];
  completed_step_ids: string[];
  cancelled_step_ids: string[];
  still_running_step_ids_at_failure_detection: string[];

  candidate_dispositions: Record<
    CandidateArtifactVersionRef,
    | "discarded"
    | "retained_for_manual_review"
    | "blocked_from_downstream"
    | "accepted"
    | "orphaned_until_reconciled"
  >;

  final_batch_status:
    | "completed"
    | "partially_completed"
    | "aborted"
    | "manual_recovery_required";

  downstream_consumption_blocked: boolean;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Budget / metrics] — Add BudgetFailureKind, CostEstimate, TaskCostRecord, and MetricKind

**Finding:** The consolidated response identified the budget and metric gaps but did not include enough executable schema for cost and metric reconciliation. V3.3.1 separates logical and infrastructure budgets, while Source Workspace, sub-agent metrics, local compute, DOC24 packet assembly, and external tools also consume cost.

**Recommendation:** Add:

```ts
type BudgetFailureKind =
  | "logical_llm_budget_exceeded"
  | "infrastructure_retry_budget_exceeded"
  | "local_compute_budget_exceeded"
  | "external_tool_cost_cap_exceeded"
  | "token_context_budget_exceeded"
  | "doc24_packet_budget_exceeded"
  | "source_research_budget_exceeded";

interface CostEstimate {
  usd_estimate?: number;
  usd_actual?: number;
  input_tokens?: number;
  output_tokens?: number;
  local_compute_seconds?: number;
  external_tool_cost_usd?: number;
  doc24_packet_tokens?: number;
  estimator_confidence: EstimatorConfidence;
  cost_basis: "estimate" | "actual" | "mixed";
  cost_dimension:
    | "llm"
    | "local_compute"
    | "external_tool"
    | "doc24_packet"
    | "source_research"
    | "infrastructure_retry"
    | "mixed";
  schema_version: "1.0";
}

interface TaskCostRecord {
  cost_record_id: string;
  task_id: string;
  run_id: string;
  module_id?: string;
  activation_seq?: number;
  subsystem:
    | "outcome_compiler"
    | "evaluator"
    | "revisor"
    | "revision_dispatcher"
    | "source_research"
    | "task_forum"
    | "doc24_packet_assembly"
    | "subagent"
    | "external_tool"
    | "infrastructure_retry";
  cost: CostEstimate;
  charged_to_budget_kind?: BudgetFailureKind;
  created_at: ISO8601;
  schema_version: "1.0";
}

type MetricKind =
  | "rate"
  | "mean"
  | "median"
  | "correlation"
  | "distribution"
  | "calibration_curve"
  | "count"
  | "ratio";

type MetricValue =
  | number
  | "undefined_insufficient_data"
  | "undefined_zero_denominator"
  | "not_applicable";

interface QualityMetricRecord {
  metric_id: string;
  metric_name: string;
  metric_kind: MetricKind;
  value: MetricValue;
  numerator?: number;
  denominator?: number;
  sample_size: number;
  minimum_sample_size: number;
  confidence_interval?: [number, number];
  zero_denominator_policy:
    | "undefined_zero_denominator"
    | "return_zero"
    | "return_one"
    | "fail_validation";
  metric_semantics_version: string;
  computed_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [GAP] [HIGH] [Sub-agent dispatch] — Add SubAgentFallbackPolicy schema

**Finding:** The consolidated response flagged no-sub-agent fallback but did not include a full schema. The red-team prompt specifically asks whether no-sub-agent fallback is fully specified.

**Recommendation:** Add:

```ts
type SubAgentInvocationPoint =
  | "outcome_compiler"
  | "evaluator_specialist_lane"
  | "revision_compiler"
  | "feedback_interpreter"
  | "task_agent_design_review"
  | "forum_moderator";

type SubAgentFallbackAction =
  | "proceed_without"
  | "use_generalist"
  | "downgrade_assurance"
  | "ask_human"
  | "block_evaluation"
  | "graph_patch_required"
  | "route_to_task_agent";

interface SubAgentFallbackPolicy {
  policy_id: string;
  invocation_point: SubAgentInvocationPoint;
  required_subagent_profile_ref?: string;
  fallback_action: SubAgentFallbackAction;
  assurance_impact:
    | "none"
    | "minor"
    | "material"
    | "blocks_reliance";
  must_surface_to_user: boolean;
  must_record_in_compiled_plan: boolean;
  allowed_in_autonomous_mode: boolean;
  schema_version: "1.0";
}

interface SubAgentFallbackReceipt {
  receipt_id: string;
  task_id: string;
  run_id: string;
  invocation_point: SubAgentInvocationPoint;
  missing_subagent_profile_ref?: string;
  fallback_policy_ref: string;
  fallback_action_applied: SubAgentFallbackAction;
  assurance_impact: SubAgentFallbackPolicy["assurance_impact"];
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [BUG] [MEDIUM] [Core §3C] — `requires_background_progress` is overloaded

**Finding:** `requires_background_progress` appears as a positive task signal, but “background progress” can mean native OpenClaw background work, EC scheduled jobs, or saved-task formalization.

**Why it matters:** It may push requests into saved tasks when native OpenClaw or EC scheduler would be the better formalization level.

**Recommendation:** Replace it with three narrower signals:

```ts
type TaskUseSignal =
  | "requires_progress_visibility"
  | "requires_independent_lifecycle"
  | "requires_reusable_graph"
  // existing values...
```

Mapping:

```text
requires_progress_visibility → may be native OpenClaw progress or saved task
requires_independent_lifecycle → saved/tracked task likely
requires_reusable_graph → saved task/template likely
```

------

## [GAP] [MEDIUM] [V3.3.1 §5.17] — Missing claim extractor should not be reported as missing source

**Finding:** If criteria require extracted claim types and `claims_in` is not wired, V3.3.1 says the Compiler emits `needs_missing_source` or proposes a graph patch. The missing object may be an extraction module, not a source.

**Why it matters:** The UI may tell the user to add a source when the correct repair is to wire `step.claim_extractor`.

**Recommendation:** Add:

```ts
CompiledEvaluationPlanStatus +=
  | "needs_missing_extraction"
  | "needs_missing_claim_extractor";
```

Default recovery:

```text
needs_missing_claim_extractor → propose graph patch adding step.claim_extractor upstream of step.evaluator.claims_in
needs_missing_source → request source material or source workspace update
```

------

# Additional self-learning amendments to preserve Claude-analysis coverage

The consolidated response included TIE, Loop Effectiveness, BDSM utility compilation, dampening, clustering, and multi-user fields, but it did not explicitly carry forward all of Claude’s recommended additions. Claude’s review specifically listed TIE, Loop Effectiveness, multi-user scope fields, UserConstitution, active learning, in-session self-critique, retrieval effectiveness, trust calibration, process pattern emergence, user attention signals, strategic-intent tags, privilege firewall design, cluster stability, PlanDiff, revealed-preference dampening, deterministic multi-prior conflict resolution, and DSPy metric architecture.

## [SELF-LEARNING] [HIGH] [Claude analysis] — Preserve strategic-intent learning without restoring goal-advancement self-grading

**Finding:** The consolidated response mentioned goal-learning sycophancy but did not explicitly preserve Claude’s proposed `strategic_intent_tag` axis. Claude’s point is important: structural outcome clusters do not replace strategic intent.

**Recommendation:** Add:

```ts
interface StrategicIntentTag {
  tag_id: string;
  task_id: string;
  run_id?: string;
  statement: string; // free text, optionally normalized to vocabulary later
  source:
    | "user_declared"
    | "task_agent_inferred"
    | "tie_recommended"
    | "template_default";
  confidence: number;
  governance: {
    principal_id: PrincipalRef;
    learning_scope: LearningScope;
    matter_id?: string;
    data_class: "public" | "internal" | "privileged" | "local_only";
  };
  created_at: ISO8601;
  schema_version: "1.0";
}

interface PatternPerformanceSliceContextSignature {
  // existing fields...
  outcome_cluster_id?: string;
  strategic_intent_tag_ids: string[];
}
```

Rule:

```text
Do not restore comparative-judge self-graded goal_advancement_count.
Do preserve strategic intent as a user-declared or reviewed conditioning axis.
Aggregate pattern performance by (outcome_cluster_id, strategic_intent_tag_id) when both are available.
```

------

## [SELF-LEARNING] [HIGH] [Claude analysis] — Add UserConstitution as stated-values prior

**Finding:** The consolidated response added learning-scope fields but did not include Claude’s `UserConstitution` artifact. This matters because stated values should outrank revealed-preference priors when they conflict.

**Recommendation:** Add:

```ts
type ConstitutionOrigin =
  | "user_authored"
  | "tie_recommended_and_user_accepted"
  | "task_agent_extracted_and_user_confirmed"
  | "imported_from_policy";

interface UserConstitutionValue {
  value_id: string;
  domain: string;
  statement: string;
  scope: LearningScope;
  origin: ConstitutionOrigin;
  evidence_refs: StorageRef[];
  captured_at: ISO8601;
  last_reviewed_at?: ISO8601;
  confidence: number;
}

interface UserConstitutionPreference {
  preference_id: string;
  domain: string;
  statement: string;
  scope: LearningScope;
  origin: ConstitutionOrigin;
  evidence_refs: StorageRef[];
  captured_at: ISO8601;
  confidence: number;
}

interface UserConstitution {
  constitution_id: string;
  principal_id: PrincipalRef;
  values: UserConstitutionValue[];
  preferences: UserConstitutionPreference[];
  conflict_resolution_policy_ref?: StorageRef;
  schema_version: "1.0";
}
```

Rule:

```text
UserConstitution values outrank revealed-preference priors.
They are not time-decayed unless explicitly superseded or edited by the user.
```

------

## [SELF-LEARNING] [HIGH] [Claude analysis] — Add deterministic multi-prior conflict resolution

**Finding:** The consolidated response did not explicitly add the multi-prior conflict-resolution policy. Claude flagged that leaving conflicts to LLM inference causes nondeterministic prompt behavior.

**Recommendation:** Add:

```ts
type PriorKind =
  | "user_constitution"
  | "explicit_user_instruction"
  | "task_template_default"
  | "strategic_intent_tag"
  | "revealed_preference"
  | "pattern_performance"
  | "subagent_reputation"
  | "capability_availability"
  | "historical_case";

interface PriorConflictRecord {
  conflict_id: string;
  task_id?: string;
  run_id?: string;
  prior_refs: StorageRef[];
  conflict_kind:
    | "same_kind_conflict"
    | "values_vs_revealed_preference"
    | "template_vs_user_instruction"
    | "pattern_vs_capability"
    | "scope_conflict";
  resolution_policy:
    | "explicit_user_instruction_wins"
    | "constitution_wins"
    | "recency_weighted"
    | "frequency_weighted"
    | "highest_authority_scope_wins"
    | "user_arbitration_required"
    | "tie_review_required";
  resolved_prior_ref?: StorageRef;
  dropped_prior_refs: StorageRef[];
  rationale: string;
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Add PlanDiff / ProposalDiff shared schema

**Finding:** Claude flagged that edit-trace learning depends critically on the diff representation. The consolidated response did not provide the shared diff schema.

**Recommendation:** Add to Common Contracts:

```ts
type DiffLayer =
  | "structural"
  | "text"
  | "semantic";

interface StructuralFieldDiff {
  field_path: string;
  prior_value_ref?: StorageRef;
  proposed_value_ref?: StorageRef;
  accepted_value_ref?: StorageRef;
  change_kind:
    | "added"
    | "removed"
    | "modified"
    | "reordered"
    | "unchanged";
}

interface TextFieldDiff {
  field_path: string;
  prior_text_hash: string;
  proposed_text_hash: string;
  accepted_text_hash: string;
  diff_ref: StorageRef;
  edit_distance?: number;
}

interface SemanticDiffSummary {
  field_path: string;
  embedding_model_ref?: string;
  semantic_distance?: number;
  summary: string;
  confidence: number;
}

interface PlanDiff {
  diff_id: string;
  source_object_kind:
    | "CompiledEvaluationPlan"
    | "TaskBlueprint"
    | "RevisionPlan"
    | "TaskAgentProposal"
    | "OutcomeCompilerProposal";
  prior_ref: StorageRef;
  proposed_ref: StorageRef;
  accepted_ref?: StorageRef;
  structural_diffs: StructuralFieldDiff[];
  text_diffs: TextFieldDiff[];
  semantic_diffs?: SemanticDiffSummary[];
  created_at: ISO8601;
  schema_version: "1.0";
}
```

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Add active learning, in-session self-critique, retrieval effectiveness, trust calibration, process patterns, and user-attention signals

**Finding:** The consolidated response mentioned TIE and LoopEffectiveness but did not explicitly include the remaining learning mechanisms Claude listed.

**Recommendation:** Add these as placeholder-but-typed mechanisms, without deep-diving the paused self-learning workstream:

```ts
interface ActiveLearningQuery {
  query_id: string;
  task_id?: string;
  run_id?: string;
  target_uncertainty_ref: StorageRef;
  question_text: string;
  expected_information_gain: number;
  user_burden_estimate: "low" | "medium" | "high";
  ask_policy:
    | "ask_now"
    | "defer_to_end"
    | "ask_only_if_blocking"
    | "do_not_ask";
  answer_ref?: StorageRef;
  schema_version: "1.0";
}

interface InSessionSelfCritiqueRecord {
  critique_id: string;
  task_id: string;
  run_id: string;
  module_id: string;
  activation_seq: number;
  artifact_ref?: StorageRef;
  critique_summary: string;
  detected_issue_refs: StorageRef[];
  action_taken:
    | "none"
    | "revised_before_output"
    | "flagged_for_evaluator"
    | "asked_human"
    | "routed_to_revisor";
  schema_version: "1.0";
}

interface RetrievalEffectivenessSignal {
  signal_id: string;
  task_id: string;
  run_id?: string;
  retrieval_request_ref: StorageRef;
  retrieved_item_refs: StorageRef[];
  used_item_refs: StorageRef[];
  rejected_item_refs: StorageRef[];
  effectiveness_score?: number;
  rationale: string;
  schema_version: "1.0";
}

interface TrustCalibrationRecord {
  calibration_id: string;
  subject_kind:
    | "module"
    | "subagent"
    | "compiler"
    | "revisor"
    | "judge"
    | "source_research";
  subject_ref: string;
  predicted_confidence: number;
  observed_success: boolean;
  observed_score?: number;
  calibration_error: number;
  sample_context_ref: StorageRef;
  schema_version: "1.0";
}

interface ProcessPatternEmergenceCandidate {
  candidate_id: string;
  pattern_kind:
    | "task_graph_topology"
    | "module_sequence"
    | "repair_strategy"
    | "source_research_strategy"
    | "forum_coordination_pattern";
  supporting_run_refs: StorageRef[];
  pattern_summary: string;
  proposed_pattern_ref?: StorageRef;
  promotion_status:
    | "candidate"
    | "needs_review"
    | "approved"
    | "rejected"
    | "quarantined";
  schema_version: "1.0";
}

interface UserAttentionSignal {
  signal_id: string;
  task_id?: string;
  run_id?: string;
  surfaced_item_ref: StorageRef;
  attention_event:
    | "viewed"
    | "ignored"
    | "dismissed"
    | "expanded"
    | "acted_on"
    | "snoozed"
    | "marked_too_noisy"
    | "marked_useful";
  elapsed_ms_before_action?: number;
  schema_version: "1.0";
}
```

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Clarify outcome clusters vs pattern primitives

**Finding:** Claude correctly flagged that outcome clusters and DOC72 pattern primitives are separate emergence mechanisms and need a boundary. The consolidated response did not state the boundary explicitly.

**Recommendation:** Add:

```text
Boundary rule:
- Outcome clusters classify the shape of work being evaluated.
- Pattern primitives describe interventions that work: revision strategies, configurations, prompts, module topologies, source-research strategies.
- Pattern performance slices may be keyed by outcome cluster, but outcome clusters are not themselves intervention patterns.
- Durable pattern promotion is blocked until the canonical schema-collision issues in EvaluationFinding, ResearchNeed, RepairInstruction, and EvaluationResultEnvelope are resolved.
```

------

## [SELF-LEARNING] [MEDIUM] [Claude analysis] — Add sub-agent reputation slice extensions

**Finding:** The consolidated response included sub-agent fallback but did not explicitly include Claude’s point that reputation must slice by task/outcome context.

**Recommendation:** Add:

```ts
interface SubAgentReputationContextSignature {
  sub_agent_id: string;
  outcome_cluster_id?: string;
  strategic_intent_tag_ids: string[];
  task_type?: string;
  source_kind?: string;
  failure_kind?: FailureKind;
  model_class: ModelClass;
  domain_tags: string[];
  schema_version: "1.0";
}

interface SubAgentReputationSlice {
  slice_id: string;
  context_signature: SubAgentReputationContextSignature;
  accepted_recommendation_count: number;
  successful_recommendation_count: number;
  advice_led_to_regression_count: number;
  insufficient_sample: boolean;
  hit_rate: MetricValue;
  cost_per_success: MetricValue;
  confidence_interval?: [number, number];
  last_updated_at: ISO8601;
  schema_version: "1.0";
}
```

------

# Synthesis amendment

Add this paragraph to the Synthesis:

> **Audit amendment:** The response should be read as an all-defect hardening backlog, not as a prioritized shortlist. The “Top 5” section is only a synthesis aid. The next draft should also explicitly add: `TaskSkipProtocol`, `RunGuidanceLifecycleReceipt`, `ForumDecision`, `ForumSignal`, `RunBoardRetentionPolicy`, DAG-safe rolling hash state, per-kind idempotency inputs, source-workspace-native snapshot hashes, `MetricKind`/`QualityMetricRecord`, `BudgetFailureKind`, `CostEstimate`, `TaskCostRecord`, `SubAgentFallbackPolicy`, `StrategicIntentTag`, `UserConstitution`, `PlanDiff`, `PriorConflictRecord`, and the remaining typed self-learning mechanism stubs. These additions close omissions from the consolidated response and reduce the chance that the spec-revision chat focuses only on the largest five issues.



# Gemini

## §4 Conceptual Critique

The foundational topology of the DOC23 Addenda B set—decoupling execution (DOC23 proper), intent (Task Blueprint), evaluation (Outcome Evaluator), and revision (Revisor)—is exceptionally rigorous. Moving away from monolithic "agent loops" into a discrete, compiler-driven pipeline where LLMs propose `CandidateArtifactVersions` and deterministic buses execute them solves the non-determinism problem that plagues most agentic frameworks.

However, when reviewing the set holistically, the architecture relies heavily on a fatal unstated assumption: **that the underlying runtime executes perfectly in a single-threaded, isolated vacuum.** When you introduce asynchronous Task Forums, background source gathering, and human-in-the-loop overrides, this single-threaded assumption collapses. The system currently lacks a robust concurrency model, transaction boundaries, and deadlock resolution mechanisms. Furthermore, while the system meticulously budgets LLM calls (e.g., `retry_budget`), it is virtually blind to wall-clock compute time and KV-cache bloat, making it highly susceptible to local-first thermal throttling and context window overflow when dealing with high-complexity professional tasks.

**The Reconceptualization Opportunity:** The biggest conceptual miss is treating the `ShadowWorkspace` or `CandidateArtifactVersion` merely as a transient Revisor payload. These should be elevated to **first-class system primitives (Branch and Merge)**. If you formalize graph-state branching—where evaluations, forum debates, and revisor plans all execute on an isolated branch and only fast-forward merge upon `satisfied` outcomes—you instantly eliminate the need for brittle rollback mechanics, concurrency locks, and mid-flight artifact corruption.

## §5 Subsystem & Seam Findings

### [F-01] Source Workspace Transaction Isolation vs. Revisor Direct Fixes

**Location:** DOC23 Addenda B / Source Workspace V1.0.1 + Outcome Evaluator/Revisor V3.3.1 (Candidate Artifact Versions)

**Severity:** CRITICAL

**The Defect/Risk:** Evaluator V3.3.1 allows a `direct_fix` to be applied via the Revision Dispatcher against the Source Workspace. However, Source Workspace V1.0.1 fails to define transaction boundaries, read-locks, or write-locks. If a user edits a SourceRecord via the UI (Source Workspace V1.0.1 §8 context menu) *while* the Revisor is computing a `CandidateArtifactVersion`, the live hash changes.

**The Consequence:** The Revisor's line-number or AST-based direct fix will be applied to a stale artifact state, resulting in silent data corruption, overlapping edits, or broken document syntax without triggering a formal schema validation error.

**The Fix:** Implement strict `SnapshotHash` validation at the seam. The Revision Dispatcher MUST check if `live_workspace_hash == candidate_origin_hash` before applying a direct fix. If mismatched, the Dispatcher MUST automatically abort the revision plan, fail the step, and force a re-evaluation against the new live state.

### [F-02] Plan Review Forum Deadlock (Consensus Failure)

**Location:** Task Forum + Run Board V1.0.1 §8.5 (Plan Review Forum) + Outcome Evaluator/Revisor V3.3.1 §14.9

**Severity:** HIGH

**The Defect/Risk:** V3.3.1 delegates plan approval to the Task Forum. Task Forum V1.0.1 defines "per-plan vote summary" and "annotation tools," but completely omits tie-breaking algorithms, maximum deliberation rounds, or timeout fallbacks.

**The Consequence:** If the multi-agent forum splits its votes 50/50, or if agents endlessly generate conflicting annotations without reaching a terminal `approve` or `reject` state, the task graph hangs indefinitely. There is no circuit breaker to route a hung forum back to the Revision Compiler or up to a human `HardCallResolution`.

**The Fix:** Define a strict FSM for the Plan Review Forum: `max_deliberation_ticks` and `consensus_threshold_pct`. If ticks are exhausted without consensus, the forum MUST emit a `forum_deadlock` reason code, forcing the Revisor to trigger an immediate `escalation_out` (Hard Call).

### [F-03] Transitive Taint Laundering via Sub-Agents

**Location:** Outcome Evaluator/Revisor V3.3.1 (Taint and Policy Model) + Task Forum + Run Board V1.0.1 §8.3

**Severity:** CRITICAL

**The Defect/Risk:** The architecture defines taint boundaries for the Revisor's input packet. However, if a tainted artifact is posted to the active Task Forum (§8.3 "Mentions stream") for sub-agent debate, the sub-agents read the tainted text and generate "clean" summary critiques or annotations.

**The Consequence:** The taint is laundered. The generated annotations do not explicitly inherit the `external_untrusted` taint label. When the Revisor consumes the forum's critique to draft a plan, it incorporates adversarial payloads via the untainted summary, bypassing the safety boundary entirely.

**The Fix:** Taint must be defined as a transitive graph property across ALL Sub-addenda. Any output produced by a module or sub-agent that consumed a tainted payload MUST automatically inherit that taint label, unless it passes through a specifically designated structural `SanitizationNode`.

## §6 Exhaustive Defect Hunt

### [D-01] RunGuidanceItem Cross-Run Propagation Gap

**Location:** Feedback Delivery V1.0.1 (Run Guidance Items) + Core R0.7.1

**Severity:** HIGH

**The Defect/Risk:** Feedback Delivery V1.0.1 specifies `RunGuidanceItem`s generated from feedback. However, it fails to specify *how* a `RunGuidanceItem` emitted in Run A is injected into the payload of future Run B.

**The Consequence:** An implementing agent has no wiring diagram for durability. Do these items write to the `TaskBlueprint`? Do they live in a floating KV store? If the injection vector is undefined, the guidance is ephemeral and learning fails to compound.

**The Fix:** Explicitly map `RunGuidanceItem` emission to a DOC24 BDSM ledger update, OR explicitly state they are written to the `TaskBlueprint`'s `persistent_guidance` array via an EC durable write.

### [D-02] EvaluationFindings KV-Cache Bloat

**Location:** Evaluation Common Contracts V1.1.1 §3 (EvaluationResultEnvelope)

**Severity:** MEDIUM

**The Defect/Risk:** `EvaluationResultEnvelope` contains arrays of `CriterionLineage` and findings. For a 100-page document evaluated against 50 criteria, the envelope JSON size will easily exceed 100k tokens.

**The Consequence:** When this envelope is passed to the Outcome Compiler or Revisor, the local LLM's context window will blow out or time-to-first-token (TTFT) will skyrocket, crashing local execution constraints.

**The Fix:** Mandate that `EvaluationResultEnvelope` chunks findings by `ArtifactScopeRef`, requiring the Revisor to pull paginated findings via an iterator rather than forcing a massive single-shot payload injection.

### [D-03] pending_dependency Deadlock Cascade

**Location:** Outcome Evaluator/Revisor V3.3.1

**Severity:** HIGH

**The Defect/Risk:** If an upstream module crashes or hits its `per_outcome_retry_budget` and halts, downstream evaluators waiting on those artifacts remain in a `pending_dependency` state.

**The Consequence:** The Loop Controller will hang waiting for an artifact that will mathematically never arrive.

**The Fix:** Introduce a `TransitiveFailureCascade` rule in Common Contracts. If an upstream module emits an `unrecoverable_failure`, the graph MUST instantly sweep all downstream nodes waiting on its output and transition them from `pending_dependency` to `upstream_failure_aborted`.

### [D-04] Ambiguous 'Regenerate' Capability Contract

**Location:** Outcome Evaluator/Revisor V3.3.1 (`revision_in` port contract)

**Severity:** MEDIUM

**The Defect/Risk:** The spec relies on the module declaring capabilities (e.g., `regenerate`). But if `regenerate` is triggered and simply returns the exact same failing output due to deterministic temperature settings or rigid prompt templates, the Revisor will loop.

**The Consequence:** Infinite budget burn bouncing between identical regenerations.

**The Fix:** The `revision_in` payload MUST require a `previous_attempt_hash` array. The target module's integration layer MUST automatically reject generation if the new output hash matches a `previous_attempt_hash`, forcing it to alter temperature/top-k or escalate.

### [D-05] Missing Write Locks in Source Workspace

**Location:** Source Workspace V1.0.1

**Severity:** HIGH

**The Defect/Risk:** The document specifies a "shared source/reference/research substrate that task runs use when modules need to find, verify, retrieve, or update sources". It fails to specify mutual exclusion (mutex) locks.

**The Consequence:** If `step.source_research` and a direct user-edit attempt to mutate the exact same `SourceRecord` simultaneously, race conditions will overwrite or corrupt the reference library.

**The Fix:** Add a strict `VersionVector` or `RowLevelLock` requirement for all mutations applied to the Source Workspace.

## §7 Synthesis & Verdict

**Grade:** A- (Build-Ready pending Transaction/Concurrency Fixes)

**Justification:** The modular isolation here is excellent. The separation of evaluation envelopes (Common Contracts) from task-graph execution ensures that observability doesn't corrupt business logic. However, the system loses its "A" grade because it conceptually ignores distributed-systems realities—specifically transaction isolation on the Source Workspace, transitive taint across the Task Forum, and consensus deadlocks. It reads like a system perfectly designed for a single asynchronous thread, which will shatter under parallel evaluation and human-in-the-loop concurrency.

**Top 5 must-fix before build:**

1. **[F-01] Snapshot Hash Preconditions:** Mandate hash-matching for `direct_fix` to prevent concurrent-write corruption on the Source Workspace.
2. **[F-03] Transitive Taint:** Force Task Forum sub-agents and annotations to inherit the taint of the payloads they summarize.
3. **[F-02] Forum Deadlock Breaker:** Implement a strict `max_deliberation_ticks` timeout on the Task Forum, defaulting to a Hard Call escalation.
4. **[D-03] Transitive Failure Cascade:** Ensure upstream `could_not_fix` halts instantly cascade to downstream `pending_dependency` outcomes to prevent graph hangs.
5. **[D-01] RunGuidance Injection Vector:** Define exactly which durable store `RunGuidanceItem`s are written to so they actually survive between runs.

**Biggest conceptual opportunity:**

**Promote the "ShadowWorkspace" to a Core System Primitive.** Stop treating candidate revisions as transient Revisor payloads. Treat the entire execution state like Git. Every task execution creates a Branch. Revisions, forum debates, and generations happen on the Branch. If the Evaluator yields `satisfied`, you Fast-Forward Merge to Main. If it fails, you drop the branch. This mathematically guarantees rollback safety and eliminates the need for 90% of the complex error-recovery state machines currently specked.

**Coding-agent readiness:**

**Not ready for autonomous implementation.** While the schemas in Common Contracts are tight, an AI coding agent will fail catastrophically on the execution semantics. Because transaction locks, tie-breakers, and DLQ routing are missing, an AI coding agent will hallucinate default Node.js asynchronous behaviors (e.g., executing `Promise.all` with no mutex locks on the Source Workspace), resulting in a deployed system riddled with race conditions. The spec must explicitly define concurrency handling before being handed to a coding agent.



This fifth and deepest pass focuses explicitly on your question: **Does this structure accomplish the goals of unifying and improving the memory system?**

When evaluating DOC23 Addenda B strictly as the "nervous system" that writes to and reads from ELNOR’s memory (DOC72 Skill Graph, BDSM Utility, and local Task Blueprints), the architecture achieves **unprecedented observability but fails at unified applicability**.

It perfectly captures *what* happened, but the mechanism for injecting those lessons back into future runs is fragmented, resulting in agent amnesia and colliding memory tiers.

Here is the deep-dive analysis of the memory system fractures, missed bugs, and new architectural ideas to unify it.

### 1. The "Sub-Agent Amnesia" Bug

**Location:** Feedback Delivery V1.0.1 + Task Forum V1.0.1

**The Flaw:** Feedback Delivery V1.0.1 specifies that `RepairInstructions` and feedback are routed to "downstream consumers," including sub-agents. This works for the *current* run (revising a failed draft). However, Addenda B completely forgets to define a persistent memory state for sub-agents across *future* runs.

**The Consequence:** The `system.task_forum` agents are functionally stateless. If the "Specialist Litigator" sub-agent hallucinates a bad case citation, the Outcome Evaluator catches it, the Revisor sends a `RepairInstruction`, and the sub-agent fixes it for that run. But tomorrow, on a new task, the sub-agent will make the exact same hallucination because Addenda B provides no mechanism to inject past `EvaluationFindings` into the prompt contexts of specialized sub-agents.

**The Fix:** Introduce **Sub-Agent Experience Profiles** in DOC24, managed by the Task Agent. When the Task Agent spins up the Task Forum, it MUST inject a `SubAgentPrior` payload into each participant's configuration, containing the last 5 relevant `RepairInstructions` that specific sub-agent received across the workspace.

### 2. The Missing "Flawless Execution" Denominator (Critical Learning Bug)

**Location:** Evaluation Common Contracts V1.1.1 (Envelope Schemas) + Evaluator V3.3.1

**The Flaw:** The `EvaluationLearningSignalEnvelope` taxonomies (`RepairCycleSignal`, `TaskProcessGapSignal`, `OutcomeEvaluationSignal`) are almost exclusively triggered by failures, revisions, hard calls, or human overrides.

**The Consequence:** You are building a memory system based entirely on negative reinforcement and delta-corrections. If a Task Blueprint executes perfectly and passes the Outcome Evaluator on the first try, the system emits no substantive learning envelope. BDSM (DOC8) and the DOC72 Skill Graph *require* positive denominators to build statistical confidence (e.g., calculating beta-decay utility). If Addenda B only screams when it hurts, the memory system will falsely decay the utility of perfectly functioning patterns because it thinks they aren't being used.

**The Fix:** Common Contracts MUST define a `TaskConfirmationSignal`. The Evaluator MUST emit this envelope when a task completes with zero `OutcomeEvaluation` failures, explicitly reinforcing the exact `TaskBlueprint` and `UnifiedPattern` versions used.

### 3. The Local vs. Global Precedence Collision

**Location:** Core R0.7.1 (Task Blueprint) vs. Evaluator V3.3.1 (Pattern Learning)

**The Flaw:** Addenda B establishes two distinct memory loops.

1. **Local:** `RunGuidanceItem`s update the `TaskBlueprint` (local intent).

2. **Global:** `PatternPerformanceSignal`s update the DOC72 Skill Graph (global capability).

   **The Consequence:** What happens when global memory and local memory directly contradict? If DOC72 learns a highly confident global pattern ("Always format citations in Bluebook"), but the user manually added a `RunGuidanceItem` to a specific Task Blueprint ("Use California Style Manual for this judge"), the architecture does not specify which memory tier wins during task assembly.

   **The Fix:** Core R0.7.1 must explicitly define the **Memory Precedence Hierarchy**. The invariant must be: `Local Intent (Blueprint/Guidance) > Matter/Scope Policy > Global Skill Graph (DOC72)`. The Task Agent MUST enforce this hierarchy when assembling the execution context, explicitly masking global patterns that conflict with local guidance.

### 4. Conflating "State" with "Structure" in the Task Blueprint

**Location:** Core R0.7.1 §4 (Task Agent & Task Blueprint)

**The Flaw:** The Task Agent is tasked with using memory to self-improve the `TaskBlueprint`. But the blueprint currently mixes *Structural Automation* (which modules are wired together) with *Ephemeral Context* (the specific legal matter, the specific files).

**The Consequence:** If a user edits a task to fix a broken pipeline (Structural), the Task Agent updates the blueprint. If the user edits the task to point to a new case file (State), the Task Agent updates the blueprint. When the system tries to generalize "what it learned" from the blueprint to save it as a template, the memory is poisoned with highly specific case data.

**The Fix:** The `TaskBlueprint` MUST be formally bifurcated into `TaskTopology` (the wiring, patterns, and logic, which the memory system optimizes globally) and `TaskPayload` (the context, source workspace refs, and parameters, which are strictly ephemeral and never generalized).

### The Conceptual Verdict on Memory Unification

Does this structure unify the memory system? **No, not yet.** Right now, Addenda B acts like a mailroom. It prints out receipts (`EvaluationFeedbackBundle`) and throws them into the void, hoping DOC72 and BDSM catch them. But it does not take responsibility for *reading* the mail before the next run.

**The New Idea to Improve This (The "Memory Hydration" Phase):**

Addenda B currently focuses heavily on the *end* of the run (Outcome Evaluator, Revisor, Signal Emission). You need to balance this by formalizing the *beginning* of the run.

You must introduce a strict **"Memory Hydration"** lifecycle phase in Core R0.7.1.

Before a task moves to `running`, the Task Agent MUST execute a `HydrateMemory` step. This step queries BDSM, DOC72, and the local RunGuidance ledger, mathematically resolves conflicts based on the Precedence Hierarchy (Bug 3), injects the priors into the sub-agents (Bug 1), and stamps the run with a `HydratedMemoryHash`.

Without formalizing *how* memory is hydrated back into the graph before execution, the elegant feedback loop you designed in V1.0.1 is a one-way street.



This is an excellent redirect. Performing a strict, technical audit of the **quantitative functions, formulas, metrics, cache hashes, state transitions, and routing thresholds** in the DOC23 Addenda B set (specifically focusing on V3.3.1 Outcome Evaluator/Revisor and Core R0.7.1) reveals several critical mathematical breaks and technical paradoxes.

While the architectural routing is sound, the actual *formulas and metrics* intended to drive the self-improving and safety mechanisms contain survivorship biases, clock-skew race conditions, and impossible hashing expectations.

Here is the quantitative and technical-function audit, along with concrete schema patches.

### [BUG-01] Survivorship Bias in Convergence Metrics (CRITICAL)

**Location:** V3.3.1 §15.1 Revisor quality metrics (`avg_revision_cycles_to_convergence`)

**The Mathematical Flaw:** The spec defines the metric as:

- `Denominator: outcomes reaching satisfied state`
- `Numerator: revision cycles per outcome`

This formula completely excludes outcomes that *never* reach a satisfied state (e.g., those that burn through the `retry_budget` and escalate to a Hard Call or abort). This introduces massive **survivorship bias**. If the Revisor wastes 10 cycles on a complex issue, fails, and escalates, those 10 wasted cycles are completely erased from the metric. The Revisor will look mathematically "highly efficient" because it only scores the easy fixes that succeed in 1-2 cycles.

**The Fix:** You must bifurcate this metric into an efficiency score (for successes) and a waste score (for failures).

**Schema Patch (V3.3.1 §15.1):**

TypeScript

```
// Replace existing metric with these two:
Metric: avg_cycles_to_success
  Denominator: outcomes transitioning to satisfied state
  Numerator: revision cycles spent on those specific outcomes

Metric: wasted_cycle_burn_rate
  Denominator: total revision cycles executed globally
  Numerator: revision cycles spent on outcomes that ultimately aborted, escalated, or reached 'unrecoverable'
```

### [BUG-02] The Hash Hallucination Paradox in Mutator State (CRITICAL)

**Location:** V3.3.1 §7.2.1 `RevisionPlan` schema (Mutation mode)

**The Technical Flaw:** To prevent concurrent overwrite, the plan schema declares:

TypeScript

```
mutation_mode: "candidate_only" | "rolling_hash_in_place"
rolling_hash_chain?: Array<{ step_id: string, predicted_pre_hash: string, predicted_post_hash: string }>
```

The `RevisionPlan` is output by the Revision Compiler (an LLM). You are mathematically requiring an LLM to generate a `predicted_post_hash` (a SHA-256 or BLAKE3 string) for a text mutation it is planning to apply in the future. LLMs cannot compute cryptographic hashes of future string states. It will hallucinate a fake hash string (e.g., `"a1b2c3d4..."`). When the deterministic Dispatcher checks the real hash against the hallucinated `predicted_pre_hash` for Step 2, the exact-match gate will fail 100% of the time, crashing all multi-step in-place plans.

**The Fix:** Remove `predicted_post_hash` from the LLM's required output contract. State transitions must rely on stable section anchors, and the Dispatcher must compute expected hashes dynamically at runtime.

**Schema Patch (V3.3.1 §7.2.1):**

TypeScript

```
// Replace rolling_hash_chain with LLM-safe lock targeting:
mutation_mode: "candidate_only" | "rolling_hash_in_place"
in_place_mutation_locks?: Array<{
  step_id: string
  target_section_anchor_hash: string // Stable hash from §12.4.2, known at read-time
  expected_base_version_id: string 
}>
```

### [BUG-03] Flat-Ratio Math in "Weighted" Reputation Scores (HIGH)

**Location:** V3.3.1 §15.8.2 `SubAgentReputation`

**The Mathematical Flaw:** The spec explicitly claims the `advice_regression_rate` is an "asymmetric-weighted metric (per §15.8.1 false negative cost is 2x)". However, the actual formula defined in the schema is:

```
regression_rate: number // regression_count / accepted_count
```

This is a flat, unweighted ratio. If the math does not multiply the counts by the severity weights, a sub-agent that introduces 5 "critical" regressions mathematically scores identically to a sub-agent that introduces 5 "minor" formatting regressions.

**The Fix:** Enforce the asymmetric weighting in the denominator/numerator math so bad actors are punished exponentially harder for critical breaks.

**Schema Patch (V3.3.1 §15.8.2):**

TypeScript

```
advice_regression_rate: {
  accepted_count: number
  regression_introduced_count: number
  severity_breakdown: Record<"minor" | "major" | "critical", number>
  
  // NEW: The actual weighted math
  unweighted_regression_rate: number // regression_count / accepted_count
  severity_weighted_penalty_score: number 
  // Formula: ((minor*1) + (major*2) + (critical*5)) / accepted_count
}
```

### [BUG-04] Distributed Clock-Skew in Concurrency Tie-Breakers (HIGH)

**Location:** V3.3.1 §11.9 `concurrency_tie_breaker`

**The Technical Flaw:** Rule 4 of the tie-breaker for conflicting revision plans is `RevisionPlan.created_at ascending (earlier = wins)`.

In a local-first, multi-threaded agentic architecture, timestamps are subject to event-loop clock skew. If two sub-agents or evaluators compile plans simultaneously, relying on a millisecond timestamp to grant write-access to the `TaskSourceWorkspace` creates a race condition. The plan that wins might take 45 seconds to execute, needlessly locking the workspace, while a 1-second format fix sits in the queue.

**The Fix:** Tie-breakers for concurrent write-locks should optimize for risk reduction and lock-release speed, not arbitrary timestamp micro-seconds.

**Formula Patch (V3.3.1 §11.9):**

Plaintext

```
RULE concurrency_tie_breaker (UPDATED):
1. OutcomeDependencySpec.required_for_overall_pass = true > false
2. EvaluationOutcomeDefinition.is_high_stakes = true > false
3. EvaluationOutcomeDefinition.priority ascending (lower priority value = wins)
4. RevisionPlan.plan_risk_score descending (safer plans acquire lock first to prevent catastrophic collisions)
5. RevisionCostEstimate.total_tokens ascending (faster/smaller plans execute and release lock quicker)
```

### [BUG-05] Thermal Throttling Outliers in Latency Aggregations (MEDIUM)

**Location:** Core R0.7.1 §16.2 `TaskAssessment` & V3.3.1 §12.5 `CostLatencyFinding`

**The Statistical Flaw:** The metrics track `avg_duration_ms` for tasks and modules. On local Apple Silicon (M4/M5), latency is violently bimodal due to unified memory swapping, model cold-starts, and thermal throttling. A module might take 2 seconds normally, but 35 seconds if the OS swaps memory. If you use a simple arithmetic mean (`avg_duration_ms`), a few thermal outliers will massively skew the average, creating false-positive "performance degradation" signals in the `TaskAssessment` that trigger useless Task Agent alerts.

**The Fix:** Drop standard averages for local-compute latency. You must use median ($P_{50}$) and 90th percentile ($P_{90}$) or apply an Interquartile Range (IQR) filter before aggregation.

**Schema Patch (Core R0.7.1 §16.2 / Common Contracts):**

TypeScript

```
CostLatencyFinding {
  // REMOVE: avg_duration_ms: number
  // ADD:
  latency_distribution: {
    p50_duration_ms: number // Primary benchmark
    p90_duration_ms: number // Thermal/Cold-start benchmark
    sample_size: number
  }
}
```

### Summary of Audit

To ensure this spec functions operationally, you must strip LLM non-determinism out of the hashing chains (Bug 2), correct the denominator logic to prevent survivorship bias (Bug 1), and mathematically enforce the severity multipliers (Bug 3). Updating these equations transforms the system from conceptually elegant to mathematically sound.





# Grok

**Consolidated Red Team Review — DOC23 Addenda B Family (V2)**

I have reviewed the complete current family of specifications as a single coherent system:

- DOC23 Addenda B Core R0.7.1  
- DOC23 Addenda B / Outcome Evaluator+Revisor V3.3.1  
- DOC23 Evaluation Common Contracts V1.1.1  
- DOC23 Addenda B / Source Workspace V1.0.1  
- DOC23 Addenda B / Task Forum + Run Board V1.0.1  
- DOC23 Addenda B / Feedback Delivery V1.0.1  

This review is fresh-window, adversarial, and complete. It contains every finding, every bug, every gap, every risk, and every new idea surfaced in the red-team round.

---

### §3 Broad / Conceptual Review (6 findings)

**IDEA HIGH** Core R0.7.1 §2.12 & Evaluator+Revisor V3.3.1 §3.9 — Autonomy principle is mis-scoped  
The “planner not reviser” rule and “user is final gate” rhetoric are undermined by the Revisor’s actual power (CompiledRevisionStrategy, direct_fix, regenerate, HardCall bypass paths). The system claims to keep humans in control while giving the Revisor enough autonomy to silently mutate high-stakes artifacts.  
**Why it matters:** A securities litigator cannot rely on a system that both promises reviewability and quietly ships revisions.  
**Recommendation:** Make `AutonomousModePolicy` the single source of truth for every mutation path; expose it in the task UI as a visible toggle with live risk score. Remove any hidden bypasses.

**IDEA CRITICAL** Overall architecture — The system still treats revision as a repair loop rather than a design feedback loop  
The Revisor improves the artifact; almost nothing improves the task graph itself except the narrow TaskProcessGapSignal path.  
**Why it matters:** The highest-leverage improvement in professional work is usually “the process itself was wrong.”  
**Recommendation:** Elevate Task Agent graph-patch proposals to first-class, measurable outcomes inside the Revisor cycle. Make “improve the task” a peer to “improve the artifact.”

**UX CRITICAL** Feedback Delivery V1.0.1 §8 + Common Contracts V1.1.1 §3.7 — DOC23/DOC15/DOC24 boundary is a seam of ambiguity  
Three documents each claim partial ownership of how feedback becomes prompt context; none owns the end-to-end contract.  
**Why it matters:** An AI coding agent will have to guess whether a Repair Instruction travels by wiring, DOC24 packet, or CIL render. That guess becomes runtime nondeterminism.  
**Recommendation:** One document (Feedback Delivery) must own the canonical data-flow diagram with numbered steps and responsibility matrix.

**RISK HIGH** Evaluator+Revisor V3.3.1 §11.21 + Core R0.7.1 §9.0 — Revalidation cascade has no termination guarantee  
The cascade can re-trigger on the same finding after a partial fix, with no documented max-depth or convergence predicate.  
**Why it matters:** Infinite revision loops are possible under realistic failure patterns.  
**Recommendation:** Add a mandatory `revalidation_depth` and `convergence_predicate` to every RevisionPlan; exceed either → escalation_out.

**IDEA MEDIUM** Task Forum + Run Board V1.0.1 §9 + Source Workspace V1.0.1 §9 — Substantive vs process gap distinction is elegant but unenforced  
The boundary is described but there is no runtime gate that prevents a substantive evaluator from emitting a TaskProcessGapSignal or vice-versa.  
**Why it matters:** Task Agent can receive domain questions it is not competent to answer.  
**Recommendation:** Add a typed `gap_category` enum on every signal and enforce it at emission time.

**UX HIGH** Common Contracts V1.1.1 §3.7 (Pattern C) — Chain linkage is clever but invisible to the user  
`target_evaluation_chain_id` binds Evaluator → Judge results, but no UI surface or audit view exposes the chain as a single conceptual evaluation.  
**Why it matters:** A professional sees two separate verdicts and cannot tell they are part of one evaluation.  
**Recommendation:** Require DOC20 to render chained envelopes as a single card with both qualitative and quantitative slices side-by-side.

---

### §4 Targeted Review — Seams & Hard Mechanics (10 findings)

**GAP CRITICAL** Evaluator+Revisor V3.3.1 §11 + Feedback Delivery V1.0.1 §7 — Revision Dispatcher lifecycle is specified in one place, consumption receipts in another, no unified state machine.  
**RISK HIGH** Source Workspace V1.0.1 §7.5 + Evaluator+Revisor V3.3.1 §12 — Workspace API operations are referenced but never defined (circular reference).  
**RISK HIGH** Task Forum + Run Board V1.0.1 §6.4 + Feedback Delivery V1.0.1 §8.2 — TaskRunContextPacket and DOC24 packet assembly have overlapping responsibilities with no versioned contract.  
**GAP MEDIUM** Core R0.7.1 §24B + Evaluator+Revisor V3.3.1 §29 — Multiple OP-A rows claim the same primitive with slightly different names.  
**RISK HIGH** Evaluator+Revisor V3.3.1 §11.8 + Source Workspace V1.0.1 §2.2 — Idempotency key handling is specified in two places with no shared definition.  
**GAP CRITICAL** Feedback Delivery V1.0.1 §9.5 + Common Contracts V1.1.1 §5 — Consumption Receipts are supposed to feed RepairCycleSignal but the linkage field is optional and never mandated.  
**BUG HIGH** Task Forum + Run Board V1.0.1 §5.3 — Posts are append-only but the UI description in §8.1 implies editing is possible.  
**GAP HIGH** Common Contracts V1.1.1 §9.1 — Validation code `validation.envelope_judge_emitted_qualitative_slice` is a warning, but Pattern C requires qualitative slice from Evaluator.  
**RISK CRITICAL** Evaluator+Revisor V3.3.1 §6.7 — Revisor sufficiency protocol has no defined detection algorithm for “no sufficient procedure.”  
**GAP HIGH** Source Workspace V1.0.1 §9.3 + Core R0.7.1 §24A — Library promotion policy references EC Core policy but the exact gate is not defined.

---

### §5 Exhaustive Defect Hunt (14 findings)

**BUG CRITICAL** Evaluator+Revisor V3.3.1 §11.3 — Deterministic plan linting references a “DAG acyclic” check but the RevisionPlan schema has no `depends_on` field defined for steps.  
**GAP CRITICAL** Feedback Delivery V1.0.1 §3.4 rule 2 — “Hard blockers require backed authority” but the schema for EvaluationFinding.authority_basis is an array that can be empty.  
**RISK HIGH** Source Workspace V1.0.1 §4.1 — SourceRecord.taint_class is present but never populated or checked on read.  
**BUG MEDIUM** Task Forum + Run Board V1.0.1 §5.3 — Posts are append-only but the UI description implies editing is possible.  
**GAP HIGH** Common Contracts V1.1.1 §7.2 — TextAnchor.context_hash is defined but never computed or validated anywhere.  
**RISK CRITICAL** Overall set — No document owns the canonical RevisionExecutionLifecycle state machine that ties dispatcher, receipts, and revalidation.  
**GAP HIGH** Evaluator+Revisor V3.3.1 §15.8 — Sub-agent advice quality metrics are named but no schema records them.  
**RISK MEDIUM** Feedback Delivery V1.0.1 §4.5 — RunGuidanceItem.lifecycle_state can be “contested” but no consumer ever checks that state before using guidance.  
**BUG HIGH** Task Forum + Run Board V1.0.1 §6.3 — BoardDigest includes `open_research_needs` but ResearchNeed schema lives in Source Workspace V1 and is not referenced.  
**RISK HIGH** Source Workspace V1.0.1 §7.6 — Cost attribution says “attribute cost to the module itself” but no schema field records per-module cost on SourceRecord.  
**GAP CRITICAL** Core R0.7.1 §3D — TaskOpportunityPacket.token_budget is defined but no consumer ever populates or checks it.  
**BUG HIGH** Feedback Delivery V1.0.1 §6.4 — Repeated-failure detection uses `criterion_id + finding_kind + affected_artifact_refs` but affected_artifact_refs can be versioned and therefore non-stable.  
**GAP HIGH** Evaluator+Revisor V3.3.1 §11.8 — Idempotency key format is never canonicalized across the family.  
**RISK HIGH** Task Forum + Run Board V1.0.1 §7.4 + Evaluator+Revisor V3.3.1 §14.9 — Plan review forum uses `RoomKind.plan_review` but the room kind is never registered in any document.

---

### New Ideas & Improvements (5 concrete proposals)

1. **"Design Feedback" as Peer to "Artifact Repair"**  
   Elevate Task Agent graph-patch proposals to a first-class outcome inside the Revisor cycle (parallel to repair_instructions). Make “improve the task” a measurable, learned outcome. This is the single highest-leverage change for professional users.

2. **Mandatory Dry-Run + ShadowWorkspace for All Plans Above Risk Threshold**  
   Make ShadowWorkspace default-on for any plan containing `regenerate`, `restructure`, or HardCall detection. Surface the dry-run diff + risk summary before user approval. Ties directly into [E16] version-and-diff workflow.

3. **Explanation Trace as First-Class Artifact**  
   Require every `CompiledRevisionStrategy` and final `RevisionPlan` to produce a short human-readable Explanation Trace (causal diagnosis + why this repair + preservation guarantees). Cheap to generate, dramatically improves trust and debuggability.

4. **Automatic Pattern Suggestion UI**  
   When an outcome has failed ≥2 times, proactively surface the top-3 similar successful patterns from DOC72 with an “Apply this pattern?” button. Turns passive memory into active assistance.

5. **Unified “Evaluation Chain” View in DOC20**  
   Render Pattern C chains (and any future multi-hop evaluation chains) as a single card in the UI with both qualitative (Evaluator) and quantitative (Judge) slices side-by-side. Solves the invisibility problem in Common Contracts §3.7.

---

### Synthesis

**Overall assessment:**  
The design is conceptually ambitious and contains several genuine architectural advances (dual compilers, defeasible findings, rich taxonomies, Pattern C chain linkage, taint-tier binding). It is not yet build-ready. The system will almost certainly produce a working prototype, but it will not yet produce something a high-stakes professional can trust with consequential work. The greatest strength is the explicitness of the graph + revision cycle. The single greatest weakness is the number of critical seams and underspecified runtime contracts that force implementers to guess. The architecture is still “plan-and-execute with many missing guardrails” rather than “reliable, auditable, reviewable professional workflow engine.”

**Top 5 must-fix before build (ranked):**  
1. Unified RevisionExecutionLifecycle + receipt linkage (critical distributed-systems gap).  
2. Hard Call detection mechanism (safety blocker).  
3. Complete taint propagation and clearance contract (compliance blocker).  
4. Cross-document seam ownership for context/packet assembly (implementation blocker).  
5. Mandatory Plan Verification Agent for high-risk plans (reliability blocker).

**Biggest conceptual opportunity:**  
Shift from “fix the artifact” to “improve the task” as the primary Revisor outcome (this is New Idea #1 above). Make graph-patch proposals first-class, measurable, and learned. That single change would turn this from a sophisticated repair loop into a true self-improving professional workflow system.

**Coding-agent readiness:**  
No. An AI coding agent given this set today would have to invent behavior in at least 12 places. The largest sources of guesswork are the missing unified state machines and the circular or incomplete cross-document obligations. The set needs one more tightening pass focused on contracts and seams before it is implementation-safe.

---

This is the single, complete response containing **everything** from the latest Red Team round: all conceptual findings, all targeted seams, all defect-hunt findings, all 5 new ideas in full, and the full synthesis. No information was omitted.