CLAUDE_DOC23_ADDENDA_B_REVIEW_CONSOLIDATED_V2.md
Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/CLAUDE_DOC23_ADDENDA_B_REVIEW_CONSOLIDATED_V2.md
ELNOR REPO READER TEXT MIRROR
Original path: Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/CLAUDE_DOC23_ADDENDA_B_REVIEW_CONSOLIDATED_V2.md
Source repo: /Users/OpenClaw1/Elnor/Elnor Specs
Git branch: main
Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331
Generated: 2026-06-09T01:23:58.539Z
---
# Claude — Consolidated Red-Team Review & Design Notes
## DOC23 Addenda B Specification Family
**Reviewer:** Claude (fresh-window read of all six in-scope documents)
**Documents under review:** Core R0.7.1 · Outcome Evaluator + Revisor V3.3.1 · Evaluation Common Contracts V1.1.1 · Source Workspace V1.0.1 · Feedback Delivery V1.0.1 · Task Forum + Run Board V1.0.1
**Prompt:** RED_TEAM_DOC23_ADDENDA_B_SET_V2.md
**Status of this document:** Save-point snapshot of Claude's complete position before external reviewer responses are folded in. Sections 1–4 are the original structured review; Sections 5–9 are everything produced after it (deeper-dive findings, proposed surfaces, the fork/chat/re-prompt design, the evaluation of the four other reviewers, the recommended hardening pass, and the reusable purpose-question library).
---
## 0. How to read this document
This consolidates several review passes into one reference:
- **§1 Executive synthesis** — the overall assessment, the two dominant defect patterns, the decision-set rule, and where an implementer would be forced to guess.
- **§2 Original findings** — 33 findings from the structured red-team response: 6 conceptual (A1–A6), 12 targeted (B1–B12), 15 defect-hunt (C1–C15). Verbatim. (The original pass self-labeled "32"; it undercounted Part B — there are 12 targeted findings, not 11.)
- **§3 Deeper-dive additional findings** — 24 further findings (D1–D24) produced after the original pass, including two CONFIRMED affirmations.
- **§4 New user-facing surfaces proposed** — six surfaces (S1–S6) with schemas.
- **§5 Fork / chat-in-session / re-prompt unified design** — TaskRunFork, ModuleActivationChat, and the single-addendum three-section decision.
- **§6 Evaluation of the four external reviewers** — what changed Claude's thinking, the additions to adopt, and the proposals deliberately not adopted (with reasons).
- **§7 Recommended R0.4 "Math + Contract Hardening Pass" scope.**
- **§8 Appendix: reusable purpose-question library** + reviewer-assignment grid.
Finding IDs are stable. Cite them directly (e.g. "B1", "D14", "S2") in downstream work.
Tags follow the prompt's set: `BUG` (wrong/contradictory), `GAP` (must be specified but isn't), `RISK` (distributed-systems/reliability hazard), `UX` (trust/transparency/reviewability), `IDEA` (reconceptualization/missing capability), `CONFIRMED` (genuinely well-designed). Severity: `CRITICAL` / `HIGH` / `MEDIUM` / `LOW`.
**On severity and disposition (read this).** Severity describes how broken a given defect is — its impact if left as-is. It is **not** a triage instruction and does not rank what to fix first. There are no priority tiers in this document. The disposition is binary: **every finding and proposal here is to be addressed, unless it appears in the "Proposed but not adopted" set (§6.2), each item of which carries an explicit reason for declining.** Severity tells you what each defect costs; it does not tell you whether to fix it. (All findings are worth fixing — that is why they were surfaced rather than discarded.)
---
## 1. Executive synthesis
**Overall assessment.** This set is architecturally ambitious and substantially sound, but its integration layer is not yet complete — as a *set*, the six documents do not yet fully agree, and that is where the defects concentrate. The strongest design decision is the strict separation of planner (Revisor) from executor (revision-capable modules) via the `revision_in` port contract — that is the single highest-leverage idea in the set and it is well-specified. The defeasibility of findings (FD V1 §3.4) and the goal-advancement sycophancy fix (V3.3.1 §6.12 + §13.3.1) are unusually thoughtful pieces of design that most AI orchestration systems get wrong. The cross-pattern coordination work (Pattern A/B/C) and the explicit graph-native discipline (no hidden dispatch) reflect mature thinking about why AI systems lose user trust.
However, the family-topology split into six documents has produced concrete schema collisions (the `EvaluationFinding` double-definition between V3.3.1 §5.7 and FD V1 §3.3 is the most severe — same name, incompatible fields, both currently authoritative). Phantom schema references (`EvaluationAffirmation` doesn't exist; `ClaimSetBundle` is in the wrong document) tell a coding agent to look in the wrong place. Mid-level mechanisms like the revalidation cascade lack convergence proofs, indeterminate-state handling is incomplete in routing policies, and several enums are referenced before being defined. The set displays the classic symptom of multiple documents authored at slightly different times: each is internally consistent, the integration isn't.
The bar Will set ("a high-stakes professional relies on it") is not yet met — but mostly because of *integration defects*, not core design. The shape of the system is right; the seams aren't tightened.
**Two dominant patterns.** (1) *Schema-level locks are used precisely where they matter* — AutonomousModePolicy locked-false fields (§6.6.2), the goal_advancement sycophancy exclusion (§6.12 / §13.3, AttributionRecord explicitly excludes `revisor_self_assessment`). That instinct is correct and should be the model for the rest. (2) *The boundary problems are bigger than the interior problems* — the most severe findings live at the seams between documents (B1, B5, B6, B11), not inside any one document's logic. The interior of each document is largely solid.
**Disposition (the decision set).** Everything in this document is to be addressed. There is no priority ordering and no "fix these, defer those" — the only items *not* being acted on are the four in §6.2 (proposed mechanisms declined, each with its reason). The most consequential single defect is **B1** (two incompatible `EvaluationFinding` schemas, both currently authoritative); the most consequential *structural* fix surfaced by the other reviewers is the **Formula Registry** (§6.1), which collapses a dozen-plus "this number has no formula" gaps into one place. The new capabilities worth calling out — because they add reach the system doesn't have today, not because they outrank the fixes — are **TaskHealthCard (A1)**, **WorkProductCertification (S1)**, and **FindingsInbox (S2)**. All are in the adopt set.
**Where an implementer would be forced to guess.** Handed this set as-is, a coding agent would have to invent answers to: (a) which `EvaluationFinding` schema is canonical when both are referenced; (b) where `ClaimSetBundle` actually lives; (c) what to do when the Pattern C chain ID doesn't resolve; (d) how a sub-agent-less Evaluator should proceed; (e) what taint class to assign to a `web_source`. None are deep design problems — they are unfinished spec details, and each is removed by a specific finding above. Closing the cross-doc obligations and naming Common Contracts as the single schema-of-record for shared types removes the largest cluster of guess-points at once; the natural place to do that is the DOC23 R3.2 absorption.
---
## 2. Original findings (33)
### Part A — Conceptual / UX findings (§3)
#### A1 · [IDEA] [HIGH] · Set-wide — No first-class "task health" surface, only fragmented signals
**Finding:** The set produces extraordinarily rich per-step instrumentation — `RevisionOperationReceipt`, `FeedbackConsumptionReceipt`, `EvaluationResultEnvelope`, `RepairCycleSignal`, `PatternHealthState`, `SubAgentReputation`, eight signal types, four delivery channels, five evaluation slices. But for a professional asking "is this task going well?", no document defines a unified, decision-grade health view. The user must reconstruct health by joining run board events, current evaluation states, pending Hard Calls, budget burn, taint state, and repeated-failure counters across at least four documents.
**Why it matters:** A high-stakes user does not read receipts; they read dashboards. CI/CD systems, clinical decision support, and Bloomberg terminals all converged on this lesson decades ago. Without a defined "Task Health Card" with a clear top-line ("on track / at risk / blocked / unrecoverable") and the three to five drivers behind that label, the user is forced into the operator role this design explicitly rejects (Core §28 closing note: "every task created, run, inspected... makes the Task Agent better"). The richness of the substrate is wasted if reviewability does not exist at this level.
**Recommendation:** Add a `TaskHealthCard` schema to Core R0.7.1 (probably §20 or a new §20A) that aggregates the runtime signals into a top-line state with explicit driver references. Specify the inputs (e.g., `unresolved_blocking_findings_count`, `repeated_failure_pattern_active`, `budget_burn_to_date`, `hard_calls_pending`, `upstream_failure_outcomes`, `taint_escalations_in_run`, `cost_estimator_confidence`) and the deterministic aggregation function. The card lives next to — not replacing — the Run Board.
**Reference:** Core R0.7.1 §12 telemetry spine (passive); Run Board V1 §1.2 events; FD V1 §6.4 repeated-failure; V3.3.1 §21 surfaces (cards by record kind, no aggregate).
#### A2 · [IDEA] [HIGH] · Set-wide — Cost predictability is asserted but never computable end-to-end before run
**Finding:** Budget bifurcation (V3.3.1 §11.15) and `EstimatorConfidence` (§11.15.4) are sophisticated, but there is no `TaskRunCostForecast` produced *before* a run — only RevisionCostEstimate produced inside a planning compiler invocation, after the user has committed to running. The Source Research module attributes cost to the workspace and tools (Source Workspace §7.6) but never produces a forecast either. The forum's BoardDigest carries `token_count` after generation, not predicted.
**Why it matters:** A professional needs to know "this task will probably cost about $X and 8 minutes" *before* clicking Run, especially for delegated/scheduled runs and overnight batches. Without a forecast, the user discovers cost only by running, which forces over-cautious manual oversight on every run — the opposite of the platform's purpose.
**Recommendation:** Define `TaskRunCostForecast` in Core R0.7.1 produced at task pre-execution (Task Assessment time per Core §16): module-level cost estimates + estimator confidence per module, summed with a forecast confidence band. Display in the Task Assessment surface (Core §24.8). Defer accuracy goals to Phase 2 but ship the surface and the band.
**Reference:** Core R0.7.1 §16 (Task Assessment); V3.3.1 §11.15.4 RevisionCostEstimate; Source Workspace §7.6 cost attribution.
#### A3 · [UX] [HIGH] · V3.3.1 §21 + FD V1 §8 — Reviewability is fragmented across at least three surfaces
**Finding:** When a revision regenerates a 30-page brief, the user is offered: SemanticChangelog (V3.3.1 §7.11 / §21.6), then raw diff (§21.6), then FeedbackConsumptionReceipt chain (FD V1 §10.3 DOC20 insert "show the path from finding → repair instruction → consumer → produced artifact"), then RunBoard timeline (Run Board V1 §8.2), then Pattern display (V3.3.1 §21.8 — `from_memory` vs `adapted_from_memory`), then ExplanationTrace markdown (V3.3.1 §7.10), then SubAgent metrics (V3.3.1 §15.8.3). Each makes sense in isolation; none defines the *review session* shape a user actually performs.
**Why it matters:** A securities litigator reviewing AI work product does not work surface-by-surface. They want to ask: "what changed substantively, what was the AI's reasoning, what evidence backs it, where can it have gotten this wrong, and can I revert?" That is one continuous review activity, not seven separate surfaces. The spec puts the burden of integration on the implementer; the implementer will guess.
**Recommendation:** Define a `RevisionReviewSession` UI contract in V3.3.1 §21 (or a new sub-section of Core §20) that names the canonical surfaces the user navigates in order: top-of-pack `SemanticChangelog` summary → ExplanationTrace narrative → diff drill-in → Evaluation Result Card showing why this was needed → SubAgent provenance for advised steps → Revert/partial-accept controls. The contract names the surfaces and their navigation links; visual layout is non-normative.
**Reference:** V3.3.1 §21.6, §21.7, §21.8, §7.10, §7.11; FD V1 §10.3 (DOC20 insert).
#### A4 · [IDEA] [MEDIUM] · Set-wide — No "fork and continue" primitive at the task-run level
**Finding:** V3.3.1 has `fork_from_checkpoint` as a `RevisionPlanStepKind` (§0.4.6, §7.5) and Core mentions "rerun/fork-from-module" UI affordances in §27H.13. But at the task-run level — when a professional says "this run was almost right, branch from here and try a different drafting approach" — no schema and no flow are defined. `RunWorkspace` is explicitly run-scoped (V3.3.1 §12.1.1); branching is plan-internal only.
**Why it matters:** This is the most-asked feature for any drafting workflow. Without it, the user's recovery mechanism for "this run went sideways at module 5" is to delete or restart, losing the upstream work. Will's explicit goal (Core §1.4) is "high-stakes professional work where errors are expensive" — and the most expensive error is throwing away work that was nearly right.
**Recommendation:** Define `TaskRunFork` in Core R0.7.1 as a first-class primitive with `parent_run_id`, `fork_point_module_id_and_activation_seq`, `copied_workspace_refs`, `divergence_reason`, and `fork_disposition: "experimental" | "alternate_path" | "recovery"`. Define the user gesture (right-click a Run Board event → "fork from here"). Specify how `SourceWorkspace` (persistent) vs `RunWorkspace` (ephemeral) are handled at fork time. *(See §5 — this proposal was substantially expanded after the original review.)*
**Reference:** Core R0.7.1 §27H.13 (mentions but does not specify); V3.3.1 §0.4.6 (plan-internal only).
#### A5 · [UX] [MEDIUM] · FD V1 §3.4 + V3.3.1 §6.12 — Over-relies on user contesting; doesn't surface contestability
**Finding:** Findings are explicitly defeasible (FD V1 §1.3, §3.4 rules 1–7). User-contesting transitions a finding to `contested` and unblocks downstream. But nowhere is "this finding is the kind a user often contests" surfaced in the UI before they look at it. The Pattern primitive tracks `contested_finding_count` (V3.3.1 §13.3) and `false_positive_count` for sub-agents (§15.8); neither surfaces into the user's review flow.
**Why it matters:** A professional with limited review time wants to know up-front: "this evaluator tends to over-flag X — start there." Without contestability signaling, the user reviews findings in order and burns attention budget on findings that aren't really problems. The infrastructure exists; the surfacing doesn't.
**Recommendation:** Add `historical_false_positive_rate_in_context` and `historical_contest_rate_in_context` fields to the EvaluationFinding (or computed view) when rendered. Sort findings by reliability descending in the §21.1 Evaluation Result Card by default. Mark findings produced by sub-agents currently in `watch` status (§15.8.3) with a "lower-confidence" badge.
**Reference:** FD V1 §3.4; V3.3.1 §13.3 contested_finding_count; §15.8 sub-agent reputation; §21.1.
#### A6 · [IDEA] [MEDIUM] · Set-wide — The "process gap vs substantive gap" distinction is correct, but no learning loop closes it
**Finding:** Forum V1 §9 correctly distinguishes process gaps (graph design problem) from substantive gaps (work-product problem). `TaskProcessGapSignal` (Core §9.0.3) feeds the Task Agent assessment queue. But Task Agent's response to process gaps is described as "proposes graph patches"; there is no enumerated trigger that converts repeated process-gap signals into Task Blueprint amendment proposals or DOC72 task-design patterns. The signal flows but the loop doesn't close at the design layer.
**Why it matters:** A system that detects "your task design lacks a source verification stage" is half a system. The other half — promoting the detection into a learned default for future tasks of this kind — is what makes the platform compound (Core §28: "the value comes from compounding"). Without the closure, every task is taught the same lesson.
**Recommendation:** Add a §9.X to Core R0.7.1 specifying the process-gap-to-design-pattern conversion: threshold N of similar `TaskProcessGapSignal`s over similar TaskBlueprints → Task Agent emits a `TaskBlueprintAmendmentProposal` to the Task Design Learning Review Queue (Core §24.8 mentions this queue but does not specify the conversion rule). Specify what "similar" means structurally. *(Note: this touches the paused self-learning surface; treat as flagged, not deep-dived.)*
**Reference:** Forum V1 §9; Core R0.7.1 §9.0.3, §24.8; Core §28 closing.
### Part B — Targeted findings (§4)
#### B1 · [BUG] [CRITICAL] · FD V1 §3.3 vs V3.3.1 §5.7 — Two incompatible `EvaluationFinding` schemas
**Finding:** `EvaluationFinding` is defined in BOTH V3.3.1 §5.7 and Feedback Delivery V1.0.1 §3.3 with mutually exclusive fields. V3.3.1's version has `finding_text`, `severity (4 values)`, `state: FindingState (12 values)`, `basis: AssuranceBasis`, `target_artifact_ref`, `taint_class`, `confidence: "low"|"medium"|"high"`. FD V1's version has `finding_kind (12 values)`, `severity (4 values)`, `authority_basis: EvaluationAuthorityBasis[] (9 values, NEW enum)`, `lifecycle_state: EvaluationFindingLifecycleState (7 values)`, `target_criterion_id`, `target_scope_ref`, `affected_claim_refs`, `confidence: number (0-1)`, `based_on_board_digest_ref`. No mapping is provided. Common Contracts §4.2 references "EvaluationFinding[] — Addenda B V3.1 §5.7" — meaning the V3.3.1 schema is canonical. So FD V1 is declaring a schema that doesn't match the canonical reference.
**Why it matters:** A coding agent implementing the Outcome Evaluator emits `EvaluationFinding` per V3.3.1 §5.7; the same agent implementing Feedback Delivery reads `EvaluationFinding` per FD V1 §3.3. The schemas don't share fields. The `FindingState` enum (V3.3.1) and `EvaluationFindingLifecycleState` enum (FD V1) overlap by name only. This is a single-name-two-schemas defect; the build will produce one model that fits neither.
**Recommendation:** Declare one canonical `EvaluationFinding` schema in DOC23 Evaluation Common Contracts V1.1.1 (the natural home — it is shared across Addenda A and B). Move it now, before V3.4 / V1.2 work proceeds. Reconcile field sets: V3.3.1's `state` and FD V1's `lifecycle_state` are the same concept; `basis: AssuranceBasis` (V3.3.1) is unrelated to `authority_basis: EvaluationAuthorityBasis[]` (FD V1) — pick one model or define how both coexist (they aren't redundant — `authority_basis` is what makes a finding a hard blocker per FD V1 §3.4 rule 2). Make the consolidation a coordinated schema bump per Common Contracts §10.2.
**Reference:** V3.3.1 §5.7; FD V1.0.1 §3.3; Common Contracts §4.2.
#### B2 · [BUG] [HIGH] · V3.3.1 §0.4.1 vs Common Contracts §3.1/§3.2 — `OutcomeEvaluationState` has 15 values; mapping treats it as 14
**Finding:** V3.3.1 §0.4.1 enumerates `OutcomeEvaluationState` with 15 values (`pending`, `pending_dependency`, `evaluating`, `satisfied`, `needs_revision`, `needs_information`, `needs_verification`, `needs_human_judgment`, `unable_to_evaluate`, `blocked_by_policy`, `regressed`, `unrecoverable`, `dirty`, `superseded`, `upstream_failure`). Common Contracts §3.1 says the field "is populated from V3.1's 14-value enum"; §3.2's verdict mapping covers 14 of the 15 — `evaluating` is not mapped at all, and the doc itself counts wrong.
**Why it matters:** A producer in the `evaluating` state has an undefined `evaluation_verdict` per the mapping. Real systems hit this transient state regularly (it's literally the in-flight state). Without a mapping, either the producer can't emit an envelope while evaluating (which would block §5.18's "every Evaluator activation emits exactly one envelope" guarantee), or it emits something the schema doesn't sanction.
**Recommendation:** Either (a) declare `evaluating` a transient state that MUST NOT emit an envelope (matching how `dirty`, `superseded`, `pending`, `pending_dependency` are treated in §3.2), AND fix Common Contracts to say "15-value enum, of which 5 are transient and 10 map to envelope verdicts," or (b) define `evaluating → indeterminate` and bump the version count. Update §5.18.4 emission discipline to disambiguate.
**Reference:** V3.3.1 §0.4.1; Common Contracts V1.1.1 §3.1, §3.2; V3.3.1 §5.18.4.
#### B3 · [GAP] [HIGH] · V3.3.1 §5.18 + Common Contracts §3.7 — Pattern C chain ID lifecycle is unspecified
**Finding:** V3.3.1 §5.18.4 says the upstream Evaluator populates `target_evaluation_chain_id`. Common Contracts §3.7 says "the upstream Evaluator's envelope populates `target_evaluation_chain_id` with a UUID identifying the evaluation chain" and the downstream Judge sets the same value. But neither doc specifies: (1) Who generates the chain UUID — the Evaluator at every activation, even when no Pattern C Judge will attach? (2) What happens when the chain ID does not resolve at the consumer side? (3) How long the chain ID is retained / when it can be GC'd. (4) Whether different Evaluator activations of the same outcome share a chain ID or always get fresh ones.
**Why it matters:** If every Evaluator activation always gets a fresh UUID, then audit reconstruction (§3.7 "given the chain id, retrieve all envelopes in the chain") works only when Pattern C wiring is present — but the spec says every Evaluator emits this field. If the Evaluator does NOT always emit it, the field is sometimes-null and downstream Judges may attach to envelopes lacking the linkage primitive. Either way the field has no lifecycle.
**Recommendation:** In Common Contracts §3.7 specify: the upstream Evaluator MUST generate a fresh ULID at activation time (or at envelope emission time) and emit it as `target_evaluation_chain_id`; the Judge in Pattern C MUST read it from `evaluator_output_in.target_evaluation_chain_id` and set its own envelope's field to the same value; orphan envelopes (no downstream consumer) keep the field but it is unused. Add a validation code `validation.pattern_c_chain_id_mismatch` for when the Judge's value does not match its declared upstream Evaluator. Specify retention: the chain ID has the same retention as the envelope itself.
**Reference:** V3.3.1 §5.18.4; Common Contracts V1.1.1 §3.7. *(See also the cross-reviewer finding in §6 on the Pattern C envelope FIELD mismatch — `evaluated_target`/`evaluation_basis` are read by the Judge but live on a different schema. That is a distinct, additional defect.)*
#### B4 · [GAP] [HIGH] · FD V1 §6.2 + V3.3.1 §5.18 — `on_indeterminate` is missing from FeedbackRoutingPolicy
**Finding:** `EvaluationDecision.verdict` has four values: `"passed" | "failed" | "indeterminate" | "not_applicable"` (FD V1 §2.1). `FeedbackRoutingPolicy` has branches for `on_satisfied`, `on_needs_revision`, `on_needs_more_sources`, `on_needs_source_verification`, `on_needs_format_repair`, `on_repeated_failure` — but no `on_indeterminate` and no `on_not_applicable`. V3.3.1 has 5 `OutcomeEvaluationState` values mapping to `indeterminate` (`needs_information`, `needs_verification`, `needs_human_judgment`, `unable_to_evaluate`, `blocked_by_policy`), so `indeterminate` is not rare.
**Why it matters:** When an evaluator returns `indeterminate`, the policy router has no branch to fire. Implementations will either pick a default silently (forum post? pause?) or hit a no-route condition and stall. This is precisely the kind of edge case the spec warns about ("trail off" semantics for indeterminate per the prompt §4.10).
**Recommendation:** Add explicit branches:
```
on_indeterminate:
| "pause_for_human"
| "post_to_forum"
| "ask_task_agent_for_process_assessment"
| "continue_with_warning"
on_not_applicable:
| "continue"
| "log_only"
```
The five indeterminate sub-states map to `on_indeterminate` unless the policy declares finer-grained handling. Add `on_unrecoverable` similarly (V3.3.1 maps `unrecoverable` to `failed` for the envelope, but the routing implications differ — repeated `unrecoverable` should not get the same router treatment as a fixable `needs_revision`).
**Reference:** FD V1.0.1 §6.2, §2.1; V3.3.1 §0.4.1; Common Contracts §3.2.
#### B5 · [BUG] [HIGH] · V3.3.1 §5.17.7 vs Common Contracts §1.2 — `ClaimSetBundle` and `ExtractedEvaluationUnit` ownership conflict
**Finding:** V3.3.1 §5.17.7 says "DOC23 Evaluation Common Contracts — `ExtractedEvaluationUnit` union, `ClaimSetBundle` schema, `ArtifactScopeRef` source-anchor primitives." Common Contracts V1.1.1 §1.2 explicitly says **out of scope**: "The Claim Extractor module schema or its 22-type unit registry (lives in Addenda A R4.1 V3)." Common Contracts §6.1 references "ClaimType[] (from Addenda A 22-type registry)" — pointing at Addenda A as owner. Grep confirms Common Contracts does NOT define `ClaimSetBundle` or `ExtractedEvaluationUnitBundle` anywhere.
**Why it matters:** Two of the three named primitives V3.3.1 §5.17 says are in Common Contracts aren't there. A coding agent implementing the `claims_in` port (typed for `"ClaimSetBundle" | "ExtractedEvaluationUnitBundle"`) cannot find these schemas in the document V3.3.1 sends them to.
**Recommendation:** Update V3.3.1 §5.17.7 cross-references to point `ExtractedEvaluationUnit` and `ClaimSetBundle` to Addenda A R4.1 V3 (their actual owner), keeping only `ArtifactScopeRef` cross-referenced to Common Contracts §7. Add an OP-A row tracking the import; if Addenda A doesn't expose them yet as public contracts, mark the obligation `pending`.
**Reference:** V3.3.1 §5.17.7; Common Contracts V1.1.1 §1.2, §6.1, §6.5.
#### B6 · [BUG] [HIGH] · Common Contracts §4.2 — `ResearchNeed` and `EvaluationAffirmation` are not where the doc says they are
**Finding:** Common Contracts §4.2 (QualitativeSlice) says: "The full `EvaluationFinding`, `OutcomeRepairInstruction`, `ResearchNeed`, and `EvaluationAffirmation` schemas live in Addenda B Core R0.7.1 (their owning addendum)." This is wrong twice: (1) `ResearchNeed` is defined in **Source Workspace V1.0.1 §6.2**, NOT in Core (grep of Core for `EvaluationAffirmation`/`ResearchNeed` returns zero hits). (2) `EvaluationAffirmation` is undefined anywhere — Core does not contain the term, and no schema appears in V3.3.1, Source Workspace, or Run Board. `OutcomeRepairInstruction` is defined in **FD V1.0.1 §5.2**, NOT Core. Only `EvaluationFinding` is in V3.3.1 (and per B1 it conflicts with a second definition in FD V1).
**Why it matters:** A coding agent implementing the QualitativeEvaluationSlice fills four fields whose owning documents are misidentified. `EvaluationAffirmation` does not exist as a schema at all — it's a phantom type. Implementations will either skip the field, invent a structure, or copy from FD V1 §3 (which has its own bugs).
**Recommendation:** Fix Common Contracts §4.2 to point each schema to its actual owner: `EvaluationFinding` → consolidated location per B1; `OutcomeRepairInstruction` → FD V1.0.1 §5.2; `ResearchNeed` → Source Workspace V1.0.1 §6.2. Either define `EvaluationAffirmation` somewhere or remove it from the slice. If kept, specify what it carries (presumably "what the artifact got right" — the positive counterpart to a finding) and where it lives. *(This phantom is also a lead-in to the cross-reviewer "flawless-execution denominator" finding in §6.)*
**Reference:** Common Contracts V1.1.1 §4.2; Source Workspace V1.0.1 §6.2; FD V1.0.1 §5.2.
#### B7 · [RISK] [HIGH] · V3.3.1 §11.21 — Revalidation cascade has no convergence proof
**Finding:** §11.21 Phase 4 says: "If revalidation produces regression (outcome was satisfied, now needs_revision): Trigger Revisor activation for the regressed outcome. Plan may use fork_from_checkpoint to address regression." No bound is specified on cascade depth. If outcome A's revision causes B to regress, B's revision can cause A to re-regress (because the dependency declaration goes both ways via `OutcomeDependencySpec.invalidated_by_outcomes` traversal). The `per_outcome_retry_budget` is per-outcome but cascades cross outcomes; the only stated bound is `per_plan_max_replans` which is at the *plan* level, not the cascade level.
**Why it matters:** Two outcomes with bidirectional declared dependencies can ping-pong revisions indefinitely. Each cycle consumes logical budget on a different outcome, so per-outcome budget never trips. `consecutive_insufficient_limit` (§6.7.3) detects insufficient plans for the same outcome, not regression chains across outcomes. The cascade can deadlock the task without ever firing a budget exhaustion.
**Recommendation:** Add `RevisorConfig.max_revalidation_cascade_depth: number` (default 5) measured from the originating mutation receipt. The Loop Controller tracks cascade chains; when a regression triggers revision and that revision triggers another regression on an outcome already in the chain, the chain is "circular" and aborts with `validation.revalidation_cascade_loop`. Add the validation code to V3.3.1 §22. Surface the cycle to the user as a Hard Call (`HardRevisionCallKind = "revalidation_cycle"` — new value) so the user can break the tie.
**Reference:** V3.3.1 §11.21; §6.7.3; §6.14 RevisorConfig.
#### B8 · [BUG] [HIGH] · V3.3.1 §11.21.2 + §5.14.1 — `upstream_failure_cascade` rule duplicated with slight divergence
**Finding:** The upstream-failure rule is specified twice. §5.14.1 and §11.21.2 currently agree on the trigger conditions (`execution_status in {could_not_fix, failed_runtime, rejected_capability}` AND `retry_count >= per_outcome_retry_budget`; both restrict the cascade target to outcomes in `pending_dependency`). But they live in different sections that a maintainer might update independently, and neither handles the case where an outcome is in `evaluating` (not `pending_dependency`) but is *about to* depend on the missing artifact — a race where the outcome becomes `pending_dependency` *after* the cascade fires.
**Why it matters:** The set is internally consistent here today, but duplication invites drift, and the race leaves an outcome hanging on an artifact that will never arrive.
**Recommendation:** Consolidate the rule in one section (§11.21.2, the named cascade); §5.14.1 links to it rather than restating. Add handling: when an outcome transitions to `pending_dependency` after a cascade has already fired on the upstream module, the new outcome is auto-evaluated against the `upstream_failure` set as part of its state-entry guard, not via a new cascade pass. Document the race in §11.21.
**Reference:** V3.3.1 §5.14.1; §11.21.2.
#### B9 · [RISK] [MEDIUM] · V3.3.1 §11.22 + §11.20.2 — Parallel batches racing the live-edit rolling hash
**Finding:** §11.22 allows parallel step execution up to `max_parallel_steps_per_plan: 4`. §11.20.2 Rolling Hash Mode B requires "Step N+1 validates against predicted hash from Step N output." But two parallel steps in the same batch with rolling-hash semantics cannot both validate against the same base; they need to be serialized. §11.20.2 mentions rolling hash is "available only when no concurrent plans target the artifact" — but does NOT say "no concurrent *steps* within the same plan."
**Why it matters:** A coding agent reading §11.22 + §11.20.2 will allow rolling-hash plans with parallel steps targeting the same artifact, producing nondeterministic hash chains. Sometimes the chain validates; sometimes it doesn't depending on completion order. The bug surfaces as flaky `validation.rolling_hash_chain_broken` in test and corrupted artifacts in production.
**Recommendation:** §11.20.2 add: "Rolling hash mode B requires sequential step execution across all steps that mutate the same artifact. Parallelism within a plan is allowed only between steps targeting disjoint artifacts." Add `validation.rolling_hash_parallel_steps_same_artifact`. Document in §11.22 (parallel batches automatically degrade to sequential when any step is rolling-hash mode B).
**Reference:** V3.3.1 §11.22.1, §11.22.2; §11.20.2, §11.20.3.
#### B10 · [GAP] [MEDIUM] · V3.3.1 §8.4 + FD V1 §7 — Sub-agent dispatch has no "no sub-agent available" fallback for the evaluator point
**Finding:** §8.4 `AdvisorySubAgentProfile.allowed_coordination_points: ["outcome_compiler" | "evaluator" | "revision_compiler" | "feedback_interpreter"]`. Three of four are well-handled (outcome_compiler / revision_compiler / feedback_interpreter have explicit invoke + accept/reject/defer semantics). But for `evaluator`, no spec text describes what happens when an evaluator coordination point has no sub-agent registered for the relevant `AdvisorySubAgentOutput` variant. Run Board V1 §7.3 allows fan-out to `target: "task_agent"`, but task-agent fallback for an evaluator-stage sub-agent gap isn't specified.
**Why it matters:** A coding agent implementing the Evaluator can ship without sub-agent support and call it conformant by silence. The "no sub-agent available" path is exactly the most important fallback to specify: every implementation hits it before any sub-agent is registered.
**Recommendation:** Add V3.3.1 §8.X "No-sub-agent fallback per coordination point" listing, for each of the four allowed points, the deterministic behavior when no compatible sub-agent is available. For evaluator: proceed with the default specialist-subevaluator path; emit a `quality_signal` with `signal_kind = "sub_agent_unavailable_evaluator"`, `actionability = "metric_only"`; do not emit a Hard Call (sub-agent absence is not a strategic gap). For revision_compiler: proceed with single-Compiler reasoning; emit signal.
**Reference:** V3.3.1 §8.4; §3.4; §6.1.5; Run Board V1 §7.3.
#### B11 · [GAP] [MEDIUM] · FD V1 §8 — DOC23/DOC15/DOC24 boundary leaks at the `instruction_in` overload
**Finding:** §8.4 says "Existing ports carry feedback today: `instruction_in` — Repair Instructions and structured directives; `context_in` — Feedback bundle as context; `data_in` — Research needs as input data." But `instruction_in` is a general DOC23 R3.1 port. Overloading it to carry both ordinary instructions AND typed `OutcomeRepairInstruction` payloads means receiving modules must runtime-discriminate the payload — and the spec gives no discriminator field.
**Why it matters:** A module's `instruction_in` consumer cannot know whether it's getting an OutcomeRepairInstruction (structured: `target_scope_refs`, `preservation_constraints`, `suggested_route`) or a free-form instruction string. DOC15/CIL prompt assembly renders the wired input but doesn't know the type to render. The typed ports (`repair_instruction_in` etc.) are listed as "ergonomics, not required for V1" — so all V1 implementations go through the overload and each solves discrimination differently.
**Recommendation:** Either (a) elevate the typed ports (`feedback_in`, `repair_instruction_in`, `run_guidance_in`, `source_need_in`) to **required** for V1, with DOC23 R3.1's port registry insert (§10.3) listing them, OR (b) define a discriminator field in the payload union that DOC15/CIL uses to dispatch rendering. Don't ship the overload.
**Reference:** FD V1.0.1 §8.4, §10.3.
#### B12 · [RISK] [MEDIUM] · Run Board V1 §1.1 + §5.3 — Passive board "auto-publishes every event" is unbounded
**Finding:** §1.1: "Every task run has a passive Run Board... every system event auto-publishes." §1.2 lists ~12 event categories. §5.3: "Posts are append-only; MUST NOT be modified after creation." No retention, deduplication, throttling, or coalescing policy is specified.
**Why it matters:** A long-running task with frequent evaluator activations and high-cardinality artifact updates can produce thousands of posts. With no coalescing, the BoardDigest (§6.2, default `max_digest_tokens: 1200`) truncates non-deterministically or summarizes lossily. A debugging-grade audit feed is great; for the UI feed it's noise.
**Recommendation:** Add §1.5 "Coalescing and throttling policy": (a) cost/duration events coalesce to per-module summaries every N seconds; (b) artifact_reference events deduplicate by `artifact_ref + version_ref`; (c) the digest selection rule prefers terminal/blocking events over status events. Retention follows EC Core policy per matter class (Source Workspace §9.5 precedent). UI default filter excludes status_update; user opts in.
**Reference:** Run Board V1.0.1 §1.1, §1.2, §5.3, §6.2.
### Part C — Defect-hunt findings (§5)
#### C1 · [BUG] [HIGH] · Common Contracts §3.7 — Pattern C `route_recommendation` resolution is procedural, not specified
**Finding:** §3.7 says "Judge's quantitative recommendation governs when Pattern C is wired, since Judge is the more recent producer in the chain." But where is this resolved? In the Loop Controller? The Switch module? "Resolution is by consumer policy" — but no consumer policy schema exists. A Switch wired to both upstream Evaluator and downstream Judge envelopes via Pattern C has no rule to decide which `route_recommendation` to obey.
**Why it matters:** Pattern C is sold as a killer feature (§5.18.1). If routing is "consumer policy" without a schema, every consumer invents its own.
**Recommendation:** Define `PatternCRouteResolutionPolicy` in Common Contracts §3.7 (or V3.3.1 §5.18.7): "When a Switch or Loop Controller reads multiple envelopes sharing a `target_evaluation_chain_id`, it resolves the effective `route_recommendation` as the recommendation of the envelope with the largest `producer_activation_seq` (most recent). When recommendations conflict at the same seq, the consumer policy declares precedence (Judge > Evaluator > deterministic_scorer, or explicit override)." Add a validation code for missing policy.
**Reference:** Common Contracts V1.1.1 §3.7; V3.3.1 §5.18.7.
#### C2 · [BUG] [HIGH] · V3.3.1 §0.4.4 vs §5.7 + §5.7.1 — `FindingState` has a pass-through state with no negative exit
**Finding:** §0.4.4 enumerates `FindingState`: `proposed | active | contested | resolved | superseded_by_revision | superseded_by_source_change | user_approved | tool_verified | human_verified | rejected_by_user | dismissed | unrecoverable`. The §5.7.1 transition table lists `proposed → active` as the only outbound transition from `proposed`. There is no `proposed → dismissed` transition for a proposed finding the Evaluator decides NOT to confirm.
**Why it matters:** What happens to a `proposed` finding that's never confirmed? Implementations will leave findings in `proposed` indefinitely or invent a transition the table doesn't list.
**Recommendation:** Add `proposed → dismissed` to §5.7.1 with a typed predicate ("Evaluator did not confirm before activation completion"). Specify that a finding in `proposed` at activation termination auto-transitions to `dismissed` with `dismissal_reason = "not_confirmed_at_termination"`.
**Reference:** V3.3.1 §0.4.4; §5.7.1.
#### C3 · [BUG] [HIGH] · V3.3.1 §6.5.2 + §7.9 — `HardRevisionCall.options` may be empty, but spec requires bounded options
**Finding:** §6.5.1 says detection produces a Hard Call "with bounded `HumanDecisionOption[]`." §7.9.1's schema has `options: HumanDecisionOption[]` with no non-empty constraint. If the Compiler cannot enumerate options (e.g., `human_preference_needed` with unbounded alternatives), the schema allows empty `options`.
**Why it matters:** Empty `options[]` makes the §21.4 UI surface non-functional — no buttons. The user cannot resolve; the Dispatcher stays in `waiting_hard_call` indefinitely (§6.5.3). The `default_if_no_response` fallback's trigger ("no response") is itself unspecified.
**Recommendation:** Constrain `options[]` to `MIN_LENGTH = 2` (e.g., "Accept" and "Reject" always present). For Hard Calls where the Compiler cannot enumerate substantive options, default to `{"continue_with_compiler_proposal", "pause_for_my_input"}`. Add `validation.hard_call_options_empty`. Specify the timeout that triggers `default_if_no_response`.
**Reference:** V3.3.1 §6.5.1, §6.5.3, §7.9.1, §21.4.
#### C4 · [GAP] [HIGH] · V3.3.1 §11.21.1 — Outcome dependency direction is undefined for cycles
**Finding:** §11.21.1 says cascading is determined by declared `OutcomeDependencySpec` and `EvaluationTargetClosurePolicy` (§5.11). Neither says what happens when `OutcomeDependencySpec` is bidirectional (A depends on B and B depends on A). The closure traversal will hit a cycle.
**Why it matters:** "Apply EvaluationTargetClosurePolicy to ensure closure" will not terminate if cycles aren't detected. Even with a visited-set guard, the *order* the cycle is evaluated in is undefined, and the user can't express the resolution.
**Recommendation:** Add to §5.11: cycle detection during closure with `validation.outcome_dependency_cycle_detected` (warning, not error — cycles are allowed; spec just needs an order). Specify cycle-evaluation order: topological order among non-cyclic outcomes, then cyclic outcomes by `outcome_priority` descending (require `outcome_priority` to be declared for cyclic outcomes).
**Reference:** V3.3.1 §11.21.1; §5.11.
#### C5 · [BUG] [MEDIUM] · V3.3.1 §6.16 + §13.3 — `LearningMode` enum is referenced but never enumerated
**Finding:** §6.14 `RevisorConfig.learning_mode: LearningMode // default "production"; see §6.16`. §0.4 says the V3.2 inventory was "extended with `LearningMode`, `ModelClass`, `CrossModelApplicability`" but the `LearningMode` values are not enumerated in §0.4 where they should be. The enum is used as `"production"` in the default and as `"signal_generation"` in the Common Contracts §5.4 gating, but the complete value set is not declared in the canonical enum inventory.
**Why it matters:** The signal envelope gates cheap-model learning on `model_class` AND `learning_mode = "signal_generation"`. Implementations need the full value set (presumably `production`, `signal_generation`, `cross_calibration`).
**Recommendation:** Explicitly enumerate `LearningMode` in V3.3.1 §0.4 alongside the other enums; cross-reference §6.16 implementation. Verify all referenced enum values from §6.16 onward exist in §0.4. *(Touches the paused learning surface; flag-only.)*
**Reference:** V3.3.1 §6.14, §6.16; Common Contracts §5.4.
#### C6 · [BUG] [MEDIUM] · Common Contracts §6.4 vs §9.4 — `unanchored_llm_judgment` acknowledgment has no field to write into
**Finding:** §6.4: `unanchored_llm_judgment` is NOT aggregation-eligible by default; Judge's `OutcomeComplianceScoringConfig` can override with an audit flag. §9.4: `scoring_basis = "unanchored_llm_judgment"` AND `required = true` requires explicit user acknowledgment. But no schema has a field to store that acknowledgment.
**Why it matters:** The warning has no acknowledgment field to write into. Implementations either fire the warning every time (loud) or store ack in an undocumented field.
**Recommendation:** Add `Criterion.unanchored_aggregation_acknowledged_by_user: boolean` (with `user_ref` + `acknowledged_at` for audit). The warning fires when `scoring_basis == "unanchored_llm_judgment" && required == true && unanchored_aggregation_acknowledged_by_user == false`; once acknowledged, it silences.
**Reference:** Common Contracts §6.4, §9.4.
#### C7 · [GAP] [MEDIUM] · Common Contracts §7.1 + §9.5 — `ArtifactScopeRef` structured anchor can be validly empty
**Finding:** §9.5: "`ArtifactScopeRef.anchor` null is allowed only when `scope_kind = "document"`." So a `field` scope requires a non-null anchor. But `StructuredAnchor` has all-optional sub-fields (`section_id?`, `field_path?`, `citation_ref?`). A `StructuredAnchor {}` is non-null but carries no anchor information, passing §9.5 while being useless.
**Why it matters:** A coding agent can construct `ArtifactScopeRef { scope_kind: "field", anchor: StructuredAnchor {} }` that validates but cannot be resolved. Findings or repair instructions referencing this scope are un-locatable.
**Recommendation:** Strengthen §9.5: `StructuredAnchor` must have at least one of `section_id`, `field_path`, `citation_ref` populated. Add `validation.structured_anchor_empty`.
**Reference:** Common Contracts V1.1.1 §7.1, §7.3, §9.5.
#### C8 · [BUG] [MEDIUM] · Source Workspace §4.1 + V3.3.1 §15.10 — `SourceRecord.taint_class` "inherited from source kind" is unspecified
**Finding:** Source Workspace §4.1: `SourceRecord.taint_class: TaintClass // inherited from source kind / retrieval method`. But no `source_kind → taint_class` mapping exists. The 15 `source_kind` values don't map to any of the 8 taint classes anywhere.
**Why it matters:** Two implementations of `step.source_research` pick different defaults. A PACER `case_law` source might be `external_authority_trusted` in one impl and `external_untrusted` in another. Pattern C Judges that gate sandboxed-vs-trusted handling on the taint class produce inconsistent results.
**Recommendation:** Add Source Workspace §4.1A "Default taint_class per source_kind": `document/email/file → user_trusted_bounded`; `web_source → external_untrusted`; `api_result/database_record → external_authority_trusted`; `library_entry → internal_corpus_trusted`; `case_law/statute/regulation → external_authority_trusted`; `prior_task_output → user_trusted_bounded`. Allow `taint_class_override` per query.
**Reference:** Source Workspace V1.0.1 §4.1; V3.3.1 §15.10, §15.10.1.
#### C9 · [GAP] [MEDIUM] · Run Board V1 §3.1 + §6.2 — BoardDigest filter rule is unspecified
**Finding:** §6.2 BoardDigest carries `included_post_ids` etc., but the **filter rule** that selects `included_post_ids` from all forum posts is not specified. `BoardDigestPolicy` (§3.2) has include-flags but no `include_post_kinds`, no severity threshold, no max-count, no selection strategy.
**Why it matters:** A 500-post forum cannot ship all posts in a 1200-token digest. Without selection rules, implementations pick differently (newest 20? highest-severity? tagged?), producing divergent downstream behavior.
**Recommendation:** Extend `BoardDigestPolicy` with `included_post_kinds: TaskRunBoardPostKind[]`, `included_severity_threshold`, `max_posts: number`, `selection_strategy: "recency" | "severity" | "score" | "mixed"`. Default: include `evaluation_finding`, `repair_instruction`, `process_gap`, `user_guidance`; severity ≥ medium; max 30; mixed (50% recency, 50% severity).
**Reference:** Run Board V1.0.1 §3.1, §3.2, §6.2.
#### C10 · [BUG] [MEDIUM] · V3.3.1 §11.6 + §0.4.7 — `RevisionOperationKind = "hard_call_resolved"` has no producer
**Finding:** §0.4.7 `RevisionOperationKind` includes `hard_call_resolved`. §7.9 defines `HardCallResolution` persisted to the ledger. §11.6 lists operation kinds but doesn't say which actor emits an operation with `operation_kind = "hard_call_resolved"` (Dispatcher? UI? Revisor?).
**Why it matters:** Operation receipts feed RepairCycleSignal (Core §9.0.2). If the actor is ambiguous, the receipt is missing or duplicated, breaking the `hard_call_resolved → revision_operation_receipt_ref` chain in `RevisorActionRecord`.
**Recommendation:** Add §7.9.4: "When a HardCallResolution is recorded, the Dispatcher emits a `RevisionOperationReceipt` with `operation_kind = "hard_call_resolved"` and `hard_call_ref` pointing to the resolved Hard Call. The receipt's `actor_ref` is the Dispatcher's runtime identity; the resolution records `resolved_by: UserRef` separately."
**Reference:** V3.3.1 §0.4.7; §7.9; §11.6; Core §9.0.2.
#### C11 · [GAP] [MEDIUM] · FD V1 §9.4 — "Silent ignoring fires validation" is unenforceable as written
**Finding:** §9.4: modules that explicitly ignore feedback emit a receipt; "Silent ignoring (no receipt) fires `validation.feedback_consumed_without_receipt` at run audit time." But the detection mechanism — which bundles were routed to which modules and whether receipts returned — is not specified, so the audit has no way to enumerate "expected receipts."
**Why it matters:** The validation never fires in practice. A module that processed-and-ignored without a receipt looks identical at audit time to one that received nothing.
**Recommendation:** Add §9.4A "Receipt expectation tracking": when the feedback router (§6) dispatches a bundle to a consumer, it records a `FeedbackDispatchExpectation` keyed to `(feedback_bundle_id, consumer_module_id, consumer_activation_seq)`. At run end (or after a receipt-grace period) the audit compares expectations to receipts; missing pairs fire the validation. Specify the grace period default (e.g., 5 minutes after consumer activation completes).
**Reference:** FD V1.0.1 §9.4.
#### C12 · [BUG] [MEDIUM] · Run Board V1 §5.4 + Source Workspace §9.4 — Cross-matter post visibility unspecified
**Finding:** Run Board §5.4's 5 visibility values don't restrict by matter. §5.2 has a `matter_id?` field. Source Workspace §9.4 says cross-matter retrieval doesn't surface privileged-matter entries. But a forum post in a privileged matter with `visibility: "all_task_modules"` — is it visible to a module in another matter? The rule isn't stated.
**Why it matters:** A privilege firewall breach via forum posts is catastrophic for a litigator. If matter scoping is implicit, it's not enforced.
**Recommendation:** Add Run Board §5.5 "Cross-matter visibility rule": a post with `matter_id == X` is visible only to readers operating under `matter_id == X`, regardless of `visibility`; `visibility` scopes within the matter. Add `validation.forum_post_cross_matter_leak`. Document interaction with `privileged: true` (always matter-scoped; never cross-matter regardless of access tier).
**Reference:** Run Board V1.0.1 §5.2, §5.4; Source Workspace V1.0.1 §9.4; V3.3.1 §13.4.
#### C13 · [GAP] [LOW] · Source Workspace §3.3 — Tier transitions lack policy/validation
**Finding:** §3.3 `SourceTierTransition` is recorded on promotion, but no rule prevents demotion (tier 3 → tier 1), no validation requires a substantive reason, and no policy gates ad-hoc tier changes by access tier (per V3.3.1 §16).
**Why it matters:** A read-only user can demote a tier-3 source card to tier 1, dropping rich content and metadata. The transition is recorded but the data is gone.
**Recommendation:** Add §3.3A "Tier transition policy": demotion (`from_tier > to_tier`) requires `cleared_by_access_tier >= "matter_team_access"` and a non-empty `reason`. Promotion is unrestricted by access tier. Add `validation.source_tier_demotion_without_authority`.
**Reference:** Source Workspace V1.0.1 §3.3; V3.3.1 §16.
#### C14 · [BUG] [LOW] · Common Contracts §3.4 — Parallel Judge+Evaluator example contradicts §5.18.8
**Finding:** §3.4: "Two envelopes can reference the same snapshot... e.g., Judge and Evaluator running in parallel on the same artifact version." But V3.3.1 §5.18.8 places Patterns A/B Evaluator inside an Experiment context, and Pattern C has the Judge *consume* the Evaluator's output (so they cannot run in parallel). The example fits no specified topology.
**Why it matters:** A coding agent reading §3.4 thinks parallel Judge+Evaluator is a supported topology and tries to wire it.
**Recommendation:** Fix the example: "e.g., two Evaluator activations on the same snapshot during an Experiment, or an Evaluator and a deterministic scorer in parallel." Either remove the Judge example or specify the topology where Judge and Evaluator are genuinely parallel.
**Reference:** Common Contracts V1.1.1 §3.4; V3.3.1 §5.18.8.
#### C15 · [GAP] [LOW] · Core §3D + Run Board §6.4 — `TaskOpportunityPacket` and `TaskRunContextPacket` taxonomy is unstated
**Finding:** Core §3D defines `TaskOpportunityPacket` (DOC24-assembled; direct-mode; 150–600 token budget). Run Board §6.3 defines `TaskRunContextPacket` (DOC24-assembled; for module activations during a run). Both are DOC24-assembled contextual packets; the differentiation is real (pre-run opportunity vs in-run context) but unstated, and their fields overlap conceptually.
**Why it matters:** A DOC24 coding agent sees both and picks one or invents a third.
**Recommendation:** Add a "Packet taxonomy" subsection to Core §3D listing the known packet types: `TaskOpportunityPacket` (pre-task, ambient), `TaskRunContextPacket` (in-run, module-scoped), and the future `TaskAgentDesignPacket` (§3D.3). State each one's DOC24 lane and audience. *(See D18 — this overlaps the "token budget fragmented across packets" finding.)*
**Reference:** Core R0.7.1 §3D.2, §3D.3; Run Board V1.0.1 §6.3, §6.5.
---
## 3. Deeper-dive additional findings (D1–D24)
These were produced after the original 32, on closer reading of the two large documents and the seams. Two are CONFIRMED affirmations.
#### D1 · [BUG] [HIGH] · V3.3.1 §17.1 vs §8.4 — Sub-agent coordination point count mismatch (4 vs 5)
§17.1 lists five sub-agent coordination points (including a Plan Verifier), but the `allowed_coordination_points` enum at §8.4 has only four (`outcome_compiler`, `evaluator`, `revision_compiler`, `feedback_interpreter`). A sub-agent profile cannot declare itself for the Plan Verifier point. Reconcile: either add the fifth enum value or remove the §17.1 reference. **Reference:** V3.3.1 §17.1, §8.4.
#### D2 · [RISK] [MEDIUM] · V3.3.1 §6.7.2 — Success-condition 5 races the cascade
Success condition 5 ("Cascaded dependent outcomes are re-evaluated") is checked at revision-cycle completion, but the cascade (§11.21) can still be firing when the Loop Controller evaluates the seven conditions. A revision can be marked successful before its own regression cascade has settled. Gate condition 5 on cascade quiescence. **Reference:** V3.3.1 §6.7.2, §11.21.
#### D3 · [GAP] [MEDIUM] · Run Board V1 §4.6 — Moderator failure path unspecified
`ForumModeratorPolicy` defines five moderator modes but no behavior when the moderator agent (for `task_agent_advisory`, `domain_moderator`) is unavailable or errors. The forum has no defined degradation (fall back to `none`? pause? queue?). Specify a moderator fallback analogous to the Task Agent fallback (§6.9.1). **Reference:** Run Board V1.0.1 §4, §4.6.
#### D4 · [GAP] [MEDIUM] · V3.3.1 §6.12 / §13.3 — `goal_advancement_count` has no decrement path
`goal_advancement_count` only increments. A pattern whose later applications stop advancing the goal never loses count; the metric is monotonic and so becomes stale-positive over time. Define a decay or re-evaluation rule (or a windowed denominator). *(Touches paused learning surface; flag-only, but the no-decrement asymmetry is structural.)* **Reference:** V3.3.1 §13.3.
#### D5 · [BUG] [MEDIUM] · Common Contracts §11.5 — Backward-compat claim overstates stability
§11.5 claims read-compatibility across versions, but the documents reference each other by section number (e.g., "Addenda B V3.1 §5.7"). On absorption into DOC23 R3.2, section numbers shift and these references break. The compatibility claim holds for schemas but not for the cross-references that bind them. Convert section-number cross-refs to stable symbolic anchors before absorption. **Reference:** Common Contracts V1.1.1 §11.5.
#### D6 · [GAP] [MEDIUM] · Source Workspace §6.2 — `ResearchNeed.status` lacks exit transitions for `human_needed`
`ResearchNeed` has a status lifecycle, but the `human_needed` status has no defined exit (who resolves it, what transitions it to satisfied/abandoned, what happens if never resolved). Open research needs in `human_needed` can leak across the run. Specify the exit transitions and a default disposition at run end. **Reference:** Source Workspace V1.0.1 §6.2.
#### D7 · [RISK] [MEDIUM] · Core §9.0.6 — Signal emission ordering vs receipt persistence is undefined
The §9.0.6 signal flow shows signals emitted then passing through the EC policy gate, but the ordering relationship between signal emission and the durable persistence of the receipts those signals reference is not specified. A consumer can receive a signal referencing a receipt not yet durably written. Specify emit-after-persist (or a read-your-writes guarantee at the gate). **Reference:** Core R0.7.1 §9.0.6.
#### D8 · [BUG] [LOW] · Common Contracts §3.1 vs V3.3.1 §5.1 — `evaluation_chain_id` naming asymmetry
The field is `target_evaluation_chain_id` on the envelope (Common Contracts §3.1) but referred to as `evaluation_chain_id` in V3.3.1 §5.1 prose. Same concept, two names; a coding agent may treat them as distinct. Normalize the name. **Reference:** Common Contracts §3.1; V3.3.1 §5.1.
#### D9 · [CONFIRMED] · V3.3.1 §6.6.2 — AutonomousModePolicy locked fields
The four `may_skip_*` fields locked to `false` at the schema level (validator rejects any record where they differ), with only `skip_low_risk_judgment_gate` user-settable, is exactly the right way to make autonomy safe by construction. Hard Calls, policy gates, privileged artifacts, and external side effects always require human gates. This is the model the rest of the spec's safety-critical invariants should follow. **Reference:** V3.3.1 §6.6.2.
#### D10 · [CONFIRMED] · FD V1 §3.4 — Defeasible findings
The seven defeasibility rules (findings are contestable; user-contesting transitions to `contested` and unblocks downstream; `authority_basis` is what makes a finding a hard blocker) are a genuinely good piece of design. Most evaluation systems treat findings as ground truth; treating them as defeasible-but-typed is the correct trust model for high-stakes review. **Reference:** FD V1.0.1 §3.4.
#### D11 · [IDEA] [MEDIUM] · Set-wide — A `TaskReplay` primitive would close the determinism story
The set has `execution_watermark_ref` (Common Contracts §3.1) and snapshots, but no first-class replay primitive that reconstructs a run deterministically from recorded state. For a litigator who must demonstrate "this is exactly what the system did," replay is the proof. Define `TaskReplay` (inputs: run_id, watermark; output: reconstructed event sequence with divergence detection). *(Relates to S5 RunReplayPreview and to the determinism purpose-question A3.)* **Reference:** Common Contracts §3.1; V3.3.1 §3.7.4.
#### D12 · [UX] [LOW] · V3.3.1 §6.10 — Planner confidence threshold needs calibration guidance
The default 0.7 confidence threshold (below which plans surface for human review) is a single magic number with no calibration guidance. Different goal kinds warrant different thresholds. Add per-goal-kind threshold guidance or a calibration mechanism, and surface the active threshold in the UI. **Reference:** V3.3.1 §6.10.
#### D13 · [GAP] [HIGH] · V3.3.1 §5.2 / §6.10 — `compiler_confidence_score` has no computation
The Compiler confidence score gates human review (§6.10) but no document specifies how it is computed. A self-reported confidence with no grounding is a sycophancy vector — the same risk the §6.12 goal-advancement fix carefully avoids. Specify the computation (or bind it to an independent signal, not self-assessment). *(This is subsumed by the cross-reviewer "Formula Registry" finding — see §6.)* **Reference:** V3.3.1 §5.2, §6.10.
#### D14 · [GAP] [MEDIUM] · V3.3.1 §7.9.3 — HardCallResolution hash normalization unspecified
The reuse rule (§7.9.3) compares `outcome_definition_hash` and `goal_context_hash` for compatibility, but how those hashes are computed/normalized is unspecified. Two semantically-identical outcomes with cosmetic differences (whitespace, field ordering) produce different hashes, so a valid prior resolution is needlessly re-escalated. Specify a canonical normalization before hashing. **Reference:** V3.3.1 §7.9.3.
#### D15 · [GAP] [HIGH] · Source Workspace + Run Board — Concurrency model undefined
Neither Source Workspace nor the Forum specifies a concurrency model for simultaneous writes (two modules adding sources to the same workspace, two participants posting to the same forum segment). Append-only posts help the forum, but Source Workspace `SourceRecord` mutation and tier transitions have no concurrency control. Specify last-write-wins vs optimistic-locking vs serialized writes per record kind. *(Relates to scenario C1(a) in the purpose-question library — two modules satisfying the same ResearchNeed.)* **Reference:** Source Workspace V1.0.1; Run Board V1.0.1 §5.
#### D16 · [GAP] [HIGH] · Set-wide — `HumanOutcomeFeedbackEvent` referenced but never schema-defined
`HumanOutcomeFeedbackEvent` appears in the Core §0.3.6 terminology table ("Feedback event | HumanOutcomeFeedbackEvent | ... yes") and is the natural target of the Teach-from-feedback flow and the ModuleActivationChat advise mode (§5 below), but no document defines its schema. Define it (probably in FD V1 or Common Contracts) before any feedback-producing surface depends on it. **Reference:** Core §0.3.6; FD V1 §2; V3.3.1 §14.3.
#### D17 · [RISK] [MEDIUM] · V3.3.1 §5.18 — Pattern C doubles per-turn evaluation latency with no budget
Pattern C wires a Judge downstream of every standalone Evaluator. In an iterative revision loop, that doubles the evaluation latency (and cost) per turn, but no budget governs the Pattern C Judge invocation separately from the Evaluator. A long revision loop silently doubles cost. Add a Pattern C invocation budget or a "score every N turns, not every turn" cadence option. **Reference:** V3.3.1 §5.18; §11.15.
#### D18 · [GAP] [MEDIUM] · Set-wide — Token budget is fragmented across four packets
Token budgets are declared independently on `TaskOpportunityPacket` (§3D), `TaskRunContextPacket` (Run Board §6.3), `BoardDigest` (§6.2), and the evaluation packets — but nothing reconciles them into a single per-activation budget. A module can receive a context packet + a board digest + a feedback bundle that jointly exceed its model's context window with no single authority checking the sum. Define a per-activation total-context budget that the packet assembler enforces across all packet sources. *(Overlaps C15.)* **Reference:** Core §3D; Run Board §6.2, §6.3.
#### D19 · [GAP] [MEDIUM] · Set-wide — Task Agent appears in two forum-shaped surfaces
The Task Agent participates both via the Forum (`task_agent_advisory` moderator mode, Run Board §4.3) and via its own assessment queue (Core §24.8) consuming `TaskProcessGapSignal`. These are two forum-shaped surfaces for the same agent with no stated relationship. Clarify whether they are one surface or two, and how a process-gap observation in one reaches the other. **Reference:** Run Board §4.3; Core §9.0.3, §24.8.
#### D20 · [GAP] [MEDIUM] · Forum V1 — Forum is run-scoped only; no task-scoped forum
The Forum and Run Board are scoped to a single run (`run_id` throughout). But coordination that should persist across runs of the same task (recurring research needs, standing user guidance, design discussions) has no home. A task-scoped forum (or a promotion path from run-forum to task-forum) is missing. **Reference:** Run Board V1.0.1 §1, §3.
#### D21 · [GAP] [MEDIUM] · V3.3.1 §13.1 — `cross_model_applicability = "requires_validation"` has no runtime behavior
`Pattern.cross_model_applicability` can be `"requires_validation"`, but no document specifies what the runtime does with that value — what validation, when, what happens while validation is pending, what gates on it. The enum value exists with no consuming behavior. Specify the validation procedure or mark the value Phase 2. *(Touches paused learning surface; flag-only.)* **Reference:** V3.3.1 §13.1.
#### D22 · [GAP] [MEDIUM] · V3.3.1 §15.8 — Sub-agent reputation portability is unspecified
`SubAgentReputation` accrues per `advisory_agent_id`, but whether reputation is scoped per-matter, per-firm, or global — and whether it ports when a sub-agent is promoted across scopes — is unspecified. A sub-agent trusted in one matter may be untested in another; reputation portability needs a scope rule mirroring the taint/pattern scope discipline. **Reference:** V3.3.1 §15.8; §16.
#### D23 · [BUG] [MEDIUM] · FD V1 — `ApplicabilityScope.authority_level` vs `domain_payload.authority_level` conflict
Two `authority_level` fields exist at different nesting levels (on the applicability scope and on the domain payload) with no rule for which governs when they disagree. A coding agent cannot determine the effective authority level. Define precedence (or merge to one field). **Reference:** FD V1.0.1 §3, §5.
#### D24 · [RISK] [MEDIUM] · FD V1 §6.3 — Multiple delivery branches can fire simultaneously with no cost/idempotency control
The `FeedbackRoutingPolicy` branches (§6.3) are not stated to be mutually exclusive; a single evaluation result can match multiple branches (e.g., `on_needs_revision` and `on_needs_more_sources`), firing multiple deliveries. No idempotency key or cost guard prevents duplicate or conflicting deliveries. Specify branch exclusivity (or an explicit multi-fire policy with idempotency keys). **Reference:** FD V1.0.1 §6.3.
---
## 4. New user-facing surfaces proposed (S1–S6)
These are six surfaces the set needs but does not deliver. S1 and S2 are the highest-leverage. Schemas are starting points, not final.
#### S1 · WorkProductCertification — "the page you staple to the cover sheet" (highest-leverage)
**Purpose.** A single artifact a professional can attach to a finished work product that certifies what the system did and didn't verify. Today the assurance story is scattered across findings, verification records, judgment limitations, and assurance slices. S1 collapses them into one signed statement: what was checked, by what basis, what was NOT checked, what the user accepted over a warning, and what remains a judgment limitation.
**Schema (sketch).**
```ts
WorkProductCertification {
certification_id: string
task_id: string
run_id: string
artifact_ref: StorageRef
artifact_version_ref: StorageRef
verified_items: Array<{ description: string; basis: AssuranceBasis; verification_record_refs: StorageRef[] }>
not_verified_items: Array<{ description: string; reason: "out_of_scope" | "no_capability" | "degraded" | "user_declined" }>
judgment_limitations: JudgmentLimitationRecord[] // §5.9
user_overrides: Array<{ what: string; warning_shown: string; user_ref: UserRef; at: ISO8601 }>
unresolved_hard_calls: string[] // should be empty for a clean cert
taint_summary: { highest_unc1eared: TaintClass; cleared_records: StorageRef[] }
certified_at: ISO8601
certified_by: UserRef | "system_auto"
schema_version: 1
}
```
**Home.** Core R0.7.1 (new §20.X). Generated at run completion; surfaced in the §21 review surfaces and as a downloadable artifact.
#### S2 · FindingsInbox — cross-task review queue (highest-leverage)
**Purpose.** A litigator running many matters needs one place to triage every open finding across every active task, sorted by reliability and stakes — not one Evaluation Result Card per run. This is the "what needs my attention right now, everywhere" surface that A1's TaskHealthCard implies at the per-task level, lifted to the portfolio level.
**Schema (sketch).**
```ts
FindingsInboxView {
generated_for: UserRef
scope: "all_matters" | { matter_ids: string[] }
entries: Array<{
finding_ref: StorageRef
task_id: string; run_id: string; matter_id?: string
severity: "low" | "medium" | "high" | "blocking"
state: FindingState
historical_false_positive_rate_in_context?: number // per A5
produced_by_subagent_status?: "active" | "watch" | "quarantined"
is_blocking_delivery: boolean
}>
sort: "reliability_desc" | "severity_desc" | "stakes_desc" | "recency"
schema_version: 1
}
```
**Home.** Core R0.7.1 or DOC20. Honors matter firewall (C12) — cross-matter entries never leak privileged content.
#### S3 · RunDiff — compare two runs of the same task
**Purpose.** When a user forks a run (A4/S5) or re-runs a task, they need to see what differed between runs: which outcomes flipped verdict, which findings appeared/disappeared, cost delta, which modules behaved differently. Today there is no run-to-run comparison.
**Home.** Core R0.7.1; consumes run records + EvaluationResultEnvelopes from both runs. Pairs naturally with TaskRunFork (§5).
#### S4 · DecisionAuditView — the "why did it decide that" surface
**Purpose.** ExplanationTrace (§7.10) is per-plan. S4 is the run-level audit view that threads every decision point — strategy selections, escalations, sub-agent consultations, Hard Call resolutions, taint clearances — into one auditable timeline a supervising attorney can walk. This is what makes the system defensible, not just usable.
**Home.** V3.3.1 §21.X, consuming ExplanationTrace decision_points + HardCallResolution ledger + TaintClearanceRecords.
#### S5 · RunReplayPreview — see what a replay would produce before committing
**Purpose.** Bound to D11 (TaskReplay). Before re-running or forking, show the user a dry-run preview: which steps would re-execute, which would reuse cached results, what would diverge given current state vs the recorded watermark.
**Home.** V3.3.1 §15.X; consumes `execution_watermark_ref` and EvaluationSnapshots.
#### S6 · LongitudinalPatternView — pattern behavior over time
**Purpose.** Extends §21.8's per-pattern card into a longitudinal view: how a pattern's convergence/regression/rollback counts have moved across many uses, in this context vs globally. Surfaces the D4 monotonicity problem visibly (a pattern that stopped working but never lost count shows a flat line a human can question).
**Home.** V3.3.1 §21.8 extension. *(Touches paused learning surface; build the view, defer the learning mechanics.)*
---
## 5. Fork / chat-in-session / re-prompt unified design
This section consolidates the design work that grew out of A4. Three capabilities turned out to share one mechanism (session continuation), and the decision was to ship them as one three-section addendum rather than as edits scattered across DOC23 R3.2.
### 5.1 TaskRunFork (Core R0.7.1)
A first-class fork primitive keyed to any module activation, not just plan-internal checkpoints. The fork point is `(module_id, activation_seq)`. The decision after the original review: forking copies workspace refs, but **side effects cannot branch** — an email already sent is sent in every branch. So the schema carries an explicit irrevocable-side-effects record at fork time.
```ts
TaskRunFork {
fork_id: string
parent_run_id: string
fork_point: { module_id: string; activation_seq: number }
fork_disposition: "experimental" | "alternate_path" | "recovery"
copied_workspace_refs: string[] // persistent SourceWorkspace copied; ephemeral RunWorkspace re-derived
divergence_inputs?: { user_directive?: string } // what the user changed at the fork
irrevocable_side_effects_at_fork: Array<{ // present in parent before fork point; cannot be undone in the child
side_effect_kind: "external_send" | "external_tool_call" | "durable_promotion" | "candidate_accepted"
receipt_ref: StorageRef
}>
created_at: ISO8601
schema_version: 1
}
```
**User gesture.** Right-click a Run Board event → "fork from here." **Pairs with** S3 (RunDiff) and S5 (RunReplayPreview).
### 5.2 ModuleActivationChat (Core R0.7.1, new §4.4)
A chat attached to any module activation, with three modes. Chat is the *input mechanism*; fork and feedback are the two *destinations*.
- **inspect (DEFAULT).** Read-only conversation against a frozen snapshot of the activation. No state change. "Why did you flag this?" / "What sources did you use here?"
- **advise.** Produces a `HumanOutcomeFeedbackEvent` (D16 — needs its schema defined) that flows into the existing Feedback Interpreter pipeline (V3.3.1 §14.3). This is teach-from-feedback reached conversationally.
- **branch.** Opens a `TaskRunFork` (§5.1) from this activation, carrying the chat's `user_directive` as `divergence_inputs`.
**Cross-doc obligations:** DOC23 R3.1 supplies `TaskModuleSessionRef` and the port mechanics; DOC11/OpenClaw supplies the read-only session handle; DOC15 renders chat context (in branch mode only); DOC20 supplies the three-mode toggle UI.
### 5.3 Re-prompts — sibling of chat-in-session
Re-prompts are **pre-scripted** (`re_prompts: string[]`, defined at design time) that fire on module kinds like `agent_task`, `coding`, `agent_review_gate`, `red_team`. They are a *sibling* of chat-in-session, not the same thing: both extend the OpenClaw/Gateway session and share session-continuation mechanics, but re-prompts are author-time and deterministic while chat is user-time and interactive.
### 5.4 Packaging decision (settled)
Rejected: separate edits touching DOC23 R3.2 (too costly). **Agreed:** one standalone addendum, three sections —
- **§1 Re-Prompt System** (re_prompts mechanics, applicability table, DOC23 compilation points)
- **§2 Module Activation Chat + TaskRunFork** (the §5.1/§5.2 design)
- **§3 Shared Module Session Continuation Mechanics** (references Addenda A §A9; portable; fold into DOC23 R3.2 later only if another feature needs it)
This addendum is the **next build** after the current review round closes. Session continuity itself lives in Addenda A §A9.
---
## 6. Evaluation of the four external reviewers
Will routed the red-team prompt (and, separately, purpose-questions) to ChatGPT, Grok, Gemini, and Claude in various configurations. Below: what those reviews surfaced that changed or extended Claude's thinking (§6.1, all adopted), the proposals deliberately *not* adopted with the reason for each (§6.2), and a flat list of the reviewer-surfaced additions being adopted (§6.3). No priority ordering — per the disposition rule in §1, everything in §6.1 and §6.3 is to be done; §6.2 is the only "not doing this" set.
### 6.1 What the other reviewers found that Claude rates high-value (and largely missed)
- **Pattern C envelope FIELD mismatch (highest-value catch).** A reviewer found that in Pattern C, the Judge reads `evaluated_target` and `evaluation_basis`, but those fields are NOT on `EvaluationResultEnvelope` — they live on `EvaluationFeedbackBundle`. This is distinct from Claude's B3 (chain-ID *lifecycle*); it's a field-location bug that breaks Pattern C wiring directly. **Adopted as a fix; it sits with B1 and B3 as a contract-consolidation item (§7).**
- **Memory Hydration phase missing.** No formal pre-run memory-read phase is specified. The system writes signals and patterns prolifically but never specifies when/how a run *reads* memory before executing. A genuine architectural absence (purpose-audit class, not wiring).
- **Memory Precedence Hierarchy undefined.** When local intent, matter policy, and global DOC72 patterns conflict, nothing says which wins. Proposed order: Local Intent > Matter Policy > Global DOC72.
- **Sub-Agent Amnesia / SubAgentPrior.** Sub-agents are invoked without a specified mechanism for carrying prior context (a `SubAgentPrior` injection). Each invocation starts cold.
- **Formula Registry (subsumes ~15 findings).** Scores throughout the set (compiler_confidence_score per D13, quality_index, reputation, goal_advancement) are bare numbers with no specified computation. A single Formula Registry — one place that defines how every numeric score is computed — subsumes D13 and roughly a dozen scattered "this number has no formula" findings. This is the highest-leverage *structural* fix the other reviewers surfaced.
- **CalibratedScore type.** A score type that carries its own confidence interval and calibration metadata, so a 0.7 from one scorer is comparable to a 0.7 from another. Pairs with the Formula Registry.
- **Flawless-Execution Denominator.** Learning fires only on failure; there is no positive-reinforcement signal when a run executes flawlessly. The denominator is missing, so the system can't distinguish "never failed" from "never ran." (Relates to Claude's B6 phantom `EvaluationAffirmation` — the positive counterpart to a finding.)
- **Other named gaps worth absorbing:** `HardCallBlockingScope` (what exactly a blocking Hard Call blocks — the step, the outcome, the whole run); `TaskCancelProtocol` (clean cancel semantics mid-run — relates to scenario C1(e)); `ResearchNeedLease` (so two modules don't both satisfy one need — D15/scenario C1(a)); `EvaluationContractReview` (a pre-execution check that the evaluation plan is coherent before spending); `EvidencePackage` (bundling evidence for a finding into one reviewable unit); `TaskBlueprint` Topology/Payload bifurcation (separating graph shape from graph content); `KnownGoodState` (a named, restorable checkpoint).
### 6.2 Proposed mechanisms not adopted (the declined set, with reasons)
These are the only items in this document **not** being acted on. Each was genuinely proposed (by a reviewer or by Claude) and is declined for a stated reason. In every case the *underlying need* is real and is met another way — so "declined" means "not this mechanism," not "ignore the problem."
- **Literal Git-style branching / copy-on-write workspace with reversible operations (Grok's ShadowWorkspace, taken literally).** *Declined* because side effects cannot branch — an email already sent, an external tool call already made, a durable promotion already committed cannot be "un-sent" in a child branch, and a copy-on-write model that implies they can is a phantom (the exact failure class Will guards against). *Adopted instead:* the branching *concept* via `TaskRunFork` (§5.1) plus the explicit `irrevocable_side_effects_at_fork` record, which makes the un-undoable visible at the fork point rather than pretending it's reversible.
- **`TaskConfirmationSignal` as a new signal type.** *Declined* because it duplicates existing signal/receipt machinery; the need (recording confirmation) is already served by existing signals and receipts. Adding a new type widens the surface for no benefit. *Adopted instead:* nothing new — fold into existing signal/receipt records.
- **Flawless-execution as a new signal type.** *Declined* for the same redundancy reason: `OutcomeEvaluationSignal` already carries the verdict, so a separate "it went fine" signal is unnecessary. *Adopted instead:* the real problem (the missing positive-reinforcement denominator — the system can't tell "never failed" from "never ran") is fixed by making the existing `OutcomeEvaluationSignal` verdict-aware and counting clean passes in it. (See §6.3.)
- **Chunking findings to manage KV-cache bloat.** *Declined* because chunking fragments a unit the reviewer needs whole; splitting a findings set across cache boundaries trades one problem (size) for a worse one (loss of coherence at review time). *Adopted instead:* render a *compressed envelope view* at prompt-assembly time — keep the finding set whole, compress the rendering.
### 6.3 Reviewer-surfaced additions to adopt (no ordering)
Every item below is adopted. They are listed flat, not ranked — pick them up in whatever sequence the work naturally takes (several cluster into the §7 hardening pass; the learning-touching ones carry the Phase B dependency noted in §7). The Formula Registry is called out only because it is the one item that *subsumes* others, not because it ranks above them.
- **Pattern C envelope field fix** — the `evaluated_target` / `evaluation_basis` field-location bug (§6.1); sits with B1/B3.
- **Formula Registry** — one place defining how every numeric score is computed; subsumes D13 (`compiler_confidence_score`) and the dozen-plus scattered "this number has no formula" gaps.
- **Memory Hydration phase** — a formal pre-run memory-read phase.
- **Memory Precedence Hierarchy** — Local Intent > Matter Policy > Global DOC72 when they conflict.
- **SubAgentPrior injection** — a mechanism for sub-agents to carry prior context instead of starting cold.
- **Flawless-execution denominator** — via the enhanced (verdict-aware) `OutcomeEvaluationSignal`, not a new signal type (per §6.2).
- **HardCallBlockingScope** — define what a blocking Hard Call actually blocks (step / outcome / run).
- **TaskCancelProtocol** — clean cancel semantics mid-run (relates to scenario C1(e)).
- **ResearchNeedLease** — so two modules don't both satisfy one need (closes D15 / scenario C1(a)).
- **EvaluationContractReview** — a pre-execution check that the evaluation plan is coherent before spending.
- **EvidencePackage** — bundle the evidence for a finding into one reviewable unit.
- **CalibratedScore type** — a score that carries its own confidence interval and calibration metadata (pairs with the Formula Registry).
- **RunGuidanceItem persistence** — persist run-guidance items rather than letting them evaporate.
- **Forum deadlock breaker** — a defined mechanism to break a stalled forum (relates to D3, D20).
- **TaskBlueprint Topology/Payload bifurcation** — separate graph shape from graph content.
- **Regenerate `previous_attempt_hash`** — so regeneration can detect and reference the prior attempt.
- **Chaos test fixtures** — fixtures that inject the failure conditions the purpose-questions surface (storage-full, malformed LLM output, mid-run privilege change).
- **KnownGoodState** — a named, restorable checkpoint.
### 6.4 Note on the Gemini memory review
Gemini's strong memory-system review arose from answering a *purpose* question ("does this unify the memory system?"), not the standard completeness-audit prompt. The lesson, captured in §8: the standard prompt is a **completeness audit** (finds defects in what's specified); purpose-questions are a **purpose audit** (find architectural absences like Memory Hydration). Both are needed; they find different classes of problem.
---
## 7. Recommended R0.4 — "Math + Contract Hardening Pass"
This is not a priority subset — it is an *execution grouping*. The contract, enum, routing, math, and hashing fixes below are tightly coupled (they all touch the shared-schema layer), so doing them as one coherent pass is more efficient than scattering them. Everything else in this document — the conceptual findings (A-series), the new surfaces (§4), the fork/chat/re-prompt addendum (§5), and the reviewer-surfaced additions (§6.3) — is equally to be done, in its own workstream. Grouping these here says "do these together," not "do these instead of those." Scope of the pass:
1. **Contract consolidation.** Make Common Contracts V1.1.x the schema-of-record for all shared types. Resolve B1 (one `EvaluationFinding`), fix B5/B6 phantom and misattributed references, fix the Pattern C envelope field location (§6.1).
2. **Enum and mapping closure.** Fix B2 (`OutcomeEvaluationState` count + `evaluating` mapping), C5 (`LearningMode` enumeration), D8 (chain-id naming).
3. **Routing completeness.** B4 (`on_indeterminate`/`on_not_applicable`/`on_unrecoverable`), C1 (Pattern C route resolution policy), D24 (branch exclusivity + idempotency).
4. **Cascade safety.** B7 (cascade depth + cycle detection), B8 (consolidate upstream-failure rule + race), C4 (dependency cycles), D2 (success-condition-5 quiescence).
5. **The Formula Registry.** One document/section defining how every numeric score is computed (subsumes D13 and the scattered score-formula gaps); introduce CalibratedScore.
6. **Hash/normalization + concurrency.** D14 (HardCallResolution hash normalization), D15 (Source Workspace/Forum concurrency model), C7 (structured-anchor non-empty), C8 (source_kind→taint mapping).
7. **Define the named-but-missing schemas.** D16 (`HumanOutcomeFeedbackEvent`), and decide on `EvaluationAffirmation` (define or remove).
8. **Pre-absorption hygiene.** D5 (convert section-number cross-refs to stable anchors before DOC23 R3.2 absorption).
**Dependency note (not a deprioritization).** The learning-engine work and the learning-surface findings (D4, D21, A6, S6, and the learning side of the flawless-execution denominator) carry a hard sequencing dependency: they are **gated on the Phase B corpus audit**, because the audit is what surfaces their own requirements — writing them ahead of it means guessing at the spec they're supposed to implement. This is a "can't be done yet," not a "do later by choice." The hardening pass above is the contract/math layer and has no such dependency — it can proceed now.
---
## 8. Appendix — Reusable purpose-question library
The standard red-team prompt is a **completeness audit**: it finds defects in what is specified. These questions are a **purpose audit**: each one, run in a fresh window, finds architectural *absences* the completeness audit structurally cannot (Memory Hydration was found this way). Run one question per fresh window — one-question focus is what produces the depth; bundling several into one window dilutes it.
Domain-specific phrases are marked `{like this}` as fill-in slots so the set is portable across the whole spec suite (DOC72, DOC24, etc.). Swap three tokens and the kit works for any spec.
**Set A — purpose / reliance**
- **A1 (reliance).** "You are reviewing `{the SPEC}`. Answer one question in depth: can `{a securities litigator}` rely on a work product this system produces without re-checking the underlying sources and reasoning themselves? If yes, walk through exactly what they'd rely on and why it's sufficient. If no, identify precisely what's missing before reliance is justified. Don't list general bugs — answer the reliance question and let the gaps fall out of it."
- **A2 (inventability).** "Reading `{the SPEC}` as the AI coding agent that must build it: where would you be forced to invent behavior the spec doesn't determine? List each place you'd have to choose between unstated alternatives, and state exactly what specification would remove the guess."
- **A3 (determinism / replayability).** "For `{the SPEC}`: if you had to reproduce a past run exactly — same inputs, same recorded state — what would and wouldn't reproduce deterministically? Inventory every source of nondeterminism (model versions, hashes, ordering, timing) and say where the spec is silent on pinning it."
- **A4 (degradation honesty).** "Reviewing `{the SPEC}`: when this system can't do something — a source is unavailable, a sub-agent is down, a budget is exhausted, a required contract is missing — does it tell the user the truth about what it couldn't do, or can it return something that looks complete while being silently degraded? Find every place a partial or degraded result could be presented as a clean one, and say what's missing that would force the degradation to surface."
**Set B — persona shifts**
- **B1 (opposing-counsel adversarial audit).** "Read `{the SPEC}` as `{opposing counsel}` trying to discredit a work product it produced. What about how this system operates could you attack — gaps in provenance, unverifiable claims, decisions with no recorded rationale? Everything you could attack is a finding."
- **B2 (inheriting operator).** "Read `{the SPEC}` as a colleague handed someone else's `{task run}` that's halfway finished — the original person is unavailable, and you must understand where it is and carry it forward. What can and can't you determine from the system's recorded state: what's been done, what's pending, what decisions were made and why, what's safe to touch? Everything you can't determine is the finding."
- **B3 (partner review).** "Read `{the SPEC}` as the `{supervising partner}` who must sign off. What would you need to see to take responsibility for this output, and what does the system fail to give you?"
- **B4 (scale).** "Read `{the SPEC}` as the user running this across `{forty active legal matters}` at once, each with its own `{tasks, sources, forums, findings}`. What breaks or degrades at that volume that works fine for one? Be specific about where attention becomes unmanageable, where `{cross-matter}` isolation could fail, where shared resources contend, and where cost or latency stops being acceptable. Name the mechanisms that quietly assume a small number of concurrent `{matters}` without saying so."
**Set C — scenario / probe**
- **C1 ("what happens when").** "For `{the SPEC}`, answer 'what happens when…' for each, and where the spec doesn't determine the answer, say so: (a) `{two modules try to satisfy the same need at once}`; (b) `{a privilege/permission level changes mid-run}`; (c) `{an LLM returns malformed output on a load-bearing call}`; (d) `{storage fills during a durable write}`; (e) `{the user cancels mid-operation after an irreversible side effect}`."
- **C2 (staleness / temporal).** "Walk `{the SPEC}` and answer one question: what goes stale, and what happens when it does? Inventory everything with a freshness dependency — `{caches, context packets, snapshots, resolved decisions, applied patterns, hash preconditions}`. For each, state how the system knows it's stale, what invalidates it, and what happens if a consumer uses it after it's gone stale. Where the spec is silent on invalidation, that silence is the finding."
- **C3 (reversibility / irreversible actions).** "For `{the SPEC}`, answer one question: what are all the irreversible actions this system can take — anything that leaves the system, mutates shared state others depend on, or can't be cleanly undone? For each, describe what protects the user from triggering it by mistake, and whether they can recover if it was wrong. The ones with no guard and no recovery path are the findings."
- **C4 (load-bearing assumptions).** "Name the load-bearing assumptions in `{the SPEC}` that are never stated as constraints — things the design quietly depends on that aren't written down anywhere as requirements. For each, describe what breaks if the assumption is false, and whether the spec gives any signal that the assumption exists."
### Reviewer-assignment grid (for Addenda B, current round)
Soft assignment; for findings it matters less who gets what than that each is answered in a fresh window. The low-friction way to get a fresh window without re-attaching documents is a **new chat inside the project** (docs already attached via project knowledge; no prior-conversation anchoring). Reusing an old window anchors the reviewer to its earlier completeness-audit conclusions and partly defeats the purpose-audit framing. Round 1 first; Round 2 if it pays off.
| Round | Question | Reviewer | Hunts for |
|---|---|---|---|
| 1 | C2 staleness | Grok | invalidation gaps, temporal coupling |
| 1 | C3 reversibility | ChatGPT | unguarded irreversible side effects |
| 1 | A4 degradation honesty | Gemini | silent-degradation paths |
| 1 | B4 scale (40 matters) | Claude | cross-matter isolation, contention |
| 2 | A1 reliance | Gemini | architectural absences before reliance |
| 2 | C1 scenarios | Grok | concrete undetermined failure modes |
| 2 | B2 inheriting operator | Claude | continuity / mid-run legibility |
| 2 | C4 load-bearing assumptions | ChatGPT | unstated requirements |
Dropped from this round as overlapping the completeness audit: A2, A3, B1, B3.
---
*End of consolidated review. Next inputs expected: external reviewer responses to the Round 1 / Round 2 purpose questions, to be folded into §3 (new findings), §6.1 (cross-reviewer catches), and §6.3 (additions to adopt) — with any genuinely not-worth-it proposals added to §6.2 with their reason.*