DOC23 Add B TEST Prompts RT 1 (5.28).md

Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC23 Add B TEST Prompts RT 1 (5.28).md

Short text page 2bfe04ab9330. Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.

Open readable HTML page · Open raw txt · Open path URL

ELNOR REPO READER TEXT MIRROR
Original path: Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC23 Add B TEST Prompts RT 1 (5.28).md
Source repo: /Users/OpenClaw1/Elnor/Elnor Specs
Git branch: main
Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331
Generated: 2026-06-09T01:23:58.539Z

---

Will's Ideas Sparked by this Round:

These are things I noticed I'd like addressed, or flushed out. This list is not meant to exclude any other ideas or set any priorities, but these are just some things I thought needed to be address:

1. The Evaluator needs to tie its findings to identified sources and evidence where possible. I need to be able to see what lead to the decision that something was true, accurate, properly cited, etc., by seeing in an easy to understand and readily available manner, the claim/statement/etc. the module checked and the basis for the decision that it passed or did not pass. This isn't always applicable, but in cases of truth, accuracy, compleness, proper citation, legal research, calculations, and even things like "did this brief match the formatting and style of X" I'll have to be able to see the basis for the decision and proof. This will help the user and the revisor and future loops of the same process. It's also important not just to show when the module has proof, but the basis and sources for negative proof, disproving something, or the sources checked when there was an absence of proof. This also acts as an audit of sorts to make sure everything was properly checked. Gemini touches on this and some similar ideas. Need to considser how exactly this works and the telemetry for it. ChatGPT noted something about an "evidence package" too in the other separately documented round of red teaming on this version of Addenda B. Others may have too.

# Claude 1

I have the DOC23 Addenda B core files in the project already, but you asked me to pull the set from GitHub — there may be family-topology split files there that aren't in the project. Let me check both: read what's local and verify against the repo via Composio.Let me check the GitHub repo via Composio for any additional Addenda B family-topology split files not in the project.The Composio GitHub call needs your approval on your end (an auth/approval gate fired). Rather than block on that, I have the authoritative DOC23 Addenda B set right here in the project — the Core, the Outcome Evaluator/Revisor, and the Evaluation Common Contracts. Those are large, so let me map their structure before drilling in.Now I have the full map. Let me read the sections most load-bearing for the 40-matter question — starting with the context isolation invariant (§13A), which is where cross-matter contamination would live.Strong surface-isolation model. Now the memory layer — where 40 matters' worth of run history accumulates.Now the attention surfaces — ambient task awareness, invocation suggestions, and rate limits. This is where 40 matters' worth of signals compete for the user's eyes.Now the concurrency controls — sub-agent limits, rate limits, token budgets, and portfolio-level review.Let me check the other two Addenda B members — the Outcome Evaluator/Revisor and the Evaluation Common Contracts — for the nightly judge/eval batch behavior, which is a cost driver at volume.Let me grep across the Addenda B set for explicit concurrency, fairness, and budget ceilings — to confirm what is and isn't specified for many simultaneous matters.The parallelism model is the smoking gun. Let me confirm how the parallelism ceiling is computed and what it's bounded against.The §17 sub-agent coordination section claims to unify all sub-agent rules — let me check whether it introduces any cross-task arbiter.I read the three operative Addenda B members — Core R0.7.1, Outcome Evaluator/Revisor V3.3.1, and Evaluation Common Contracts V1.1.1 — plus the underlying Task System R3.1 and DOC72 R5.73 for the write/embedding substrate. (The Composio GitHub pull hit an approval gate on your side; the project copies are the same operative files, and DOC23 Addenda B is `Core + Outcome Evaluator/Revisor + Common Contracts`, so the set is complete.)

Here's the short version before the detail: the spec is genuinely good at the two isolation failures you'd expect to worry about first — surface-context bleed and learned-pattern privilege bleed are both well-defended. What it is *not* designed for is **forty matters contending for one laptop and one writer**. Almost every budget, queue, and parallelism control in the set is scoped per-run, per-plan, or per-invocation. There is no portfolio-level arbiter anywhere, and §2.2 actively *rejects* the kind of component that would be one. That is the through-line.

## Where attention becomes unmanageable

**[GAP] The portfolio review is unbounded.** §16A.2 `TaskPortfolioAssessment` accepts `task_scope: "all_tasks"` and returns `tasks_reviewed: string[]`, `findings: TaskPortfolioFinding[]`, plus eight more flat unbounded arrays (`low_value_or_duplicate_tasks`, `stale_templates`, etc.). There is no pagination, no per-matter chunking, no top-k cap, no severity floor. For one matter this is a tidy report. Across 40 matters it is a single assessment that both blows the Task Agent token budget on assembly and produces a findings list no human will read. "Elnor, look at my tasks and suggest improvements" (§16A.2's own example trigger) becomes the most expensive and least useful command in the system precisely when you most need it.

**[GAP] One global attention channel, no matter partitioning.** The ambient task-awareness card (§3B.4, 50–160 tokens) and the chip surface (§3C.8) are a single undifferentiated stream. Task opportunity is prefiltered by DOC24/EC (§3B.4 note 7), but nothing in the routing budgets attention *per matter*. At 40 matters with scheduled docket monitors and email-triggered tasks, chip/suggestion volume scales with matter count while the user's attention does not. There is no "you have 6 pending task suggestions across 4 matters" rollup — just N independent chip decisions.

**[GAP] Single global learning review queue.** The "Task Design Learning Review Queue" (§20.1, §24 line 9595) and the `TaskSystemImprovementProposal` lifecycle (§9C.3, all entering `candidate`) are one queue. Every matter's proposed template/preset/pattern updates land in the same place for the user to clear. At 40 matters this is a firehose with no per-matter view and no auto-triage by `support_count`.

**[GAP] Suggestion suppression is keyed on the wrong unit for multi-matter.** §9A.4 rate-limits and suppresses suggestions by `context_class_key` plus recent rejection. But `context_class_key` (§9A.1/§9A.2) is task-type/document-archetype shaped, not matter-shaped. So if you reject the "complaint evaluation" chip on three matters where it wasn't wanted, you suppress it across the other 37 where it would have helped — or, if you read the class as matter-specific, you get no cross-matter learning at all. The spec doesn't resolve which, and either reading is a problem at 40.

## Where cross-matter isolation could fail

First, credit where due, because this is the half you defended well:

**[CONFIRMED] Surface-context bleed is genuinely closed.** §13A.2 ("no singular current-work-context"), §13A.3 (the isolation invariant), and the `TaskRunScopeEnvelope` defaults (§13A.7: background/triggered runs default `surface_context_policy = not_applicable`) mean an active Marex chat cannot leak into a scheduled task for a different matter. The R0.6.3 patch record (§27G/line 10344) shows this was deliberately hardened. Good.

**[CONFIRMED] Learned-pattern privilege bleed is closed at the pattern layer.** Outcome Evaluator §16.6.5 + §13.4 cross-matter retrieval + §13.1.2 identifying-content scan mean a litigation pattern doesn't surface in M&A work unless explicitly promoted, and `pattern_promotion_default_scope` defaults to matter-scoped. §16.6.6 forces every task to carry a matter (`validation.task_missing_matter_assignment`). For 40 matters this is the right firewall.

Now the actual exposure:

**[BUG] Entity→matter resolution at trigger time assumes near-1:1, and 40 securities matters are N:1.** This is the real isolation risk and it's a different mechanism than the one the spec hardened. §13A.3 condition 5 admits surface context when "DOC24 resolves a high-confidence entity/matter match from task-owned evidence," and §13A.8 `EmailTriggerScopeSummary` resolves scope from `sender_ref / recipient_refs / subject / body / thread`. Across 40 securities matters you have *shared* entities by construction: the same defendant banks, the same opposing-counsel firms, the same expert witnesses, the same SEC filing archetypes. An email-triggered docket-monitor task receiving a filing from a firm that appears in six of your matters has an ambiguous matter binding. The schema carries a `confidence: number` and `new_case_or_matter_candidate_refs`, but there is **no disambiguation protocol when one entity maps to N matters and no "ambiguous → hold for user / do not auto-bind" gate** at trigger time. The whole isolation invariant rests on the resolved scope being correct; the spec never specifies what happens when resolution is correctly *uncertain*. A wrong bind here is worse than surface bleed because it's silent and it crosses a privilege boundary.

**[GAP] Two different isolation units coexist.** Pattern promotion is matter-scoped (§16.6.5). Suggestion/invocation learning is `context_class_key`-scoped (§9A, not matter-aware). The same system therefore isolates learned *patterns* by matter but learns *invocation behavior* across matters. That inconsistency is invisible at one matter and bites at 40.

**[GAP] Privileged context comingled in global append logs.** §22.1 routes all context feedback to a single `ELNOR_MEMORY/system/task_feedback/context_feedback.jsonl` and all share audit to one `task_audit_events.jsonl`. For a securities litigator running 40 matters, context-feedback events from privileged matter A sit in the same physical file as matter B. Not an injection-time bleed, but for legal-hold, export, and privilege-log purposes (which §16.6 otherwise takes seriously) a single comingled stream is the wrong granularity. Per-matter or per-class sharding is cheap to add now and expensive to retrofit.

## Where shared resources contend

**[GAP — structural] EC sole writer is the funnel.** §2.4 (EC sole durable writer) plus the single-writer SQLite/WAL substrate means every durable write serializes: trace spans (`trace_spans.jsonl` per run), context packets (§22.1 writes one `TaskModuleContextPacket` *per module per activation_seq*), memory writes, learning signals, artifact indexes. Per-run path sharding helps read locality but does nothing for write contention — all 40 matters' writes pass through one writer. With concurrent active drafting + scheduled monitors + email triggers, the write queue is the bottleneck, and nothing in the set specifies write prioritization across matters. A deadline-critical filing task's writes queue behind a background docket-monitor's telemetry with equal standing.

**[GAP — structural] The parallelism ceiling is one laptop, divided by nobody.** Outcome Evaluator §11.22 bounds step parallelism by `max_parallel_steps_per_plan` (default 4) **and** `LocalHardwareContext.max_parallel_sub_agents_safe`, which §11.22.4 derives from a single M4 Pro's thermal and memory pressure and *degrades to sequential* under load. Each plan independently takes `min(ready_steps, max_parallel)` (§11.22.2). There is no component that divides the machine-wide sub-agent budget across the 40 plans that may be eligible at once. So either every plan reads the same machine-global number and they collectively oversubscribe, or they collide at OpenClaw's `maxConcurrent: 8` global cap — and that cap has no matter awareness. The §11.9.4 concurrency tie-breaker only orders plans that target *the same artifact*; it does not arbitrate between matters. The implicit assumption baked into §11.22.4 is "the scarce resource is one machine and a human running a handful of things," which is exactly the assumption 40 matters violates.

**[GAP] Embedding and model fan-in.** All 40 matters share one locked Qwen3-Embedding-0.6B MLX pipeline (one GPU) for extraction, one Gemini primary, and one set of Claude API credentials for high-stakes work. The nightly extraction lane (§10.8.2) is LLM-assisted and grows linearly with run volume against a fixed one-night, one-machine window. Nothing specs backpressure when the day's run volume exceeds what the night can absorb.

## Where cost and latency stop being acceptable

**[GAP] No portfolio-level cost ceiling exists; every budget is local.** §17.2 says sub-agents "honor per-invocation cost and timeout budgets." The Outcome Evaluator runs specialist subevaluators per lane (§17.1, §8.1) for each evaluation, and the re-extraction circuit breaker you carry is `$15/run` (DOC73 §14.7) — per run. At 40 matters running complaint-eval and red-team tasks, the multiplier is matters × tasks × lanes × judge/revise iterations, and there is no aggregate budget that says "the portfolio is spending too much this hour, throttle." A runaway loop on one matter is capped; forty simultaneous well-behaved matters are not.

**[GAP] Nightly batch backlog → stale current views.** The two-lane model (§10.8) puts cheap deterministic writes immediate and heavy extraction/learning/compaction nightly. If the nightly lane can't drain in one idle window (likely at 40 matters on one machine), the backlog compounds: Task Activity Memory links (§10.10) lag, BDSM/DOC8 compiled bundles go stale, and the "current views" that delivery reads from drift behind reality. The spec has no batch-budget, no carry-over policy, no "extraction debt" surfacing. For one matter the night is always enough; at 40 it quietly stops being enough and the failure is invisible until retrieval feels wrong.

**[GAP] Context-packet write amplification.** Per §13A.9 + §22.1, each model-facing module activation produces and persists a `TaskModuleContextPacket` (plus prompt/capability/policy snapshots). At 40 matters × multi-module tasks × multiple activations, that's heavy small-file write I/O through the single EC writer — both a latency cost and a storage-growth cost with no compaction policy for packets specifically (§10.9 compaction covers run memory, not the per-activation packet/snapshot tree).

## The mechanisms that quietly assume small N

If you want the subtractive-pass list — the specific places that encode "few concurrent matters" without ever saying so:

- **§11.22.4 `LocalHardwareContext`** — assumes the bounded resource is one Apple Silicon machine; degrade-to-sequential is the *only* relief valve and there's no cross-matter scheduler above it.
- **§2.2 (rejects a central runtime orchestrator) + §2.4 (EC sole writer)** — together they guarantee there is no home for a matter-fairness arbiter, and one serialized writer for all matters' durable state.
- **§3B.4 single ambient card + §3C chip stream** — assume one attention channel is enough.
- **§16A.2 unbounded `TaskPortfolioAssessment` arrays** — assume "all tasks" fits in one pass.
- **§9A.4 `context_class_key` suppression** — assumes context classes don't fan out across many matters.
- **§10.8.2 nightly extraction lane** — assumes one idle window absorbs the day's run volume.
- **§13A.8 `EmailTriggerScopeSummary` / §13A.3 cond. 5 matter resolution** — assume entity→matter is roughly 1:1, not N:1.
- **§22.1 single global `context_feedback.jsonl` / `task_audit_events.jsonl`** — assume one append stream is an acceptable granularity for privileged work.
- **Per-invocation / per-run / per-plan budgets throughout (§17.2, §11.22, DOC73 §14.7)** — no portfolio-level budget anywhere.

The cleanest framing: the spec is correct that there's no *singular current context* (§13A.2), but it still assumes a *singular resource envelope* — one machine, one writer, one night, one attention surface — and never names that assumption the way it named and killed the current-context one. The fix isn't a central orchestrator (you were right to reject that); it's a thin **portfolio resource/attention governor** that owns three things §2.2 currently leaves homeless: a write/parallelism fairness policy keyed on matter + deadline class, an aggregate cost/throttle budget, and a per-matter rollup for attention and learning queues. I'd also add an explicit `ambiguous_matter_resolution → hold` gate to §13A.8 before anything else, because that one is a correctness bug, not just a scaling degradation.

Want me to write this up as a versioned red-team finding doc against the Addenda B set (typed findings, section anchors, proposed acceptance criteria) so it's routable to your other windows and ready to fold into OP-A? I held off creating a file since you asked for an answer, not a deliverable.

# Claude 2

Good, sharp frame. I've got the set mapped from before, but the handover question targets a different cluster of sections than the scaling one did — run state, checkpoints, gates, decision capture, and concurrent-edit safety. Let me read the ones I haven't yet, starting with what a colleague actually opens: the Run Inspector.The inspector is strong on step status and context audit. Now the telemetry primitives and — critically for "why" — the policy decision snapshot.The policy "why" is captured as reason codes; substantive "why" looks thinner. Now the durable-execution contract, human gates/holds, and idempotency — the heart of "what's pending" and "what's safe to touch."Now the mechanism that literally is "carry it forward" — module continuation, mid-run interaction limits, and activation replay.Strong replay/continuation model. Now two things the handover hinges on: whether the recorded state captures *why* (decisions and rationale) and *who/what's next*. Let me check the decision-card schemas and the Task Agent run lens.The RunLens is a genuinely good "where is it" surface. Let me check three definitions that determine whether the *why* is recoverable: HumanGateSummary, the design-decision cards, and the Evaluator's HardCallResolution.`HumanGateSummary` is referenced in the run lens and trace payload but I don't see it defined in Core. Let me check whether the base Task System spec defines a human-gate decision record with rationale, since §14.1 says gates predate Addenda B.The base spec captures gate *outcomes* (approve/reject/revise + revision instructions). Now the Evaluator's runtime-decision and concurrent-edit machinery — the last piece of "why" and "safe to touch."I read the handover-relevant cluster across the set — the Run Inspector and run-lens read models (Core §4.6, §12, §20C), durable-execution and gate contracts (§14), the continuation/replay machinery (§20E), the base R3.1 gate and checkpoint model, and the Outcome Evaluator/Revisor's decision-trace machinery (HardCallResolutionLedger, AssuranceBasis, RevisionExecutionRecord, GraphStateRollback).

The one-line answer: **you can fully reconstruct what mechanically happened and where the graph is, but you cannot reliably reconstruct why the human and the ordinary agent modules chose what they chose, who was driving, or what they intended to do next.** The set was designed for the person who ran the task to read it back later — not for a handover between two people. Detail by your four axes.

## What's been done — mostly determinable

This is the strongest area. `TaskRunStepRow` (§20C.3) gives you per-activation status, in/out summaries, artifacts, deliveries, cost, model/fallback, and it's **activation-seq aware**, so loops, branches, and fan-out reconstruct correctly rather than collapsing into a fake linear timeline. `ModuleActivationReplayRecord` (§20E.4) captures the full per-activation context — graph snapshot, config, input/output bundles, checkpoint, prompt, packet, capability, policy, source-document *versions* — so each completed module is deterministically reconstructable and re-runnable. `TaskDeliverablesReadModel` (§20C.5) separates final/intermediate/undelivered/failed outputs and external delivery receipts, so you know what already left the building. DOC72 Task Activity Memory (§10.10) links the run to its matter/entities/work products *if memory was enabled*.

The gap here is narrow: **[GAP]** intermediate reasoning inside a completed module is only as recoverable as its `output_ref` and `summary` (§12.2) — a freeform string. You can see *what* each step produced, not the working it discarded.

## What's pending — determinable

`TaskAgentRunLens` (§4.6) is genuinely good as a "where is this" surface: `current_status`, `current_module_states`, `graph_progress_summary`, `active_path_summary`, `wait_reasons`, `human_gate_status`, `active_subagents`. `TaskRunNarrativeStatus` (§20C.4) gives current/next/waiting in prose. Critically, §14.3 forbids hidden waits — "must not create hidden waits" — so you can trust that every pending gate or hold is graph-visible. `continuation_status` (§20E.1) tells you honestly whether the live session can be resumed.

**[GAP]** But "next" here is **graph-derived, not operator-derived**. `next_work_summary` and `active_path_summary` tell you what the *graph* will execute next, not what the *person* intended — e.g., "I was about to reject the red-team finding and rerun with a narrower scope" is nowhere unless they encoded it. More on that under intent below.

## What decisions were made and why — this is where it splits sharply

There are three distinct decision classes, and they're captured very unevenly:

**Policy/system decisions: fully captured. [CONFIRMED]** `PolicyDecisionSnapshot` (§12.5) records every capability mount, external send, file write, memory write, subagent spawn, model selection with `decision: allowed|blocked|requires_approval|degraded` plus `reason_codes`. So you can always answer "why was PACER denied to module 3" or "why was that email blocked."

**Outcome Evaluator/Revisor decisions: richly captured. [CONFIRMED]** Where a run uses the Evaluator/Revisor pipeline, the "why" is unusually well preserved: `AssuranceBasis` ("why the verdict is trustworthy") vs `EvaluationLimitationKind` ("why no trustworthy verdict exists"), `FailureKind` per finding (source_gap, reasoning_error, etc.), the `HardCallResolutionLedger` (§7.9.2) with full compatibility binding so you know under what conditions a hard call still holds, and `still_failing_same_reason` vs `failing_for_new_reason` progress signals (§5.5.4) that explain *why a revision loop continued or escalated*. This is the model for what good handover capture looks like.

**Ordinary agent-module decisions: not captured. [GAP]** This is the asymmetry that becomes the finding. For a `step.agent_task` complaint drafter, a `step.red_team` reviewer, a generic LLM module — you get the prompt (`EffectivePromptSnapshot`, §12.3), the output (`output_ref`), and the policy decisions, but **no structured rationale linking input to output**. Why the drafter chose one argument structure over another exists only insofar as the model happened to narrate it in its own output. The Evaluator pipeline got purpose-built rationale capture; the ordinary modules — which is what most of a halfway complaint-draft run actually consists of — did not.

**Human-gate decisions: outcome yes, rationale and identity no. [GAP]** §14.2 asserts `pending_human_signal_policy.decision_log_required: true`, but **no decision-log schema is defined anywhere in the Addenda B set**, and `HumanGateSummary` is referenced (§4.6 line 1228, §10.5 line 4881) but never defined in Core. The base R3.1 Human Review module captures the *outcome* — which of `approved_out`/`rejected_out`/`revision_out` fired, and on a revision the reviewer's bundled instructions — but an *approval* may carry no reason, there is no captured reviewer identity, and there's no record of what standard the reviewer applied or what they were shown versus what they weighed. So you can see that gate 2 was approved; you can't see *why*, or by whom, or against what bar.

**Design-time "why" exists but answers a different question. [CONFIRMED-but-orthogonal]** `TaskModuleDesignRationaleCard`/`TaskConfigRationaleCard`/`TaskDesignDecisionCard` (§8.10–§8.12) and `TaskDesignDecision` (§9.7) capture why the *task was built* this way. They do not capture why *this run* went the way it did. For a handover, that's useful background and the wrong layer for "where is this run."

## What's safe to touch — determinable but conditionally

Several real protections: completed parent runs are immutable (§20E.1, §20E.3) — continuation/fork creates child runs with lineage, never mutates the parent. Side-effect replay on fork **defaults to `block_all_external_side_effects` or `draft_only`** (§20E.3), so re-running won't silently re-send. Crash recovery promises `never_repeat_committed_side_effects` (§14.2). The Revisor Dispatcher verifies "no concurrent plans target the artifact" before mutating (Evaluator §11.21) and can `GraphStateRollback` on regression (§11.13). Mid-run, mutation is "not presumed" (§20E.2): you may observe, pause-after-module, cancel, or leave a downstream note, but you cannot inject into a live LLM call unless the runtime supports it.

But the coverage is conditional in two ways that matter for handover:

**[GAP]** Idempotency is `should`, not `must`: §14.4 says side-effecting ops "should carry an idempotency key and policy snapshot **where supported**." So whether a specific email/file-write/PACER-download is safe to re-trigger is determinable *only if the key was actually recorded* — not guaranteed by the contract.

**[GAP]** The robust concurrent-edit guard lives **inside the Revisor Dispatcher** (§11.21). For an ordinary run, there's no clearly specified artifact lock/lease preventing you from editing a source document or the task graph while a module mid-run still depends on it. §20E.2 says mutation is "not presumed," which is a posture, not a lock. So "is it safe to edit this input right now" isn't reliably computable from recorded state.

**[GAP — operational]** Even when durable state is intact, the **live reasoning session is probably gone**: continuation depends on OpenClaw reporting the session resumable (§20E.1), and OpenClaw archives sessions after ~60 minutes (`archiveAfterMinutes: 60`, carryover §2.2). If the original person stepped away, you resume from the last checkpoint *boundary* (§14.2: module / llm-call / tool-call boundary), losing whatever intra-module progress existed. The system tells you this honestly via `continuation_status = archived|expired` — which is the right behavior — but "pick up mid-thought where they left off" is usually not on the table.

## The findings, consolidated

Everything you can't determine clusters into one shape: the recorded state captures **mechanism and graph position** comprehensively and **intent and rationale** poorly.

1. **[GAP] No structured "why" for ordinary modules** — the Evaluator/Revisor pipeline has rich rationale capture (AssuranceBasis, HardCallResolutionLedger, FailureKind); every other module type does not. Most of a real run is "every other module type."
2. **[GAP] Human-gate rationale and decider are unschematized** — `decision_log_required: true` is asserted but undefined; `HumanGateSummary` is referenced but undefined; only the gate *outcome* is durable.
3. **[GAP] No operator identity, ownership, or intent on a run** — `created_by` is a category, not a person; "next" is graph-derived, not "what the person planned"; the only intent channel is the optional `user_note_for_downstream` (§20E.2). Multi-actor ownership is explicitly deferred to DOC50/§20H, which means the *handover scenario itself* is out of scope of the current operative set.
4. **[GAP] "Safe to touch" is conditional** — idempotency is `should/where-supported`; the concurrent-edit lock is Revisor-Dispatcher-local; live session continuation is time-limited.
5. **[CONFIRMED, and the bright spot] Honest degradation** — §4.7 guarantees the system tells you *which* trace surfaces are incomplete rather than bluffing. The limit is that it surfaces the existence of a hole, never its missing contents.

The underlying assumption the spec never states: **continuity is modeled for one operator across time, not for two operators across a handoff.** The mechanical record is for *the system* to resume deterministically (checkpoints, replay records, idempotency); the rationale that a successor human would need was assumed to live in the head of the person who ran it — who is, by your hypothesis, exactly the person who isn't there.

If you want concrete fixes to name in a finding: a required `HumanGateDecisionRecord` (decider, what-was-shown vs what-was-weighed, outcome, rationale, standard applied) to actually discharge §14.2's `decision_log_required`; a first-class `RunOperatorContext`/handoff-note primitive on the run (owner, intent-for-next, open questions) pulled forward from the deferred DOC50 work; a lightweight `ModuleDecisionRationale` capture generalizing the Evaluator's AssuranceBasis pattern to ordinary agent modules; and making side-effect idempotency keys a MUST with a recorded dependency set so "safe to touch" becomes computable rather than conditional.

Want me to write this up as a versioned red-team finding against the Addenda B set — typed findings, section anchors, the four proposed schema fixes as paste-ready TypeScript, and acceptance criteria — so it's routable to your other windows and ready for OP-A? Same offer as last time; I held off on a file since you asked for the analysis, not a deliverable.

# Grok 1a

**Staleness Inventory — DOC23 Addenda B Family (Complete Walk)**

I walked every document in the current Addenda B set (Core R0.7.1, Outcome Evaluator+Revisor V3.3.1, Evaluation Common Contracts V1.1.1, Source Workspace V1.0.1, Task Forum + Run Board V1.0.1, Feedback Delivery V1.0.1) solely for freshness dependencies. Below is the exhaustive inventory of every artifact, record, or mechanism that carries a freshness or validity lifetime.

For each item I state:
- **What it is**
- **How the system detects staleness** (or the mechanism the spec says should exist)
- **What invalidates it**
- **What happens if a consumer uses it after it is stale** (or the finding if the spec is silent)

---

### 1. EvaluationSnapshot (Evaluator+Revisor V3.3.1 §5.16 + Common Contracts V1.1.1 §3)

**What it is**
Immutable snapshot of artifact content, source workspace state, and graph topology at the moment an evaluation or revision plan was compiled.

**How the system knows it is stale**
`evaluation_snapshot_ref` is stored in every `EvaluationResultEnvelope`. Consumers are supposed to compare the snapshot’s content hashes against the current artifact version.

**What invalidates it**
Any material change to the target artifact (content hash changes), source workspace snapshot, or graph topology after the snapshot was taken.

**What happens if used after stale**
Spec is **silent**. No validation rule, no precondition check, no error code, no automatic re-snapshot, no `stale_snapshot` state transition. A consumer (Revisor, downstream evaluator, audit view) can happily use an outdated snapshot, leading to false-positive verdicts, incorrect revision plans, or lost audit integrity. This is the single largest silent staleness gap in the family.

---

### 2. ArtifactMutationPrecondition / hash preconditions (Evaluator+Revisor V3.3.1 §7.13)

**What it is**
Per-step precondition in a RevisionPlan that records expected base version + content hash before a revision step is allowed to run.

**How the system knows it is stale**
Revision Dispatcher is supposed to check the precondition before dispatch.

**What invalidates it**
Any change to the target artifact (or its version) between plan compilation and dispatch.

**What happens if used after stale**
Spec is **silent on failure behavior**. The precondition is mentioned but no error code, no fallback, no re-plan trigger, no escalation is defined. A consumer (dispatcher or module) could proceed on stale hash, causing lost updates or silent corruption.

---

### 3. SourceFreshnessRecord & SourceVerificationResult (Source Workspace V1.0.1 §2.3)

**What it is**
Explicit records tracking `freshness_state` ("fresh" | "stale" | "needs_update" | "unknown") and verification results for every SourceRecord.

**How the system knows it is stale**
`freshness_state` field + `checked_at` timestamp + `next_check_due`. Source Research module and workspace UI are supposed to consult these.

**What invalidates it**
Time-based expiration, upstream source change (e.g., new case citation, new SEC filing), or manual “mark stale” action.

**What happens if used after stale**
Spec is **partially specified but incomplete**. The UI shows warnings, but no normative rule forces downstream consumers (Evaluator, Revisor, drafting modules) to treat a stale source as invalid. A module can still read and cite a stale source without triggering re-verification or blocking. Silence on enforcement in the evaluation/revision path.

---

### 4. TaskRunContextPacket & BoardDigest (Task Forum + Run Board V1.0.1 §6)

**What it is**
DOC24-assembled context packet (and its contained BoardDigest) that downstream modules receive.

**How the system knows it is stale**
`generated_at` timestamp + `refresh_policy` ("before_each_consumer_read" | "periodic" | "on_change").

**What invalidates it**
Any change to the underlying board posts, open research needs, unresolved repairs, run guidance, or source workspace state after generation.

**What happens if used after stale**
Spec is **silent**. No validation that a module’s received packet is still current. A module can act on stale guidance or stale open needs. The `refresh_policy` is declarative but has no enforcement contract in DOC15/CIL prompt assembly.

---

### 5. RunGuidanceItem (Feedback Delivery V1.0.1 §4)

**What it is**
Run-scoped guidance derived from findings (e.g., “do not cite source X”, “must include Y”).

**What invalidates it**
Supersession by newer finding, expiration timestamp, or user/tool contest.

**What happens if used after stale**
DOC15/CIL is supposed to filter by lifecycle state (§4.4 anti-drift rule), but the rule is only “MUST NOT receive superseded or expired guidance” — no error code, no fallback, no notification if stale guidance slips through. Enforcement is underspecified.

---

### 6. Resolved Hard Calls / HardRevisionCall (Evaluator+Revisor V3.3.1 §7.9)

**What it is**
A Hard Call that has been resolved by the user (or escalated).

**How the system knows it is stale**
Spec never defines a freshness model for resolved Hard Calls.

**What invalidates it**
Any subsequent change to the artifact or context that the Hard Call was about.

**What happens if used after stale**
**Complete silence**. No mechanism tracks whether a resolved Hard Call is still valid. A Revisor could reuse an old resolution on changed material, leading to incorrect autonomous behavior.

---

### 7. Applied Patterns & PatternPerformanceSlice (Evaluator+Revisor V3.3.1 §13)

**What it is**
Learned patterns retrieved and applied by Outcome Compiler or Revision Compiler, plus their performance slices.

**How the system knows it is stale**
`template_compatibility_check` (§13.7) and `compatibility_constraints` (mentioned in card but not fully wired in V3.3.1). Also `cross_model_applicability` and `model_class`.

**What invalidates it**
Capability drift, module version change, environment change, or regression detected in performance slice.

**What happens if used after stale**
Spec has partial handling (`template_compatibility_check` marks template stale and falls back to fresh inference). However, the fallback behavior for already-applied patterns mid-run is silent. A stale pattern can still be used, producing incorrect evaluations or revisions.

---

### 8. Defeasible Findings & EvaluationFinding.lifecycle_state (Feedback Delivery V1.0.1 §3)

**What it is**
Findings with explicit lifecycle ("proposed" → "accepted" → "user_approved" → "tool_verified" → "contested" → "superseded" → "expired").

**How the system knows it is stale**
Lifecycle state + `superseded_by_finding_id` + `expires_at`.

**What invalidates it**
New evidence, user contest, tool verification, or time expiration.

**What happens if used after stale**
Spec says findings are defeasible (§1.3) and downstream consumers “treat them as defeasible,” but there is no mandatory filter or error when a stale finding is consumed by Revisor, drafting module, or pattern learning. Enforcement is advisory only.

---

### 9. EvaluationResultEnvelope.evaluation_snapshot_ref (Common Contracts V1.1.1 §3 + V3.3.1 §5.16)

**What it is**
Reference to the snapshot used for the evaluation result.

**How the system knows it is stale**
Hash comparison at re-evaluation time.

**What invalidates it**
Any change to the artifact after snapshot creation.

**What happens if used after stale**
**Silent**. No validation rule forces re-snapshot or blocks use of stale envelope. Revalidation cascade can operate on stale data.

---

### Summary of Staleness Gaps (the real finding)

The system has **many** freshness dependencies, but **almost none** have robust, enforceable invalidation + consumer behavior.

- Strongest areas: SourceFreshnessRecord and RunGuidanceItem lifecycle (explicit fields exist).
- Weakest areas: EvaluationSnapshot, ArtifactMutationPrecondition, Resolved Hard Calls, TaskRunContextPacket, and stale pattern reuse — these are referenced heavily but have either no detection mechanism or no defined consumer reaction.
- Most dangerous silence: When a consumer (Revisor, drafting module, pattern applicator, audit view) receives stale data, the spec does not say what must happen. In several cases it does not even say how staleness is detected.

This is the core staleness risk in the current Addenda B family: the system records that something *could* be stale, but it has no consistent, enforceable contract for what to do when it *is* stale. Consumers are left to guess or silently proceed with outdated information. This directly undermines reliability for high-stakes professional work.

# Grok 1b

**What happens when… — DOC23 Addenda B Family (explicit, spec-grounded answers)**

I walked the entire current Addenda B set (Core R0.7.1, Outcome Evaluator+Revisor V3.3.1, Common Contracts V1.1.1, Source Workspace V1.0.1, Task Forum + Run Board V1.0.1, Feedback Delivery V1.0.1) for each scenario. Where the spec defines behavior, I quote the relevant contract. Where it is silent, I say so explicitly.

### (a) Two source-research modules try to satisfy the same ResearchNeed at once

**Spec location:** Source Workspace V1.0.1 §6 (ResearchNeed queue), §6.3 (routing), §7.5 (Source Research module `research_need_in` port and operational behavior).

**What the spec says:**
ResearchNeed records have a `status` field (`open` | `in_progress` | `answered` | `unresolved` | `human_needed`) and an optional `routed_to_module_id`. The Source Research module consumes `research_need_in`. No locking, atomic claim, deduplication, or race-condition rule is defined for the `research_need_queue`.

**The spec is silent.**
There is no defined behavior for concurrent consumption of the same ResearchNeed record. The system does not specify whether the queue uses optimistic locking, a claim token, or any form of serialization. Two modules can both read the same open ResearchNeed, both attempt to satisfy it, and both emit ResearchAnswer records (or both update the same SourceRecord). No error, no deduplication, and no “first one wins” rule is stated. The outcome is undefined.

### (b) A matter is reclassified from non-privileged to privileged mid-run

**Spec location:** Common Contracts V1.1.1 §3 (EvaluationResultEnvelope governance fields: `privileged`, `matter_id`, `data_class`), §5 (EvaluationLearningSignalEnvelope same fields); Source Workspace V1.0.1 §9.4 (privacy and matter firewall); Feedback Delivery V1.0.1 §3.4 and §F1a (taint model); Evaluator+Revisor V3.3.1 §15 (adversarial input boundary and PolicyDecision gate); Core R0.7.1 §24A (EC Core policy references).

**What the spec says:**
Envelopes and workspace records carry `privileged` and `matter_id`. EC Core is the sole durable writer and enforces policy at write time (PolicyDecision gate). Taint and privilege are checked at mutation boundaries and at export/audit.

**The spec is silent on mid-run reclassification.**
There is no defined behavior for a live run when a matter’s privilege level changes while modules are still executing, emitting envelopes, writing to the workspace, or sending emails. No mechanism is described for pausing the run, re-tainting existing artifacts/envelopes, or rolling back already-emitted side effects. The firewall rules apply at emission/write time; nothing addresses retroactive reclassification during an in-flight run.

### (c) An LLM returns malformed output on a load-bearing evaluation call

**Spec location:** Evaluator+Revisor V3.3.1 §9 (validation codes and ModuleAck), Common Contracts V1.1.1 §9 (envelope-level validations), §3.7 (Pattern C consumption), Feedback Delivery V1.0.1 §2.2 (bundle invariants and validation codes).

**What the spec says:**
The system defines explicit validation codes for malformed or inconsistent envelopes (`validation.feedback_bundle_unanchored`, `validation.feedback_bundle_snapshot_drift`, `validation.feedback_bundle_verdict_inconsistent`, `validation.envelope_producer_kind_mismatch`, etc.). An unhandled exception from a module’s `revision_in` is synthesized into `ModuleAck.status = error` (V3.3.1 §9, cross-referenced from earlier red-team fixes).

**The spec is partially silent on LLM-specific malformed output.**
If the LLM returns structurally invalid JSON or violates a slice schema inside an `EvaluationResultEnvelope`, the validation rules in Common Contracts §9 will catch it and fire the appropriate code. However, the spec does not define the *recovery* path for a load-bearing evaluation call (e.g., does the Evaluator retry with a different model, fall back to cheap mode, emit an indeterminate verdict, or escalate immediately?). The error is detected, but the downstream effect on the revision cycle or run continuation is not specified beyond the generic error ack.

### (d) Storage fills up during a durable write to the entity graph

**Spec location:** Core R0.7.1 (EC Core is sole durable writer, referenced throughout), Evaluator+Revisor V3.3.1 §11.6 (RevisionOperationReceipt extends PBEOperationReceiptLite), Common Contracts V1.1.1 §5 (signal envelope governance), EC Core Addendum A V3.3 (retention and storage policy, referenced but not reproduced here).

**What the spec says:**
EC Core is the sole durable writer. All entity-graph writes (patterns, signals, receipts, workspace records) go through EC. Retention rows and export/backup policies exist.

**The spec is silent.**
There is no defined behavior for a storage-full condition during a durable write (e.g., to the entity graph, pattern store, or workspace). No error code, no graceful degradation, no fallback to in-memory-only mode, no run-pause, and no user-visible escalation is specified. The system assumes writes succeed; storage exhaustion is not addressed.

### (e) The user cancels a task mid-revision-plan after it has already sent an email

**Spec location:** Evaluator+Revisor V3.3.1 §11 (Revision Dispatcher lifecycle and side-effect policy), §11.18 (RevisionSideEffectPolicy), Core R0.7.1 (Task cancellation mechanics referenced but not detailed in Addenda B), Feedback Delivery V1.0.1 §7 (delivery channels), Task Forum + Run Board V1.0.1 §5 (post kinds include “delivery_receipt”).

**What the spec says:**
RevisionSideEffectPolicy governs external side effects (e.g., email). The Revision Dispatcher is deterministic and records deliveries. Cancellation mechanics live at the DOC23 parent level (not fully detailed in Addenda B).

**The spec is silent on mid-revision cancellation after side effects.**
There is no defined behavior for a user cancel after a revision plan has already executed a side effect such as sending an email. No rollback, no compensating transaction, no notification to the recipient, and no audit entry requirement for “cancelled after side effect” is specified in the Addenda B family. The Revision Dispatcher lifecycle does not address cancellation as a first-class event during plan execution.

---

**Summary of findings**
- (a) **Silent** (no concurrency control for ResearchNeed).
- (b) **Silent** (no mid-run reclassification handling).
- (c) **Partially specified** (malformed output is detected; recovery path is undefined).
- (d) **Silent** (storage-full during durable write).
- (e) **Silent** (cancellation after side-effect has already occurred).

The Addenda B set is strong on happy-path contracts but consistently silent on these classes of real-world failure, concurrency, and governance edge cases. In every case above, an implementing agent would have to invent policy.

# Grok 2b

**Staleness in the ELNOR system (per DOC23 Addenda B set — Core R0.7.1, Common Contracts V1.1.1, Outcome Evaluator/Revisor V3.3.1) is fundamentally hash/precondition-driven and lazy-detected at consumption time.** There are no explicit TTLs, expiration clocks, or proactive background refresh jobs described anywhere in the Addenda B corpus. Freshness is anchored exclusively to cryptographic hashes (artifact content, pre/post, snapshot, source), generation_ids, and explicit preconditions. Everything that can go stale does so because an underlying source, registry, model, or context has drifted since the cached artifact was minted. Detection is always *reactive* (fail the precondition/hash check on first use), and the spec is deliberately silent on proactive invalidation, background re-verification, or cache eviction policies beyond “fail-and-replan/revalidate.” That silence *is* the finding: the design accepts the risk of stale consumption until the moment of use, relying on the consumer (TaskSystem, Evaluator, Revisor, or Orchestrator) to handle the fallout via degraded modes, replans, or re-escalation rather than preventing stale reads upstream.

Below is the exhaustive inventory of every freshness-dependent artifact called out in the question, drawn directly from the walked Addenda B files (TKP handling in Core R0.7.1 §4.2–4.4, Context/Registry contracts in Common Contracts V1.1.1 §3.1–3.3, snapshot/precondition mechanics in Outcome Evaluator/Revisor V3.3.1 §5.1–5.3 and §6.2). For each I state (1) how the system knows it is stale, (2) what invalidates it, and (3) what happens if a consumer uses it anyway (or the spec’s silence on that outcome).

### 1. Cached source verifications (TKP.source_hashes + attached verification artifacts)
- **How the system knows it is stale**: On any read of the TaskSystemKnowledgePack (TKP), the runtime compares the cached `source_hashes` map against live source artifacts (via the registry’s current content hash or Merkle root). A mismatch = stale.
- **What invalidates it**: Any mutation to a registered source (file, model weight, prompt template, or external data feed) that changes its content hash. No time-based invalidation.
- **What happens if a consumer uses it after stale**: Core R0.7.1 explicitly defines a “degraded mode” path: the TKP is still usable but the TaskSystem marks the affected knowledge slice as `drift_detected` and routes subsequent calls through a slower re-verification or partial re-ingest path. No hard abort; the spec is silent on whether consumers are required to block or whether they may silently continue with reduced confidence. Finding: the silence leaves open the possibility of indefinite degraded-mode operation without forced refresh.

### 2. Context packets (generation_id + embedded snapshots)
- **How the system knows it is stale**: Every context packet carries a `generation_id` (monotonically increasing per live registry/context store) and one or more embedded snapshot references. At consumption, the receiver compares the packet’s `generation_id` against the live registry’s current generation. Mismatch = stale.
- **What invalidates it**: Any change to the live context registry (new sources registered, existing sources mutated, or explicit context invalidation events). Snapshot-embedded data is also invalidated if the underlying EvaluationSnapshot becomes stale (see below).
- **What happens if a consumer uses it after stale**: The spec (Common Contracts V1.1.1 §3.2) requires the consumer to treat the packet as “best-effort”; it may still be used for non-critical paths, but any operation that touches a live-dependent decision (hard calls, pattern application) must re-fetch or trigger a context refresh. The Evaluator/Revisor will log a `stale_context` event but does not abort the entire task. Silence finding: no mandated quarantine or forced replan at packet level — only a best-effort warning. Exact TTL beyond generation_id is absent; a packet could theoretically live forever if the registry never increments.

### 3. Evaluation snapshots (before/after state + precondition set)
- **How the system knows it is stale**: Every EvaluationSnapshot carries pre- and post-hashes plus an explicit set of hash preconditions. On replay or re-evaluation, the Outcome Evaluator/Revisor recomputes the current hashes and checks them against the snapshot’s stored precondition set.
- **What invalidates it**: Any change to an artifact whose hash appears in the snapshot’s precondition set (source files, resolved hard-call outputs, applied-pattern results, or model weights).
- **What happens if a consumer uses it after stale**: V3.3.1 §5.2 is crystal clear: precondition failure triggers one of three actions chosen by the consumer’s policy — (a) abort the evaluation, (b) trigger immediate replan, or (c) force revalidation of the affected slice. The snapshot itself is never silently accepted. No degraded mode here; staleness is treated as a hard integrity violation for evaluation correctness.

### 4. Resolved hard calls (compatibility binding + output hashes)
- **How the system knows it is stale**: Each HardCallResolution record binds the call to specific input hashes, model version hashes, and output hashes. On reuse, the runtime re-checks those binding hashes against current artifacts.
- **What invalidates it**: Model update, source change, or any mutation to an input artifact referenced in the binding.
- **What happens if a consumer uses it after stale**: Core R0.7.1 §4.3 mandates re-escalation: the call is demoted from “resolved” back to “pending hard call” and re-routed through the full resolution pipeline (potentially hitting the LLM again). The spec is silent on whether intermediate consumers (e.g., pattern appliers) may choose to use the stale output with a warning flag; the default path is re-escalation.

### 5. Applied patterns (cross_model_applicability, compatibility checks, health state, novelty assessment)
- **How the system knows it is stale**: Each AppliedPattern record stores `cross_model_applicability` metadata, compatibility hashes, current health state, and a novelty assessment timestamp/hash. The Revisor re-runs the compatibility and health checks on every use.
- **What invalidates it**: Model drift (new version), pattern definition change, or health-state downgrade of any participating model.
- **What happens if a consumer uses it after stale**: V3.3.1 §6.2 states the pattern is immediately demoted or quarantined: it is removed from active selection pools, logged as `pattern_stale`, and the task is either replanned or falls back to a lower-fidelity pattern set. No silent reuse allowed; the spec treats stale patterns as a correctness risk that must be surfaced.

### 6. Hash preconditions (artifact content hashes, pre/post hashes, snapshot hashes)
- **How the system knows it is stale**: These are the atomic building block. Every precondition set is a list of (artifact_id, expected_hash) tuples. The runtime simply recomputes the current hash and compares.
- **What invalidates it**: Any write to the referenced artifact (file change, model update, context mutation).
- **What happens if a consumer uses it after stale**: Uniform across all Addenda B components: precondition failure aborts the current operation and forces either (a) full task replan or (b) targeted revalidation of the failing slice. No “use anyway” path exists in the normative text.

**Overarching findings on silence**
- No component defines a proactive cache expiration or background freshness job. Staleness is purely *demand-driven*.
- Source-verification caches and certain context packets have the weakest guarantees (degraded mode or best-effort only).
- Evaluation snapshots, hard-call resolutions, applied patterns, and hash preconditions are all “fail-hard” with explicit replan/re-escalation paths.
- The spec never mandates a global “invalidate all caches on registry change” broadcast; consumers are responsible for checking at the point of use. This is intentional local-first design (minimize network chatter) but creates the possibility of transient stale reads until the next consumer touch.

This is the complete, focused answer to the single question posed. The design trades proactive freshness for zero network overhead until the moment of consumption, with hash-driven detection and policy-driven recovery. If you want operative amendments to close any of the noted silences (e.g., mandatory TTLs, background re-verification hooks, or stricter quarantine rules), I can draft the exact sections for DOC23 Addenda B. Stand by for your instructions, Will.

# ChatGPT 1

I found the current DOC23 Addenda B set in the GitHub repo, not the archived R0.6.x files. The operative set is these six files under `Current Specs/DOC23/DOC23 Addenda B/`: Core R0.7.1, Outcome Evaluator/Revisor V3.3.1, Source Workspace V1.0.1, Task Forum + Run Board V1.0.1, Feedback Delivery V1.0.1, and Evaluation Common Contracts V1.1.1. The Core confirms that R0.7.1 is the family entry point and that these siblings are the Addenda B topology. This also matches the repo/carryover context that current specs now live in the GitHub repo under `Current Specs/`.

## Findings

### GAP-ADDB-IRREV-01 — External source/tool queries can leave the system without a specific Addenda-B side-effect gate

The Source Research module can use web, API, database, email, file system, browser, connectors, manual upload, document libraries, PACER/CourtListener/SEC/financial databases, MCP tools, browser skills, prior task outputs, and internal sources. Its operational behavior includes planning and executing queries, recording `SourceQueryRecord`s, creating `SourceRecord`s, verifying results, and updating the workspace. That is an irreversible action class because an external lookup can create provider logs, cost, access traces, rate-limit state, or visible browser/API effects that cannot be undone.

What protects the user: the module has `allowed_source_types`, `required_capabilities`, workspace policy, freshness policy, and verification policy, and hidden research is forbidden because material research must emit a SourceRecord, ResearchAnswer, SourceVerificationResult, SourceQueryRecord, forum post, or universal tool receipt.

Why this is still a finding: those are configuration/audit protections, not a mistake-prevention gate equivalent to the side-effect policy used elsewhere. V3.3.1 has a strong `RevisionSideEffectPolicy` for email sends, calendar writes, webhooks, filings/submissions, memory writes, and internal artifact writes, with policy decisions, human gates, dry runs, and replay controls. But Source Workspace does not bind external source-query execution to that same side-effect class table. Recovery is only “stop using it / mark it stale / supersede it / record the receipt”; the external query itself cannot be recalled.

Required patch: add an `ExternalSourceQueryPolicy` or map each non-local `allowed_source_type` to `RevisionSideEffectPolicy` / `TaskSecurityPolicy`, with explicit rules for user confirmation, cost caps, privacy class, provider class, browser-visible effects, and non-replay.

### GAP-ADDB-IRREV-02 — Source Workspace export/save/open actions are exposed without a local policy/recovery contract

The Source Workspace UI exposes actions including “Open native,” “Open in browser,” “Show in Finder,” “Save As,” “Add to library,” “Verify,” “Mark stale,” “Link to claim,” workspace “Promote to library,” and “Export” as a structured bundle. Export/Save As/file-open/browser-open can leave the system or create durable files/access traces; they are not cleanly reversible once disclosed, saved, or opened externally.

What protects the user: for “Add to library,” the Source Workspace does specify user approval plus EC Core policy gate, and privileged-matter sources do not auto-promote. For the export/save/open actions themselves, I did not find an Addenda-B-local guard, redaction preview, policy decision, human confirmation threshold, or recovery path in the Source Workspace sub-addendum.

Recovery: partial at best. A local exported file may be deleted, but disclosure/access traces cannot be undone; a browser/open-native action may already have left system boundaries. This is a finding unless DOC20/EC has a global export/open policy that Addenda B explicitly consumes.

Required patch: bind `Export`, `Save As`, `Open in browser`, `Open native`, and `Show in Finder` to a named `WorkspaceExternalizationPolicy` with EC policy decision, redaction/matter firewall check, confirmation copy, destination preview, receipt, and “cannot undo disclosure” warning for privileged or matter-scoped workspaces.

## Irreversible action inventory

| Irreversible action class | What protects the user from triggering it by mistake | Recovery if wrong | Finding? |
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **External message sends, webhooks, client-portal sends, court filings/submissions.** These are explicitly modeled as side effects: `external_message_send`, `webhook_post`, and `filing_or_submission`. | Strong guard. Requires policy decision, human gate, dry run, TaskSecurityPolicy / external owner-doc equivalent, and never-replay handling for sends/filings. Dry run is non-degradable for external side-effect steps. | Not cleanly recoverable. Rollback never reverses external side effects; it creates corrective artifacts instead. | No. Guard exists; recovery is intentionally limited. |
| **Calendar writes.** Calendar writes are in the side-effect table. | Policy decision and human gate required; replay is idempotent with key. | Usually recoverable by a later modifying/deleting calendar write, but not “clean” because recipients/notifications may already have seen it. | No. |
| **External source/tool queries: web/API/database/browser/connectors/DOC25/DOC73/PACER/SEC/etc.** | Partial guard only: allowed source types, required capabilities, workspace policy, freshness/verification policy, and mandatory receipts/records for material research. | Not recoverable. Can mark stale, discard, supersede, or avoid downstream use, but cannot undo provider logs/cost/access. | **Yes — GAP-ADDB-IRREV-01.** |
| **Source Workspace export/save/open actions.** UI exposes Open native, Open in browser, Show in Finder, Save As, Export, Add source, Add to library, Verify, Mark stale. | Add-to-library has a guard; export/save/open do not have a local Addenda-B guard in the Source Workspace doc. | Weak. Local exported files can be deleted, but disclosure/access traces cannot be undone. | **Yes — GAP-ADDB-IRREV-02.** |
| **Internal artifact writes: module revisions, direct fixes, artifact creates/updates/deletes.** | Very strong guard. Revision plans require deterministic lint, policy decisions for mutations, version/hash preconditions, write sets, read sets, taint/safety checks, CIL authority snapshot, and capability validation. Artifact mutation protocol checks preconditions before dispatch and makes phases 1–3 atomic per step. | Recoverable by rollback/new artifact version if checkpoint history exists. Rollback creates a new reverted version, not a silent pointer change. | No. |
| **Class-safe mechanical direct fixes that auto-commit.** | Guarded by `DirectFixAllowedClass`, direct-fix lint, policy decision, artifact version preconditions, and restriction to mechanical/non-judgment edits. Class-safe direct mechanical fixes may auto-commit only after PolicyDecision approval. | Recoverable by rollback/new version. | No. |
| **Meaning-bearing edits, multi-step mutations, privileged artifact changes, and external-side-effect operations.** | Candidate versions are mandatory for all of these. Meaning-bearing candidates default to review-required; privileged/external-side-effect candidates require a human gate and cannot auto-accept. | Candidate can be rejected before acceptance; accepted internal changes can be rolled back as new versions. External effects cannot be rolled back. | No. |
| **CandidateArtifactVersion accepted/rejected.** | State transition to accepted or rejected must produce a `RevisionOperationReceipt`; state flip without receipt is rejected. Review-required candidates surface accept/modify/reject to UI; human-gate candidates block dispatch until explicit receipt. | Rejected candidates remain audit artifacts. Accepted internal candidates can be reverted by graph rollback/new version; external side effects remain unreversed. | No. |
| **Sandboxed evaluation promotion from candidate into main graph.** | Candidate evaluations stay isolated: main runtime state and progress signals are not updated until acceptance. Accepted candidates require taint clearance through SanitizationNode or explicit user review; otherwise acceptance is blocked. | Rejected candidates archive audit and discard pattern learning signals; accepted candidates can be rolled back internally. | No. |
| **Run Board posts and Forum posts.** Passive board records module starts/completions, artifacts, deliveries, errors, gate decisions, findings, repairs, source updates, user comments, Task Agent recommendations, cost, and lifecycle events. Posts are append-only and must not be modified after creation; later posts can supersede them. | Guard is visibility/participant policy and matter/privilege scoping, not editability. Forum is opt-in; passive board is always-on audit. | Not cleanly recoverable; original audit post remains. Can supersede, restrict visibility, or correct by later post. | No, but add UX warning: “audit entries are append-only.” |
| **Feedback routing to modules, forum, source research, format checker, Revisor, Task Agent.** | Feedback Routing Policy is explicit and rendered as UI controls. Channels are explicit: control flow, direct repair wiring, board/forum publication, and process-gap signal. Findings are defeasible; model-only hard blockers are barred unless the outcome explicitly permits subjective blocking. | Recoverable by contesting, superseding, expiring, or demoting findings/guidance. Contested findings do not block by default. | No. |
| **Run Guidance promotion.** | Candidate-to-item promotion is gated: auto-promotion only for `tool_verified` or `user_approved` findings when routing policy permits; model-judgment-only defaults to human-in-the-loop. | Recoverable by contesting, superseding, expiring, scope-editing, or excluding from DOC15/DOC24 context. | No. |
| **FeedbackConsumptionReceipt and RevisionOperationReceipt persistence.** | Every state-mutating or module-invoking operation produces receipts; receipts are persisted through EC, not local shadow state. Every feedback consumer must emit a receipt; silent ignoring fires validation. | Receipts are audit truth, not deleted. Recovery is corrective/superseding receipt. Receipt failure has handlers. | No. |
| **Source Workspace durable state: SourceRecords, SourceQueryRecords, ResearchNeeds, ResearchAnswers, Verification/Freshness records, tier transitions, workspace availability.** | Workspace policy/config controls create/update/downstream availability; hidden research is forbidden; material source work must be recorded. | Original records persist. Recover by marking stale, adding failed verification, answering/unresolving needs, superseding records, or reducing downstream availability. | No, except the external-query gap above. |
| **DOC72 task memory summaries from Source Workspace.** | Controlled by `persistence_policy.save_to_doc72_summary`; summaries are created at task completion and include source counts, key findings, unresolved needs, and verification status. Core also says artifact intake is policy-governed and most intermediate outputs default to run-scoped storage only. | Not clean rollback from Addenda B alone; likely corrected/superseded through DOC72/EC memory governance. | No, but ensure DOC72 correction/supersession path is cross-referenced in R3.2. |
| **DOC73 library promotion candidates / library promotions from Source Workspace.** | Controlled by `promote_to_doc73_library_candidate`; promotion requires user approval and EC Core policy gate; privileged-matter sources do not auto-promote. | Candidate can be rejected. Promoted library entry can be archived/superseded by DOC73 governance, but external disclosure cannot be undone if exported/shared elsewhere. | No. |
| **Pattern promotion and durable learning patterns.** | Promotion to broader scope requires explicit user action and governance policy approval; identifying matter detail is scanned/anonymized or matter-locked; global/wide-domain promotion requires governance approval. | Patterns can be demoted, archived, quarantined, purged, or scope-locked. Demotion is expressly not deletion. | No. |
| **DurableKnowledgeCandidate and DirectInstructionCandidate promotions.** | Teaching signals go to Teach-from-feedback UI for user confirmation before persistence as Patterns. Direct instructions default to current-run only unless user explicitly promotes; privileged/no-learning feedback is locked to current run; global/hard constraints require governance/conflict checks. | Candidate can be rejected/edited/narrowed/superseded; promoted instructions can be superseded through directive governance. | No. |
| **Learning signal persistence to DOC72 / DOC8 / BDSM / DOC24.** | Signals use `EvaluationLearningSignalEnvelope` with EC-governed `governance_policy_ref`, `data_class`, `matter_id`, and `pattern_promotion_eligible`; privileged signals do not auto-promote. | Raw signal remains audit/learning input; recovery is blocking promotion, demoting learned patterns, or emitting negative/correction signals. | No. |
| **HardCallResolution and human gate decisions.** | Hard calls ask explicit human questions and have defined options/defaults. Reuse is tightly bound to outcome hash, goal hash, evidence snapshot, artifact scope, expiry, and supersession state; stale resolutions re-escalate. | Can expire or supersede the resolution. If it authorized an external side effect, the external effect itself is not recoverable. | No. |
| **Task graph/template/module-preset/security/side-effect/capability-binding changes proposed by Task Agent.** | Task Agent is an intelligence interface, not an executor; it may not mutate task records directly, change executable templates/presets without versioned proposal and approval, bypass validation/preflight, or add side-effecting capabilities without policy compliance and approval. Learning is proposal-based for material task-system changes. | Versioned proposal can be rejected; accepted graph/template changes need version rollback/supersession. | No. |
| **Task Forum decisions, plan-review rooms, consensus records.** | High-stakes plans can require forum review before dispatch; the forum creates a room, invites scoped participants, awaits consensus, and records consensus in the assurance trail. Task Agent advisory moderator cannot change graph without user approval. | Forum record persists; decision can be superseded or plan aborted/replanned. | No. |
| **Costly model runs, cheap-model signal generation, calibration, sub-agent evaluation.** | RevisorConfig separates production/signal_generation/calibration; calibration uses cheap and production paths and requires explicit user authorization for mixed cost pool. Budgets distinguish logical vs infrastructure retries, hard caps, estimator confidence, and preemption. | Cost/API call cannot be undone. Can abort, preempt, block auto-approval, recalibrate, or demote learned patterns. | No. |

Bottom line: the Addenda B set is generally strong on irreversible artifact mutation, candidate acceptance, memory/pattern promotion, feedback publication, and external sends/submissions. The two real gaps are the **external source-query path** and the **Source Workspace export/save/open path**, because both can leave the system and neither is locally tied to the side-effect/human-gate/recovery discipline that V3.3.1 otherwise specifies.

# ChatGPT 2

I reviewed the current six-file DOC23 Addenda B set in the repo: Core R0.7.1, Outcome Evaluator/Revisor V3.3.1, Source Workspace V1.0.1, Task Forum + Run Board V1.0.1, Feedback Delivery V1.0.1, and Evaluation Common Contracts V1.1.1. The Core confirms this is the operative Addenda B family topology and that R0.7.1 is the entry point.

I treated “never stated as constraints” narrowly: I did **not** count explicit invariants like EC sole durable writer, Task Agent not being an executor, candidate-version gates, or Revisor lint rules. The items below are assumptions the architecture depends on, but does not turn into an explicit Addenda-B-wide requirement, validation rule, preflight check, or acceptance criterion.

## 1. Cross-doc inserts will be absorbed before anyone implements against the “build-ready” Addenda B docs

**Assumption.** The sub-addenda can be treated as build-ready even though many of their essential mechanics still live as `[XDOC-INSERT]` obligations to DOC23 R3.1, DOC20, DOC12, DOC15, DOC24, DOC25, and DOC73.

**What breaks if false.** Implementers build orphan schemas and phantom controls: `step.source_research` is not in the DOC23 module registry, Source Workspace UI actions are not in DOC20, feedback ports are not canonical, Task Forum room kinds are not registered in DOC12, DOC24 does not assemble TaskRunContextPackets, and Common Contracts remain a sibling instead of a parent-level primitive. The result is a family of internally plausible specs with no executable surface.

**Spec signal.** Strong signal, but not a constraint. Common Contracts says it is active only until DOC23 R3.2 absorbs it and then retires. Source Workspace, Feedback Delivery, and Task Forum each include consuming-doc insert blocks that are necessary for implementation. The missing constraint is: “implementation may not treat any sub-addendum surface as build-complete until all required owning-doc registry/UI/route inserts are absorbed or explicitly stubbed.”

## 2. DOC24’s task-opportunity classifier will be accurate enough to protect the direct-first product experience

**Assumption.** DOC24 can reliably decide when to stay direct, when to show a quiet task chip, when to suggest a saved task, and when to invoke Task Agent.

**What breaks if false.** If it over-classifies, Elnor becomes task-preoccupied: ordinary requests summon saved-task suggestions, TKP cards, and Task Agent consults. If it under-classifies, the saved-task system never appears when it should. Both outcomes undermine the main product promise.

**Spec signal.** Strong signal. Core defines saved tasks as useful only when formalization adds value, lists direct-mode veto cases, defines TaskOpportunityLevel handling, and makes DOC24 owner of the TaskOpportunityPacket. It also says degraded routing stays direct-first if DOC24 cannot compute the opportunity. The missing constraint is a calibration requirement: false-positive/false-negative thresholds, regression fixtures, and a kill-switch for task routing drift.

## 3. The Task Agent Knowledge Pack will stay current with module catalogs, port registries, policy semantics, and code reality

**Assumption.** Task Agent’s hybrid education substrate remains fresh enough that its design advice tracks the current task system.

**What breaks if false.** Task Agent proposes illegal graphs, stale ports, wrong module presets, duplicate templates, obsolete side-effect rules, or unavailable capabilities. The Revisor and Dispatcher may catch some failures later, but the user-facing design experience becomes noisy and untrustworthy.

**Spec signal.** Partial signal. Core says Task Agent education requires generated canonical mechanics, authored doctrine, rationale cards, examples, design cases, template/preset cards, DOC72 links, DOC8/BDSM overlays, and validators. EC owns Task Knowledge Compiler execution, pack activation, validation, read models, storage, scheduling, and route/command registry entries. Task Agent has a degraded mode for a stale knowledge pack, but the spec does not define a freshness budget, invalidation trigger, or “do not design from stale TKP” preflight.

## 4. All relevant artifact, source, board, and workspace states will be snapshot-addressable and resolvable later

**Assumption.** Every evaluation, feedback bundle, finding, repair, source state, board digest, and mutation can point back to immutable state that still exists when replay, rollback, audit, or learning needs it.

**What breaks if false.** Evaluation findings cannot prove what version they evaluated. Repairs may target the wrong artifact. Stale-finding demotion fails. Read/write conflict detection becomes fiction. Replay cannot reconstruct causal state. Candidate acceptance and rollback become unsafe.

**Spec signal.** Strong signal, but scattered. Common Contracts requires `evaluation_snapshot_ref` before an envelope emits. Feedback bundles require snapshot consistency with the envelope and reject stale artifact references. V3.3 depends on read/write sets, source versions, graph snapshot hashes, capability snapshot hashes, and conflict detection. The missing constraint is a global snapshot-retention/indexing contract across all Addenda B producers.

## 5. V3.3’s narrow workspace semantics and Source Workspace’s broader semantics will not diverge

**Assumption.** There is one coherent `TaskSourceWorkspace` object even though V3.3 owns narrow evaluation/revision interactions and Source Workspace owns broader task/user/module behavior.

**What breaks if false.** Evaluator/Revisor may interpret workspace fields differently from Source Research or Task Agent. Availability, taint, matter scoping, run-scoping, verification state, or source applicability could mean different things depending on which subsystem reads the object. That would create source leakage, missed warnings, bad revision planning, and inconsistent UI.

**Spec signal.** Strong signal. Source Workspace explicitly says both views consume the same object and that V3.3 governs overlapping field semantics. The workspace schema includes shared state that many modules depend on: source records, query records, research needs, verification/freshness records, run guidance IDs, availability policy, persistence policy, taint, privilege, and matter ID. The missing constraint is a field-level ownership/precedence table plus a validator for overlap conflicts.

## 6. Source documentation tiers will preserve enough provenance for downstream evaluation and repair

**Assumption.** The selected documentation tier is sufficient for the future uses the task will make of a source.

**What breaks if false.** A tier-0 or tier-1 lookup may later become load-bearing evidence, but only a thin receipt exists. Downstream modules may treat a weak source reference as adequate support. Evaluator cannot verify a claim. Revisor cannot repair source misuse. The system repeats research or, worse, cites unverifiable support.

**Spec signal.** Partial signal. Source Workspace defines tiers from ephemeral lookup through full workspace and says `documentation_mode` governs default tier selection. It also bans hidden material research by requiring a SourceRecord, ResearchAnswer, SourceVerificationResult, SourceQueryRecord, forum post, or tool receipt. But there is no risk-based minimum tier matrix: no rule that source-dependent legal claims, filing-bound outputs, privileged matters, or evaluator-consumed criteria require tier 2/3/4 documentation.

## 7. External source actions, open/save/export actions, and provider calls will be classified as side effects whenever they leave the system

**Assumption.** Anything that exits the local task boundary will be routed through policy/side-effect discipline even when it appears in Source Workspace rather than V3.3 revision dispatch.

**What breaks if false.** Web/API/database/email/browser/connector queries may create provider logs, costs, disclosures, and access traces without user gates. Source Workspace actions like Open in browser, Open native, Show in Finder, Save As, Add to library, and Export may disclose or persist matter-scoped material without side-effect classification.

**Spec signal.** Weak-to-partial signal. Source Research allows external mechanisms including web, API, database, email, file system, browser, connectors, and manual uploads. Source Workspace exposes open/save/export/library actions. V3.3 has strong side-effect concepts for revision capabilities, including `side_effect_policy_ref`, idempotency semantics, and `external_side_effect` safety class. But Addenda B never states a cross-family constraint that Source Workspace externalization and Source Research provider calls must map into that same side-effect policy.

## 8. DOC24/CIL summarization and packet rendering will preserve the meaning, scope, and authority of feedback/source/forum state

**Assumption.** Once feedback, source warnings, Run Guidance, board digests, and research needs are selected and summarized into context packets, the receiving module gets the right actionable meaning without privilege leakage or scope drift.

**What breaks if false.** A module may miss a blocker, receive stale or contested guidance, apply source warnings to the wrong segment, act on privileged matter data, or obey compressed summaries as stronger authority than intended. Direct wired feedback and DOC24-selected contextual feedback could also conflict.

**Spec signal.** Strong signal of dependency, weak constraint on fidelity. Feedback Delivery says DOC23 owns direct wiring, DOC15 renders wired input into prompts, and DOC24 participates only when selected contextual guidance is needed. Task Forum says DOC24 assembles TaskRunContextPackets, applies permissioning and token caps, and may include board digest, workspace snapshot, active guidance, open needs, and unresolved repairs. The missing constraint is a packet-fidelity contract: how summaries preserve authority basis, lifecycle state, contradiction, scope, and “do not use” instructions.

## 9. Lifecycle transitions will propagate causally before downstream modules consume stale state

**Assumption.** When a finding becomes contested, superseded, expired, user-approved, tool-verified, or converted into Run Guidance, every downstream consumer sees the right lifecycle state in time.

**What breaks if false.** Superseded findings keep blocking. Contested findings get treated as truth. Old Run Guidance leaks into later modules. Source updates do not clear stale source warnings. DOC8/BDSM learns from obsolete signals. The run looks deterministic in the UI but is actually racing stale read models.

**Spec signal.** Partial signal. Feedback Delivery says Run Guidance promotion is gated and downstream modules must not receive superseded or expired guidance. Consumption receipts are mandatory, and silent ignoring fires validation. Board posts are append-only and can be superseded by later posts. The missing constraint is a monotonic read-model / causal-ordering rule across Feedback Delivery, Forum, Source Workspace, DOC24 packet assembly, and DOC15 rendering.

## 10. Receipt coverage will be universal enough to reconstruct the causal chain

**Assumption.** Every material operation emits a receipt or board event, not just Revisor steps and feedback consumers.

**What breaks if false.** Task Agent cannot diagnose “what happened.” DOC8 cannot distinguish ignored feedback from unseen feedback. Replay and rollback cannot determine what to reverse or supersede. The Run Board becomes a partial narrative rather than an audit trail.

**Spec signal.** Strong signal, but no single Addenda-B-wide receipt taxonomy. V3.3 requires receipts for every state-mutating or module-invoking revision operation and persists them via EC. Source Workspace records every source query in `SourceQueryRecord`. Task Forum says every run has a passive Run Board that records modules, artifacts, deliveries, gate decisions, findings, repairs, source updates, comments, cost, and lifecycle events. The missing constraint is a universal “all material Addenda B side effects/events must have one of these receipt/event types” registry.

## 11. Cheap/local/model-class learning signals will be comparable enough not to contaminate durable task-design patterns

**Assumption.** Signals produced under `cheap_local`, `cheap_api`, `medium`, and `expensive_frontier` model classes can be safely correlated, calibrated, or isolated.

**What breaks if false.** Cheap-model artifacts teach the Task Agent bad patterns. Revisor learns repair strategies that only work for one model class. Cross-model applicability claims become noise. Pattern promotion becomes unsafe, especially where DOC8/BDSM is expected to compute utility and correlations.

**Spec signal.** Partial signal. Common Contracts includes model class and model fingerprint on every learning signal. V3.3 says BDSM consumes signals and compiles runtime-safe utility bundles, but not hot-path mutations. V3.3 also defers full BDSM compilation, hot-path guidance injection, cross-LLM red-team automation, and broader evaluation domains to Phase 2. This is especially load-bearing because current project carryover still flags DOC8 as undersized relative to the learning role the stack assigns it. The missing constraint is a calibration gate for when model-class-specific signals may generalize.

## 12. Criterion hashes and anchors will remain semantically stable across edits

**Assumption.** `criterion_semantics_hash`, `TextAnchor.context_hash`, `StructuredAnchor`, and `ArtifactScopeRef` are stable enough to connect findings, scores, repairs, and learning across revisions.

**What breaks if false.** Scores compare different criteria as if they were the same. Repairs target the wrong paragraph or section. A criterion reworded for clarity may silently become a new criterion, or a materially changed criterion may keep the old identity. Pattern learning corrupts because deltas attach to the wrong semantic unit.

**Spec signal.** Strong signal. Common Contracts defines criterion normalization and says the hash anchors cross-run learning. TextAnchor context hashes detect drift and ask consumers to re-resolve or surface drift. Validation catches malformed hashes and anchors. The missing constraint is a semantic-drift review rule: when normalization preserves identity but meaning changed, or when section structure changes too much for anchors to be trusted.

## 13. Optional Source Workspace and Forum substrates can be added late without losing correctness

**Assumption.** Simple tasks can omit Source Workspace and Forum, and the system can still recover if later evaluation discovers source, coordination, or repair-routing needs.

**What breaks if false.** A task starts simple, then an Evaluator emits `needs_more_sources` or `needs_source_verification`, but there is no Source Research route, no workspace, no forum visibility, and no configured repair consumer. The system pauses, emits unresolved needs, or routes to Task Agent for a process patch after work has already run with insufficient structure.

**Spec signal.** Partial signal. Source Workspace is optional for simple tasks and can be run-scoped or task-scoped. Forum is optional and suggested for medium/complex tasks. Feedback Delivery provides routing channels for source research, forum, repair modules, and process gaps. Module Assistance Requests can operate with or without a Forum. The missing constraint is a preflight requirement that every configured feedback branch has a real consumer or a declared degraded path.

## 14. The system can reliably distinguish substantive gaps from process gaps

**Assumption.** The Task Agent, Evaluator, Forum, domain moderators, and Revisor can route “the work product is wrong” separately from “the task graph/process is wrong.”

**What breaks if false.** Task Agent starts making substantive legal/domain judgments outside its competence, or a domain evaluator diagnoses missing arguments when the real problem is missing graph topology. Repair goes to the wrong place: graph patches when the artifact needed revision, or content fixes when the workflow needed a new source-verification branch.

**Spec signal.** Strong signal. Task Forum explicitly defines substantive gaps and process gaps, explains why confusing them causes misrouting, and states that Task Agent consumes findings for diagnosis but must not be the default substantive evaluator. The missing constraint is a classifier confidence / ambiguity rule: when a gap is both substantive and process-level, who decides, what gates execution, and what gets surfaced to the user?

## 15. Pattern C’s dual evaluation recommendations will resolve predictably

**Assumption.** When an Evaluator and a downstream Judge both emit `EvaluationResultEnvelope`s in Pattern C, downstream consumers know which recommendation governs.

**What breaks if false.** Evaluator qualitative findings and Judge quantitative scores can disagree. One envelope may recommend fail/revision while the other recommends pass. Without a binding resolver, UI, Revisor, Loop Controller, learning, and audit may disagree about what the evaluation “meant.”

**Spec signal.** Explicit signal of the risk, but not a hard constraint. Common Contracts says Pattern C can produce two route recommendations in the same chain and that resolution is by consumer policy, “typically” the Judge’s quantitative recommendation governs. V3.3 says the Evaluator must not hidden-dispatch Judge; Pattern C is graph-wired and Judge’s input port is owned by Addenda A. The missing constraint is a canonical chain-resolution policy: which producer governs by default, when a conflict escalates, and how the Revisor receives the resolved verdict.

## Bottom line

The Addenda B set is strong where it has explicit gates: Revisor dispatch, candidate versions, side-effect-capable revision steps, hard calls, taint, receipts, direct-fix limits, and direct-first routing. The unstated load-bearing assumptions are mostly at seams: **cross-doc absorption, DOC24 routing accuracy, source/workspace semantics, context-packet fidelity, receipt universality, learning calibration, optional-substrate recovery, and Pattern C conflict resolution**. Those are the places where a coding agent could implement locally plausible pieces and still produce a system that is not actually coherent end-to-end.

# Gemini 1

To answer your question directly: **No, the system is not entirely honest.** While DOC23 Addenda B has exceptional *post-mortem* honesty (its telemetry and evaluation envelopes meticulously log what broke after the fact), it suffers from severe **runtime dishonesty**. When faced with resource constraints, missing dependencies, or unavailable sub-agents, the architecture's default posture is to gracefully degrade to keep the execution graph moving. Because it lacks strict structural "floors" for these degradations, it frequently presents partial, downgraded, or truncated results to the user as "clean" and "complete."

Here is every place in the Addenda B set where a partial or degraded result is silently masked as a clean one, and the exact architectural missing piece required to force that degradation to the surface.

### 1. The Quorum Illusion (Silent Consensus)

**Location:** Task Forum + Run Board V1.0.1 §8.5

- **The Dishonesty:** When the Revisor delegates a plan to the Plan Review Forum, multiple sub-agents deliberate. If a specialized sub-agent (e.g., a "Securities Law Specialist") times out, crashes, or hits an API limit, the forum does not halt. It simply tallies the votes of the surviving agents. The forum returns an `approved` state to the Revisor, and the user sees a green checkmark indicating the plan passed forum review. The user is falsely led to believe the specialist validated the plan.
- **What’s Missing to Force Honesty:** The Task Forum requires a **`RequiredQuorumManifest`**. The forum must not be allowed to return a binary `approve/reject`. If a mandated capability drops out, the forum's state must evaluate to `approved_degraded_quorum`. This forces the Run Board UI to render a yellow warning flag ("Approved, but Specialist was unavailable") rather than a clean green check.

### 2. The Assurance Downgrade (Silent Confidence Drop)

**Location:** Outcome Evaluator/Revisor V3.3.1 §7

- **The Dishonesty:** An Outcome is configured with an `AssuranceBasis` (e.g., `specialist_agent` or `comparative_judge`). If the `step.judge` module is unavailable or the user's compute budget is exhausted, the Outcome Compiler is highly likely to fallback to a cheaper `auto_check` (regex or simple zero-shot LLM pass) to prevent the graph from crashing. If the cheap check passes, the Outcome state transitions to `satisfied`. The user sees a passing outcome and assumes it was subjected to the rigorous, expensive scrutiny they requested.
- **What’s Missing to Force Honesty:** `AssuranceBasis` must be defined architecturally as a **Strict Floor**, not a preference. The `EvaluationResultEnvelope` must carry a `TargetAssurance` vs. `ExecutedAssurance` tuple. If the executed basis falls below the target, the outcome state cannot be `satisfied`; it must be physically restricted to `satisfied_downgraded`, forcing the UI to admit the evaluation was shallow.

### 3. The "Evidence of Absence" Loophole (Silent Passing)

**Location:** Source Workspace V1.0.1 + Outcome Evaluator V3.3.1

- **The Dishonesty:** If an Outcome mandates, "All factual claims must be supported by the Source Workspace." If `step.source_research` silently fails to fetch a URL (e.g., due to a paywall or 404), the Source Workspace remains empty. The Evaluator scans the claims, checks the empty workspace, finds no explicit contradictions, and marks the outcome `satisfied`. The system tells the user the claims are "verified" when, in reality, it was just entirely blind to the source material.
- **What’s Missing to Force Honesty:** Outcomes must enforce a **`SourceDataInvariant`**. Evaluators must be mathematically prohibited from treating "not contradicted" as "verified." If a source retrieval fails, the workspace must emit a `SourceMissing` token. Any Evaluator that reads a workspace containing a `SourceMissing` token must transition to an `indeterminate_missing_dependencies` state, not `satisfied`.

### 4. The Budget-Exhausted Truncation (Silent Partial Misfires)

**Location:** Outcome Evaluator/Revisor V3.3.1 (Candidate Artifact Versions & Budgets)

- **The Dishonesty:** A Revisor generates a 5-step `RevisionPlan`. It executes steps 1, 2, and 3 successfully. Before step 4 executes, the system hits its `max_llm_calls_per_revision` budget. The Revisor fires an `escalation_out` (Hard Call). However, because steps 1-3 were applied to the active artifact, the user is presented with a half-mutated document. The system might show the task as "Escalated," but the artifact itself silently contains incomplete logical surgeries that look like finalized text.
- **What’s Missing to Force Honesty:** Mandatory **Transaction Rollbacks**. A `CandidateArtifactVersion` must be treated as a strict atomic transaction. If the budget is exhausted before the *entire* plan results in 100% `satisfied` outcomes, the Revision Dispatcher must revert the artifact to the `base_version` and attach a `budget_exhausted_rollback` flag. The system must tell the user: "I ran out of budget halfway through, so I undid everything to keep your document safe."

### 5. Transitive Taint Laundering (Silent Security Breaches)

**Location:** Task Forum V1.0.1 + Outcome Evaluator V3.3.1 (Taint Model)

- **The Dishonesty:** This was surfaced in the red team review, but it fundamentally answers the honesty question. If a tainted, adversarial source is read by a forum sub-agent, the agent generates an annotation. The annotation drops the `external_untrusted` taint label. The Revisor consumes the "clean" annotation and alters the document. The system presents the final artifact to the user as clean and internally generated, silently laundering adversarial payloads.
- **What’s Missing to Force Honesty:** Taint must be defined as a **Transitive Graph Property**. Any artifact, instruction, or annotation produced by a module that *looked* at tainted data must inherit the taint label. The system must explicitly warn the user: "This output was influenced by untrusted external sources," preventing a silently degraded security posture.

### The Core Architectural Fix

The system currently treats "success" as a boolean `satisfied`. To make this architecture honest, you must replace the boolean with a matrix. An outcome or task is never just "done." It is `done_perfectly`, `done_with_degraded_quorum`, `done_with_downgraded_assurance`, or `aborted_for_safety`. If you add these explicit states to the `EvaluationCommonContracts`, the system is forced to tell the user exactly what compromises it made to cross the finish line.

# Gemini 2

**To answer your question directly: No.** A securities litigator cannot rely on the work product this system produces without re-checking the underlying sources and reasoning themselves.

While the DOC23 Addenda B architecture provides unparalleled *observability* into what the system did, it provides no structural *guarantees* of factual or logical integrity. It is designed to relentlessly iterate until a set of LLM-driven evaluators stop complaining (reaching the `satisfied` state). For a securities litigator bound by Rule 11 and the PSLRA—where pleading specific, unassailable facts is the barrier to entry—a lack of complaints from an LLM evaluator is not a proxy for truth.

To justify blind reliance without re-checking, the architecture must transition from "Evaluation" to "Evidentiary Proof." Here is precisely what is missing in the current spec set before reliance is justified.

### 1. The Missing Citation Contract (Ephemeral Source Grounding)

**The Reliance Gap:** You have a `SourceWorkspace` and you have an `EvaluationResultEnvelope` with `TextAnchor`s. However, the anchors only link an *evaluator's finding* to a spot in the *draft artifact*. There is no architectural contract linking the *generated text* to the *underlying source material*.

**Why it precludes reliance:** When the Revisor executes a `direct_fix` to add a scienter allegation, it reads the Source Workspace and writes the text. But the final text is just a raw string. The litigator looking at the final Complaint has no way to know *which* specific page of *which* specific deposition in the Source Workspace supports that sentence. They are forced to re-read the sources to verify the AI didn't hallucinate the fact.

**What is required:** The `revision_in` port and Revisor MUST be bound by a **`CitationManifest` Contract**. Any module that generates or mutates factual text must mathematically bind that text block to a specific `SourceRecord` ID and exact quote. The artifact cannot just be a text file; it must be a dense, verifiable citation graph.

### 2. Passive vs. Affirmative Verification (The "Absence of Contradiction" Loophole)

**The Reliance Gap:** The Outcome Evaluator determines if an outcome is `satisfied`. If the outcome is "Ensure all dates match the timeline," an LLM evaluator will scan the text against the Source Workspace. If the LLM evaluator *fails to notice* a contradiction, or if the Source Workspace failed to load the correct timeline document, the evaluator emits no findings. The outcome transitions to `satisfied`.

**Why it precludes reliance:** The system treats the absence of a detected error as proof of accuracy. A litigator cannot rely on "I didn't see a mistake." They rely on affirmative proof.

**What is required:** Evaluator outcomes must support a **`StrictAffirmativeProof` mode**. In this mode, `satisfied` cannot be achieved by returning an empty findings array. The evaluator MUST extract every factual claim and pair it with its affirmative source match. If a claim lacks an affirmative match, the outcome is forced to `indeterminate_unverified`, preventing the system from masking ignorance as accuracy.

### 3. The Black-Box Causality of the Revision Compiler

**The Reliance Gap:** The Revisor is the brain of this system. It ingests failed outcomes and outputs a `RevisionPlan`. But the Revisor is an LLM. Its *causal reasoning* for why Step 1 (e.g., "delete paragraph 3") actually solves Finding A is entirely opaque.

**Why it precludes reliance:** If a litigator does not re-check the reasoning, they are blindly trusting that the LLM correctly understood the legal deficiency and applied the correct legal remedy. Because the `RevisionPlan` schema captures the *instruction* but not the *logical proof* of the instruction, the user has no way to audit the Revisor's judgment short of reading the diffs and reverse-engineering the logic.

**What is required:** The `RevisionPlan` schema MUST require a **`CausalProof` ledger**. Before emitting the plan, the Revisor must explicitly map: "Action X alters Fact Y, which forces Outcome Z to pass because of Legal Rule W." If the system cannot explicitly state the chain of logic, it must escalate to a human.

### 4. Taint Laundering via Advisory Summaries

**The Reliance Gap:** The Revisor input packet is protected by a taint boundary. But if opposing counsel's brief (tainted `external_untrusted`) is debated by sub-agents in the Task Forum, the sub-agents produce a "clean" summary annotation.

**Why it precludes reliance:** The Revisor reads the "clean" annotation and incorporates its framing into the work product. The system tells the litigator the artifact is safe. The litigator relies on it, unknowingly filing a brief that adopts the opposing counsel's poisoned framing or hallucinated case law.

**What is required:** As noted in previous technical passes, **Transitive Taint** is non-negotiable for reliance. Any downstream reasoning that touches tainted data must inherit the taint. A litigator can only rely on a system that refuses to launder adversarial input.

### Summary

To rely on this system without checking the sources, the litigator needs to know that the document is not merely *plausible*, but *proven*. Addenda B currently optimizes for plausibility (getting the evaluators to say "looks good"). Until you introduce strict citation manifests, affirmative proof modes, and un-laundered taint tracking, the system remains a brilliant junior associate—capable of great drafts, but requiring a senior partner's red pen before it goes to the judge.