RED_TEAM_DOC23_ADDENDA_B_SET_V2.md

Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/RED_TEAM_DOC23_ADDENDA_B_SET_V2.md

Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.

Open text page · Open raw txt · Open path URL

# Red Team Review Prompt — DOC23 Task System + Addenda B Document Set (V2)

**Instructions for the reviewing model:** You are a principal systems architect and product reviewer with deep expertise in AI orchestration systems, distributed-systems failure analysis, schema and contract design, and the user experience of professional-grade software tools. You are reviewing a set of six interlocking specification documents for the task-execution system of an AI platform.

The author does not want validation. The author wants you to do two jobs at once. First, think big: critique whether this design will actually produce a system that high-stakes professional knowledge workers can *rely on*, and propose how to make it genuinely excellent — including reconceptualizations, not just fixes. Second, hunt defects: find every bug, every piece of broken or fragile logic, every missing piece, every underspecified mechanism, and every gap in wiring, schema, or contract that would cause the built system to fail or force an implementing AI coding agent to guess.

Your review therefore has three parts, all required: a broad conceptual critique (§4), a targeted review of the known-hard parts and the seams between documents (§5), and an exhaustive defect hunt across all six documents (§6). Do not skip the conceptual part to get to the technical findings, and do not treat the defect hunt as an afterthought — the author wants bugs squashed.

This is a fresh-window review. All necessary context is in this prompt and the attached documents.

---

## §1 System context

**ELNOR** is a local-first, general-purpose AI agent operating system. It runs on Apple Silicon using OpenClaw as the runtime substrate, with a mix of local quantized models, cheap API models, and frontier models. It is built by AI coding agents working from specifications — the author writes and refines specs; AI agents implement them. **The specifications are the product at this stage; their completeness and precision directly determine build quality. A defect in the spec becomes a defect in the system.**

ELNOR is **domain-agnostic by design**. Its first deployment context is a securities litigation practice, and you may use high-stakes legal work as a concrete lens when reasoning about reliability and user experience — but the system itself must work equally well for any professional knowledge-work domain (research, engineering, finance, medicine, journalism). Do not let the legal framing narrow your review; treat "high-stakes professional knowledge work" as the general target.

**The task system** is ELNOR's mechanism for composing, executing, evaluating, and revising multi-step AI work. A user — or an AI Task Agent assisting them — builds a task graph out of modules. Modules execute, produce artifacts, get evaluated against criteria, and, when they fall short, get revised. The core modules in scope:

- **Outcome Compiler** — interprets a human request into a structured `CompiledEvaluationPlan` with explicit evaluation criteria and `OutcomeSpec` records
- **Outcome Evaluator** (`step.evaluator`) — runs evaluations against compiled plans; produces findings and revision requests
- **Revisor** (`step.revisor`) — composes revision plans from evaluator findings; dispatches to modules to perform fixes; manages the revision cycle
- **Judge** and **Experiment** (Addenda A modules, referenced but not the primary review target here) — quantitative scoring and variant comparison
- **Sub-agent dispatch** — specialist sub-agents that main modules can delegate to (registry mechanics owned by a separate spec, DOC24)

---

## §2 Documents under review

Six interlocking specifications. Review them as a *set* — many of the highest-value findings are at the seams between documents, not inside any one.

1. **DOC23 Addenda B Core R0.7** — the family-core specification. Defines shared primitives, the Task Mode taxonomy, the `TaskOpportunityPacket` schema and token budgets (§3D), and a large block of cross-document obligations (§24/§24A).

2. **DOC23 Addenda B / Outcome Evaluator+Revisor V3.3** — the Evaluator and Revisor modules. Includes Pattern C evaluation wiring (§5.18 `evaluation_result_out` port), the `AdvisorySubAgentProfile` schema (§8.4), the revision cycle and revalidation cascade (§11.21), `PatternPerformanceSlice` (§13.3), HardCall resolution (§21), and a set of patches from a prior red-team round (taint-clearance-bound-to-AccessTier, upstream-failure cascade, logical-vs-infrastructure budget bifurcation, SemanticChangelog for regeneration, and the goal-learning sycophancy fix at §6.12).

3. **DOC23 Evaluation Common Contracts V1.1** — shared schemas across the evaluation family: `EvaluationResultEnvelope`, `EvaluationArtifactEnvelope`, the signal envelope, and Pattern C consumption semantics (§3.7, `target_evaluation_chain_id` binding).

4. **DOC23 Addenda B / Source Workspace V1** — the source-bound workspace substrate for task runs; `SourceRecord` schema and workspace API.

5. **DOC23 Addenda B / Feedback Delivery V1** — delivery of evaluation findings to feedback consumers; defeasible findings; the DOC23 / DOC15 / DOC24 boundary (§8).

6. **DOC23 Addenda B / Task Forum + Run Board V1** — the `TaskRunContextPacket` pattern (§6.3), digest packets, and run-board surfaces.

---

## §3 Broad / conceptual review

The author's bar is not "the system works." It is "a high-stakes professional relies on it" — for work where errors are expensive, deadlines are hard, and reputations are exposed. Review against that bar. Step back from the spec text and assess the design as architecture and as product.

### §3.1 Reliability and trust
Will a professional trust this system with consequential work? Identify what in the design earns trust and what quietly erodes it. Trace concrete failure scenarios — a module produced a wrong artifact; an evaluation passed something that should have failed; a revision made the work worse — and for each, determine whether the user can find out what happened and why. Assess whether behavior is predictable enough that a professional can build a durable workflow on top of it, or whether the system is too variable to depend on.

### §3.2 Failure transparency and recovery
When something fails — a module crashes, a revision cycle exhausts its budget, an outcome is indeterminate — does the user understand what happened and what to do next, or does the system fail opaquely? Can the user recover gracefully: undo, fork, return to a known-good state? Are partial failures contained, or does one broken module poison an entire task?

### §3.3 The human's role
The system asks the user for input, judgment, and approval at various points, and escalates "Hard Calls" for human resolution. Assess whether this is calibrated correctly: too many interruptions destroy a professional's flow; too few surrender control over high-stakes output. Are the right decisions escalated, at the right moments? Is the cognitive load reasonable — can the user direct this system without becoming its operator?

### §3.4 Reviewability of AI work
High-stakes work must be reviewable before it is relied on. When a module regenerates or restructures a large artifact, can the user actually review what changed — the set includes a SemanticChangelog mechanism for this; assess whether it is sufficient. Are evaluation results presented so a professional can audit the reasoning, the evidence, and the sources, rather than only seeing pass/fail?

### §3.5 Is the concept right?
Beyond execution quality, interrogate the concept. What does this task system fundamentally get right, and what does it get wrong at the level of idea rather than detail? Set the six documents aside and ask: designing a reliable AI task system for high-stakes professional work from first principles, where would you diverge from what is here? Name the weakest load-bearing idea in the architecture — the assumption that, if it does not hold, undermines everything built on it. Name what the system cannot currently do that a demanding professional would expect. And look outward: workflow engines, CI/CD systems, clinical decision support, aviation checklist discipline, and professional drafting tools each carry decades of practice in making complex multi-step work reliable and reviewable — what of that is missing here? The author explicitly wants ambitious improvement ideas at this stage, including ones outside the system's current framing.

---

## §4 Targeted review — seams, open items, complex mechanics

Focused scrutiny of the hard parts. For each, determine whether the current design is sound, complete, and implementable.

### §4.1 Cross-document seams
The single highest-value category of finding. Six documents written at different times will not fully agree. Hunt for: a primitive defined one way in one document and used differently in another; a Core R0.7 §24/§24A obligation whose target-document side is missing or inconsistent; a Common Contracts V1.1 schema that the Evaluator+Revisor V3.3 document populates incompletely or contradicts; a boundary (DOC23/DOC15/DOC24) where each side assumes the other handles something; and version-skew where the documents have drifted out of agreement.

### §4.2 The revision cycle
Trace the full cycle: original artifact → Evaluator → findings → Revisor → revision plan → dispatched modules → revised artifact → Evaluator re-runs (revalidation cascade, §11.21) → loop or terminate. Is every transition specified? Where can it get stuck, loop forever, or terminate ambiguously? What exactly gets re-evaluated on revalidation — only the targeted outcomes or all of them — and what happens when a revision fixes one outcome and breaks another? Are all terminal states (convergence, budget exhaustion, `could_not_fix`) reachable and well-defined, with no state the cycle can enter and not leave?

### §4.3 Failure cascades and distributed-systems hazards
The set includes an `upstream_failure` / transitive failure cascade — a fix for a deadlock where downstream outcomes wait forever on artifacts a crashed upstream module will never produce. Examine whether the cascade is complete: can any outcome still hang, and how is partial upstream success handled? Hunt for race conditions in the evaluation/revision wiring, concurrent revisions touching the same artifact, and hazards in parallel run targeting. Examine whether `pending_dependency` and related suspended states can deadlock or leak. Then simulate emergent behavior over hundreds of task ticks across many interacting modules and report latent hazards invisible in any single-module reading.

### §4.4 Pattern C evaluation wiring
Pattern C wires evaluation results between modules via an `evaluation_result_out` port (V3.3 §5.18) and `target_evaluation_chain_id` binding (Common Contracts §3.7). Trace it end to end: producer emits, consumer binds, result flows. Is every step specified? What happens when the chain ID fails to resolve, or resolves to a stale or wrong target? Is the `EvaluationResultEnvelope` / `EvaluationArtifactEnvelope` distinction clean, or are there cases where the wrong envelope is used?

### §4.5 Taint, privilege, and access tiers
A prior review bound taint clearance to access tiers so that a junior user clearing `external_untrusted` taint cannot poison firm-wide or global scope. Examine whether the binding is complete — are there paths where taint is cleared without the access-tier check? Trace a taint-bearing payload through the full revision cycle and confirm taint propagates correctly across revision, re-evaluation, and pattern formation.

### §4.6 Budget governance
A prior review bifurcated budgets into logical (`max_logical_llm_calls`) and infrastructure (`max_infrastructure_retries`) so local-model stuttering does not consume the planning budget. Examine whether the bifurcation is complete and consistently applied — are there budget-consuming paths that fall into neither category or are charged to the wrong one? Assess cost predictability: can a user know roughly what a task will cost before running it, and see where the budget went after?

### §4.7 Sub-agent dispatch contract
The `AdvisorySubAgentProfile` schema (V3.3 §8.4) defines specialist sub-agents that main modules dispatch to; registry mechanics live in a separate spec (DOC24). Examine the *contract* this set defines: is it complete enough for DOC24 to implement against without ambiguity, and what is underspecified at the boundary? Main modules must function correctly when no sub-agent is available — confirm the no-sub-agent fallback is fully specified for every dispatch point.

### §4.8 The DOC23 / DOC15 / DOC24 boundary
Feedback Delivery V1 §8 defines the boundary between DOC23 (wiring), DOC15 (final prompt assembly), and DOC24 (context selection and permissioning). Examine whether the boundary is crisp, or whether responsibilities fall into a gap between the three or are claimed by two. Examine whether the `TaskRunContextPacket` (Task Forum + Run Board §6.3) and the broader packet-assembly story fully specify the producer/consumer contract.

### §4.9 The InjectionSlotRegistry gap (known open item)
Core R0.7 §3D defines the `TaskOpportunityPacket` schema and token budgets but does not specify the corresponding slot-registration entries — slot_id naming, slot_kind, rendering constraints, the required-per-packet-kind matrix. This is a known gap flagged for a future Core R0.8 revision. Determine precisely what must be specified to close it, and whether other schema/registry gaps of the same kind exist elsewhere in the set.

### §4.10 HardCall, cancel, skip, and indeterminate semantics
Examine the HardCall escalation mechanism: what triggers a Hard Call, is the trigger set complete (are there situations that should escalate but don't), is the resolution flow well-defined, and what happens to a task while a Hard Call is pending? Examine cancel, skip, and indeterminate-outcome semantics across the set: when a user cancels mid-task, what state is the task left in; when an outcome is `indeterminate`, what happens downstream? Determine whether these edge states are fully specified or trail off.

---

## §5 Defect hunt — every document, exhaustive

This is the layer where you squash bugs. Go through all six documents and hunt, adversarially and exhaustively, for everything that is wrong, broken, missing, or too vague to build. The bar is blunt: if this set were handed to an AI coding agent tomorrow, every defect you miss becomes a defect in the system. Find them now.

Hunt for all of the following.

**Bugs — things that are wrong.** A described algorithm with a logic error. A state machine with an unreachable state, a state with no exit, or a missing transition. A schema that cannot represent what the system needs, or that permits invalid or contradictory data. A contract whose producer side and consumer side cannot both be satisfied. Off-by-one, boundary, ordering, and concurrency errors in any described behavior. Pseudocode, type definitions, or schema definitions that are simply incorrect as written. A behavior specified one way that should be another. A claim in one document contradicted by another. Anything that, implemented faithfully as written, yields a crash, corrupted state, a hang, or a wrong result.

**Bad or fragile design.** Mechanisms that are technically specified but will behave badly: approaches that do not scale, designs that hold on the happy path but degrade or fail under load or volume, absent or incorrect error handling, retry logic that can storm or loop, missing resource cleanup, missing idempotency where operations can be retried.

**Missing pieces.** Anything the system needs that no document provides. A required step with no specified mechanism. A failure mode with no handler. An enum used but never fully enumerated. A data structure referenced but never defined. An obligation stated with no acceptance criteria. A surface the user needs that no document delivers.

**Underspecification — anything that forces the implementer to guess.** Vague triggers ("when appropriate"), undefined thresholds and limits, "or equivalent" hand-waves, behavior given for the happy path but not the edges, schemas silent on missing or malformed input. Every place an AI coding agent would have to choose between unstated alternatives is a defect risk: flag each one and state exactly what specification — what code, schema, contract, or rule — would remove the guess.

**Missing wiring.** A signal emitted that nothing consumes. A field required but never populated. A port with one end specified and the other end missing. A state transition named but with no defined trigger. A producer with no consumer, or a consumer with no producer.

The standard for this layer: a competent AI coding agent should be able to implement all six documents without inventing behavior, without choosing between unstated alternatives, and without missing a connection because it was never drawn. Apply the hunt to every document — Core R0.7, Evaluator+Revisor V3.3, Common Contracts V1.1, Source Workspace V1, Feedback Delivery V1, and Task Forum + Run Board V1. Do not let the smaller documents escape scrutiny because the larger ones absorb attention.

---

## §6 Self-learning surface — flag but do not deep-dive

The author is separately reorganizing the self-learning / improvement architecture (BDSM signals, pattern performance learning, the goal-advancement axis, optimization integration). That work is **deliberately paused** pending a memory-system reorganization.

Therefore: if you encounter self-learning material in these documents — `PatternPerformanceSlice` learning axes, the `goal_advancement_count` mechanism, the §6.12 sycophancy fix, BDSM signal emission, learning-signal envelopes — **flag anything genuinely new or broken in one line, then move on.** Do not spend review budget deep-diving the self-learning surface or re-deriving its gaps; the author has already mapped them. Self-learning findings should be a small, clearly labeled minority of your review. Spend your effort on the task-system mechanics, reliability, UX, and the technical defects in §4 and §5.

---

## §7 Output format and coverage

Produce findings in this structured format. The author routes this prompt to multiple models in parallel and compares outputs; structure enables comparison.

```
## [TAG] [SEVERITY] [DOC/SECTION] — [Finding title]

**Finding:** [1-3 sentences — what the issue or idea is]

**Why it matters:** [Concrete consequence — what breaks, what the user experiences, what the coding agent gets wrong, or what improvement is unlocked]

**Recommendation:** [Specific action — what to add, change, remove, specify, or investigate]

**Reference:** [Document and section; cross-references to other documents in the set where relevant]
```

**Tags:**
- `BUG` — something wrong, broken, internally contradictory, or contradicted across documents
- `GAP` — something that must be specified but isn't: missing pieces, underspecification, missing wiring, schema/contract gaps
- `RISK` — a distributed-systems, reliability, or emergent-behavior hazard
- `UX` — a user-experience, trust, transparency, or reviewability problem
- `IDEA` — a broad improvement, reconceptualization, or missing capability (the conceptual layer)
- `CONFIRMED` — a part that is genuinely well-designed; affirm briefly and move on

**Severity:**
- `CRITICAL` — must fix before build; the system fails or misleads the user without it
- `HIGH` — significant defect or gap; should fix before build
- `MEDIUM` — meaningful improvement; address during build
- `LOW` — minor; document for later

**Required coverage** — your review must include at least:
- 6 conceptual findings (`IDEA` / `UX`) from §3 — the excellence question
- 10 targeted findings (`BUG` / `GAP` / `RISK`) from §4, of which at least 4 are cross-document seam findings (§4.1)
- 12 defect-hunt findings (`BUG` / `GAP`) from §5, spread across all six documents — not concentrated in one or two

Total minimum: 28 findings. If you have fewer, you have not read closely enough. More is better. Do not pad with trivia — but do not stop early, and do not let the defect hunt come up short; the author wants bugs found.

---

## §8 Synthesis

After all findings, produce a synthesis section:

```
## Synthesis

**Overall assessment:** [2-3 paragraphs. Is this design sound? Will it produce a system high-stakes professionals can rely on? What is its single greatest strength and single greatest weakness? Is it build-ready, refine-first, or does some part need rethinking?]

**Top 5 must-fix before build:** [Rank-ordered CRITICAL/HIGH findings]

**Biggest conceptual opportunity:** [The one §3 idea that would most improve the system]

**Coding-agent readiness:** [Could an AI coding agent build this set today without guessing? If not, what are the largest sources of required guesswork?]
```

---

## §9 Constraints

- **Do not validate.** Brief `CONFIRMED` tags for genuinely strong parts, then move on. Spend your effort on what is wrong, missing, or improvable.
- **Be specific.** Cite document and section. "This is unclear" is useless; "Core R0.7 §3D defines X but not Y, so when Z happens the implementer must guess between A and B" is useful.
- **Review the set, not the documents in isolation.** The seams between the six documents are where many of the highest-value findings live.
- **Hold the excellence bar.** The target is not "works" — it is "a high-stakes professional relies on it."
- **Stay domain-agnostic.** Use professional-knowledge-work reliability as the lens; do not assume the system is legal-specific.
- **Do not assume unstated detail.** If the spec does not say it, that is a `GAP`, not something to fill in charitably.
- **Squash bugs.** The defect hunt is not secondary. Every defect found now is a defect kept out of the build.

Begin your review. Produce all 28+ findings in the structured format, then the synthesis.