DOC23_FAMILY_SELF_LEARNING_COHERENCE_MAP_V1_CLAUDE_RED_TEAM_REVIEW_V2.md
Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC23_FAMILY_SELF_LEARNING_COHERENCE_MAP_V1_CLAUDE_RED_TEAM_REVIEW_V2.md
# DOC23 Family Self-Learning Coherence Map V1 — Claude Red Team Review V2
**Reviewer:** Claude (Anthropic)
**Date:** 2026-05-19
**Document reviewed:** `DOC23_FAMILY_SELF_LEARNING_COHERENCE_MAP_V1.md` (706 lines, dated 2026-05-19)
**Supersedes:** V1 of this review (errors corrected after author feedback; superseded entirely)
**Supporting documents consulted:**
- DOC23 Addenda B Core R0.7 (2026-05-17)
- DOC23 Addenda B Outcome Evaluator+Revisor V3.3 (2026-05-17)
- DOC23 Addenda B Feedback Delivery Subsystem V1
- DOC23 Addenda A R4.1 V6 (2026-05-17)
- OP-A V3.16
- Recent ML literature (2024-2026) on process reward models, LLM agent memory, agent-based reasoning
**Review discipline:** Harsh critique per author's instruction. No validation padding. Substantive findings only.
**V2 vs V1 changes:**
- DSPy correctly framed as v1 component (not R5-deferred)
- Free-text rationale architecture preserved (Will's design choice; BDSM already calibrated for it)
- Sub-prompt optimization framing removed (architecture doesn't have sub-prompts; learning targets are task artifacts, patterns, values, configurations, strategies)
- BDSM nightly chat-extraction architecture correctly understood (Task Agent doesn't self-label)
- Research citations updated from 2022-2023 to 2024-2026 work
- "Defer to R2" language removed throughout (no phasing philosophy)
- Task Improvement Engineer (TIE) specified as first-class architectural component
- Multi-user forward-compatibility schema additions specified
- Unified learning architecture with 7 distinct mechanisms enumerated
**Coverage:** 32 findings — 5 Layer 1 (satellite), 11 Layer 2 (architectural detail), 7 Layer 3 (cutting-edge research), 5 Layer 4 (specific topic banks), 4 findings specific to the unified learning architecture, plus concrete examples appendix.
---
# Part I — Findings
## Layer 1: Satellite view
### [GAP] [CRITICAL] §1 / §4 / §10 — No measurement loop specified for the proposal itself
**Statement of finding:** The proposal adds substantial machinery (5 new signals, outcome clustering, multi-prior coordination, rationale standardization) but defines no mechanism to measure whether any of it actually improves task outcomes. After the spec lands and is implemented, there's no specified way to know if the system is learning or just generating signals into a void.
**Why this matters:** Without a measurement loop, the proposal is unfalsifiable. The system could ship the entire architecture, accumulate signals for months, and produce no detectable improvement — and there'd be no structured way to discover that. Worse, the absence of measurement means the proposal cannot itself be evaluated for ROI.
**Recommendation:** Specify a concrete measurement substrate as a first-class component. The Loop Effectiveness Test pattern compares (a) artifact → Judge baseline against (b) artifact → Evaluator → Revisor → Judge revised score, producing structured delta records that serve as: (1) operator visibility into loop ROI per task type, (2) automatic accumulation of labeled training data for Revisor and Evaluator DSPy targets, (3) diagnostic substrate for TIE's identification of when the loop fails, (4) input to trust calibration for autonomy thresholds. Specify `LoopEffectivenessTestRunRecord` schema; specify the task-graph wiring (two parallel paths feeding into a comparison node, no Experiments module wrapper needed); specify the BDSM consumption.
**Reference:** §1 (executive summary lists what works but not how anyone would know), §4 (cross-module mechanisms — measurement absent).
---
### [BUG] [HIGH] §4.1 / §5 — Outcome cluster replacement imports problems the goal axis didn't have
**Statement of finding:** The proposal removes goal-advancement learning entirely and replaces it with emergent outcome clustering. But strategic-goal-conditioned learning is genuinely useful in real legal practice. Removing it loses signal entirely. The new clustering mechanism doesn't actually replace what goal-conditioning would have provided — clusters group outcomes by structural similarity, not by strategic intent.
**Why this matters:** Two outcomes that look structurally identical (same AssuranceBasis, same lane composition) may serve completely different strategic purposes (one is a settlement demand, one is a hostile-takeover-defense brief). Clustering by structure groups them together; goal-conditioning would distinguish them. The removal-plus-replacement loses real information.
**Recommendation:** Restructure as an addition. Keep `goal_advancement_count` removed (the broken implementation), but keep `GoalRef[]` as a real learning axis under a different name and mechanism. Rename to `strategic_intent_tag` — a user-declared, free-text-with-vocabulary field on tasks. Aggregate pattern slices per (cluster_id, strategic_intent_tag) when both are present. The clustering mechanism stays; the strategic conditioning stays; only the broken comparative-judge-self-grading is removed.
**Reference:** §5.1 (proposes full removal), §5.2 (proposes weaker Option-2 metadata-only repurpose), §4.1 (cluster mechanism).
---
### [GAP] [HIGH] §4.1 — Cluster mechanics at single-user data scale not analyzed
**Statement of finding:** HDBSCAN is a density-based clustering algorithm designed for population-scale data. ELNOR is single-user. After several months of normal use, a single user produces tens to low hundreds of outcomes total. HDBSCAN at this scale typically produces either one giant cluster (collapse) or many singleton clusters (fragmentation).
**Why this matters:** The proposal stakes the entire goal-axis replacement on outcome clustering producing useful clusters. If clustering produces degenerate results at single-user scale, the replacement axis is worse than the broken axis it replaces.
**Recommendation:** Before committing to HDBSCAN, run a paper-design analysis: estimate realistic outcome volumes per matter, per month, per cluster admission threshold. If forecasts show degenerate behavior, switch to a different mechanism. Recommended alternative: **k-Nearest-Neighbors retrieval** — no clustering at all, just retrieve N most-similar past outcomes for context injection. Avoids cluster-ID stability problems entirely. Maps to current frontier memory-architecture research (Zep, H-MEM, Mem0 — see Layer 3 findings). Provides cluster-context-like behavior without the failure modes at single-user data scale.
**Reference:** §4.1 (entire mechanism); HDBSCAN known to produce sparse-data pathologies (Campello et al. 2013).
---
### [BUG] [HIGH] §3.12 / §4 — Edit-trace feedback loops need explicit dampening; otherwise collapse to parroting
**Statement of finding:** The proposed lifecycle for edit-trace signals (§3.12 OutcomeCompilerProposalEditTrace, §3.4 TaskAgentProposalEditTrace) creates feedback loops with degenerate stable states. User edits Compiler's proposal toward X → BDSM aggregates → DOC24 injects "user typically edits toward X" as in-session prior → next Compiler proposal biases toward X → user accepts without editing → bias stabilizes around X → Compiler now produces what user said last, no novelty introduced. If user changes mind, the prior persists until enough new evidence accumulates, which may take many runs.
**Why this matters:** This is the failure mode the goal-axis removal was supposed to fix (sycophancy), but the new architecture reintroduces it through revealed-preference learning. A Compiler that always proposes what the user last accepted is parroting, not learning.
**Recommendation:** Specify explicit dampening and unlearning mechanisms in §4.1 and §4.2:
- **Time decay on edit-trace influence:** prior strength halves every N days of no reinforcement
- **Diversity injection requirement:** Compiler MUST sometimes propose against the revealed pattern (low probability, e.g., 10%) to maintain exploration
- **Drift detection:** if Compiler's proposals haven't varied meaningfully across N runs, flag as suspected collapsed state
- **Active disagreement signal:** when user makes a large edit (high diff distance), emit a `CompilerPriorRefutationSignal` that aggressively decays the relevant prior
- **TIE oversight:** TIE periodic audit checks for prior-injection drift; if detected, recommends prior decay or reset
**Reference:** §3.12 (lifecycle proposal — no dampening), §4.1 (cluster context injection), §4.2 (multi-prior coordination — time-decay mentioned but not specified).
---
### [GAP] [HIGH] §1 / §4 — BDSM specification gap is the actual blocker; proposal adds obligations without fixing it
**Statement of finding:** The document repeatedly identifies BDSM's "utility compilation rules" as unspecified (§3.1 step 8, §3.2 step 8, §3.4 step 8, §3.5 step 8, §3.7 step 9, §3.8 step 10, §3.9 step 10, §3.11). It then adds 5 new signal types and a standardized rationale field, all of which BDSM must consume. The proposal piles obligations onto BDSM without specifying how BDSM transforms signals into utility bundles.
**Why this matters:** Adding more signal types doesn't unblock anything if BDSM doesn't know what to compute from them. BDSM is the load-bearing component in nearly every learning loop. Its specification gap is the actual blocker for the Self-Learning Architecture R1 spec the proposal anticipates producing.
**Recommendation:** Before producing R1, produce a BDSM utility-compilation specification that defines: (a) how each signal type aggregates over time, (b) what derived metrics emerge from aggregation, (c) what utility bundles compile from aggregated metrics, (d) how bundles are queried by DOC24, (e) the BDSM nightly extraction pipeline for chat-during-setup signals (where BDSM extracts correction-relevant segments using next-proposal-change as ground truth, without requiring Task Agent self-labeling), (f) the BDSM interface that TIE consumes for diagnostic reasoning. Either expand this scoping doc to include BDSM specification, or commit BDSM specification as a parallel workstream that must complete before R1 ships.
**Reference:** Cross-cutting across §3.1-§3.11; §1 doesn't acknowledge BDSM as gating.
---
## Layer 2: Architectural detail
### [GAP] [CRITICAL] §4 — Task Improvement Engineer (TIE) not specified; the highest-leverage learning mechanism is missing entirely
**Statement of finding:** The proposal's learning architecture relies on signal capture feeding BDSM, BDSM compiling utility bundles, downstream consumers (DOC72 patterns, DOC24 priors, DSPy training data) acting on bundles. This is incomplete. The proposal lacks an LLM-reasoning layer that processes accumulated diagnostic data and produces structured improvement recommendations spanning all of the system's learnable surfaces.
The proposed DSPy targets address only top-level module prompts. The proposed BDSM utility compilation produces aggregates and patterns. Neither mechanism produces actionable recommendations for: rubric refinement, OutcomeDefinition improvement, configuration tuning, strategy-selection updates, DOC72 pattern primitive creation, user constitution updates, sub-agent dispatch rule changes, task graph topology improvements, or architectural/code changes.
These are the highest-leverage learning targets in a single-user sparse-data system. They're also exactly what frontier-model reasoning over structured diagnostic data is good at. The proposal misses this entirely.
**Why this matters:** Without TIE, the proposal's learning architecture is limited to:
- DSPy optimization of top-level module prompts (slow; requires substantial data)
- Pattern emergence in DOC72 (passive; no agent reasoning)
- Quality dashboards (visibility only; no action)
This leaves the majority of the system's learnable surfaces unaddressed. Rubrics never improve from data. OutcomeDefinitions stay generic. Configurations stay default. Strategy-selection doesn't learn. The system accumulates signal but mostly can't act on it.
**Recommendation:** Add Task Improvement Engineer (TIE) as a first-class architectural component. Specification:
**Definition:** TIE is a specialized agent configuration that periodically (scheduled or on-demand) analyzes accumulated per-step diagnostic data and produces structured improvement recommendations across the system's full set of learnable surfaces. Recommendations route through a multi-stage review pipeline before applying.
**Tiers of intervention (TIE operates at multiple levels):**
- **Tier 1 — Task artifact level:** Edits a specific saved task's rubric, OutcomeDefinition, or configuration. Scope: one saved task. Lowest risk.
- **Tier 2 — Cross-task pattern level:** Writes DOC72 pattern primitives, updates Compiler default rules, modifies user constitution. Scope: all future tasks matching the pattern. Medium risk.
- **Tier 3 — System configuration level:** Updates module configurations, sub-agent dispatch rules, learned lookup tables, model_class routing. Scope: system-wide. Medium-high risk.
- **Tier 4 — Architecture/code level:** Proposes schema changes, module behavior changes, new modules, code patches. Scope: structural. Highest risk.
**Capabilities TIE requires (read access):**
- All `EvaluationLearningSignalEnvelope` storage across modules
- User correction records with rationales
- Module quality metrics (V3.3 §15 dashboard data)
- Task design pattern cards (Core R0.7 §9.6)
- Outcome cluster membership (DOC72)
- Pattern slice performance (DOC72)
- Sub-agent reputation records
- Task graph execution traces
- OutcomeDefinition / rubric artifacts for all saved tasks
- Module configurations
- DOC72 pattern primitives
- User constitution
- Task graph topology definitions
- Drafting agent configurations
- **Codebase access** for Tier 4 work (read code, propose patches)
- **Spec access** for understanding what's optimizable (read DOC23 family specs)
- **Self-learning system meta-information** (knowledge of what learning mechanisms exist and how to engage them)
**Capabilities TIE requires (write access — always gated through ImplementationProposal lifecycle, never direct):**
- DOC72 pattern primitives (new patterns) — via EC write
- OutcomeDefinition / rubric artifacts (refinements) — via EC write
- User constitution (new values/preferences) — via EC write
- Module configurations (parameter changes) — via configuration registry write
- Task graph topology proposals (structural changes) — via task definition update
- Sub-agent dispatch rule updates — via dispatch rule registry write
- Code change proposals (Tier 4) — generates diffs; user reviews; Claude Code/Codex implements
**The TIE multi-stage pipeline:**
```
Stage 1 — Issue identification (continuous, cheap):
Lightweight monitor (Kimi or local model) scans accumulated signals for
patterns: repeated user corrections, score deltas, anomaly detection.
Stage 2 — Severity threshold gate:
Patterns crossing threshold (e.g., 3+ similar corrections, critical
correction, dashboard anomaly) escalate to TIE. Below threshold: logged,
not escalated.
Stage 3 — TIE analysis (frontier model):
TIE does structured diagnostic reasoning. Produces
DiagnosticImprovementRecommendation with proposed changes ranked by
confidence and impact. Considers all 4 tiers as candidate change levels.
Stage 4 — Multi-LLM red team review:
TIE's recommendation gets adversarial review by other models. Different
models assess: Is the diagnosis correct? Are there alternative explanations
TIE missed? Are recommended changes well-targeted? Are there unintended
consequences? Refined recommendation produced.
Stage 5 — Optional user gating:
Will reviews refined recommendation + adversarial notes. Can approve,
reject, revise. For low-risk Tier 1 changes with high TIE track record,
can auto-proceed; configurable per change kind.
Stage 6 — Implementation proposal generation:
For recommendations requiring code changes (Tier 4), specialized agent
generates actual code patches. For structured artifact changes (Tiers
1-3), changes are prepared as reviewable diffs.
Stage 7 — Final review:
Will sees structured change with full context: issue, diagnosis,
adversarial review, proposed change, affected artifacts/code. Approve,
revise, or reject. Approved changes apply.
Stage 8 — Outcome tracking:
After approved changes apply, system tracks whether issue actually
resolves. Produces meta-signal: TIE's recommendations either work or
they don't. Loop Effectiveness Test is the measurement tool. Feeds trust
calibration.
```
**Schemas required (new):**
```ts
ImprovementIssue {
issue_id: string
detected_at: timestamp
severity: "low" | "medium" | "high" | "critical"
pattern_summary: string
evidence_signal_ids: SignalRef[]
threshold_crossed: ThresholdRule
routed_to_tie: bool
routed_at: timestamp?
// Multi-user forward-compat
principal_id: PrincipalRef
learning_scope: LearningScope // user_only by default
}
DiagnosticImprovementRecommendation {
rec_id: string
issue_id: string
tier: 1 | 2 | 3 | 4
observed_pattern: string
diagnostic_interpretation: string
evidence_signal_ids: SignalRef[]
recommended_changes: Array<{
change_kind:
| "rubric_refinement"
| "outcome_definition_refinement"
| "pattern_primitive_emergence"
| "stated_values_update"
| "memory_injection_rule"
| "configuration_change"
| "strategy_selection_update"
| "model_class_change"
| "sub_agent_dispatch_change"
| "task_graph_topology_change"
| "schema_change"
| "code_change"
| "new_module_proposal"
change_target: string // artifact ID, pattern ID, schema name, file path
change_specification: string // the actual edit or diff
confidence: number
expected_impact: string
}>
produced_at: timestamp
produced_by_model: ModelRef
schema_version: 1
}
TieAnalysisRecord {
analysis_id: string
recommendation_ref: DiagnosticImprovementRecommendation
reviewed_by_models: ReviewRecord[] // multi-LLM red team
refined_recommendation_ref: DiagnosticImprovementRecommendation
user_gate_status: "pending" | "skipped" | "approved" | "rejected" | "revised"
user_revision_notes?: string
}
ImplementationProposal {
proposal_id: string
analysis_id: string
change_kind: ChangeKind
proposed_artifact_changes: ArtifactDiff[]
proposed_code_changes: CodeDiff[]? // when applicable
generated_by_agent: AgentRef
user_approval_status: "pending" | "approved" | "rejected" | "approved_with_modifications"
applied_at: timestamp?
}
ImprovementOutcomeRecord {
proposal_id: string
applied_at: timestamp
baseline_metrics: Metrics // pre-change measurements
post_change_metrics: Metrics // post-change (Loop Effectiveness Test data)
outcome: "resolved" | "improved" | "no_effect" | "worsened"
tie_accuracy_signal: number // feeds trust calibration
outcome_assessed_at: timestamp
}
```
**Reference:** Cross-cutting; the proposal's §4 cross-module mechanisms section has the right structure but is missing TIE entirely.
---
### [GAP] [HIGH] §4 — DSPy target registry is module-level only; misses artifact, pattern, and configuration learning surfaces
**Statement of finding:** The proposal's §4.4 DSPy target table lists four module-level targets (claim_extractor_main, outcome_evaluator_main, revision_compiler_main, outcome_compiler_main). This is correct as far as it goes — these are the right top-level prompts to optimize. But the proposal frames DSPy as "the optimization mechanism" when in fact most of the system's learnable surfaces are not prompts at all.
The current proposal treats DSPy as nearly the entire self-learning architecture. It isn't. DSPy is one of seven distinct learning mechanisms (see Part II below). The proposal needs to acknowledge this and specify the other six mechanisms.
**Why this matters:** Conflating "the learning architecture" with "DSPy" produces an architecture where:
- Rubrics never improve from data (DSPy can't edit task artifacts)
- OutcomeDefinitions never improve (same)
- Patterns don't emerge from learned diagnoses (DSPy doesn't reason about meaning)
- Configurations stay default forever (DSPy doesn't tune parameters)
- Strategy selection doesn't learn (DSPy doesn't update selection logic outside the main prompt)
- Sub-agent dispatch reputation extends only as currently specified (no improvement mechanism)
Per the architecture's actual structure: module prompts are fixed, generic, and consume task-specific context at runtime. Most of the system's improvability lives in the surrounding context — rubrics, patterns, values, configurations, strategies — none of which DSPy addresses.
**Recommendation:** Reframe §4.4 from "DSPy training data architecture" to "Mechanism A: DSPy module-level prompt optimization (one of several mechanisms)." Add complete specifications for the other six mechanisms (see Part II). DSPy training data per target stays as specified for the four module-level targets; the proposal's actual gap is the absence of the other six mechanisms, not DSPy itself.
**Reference:** §4.4; Part II of this review.
---
### [GAP] [HIGH] §4.3 / §3.5 / §3.12 — Multi-user forward-compatibility schemas missing entirely
**Statement of finding:** The proposal specifies signal envelopes, pattern slices, clusters, and rationale fields without scope/visibility tagging for future networked use. Will plans for eventual networking (firm-wide deployment, possibly shared product). Without scope tags from day one, future networking requires retrofitting every accumulated signal and pattern.
**Why this matters:** Retrofitting scope tags on accumulated learning data is expensive and error-prone. Signals captured in single-user mode without scope context can't be reliably classified as user-specific vs shareable later. Privilege firewall is particularly hard to retrofit — content that was captured without privilege classification may be inferentially privileged, and the lack of classification makes the firewall unenforceable.
**Recommendation:** Add multi-user forward-compatibility fields to all learning artifact schemas now, even though single-user mode defaults everything to user_only:
```ts
// Extension to EvaluationLearningSignalEnvelope (Core R0.7 §9.0)
EvaluationLearningSignalEnvelope {
// ... existing fields ...
// Multi-user forward-compatibility (new)
principal_id: PrincipalRef // signal owner; defaults to Will single-user
learning_scope:
| "user_only" // never shared (default)
| "team_eligible" // promotable to team-shared with admin approval
| "firm_eligible" // promotable to firm-shared
| "public_eligible" // safe for product-wide aggregation
scope_inference_basis:
| "user_explicit" // user told the system the scope
| "inferred_from_content" // BDSM inferred (e.g., privileged → user_only)
| "default" // no explicit determination; default applied
default_scope_rule: string // which rule produced the default (auditable)
}
// Extension to PatternPerformanceSlice context_signature (V3.3 §13.3)
context_signature {
// ... existing fields ...
principal_id: PrincipalRef
cross_user_repetition_count: number // 1 in single-user; >1 networked when pattern observed across users
share_eligibility: SharingEligibility // computed from data_class + scope + content
}
// New: UserConstitution (concrete artifact)
UserConstitution {
constitution_id: string
principal_id: PrincipalRef // owner
values: Array<{
value_id: string
domain: string // "legal_writing", "investment_research", etc.
statement: string // free-text user value
scope: ScopeTag // user_only by default; explicit promotion for team/firm
origin: "user_authored" | "tie_recommended_and_user_accepted" | "tie_recommended_auto_accepted"
captured_at: timestamp
last_reviewed_at: timestamp?
confidence: number
}>
preferences: Array<{
preference_id: string
domain: string
statement: string
scope: ScopeTag
origin: PreferenceOrigin
}>
schema_version: 1
}
```
**Default-scope rules at single-user mode:**
- `data_class == "privileged"` → `learning_scope: user_only` always (not promotable)
- `data_class == "local_only"` → `learning_scope: user_only`
- `data_class == "internal"` AND content_class is matter-specific → `learning_scope: user_only`
- `data_class == "internal"` AND content_class is generic-procedural → `learning_scope: team_eligible` (still defaults to user_only; flag for promotion review)
- `data_class == "public"` → `learning_scope: public_eligible` (still defaults to user_only until explicit promotion)
- Stated values always `user_only` unless user explicitly shares to team/firm
- Cluster centroids derived from `user_only` data NEVER bleed into cross-user clusters even if structurally similar
**Reference:** New requirement; not in proposal at all.
---
### [GAP] [HIGH] §4.2 — Multi-prior coordination defers conflict resolution to LLM inference; reasoning not specified
**Statement of finding:** §4.2 specifies priority order and budget allocation but defers conflict resolution to the LLM: "When two priors of same kind disagree, prompt assembly presents both with annotations and lets the LLM weigh them." This is not a conflict resolution policy — it's a deferral that produces non-deterministic resolution varying per call.
**Why this matters:** Same user with same conflicting priors gets different proposals on different runs. Token budget is wasted on content the LLM picks one of and discards. For a constitution-aware system (see below for stated-values prior addition), the inconsistency undermines the value of stated values entirely.
**Recommendation:** Specify deterministic conflict resolution at prompt-assembly time, not LLM-inference time. Three approaches:
- **Recency-weighted merging.** More recent prior wins entirely; older dropped.
- **Frequency-weighted merging.** Stronger pattern (more frequent) wins.
- **User-arbitrated resolution.** When same-kind conflict detected at compile time, surface a question to the user.
Recommend hybrid: recency-weighted for low-stakes module behavior; user-arbitrated (or TIE-arbitrated) for high-stakes decisions. Specify the boundary in §4.2. Also add: **stated values prior** as a seventh prior kind (see Layer 3 finding on Constitutional AI), which has higher priority than revealed-preference and is not subject to time-decay (values are stable; not transient patterns).
**Reference:** §4.2 (conflict resolution paragraph).
---
### [GAP] [HIGH] §4.1 — Cluster ID stability claim is hand-wavy; HDBSCAN drift is real
**Statement of finding:** §4.1 mentions "centroid-distance-based ID inheritance" as the solution to HDBSCAN's cluster-ID instability. HDBSCAN cluster shapes can drift dramatically with one new sample — entire subsets of clusters can merge or split. Centroid-distance inheritance assumes centroids move continuously; HDBSCAN centroids do not.
**Why this matters:** If cluster IDs are not stable, then `PatternPerformanceSlice` accumulated against `outcome_cluster_id: c-2891` becomes meaningless when c-2891's composition changes. Historical pattern slices reference cluster IDs that no longer mean what they meant.
**Recommendation:** Treat cluster ID stability as a hard constraint. Options:
- **Stable hierarchical algorithm (agglomerative).** Produces nested cluster hierarchies stable under updates.
- **Persist cluster definitions, not just IDs.** Once promoted to "durable," persist centroid + member set + admission rule. Subsequent runs admit new members to durable clusters by similarity rather than re-clustering from scratch.
- **Eliminate clustering entirely.** Per the cluster-mechanics finding above, switch to k-NN retrieval.
The k-NN approach also aligns with current frontier memory-architecture practice (Zep 2025, H-MEM 2025, Mem0 2025 — see Layer 3 findings). Worth considering as primary mechanism, not just fallback.
**Reference:** §4.1 ("open design questions"), §9 (open questions item 2).
---
### [GAP] [HIGH] §4.1 / §4.3 — Cross-matter privilege firewall not designed; semantic clustering risk
**Statement of finding:** §4.1 lists "cross-matter cluster firewall" as an open question; §4.3 specifies BDSM "semantic-clusters" rationale text. These two are in tension. Rationale fields may contain matter-specific content. Semantic clustering across matters means the cluster centroid encodes matter-specific signal even if individual rationale members are firewalled. The cluster centroid becomes inferentially privileged.
**Why this matters:** Privilege protection in legal practice is strict. Privilege firewall must extend to derived data, not just source data. Without explicit firewall design, EC Core can't reliably gate signal flow when networking ships.
**Recommendation:** Specify firewall explicitly:
- **Cluster scope:** Default to matter-scoped clusters. Cross-matter promotion is explicit user action with privilege-waiver acknowledgment.
- **Rationale clustering scope:** Semantic clustering of rationale text occurs within data_class boundaries. Privileged-only clusters, internal-only clusters, public-only clusters are disjoint cluster spaces.
- **Cluster export rules:** When a cluster is referenced cross-matter, the reference passes through data_class filtering — only public/internal cluster content reaches non-source matters.
- **TIE oversight:** TIE never produces recommendations that would leak privileged content across matters; its read access enforces privilege boundaries; this is built into its capability gating.
Add as a new §4.5 in the scoping document.
**Reference:** §4.1 (open question only), §4.3 (BDSM clustering treatment).
---
### [GAP] [HIGH] §3 / §4 — RepairCycleSignal underutilized; its rich per-criterion data is the supervision signal already in place
**Statement of finding:** RepairCycleSignal (Core R0.7 §9.0.2) already carries the highest-quality supervision signal in the entire pipeline: structured before/after snapshots, per-criterion score deltas, qualitative finding deltas. This is labeled training data ready for consumption. But §3.2 still says the consumer rules are partial, and §4.4 lists "RepairCycleSignal aggregate" + "Pattern slice convergence/failure rates" as Revisor training sources without specifying how the rich per-criterion deltas feed in.
**Why this matters:** The Revisor target is the highest-leverage DSPy target. The data needed to train it is already being captured. But the rich part is being underused — just the aggregate counters consumed. Adding 5 new signals while underusing the richest existing one is misallocation.
**Recommendation:** Promote RepairCycleSignal to first-class DSPy training source AND first-class TIE input:
- **DSPy Revisor:** Each `qualitative_delta.resolved_finding_ids` length minus `new_finding_ids` length is direct supervision. When Judge ran, `per_criterion_score_deltas` is sharper. Specify the adapter that turns RepairCycleSignal into Revisor training examples.
- **TIE:** Per-criterion score deltas are exactly the per-step diagnostic data TIE reasons over. Specify the TIE consumption path that produces strategy-selection recommendations, configuration tuning recommendations, and Tier 3 architectural diagnoses based on aggregated RepairCycleSignal data.
- **Loop Effectiveness Test:** Connects to the canonical test pattern; RepairCycleSignal is the messier organic version of the same supervision signal.
**Reference:** §3.2 (RepairCycleSignal lifecycle — partial), Core R0.7 §9.0.2 (full schema, underutilized); §4.4 (DSPy training architecture).
---
### [GAP] [MEDIUM] §3.4 / §3.12 — PlanDiff and proposal-diff schemas underspecified; signal value depends critically on diff choice
**Statement of finding:** §3.12 references a `PlanDiff` schema for OutcomeCompilerProposalEditTrace. §3.4 describes "diff between proposed and accepted" task design. Neither specifies what gets diffed at field-by-field, semantic, or text level.
**Why this matters:** Different diff representations lead to different DSPy training behavior and different TIE diagnostic value. Without specification, implementations diverge.
**Recommendation:** Specify a layered diff record reusable across §3.12-§3.15:
- **Structural diff (mandatory):** Field-by-field comparison for typed fields
- **Text diff (mandatory for free-text fields):** Standard text diff for criteria_description, instructions, narrative
- **Semantic diff (optional, computed on demand):** Embedding-distance for clustering needs
Specify in a shared schema in Common Contracts.
**Reference:** §3.4, §3.12, §3.13-§3.15.
---
### [BUG] [MEDIUM] §3.16 — Chat-during-setup signal architecture works (corrected); narrower filter rules need specification
**Statement of finding (corrected from V1):** Per author clarification, the BDSM nightly extraction process — not Task Agent self-labeling — extracts correction-relevant chat segments using the next-proposal-change as ground truth. This is sound. The remaining gap is narrower: BDSM's extraction filter rules need explicit specification (which segments count as correction-relevant; how to filter noise from signal).
**Why this matters:** BDSM extraction quality determines whether the captured chat-during-setup signal is useful or noisy. Without explicit filter rules, implementations may capture too broadly (storage waste, BDSM compute cost) or too narrowly (missed correction signal).
**Recommendation:** Specify BDSM extraction filter rules:
- **Inclusion criterion:** User chat input followed by Task Agent proposal change within N turns (correction-relevant)
- **Exclusion criterion:** User input followed by no proposal change (incidental chat, filed as low-value but preserved for context)
- **Stronger inclusion:** User input contains explicit correction language ("no", "actually", "instead", "different") AND followed by proposal change
- **Aggregation:** Multiple corrections in single setup session aggregate into one TaskAgentChatDuringSetupSignal record
**Reference:** §3.16.
---
### [GAP] [MEDIUM] §4.4 — DSPy composite metric design not validated for multi-objective optimization
**Statement of finding:** §4.4 lists 4 composite metric components for `outcome_compiler_main`. It does not address: how components are scalarized or Pareto-balanced; whether components are on commensurable scales; whether DSPy's optimizers support multi-objective composites natively. DSPy's metric function returns scalar; Pareto-frontier optimization is GEPA-specific.
**Why this matters:** Multi-objective optimization is a different problem than scalar. Hand-waving here will require substantial implementation rework.
**Recommendation:** Specify per-target metric architecture explicitly:
- **For each composite target, declare scalarization or Pareto strategy.**
- **Address scale commensurability** (normalize components).
- **Acknowledge initial defaults need calibration** once real signal volume exists.
- **Document the optimizer choice.** GEPA for Pareto; MIPROv2 for scalar. Decision drives what's possible.
**Reference:** §4.4 (DSPy per-target table).
---
### [GAP] [MEDIUM] §4.1 / §6 — Outcome cluster vs Pattern primitive emergence boundary undefined
**Statement of finding:** DOC72 already has Pattern primitive emergence. §4.1 proposes outcome cluster emergence as a parallel mechanism. The boundary between Pattern primitives and outcome clusters is not drawn.
**Why this matters:** Two unrelated emergence mechanisms in the same component create coordination problems and consistency issues.
**Recommendation:** Specify the relationship:
- **Boundary:** Patterns = *interventions that work* (revision strategies, configs, prompts). Clusters = *outcome shape* (what kind of work is being done). Orthogonal axes of a 2D learning surface.
- **Dependency:** Cluster emergence runs first; pattern slice updates query current cluster assignment.
- **Migration:** When durable cluster's composition changes (rare with stability mechanism above), pattern slices re-key as one-time operation.
**Reference:** §4.1.
---
### [GAP] [MEDIUM] §3.17 — Sub-agent reputation marked "fully traced" but interactions with new mechanisms not addressed
**Statement of finding:** §3.17 says sub-agent reputation is fully traced. True for existing scope. But the proposal adds outcome clustering, multi-prior coordination, and rationale standardization. Does sub-agent reputation accumulate per (sub-agent, outcome_cluster)? Does multi-prior coordination modulate sub-agent dispatch? Does TIE update sub-agent dispatch rules?
**Why this matters:** Marking something "fully traced" when adjacent mechanisms are changing is misleading.
**Recommendation:** Update §3.17:
- Reputation slices accumulate per (sub-agent, outcome_cluster) — extends current substrate
- Sub-agent dispatch participates in multi-prior coordination as "Capability availability" prior
- TIE can recommend sub-agent dispatch rule changes at Tier 3 (system-wide dispatch refinements)
**Reference:** §3.17.
---
### [GAP] [MEDIUM] §3.18 — Loop Effectiveness Test pattern missing from learning surface inventory
**Statement of finding:** §3.18 lists module quality metrics as "fully traced" but treats them as dashboard signals only. There is no specified test pattern that directly measures whether the Evaluator/Revisor loop is producing real quality improvement.
**Why this matters:** Without canonical Loop Effectiveness Test, every user reinvents the test with varying methodology, producing non-comparable results. With canonical pattern, results aggregate and become primary input to TIE's diagnostic reasoning + DSPy training.
**Recommendation:** Add §3.19 "LoopEffectivenessTestRunRecord (NEW)":
- **Trigger:** Test dispatched manually, periodically, or by TIE
- **Mechanism:** Same artifact along two paths: (Branch A) artifact → Judge → score_a; (Branch B) artifact → Evaluator → Revisor → revised → Judge → score_b
- **Schema:** `original_artifact_ref`, `revised_artifact_ref`, `judge_score_original`, `judge_score_revised`, `score_delta_by_dimension`, `evaluator_findings_ref`, `revisor_plan_ref`, `loop_iterations`, `cost_total`
- **Consumer:** BDSM aggregates per task type / module config / cluster id; feeds TIE diagnostic reasoning; feeds Revisor/Evaluator DSPy training; feeds trust calibration
- **Action surface:** Loop ROI dashboard; TIE inputs; outcome tracking for TIE-applied changes
**Reference:** §3.18.
---
## Layer 3: Cutting-edge / frontier practice
### [RESEARCH] [HIGH] §4 — Updated process reward model framing: AgentPRM (2025), not Lightman (2023)
**Statement of finding:** Process reward models for LLM agents have advanced significantly since the 2023 PRM literature focused on math reasoning. The proposal needs to reference current agent-specific PRM work.
**Updated citations:**
- **AgentPRM (Xi et al. 2025, arxiv 2511.08325):** Process supervision specifically for LLM agents in multi-step decision-making. Captures both per-step quality ("promise" = probability step achieves goal) AND inter-step dependency ("progress"). Directly applicable to ELNOR's pipeline: Evaluator findings affect Revisor plans, which affect downstream Evaluator runs.
- **VersaPRM (2025):** Multi-domain PRM training via synthetic reasoning data. Relevant if ELNOR lacks training data volume in specific domains.
- **AdaptiveStep (2025):** Automatically divides reasoning into steps based on model confidence. Useful for ELNOR's variable-length reasoning chains.
**Why this matters:** The "promise + progress" framing in AgentPRM fits ELNOR's pipeline exactly. ELNOR has per-step quality signals (per-criterion findings, per-action revision outcomes) AND inter-step dependencies (findings determine revisions, revisions affect future findings). The agent-PRM literature provides the right methodology for ELNOR's supervision architecture.
**Recommendation:** Reframe TIE's diagnostic reasoning + DSPy training data architecture in agent-PRM terms:
- Per-step "promise" signals: probability a given Evaluator finding, Revisor action, or sub-agent dispatch achieves its local goal
- "Progress" signals: how each step affects downstream pipeline state
- Both feed into TIE's diagnostic reasoning (qualitative interpretation) AND DSPy training data (quantitative optimization signal)
ELNOR is essentially building an agent-PRM training corpus through normal operation. The literature should inform the spec.
**Reference:** Xi et al. 2025 (https://arxiv.org/abs/2511.08325).
---
### [RESEARCH] [HIGH] §4.1 — Updated memory architecture framing: H-MEM / Mem0 / Zep (2025), not Generative Agents (2023)
**Statement of finding:** LLM agent memory architectures have advanced substantially since 2023. The proposal's HDBSCAN clustering approach is inventing a worse version of more mature memory architectures.
**Updated citations:**
- **H-MEM (Sun & Zeng 2025, arxiv 2507.22925):** Hierarchical memory with positional index encoding for efficient layer-by-layer retrieval. Better fit than flat clustering at small data scales.
- **Mem0 (2025):** Production-ready long-term memory for AI agents. More mature than research-grade alternatives.
- **Zep (2025):** Temporal knowledge graph architecture for agent memory. Particularly relevant because DOC72 already has knowledge-graph substrate; Zep's approach is directly applicable.
- **G-Memory (2025):** Hierarchical memory for multi-agent systems. Relevant for sub-agent dispatch architecture.
- **In Prospect and Retrospect (2025):** Reflective memory management for long-term personalized dialogue agents. More refined than Reflexion's verbal self-critique.
**Why this matters:** Single-user sparse-data systems benefit from structured memory hierarchies more than from density-based clustering. The 2023 Generative Agents pattern has been superseded.
**Recommendation:** Replace §4.1 outcome cluster emergence with structured memory architecture:
- **Observe:** Each OutcomeSpec persists to memory stream in DOC72 (no clustering)
- **Reflect:** Hierarchical summaries generated periodically (per H-MEM positional indexing)
- **Retrieve:** k-NN similarity + temporal recency for in-session context injection (per Zep temporal graph)
- **Integration with DOC72:** Memory architecture directly leverages DOC72's knowledge-graph substrate; no separate clustering subsystem needed
**Reference:** Sun & Zeng 2025 (https://arxiv.org/abs/2507.22925), Mem0 documentation, Zep architecture paper 2025.
---
### [RESEARCH] [HIGH] §4.2 — Constitutional AI / user-values-as-prior; concretize as UserConstitution
**Statement of finding:** §4.2 lists six prior kinds. None is "explicit user values/constitution." Anthropic's Constitutional AI work (Bai et al. 2022, refined in RLAIF and constitutional-classifier work 2024-2025) demonstrates that explicitly declared values, used as inference-time priors, outperform purely revealed-preference learning.
**Why this matters:** Revealed-preference signals are noisy (one bit per interaction) and require many samples to converge. Stated values are dense and stable. Missing this prior kind biases the system toward slow accumulation of weak signals when a fast strong signal is available. Single-user system can solicit explicit values cheaply.
**Recommendation:** Add seventh prior kind in §4.2 taxonomy:
**Stated values / user constitution prior** with concrete artifact specification:
```ts
UserConstitution {
constitution_id: string
principal_id: PrincipalRef
values: Array<{
value_id: string
domain: string // "legal_writing", "investment_research", "communication"
statement: string // "Legal analysis should engage with conflicts in authority"
scope: ScopeTag // user_only by default
origin: "user_authored" | "tie_recommended_and_user_accepted" | "tie_recommended_auto_accepted"
captured_at: timestamp
last_reviewed_at: timestamp?
confidence: number
}>
preferences: Array<{
preference_id: string
domain: string
statement: string
scope: ScopeTag
origin: PreferenceOrigin
}>
schema_version: 1
}
```
- **Storage:** Lives at `~/.openclaw/userconstitution/{principal_id}.json` or via DOC72 entity-card substrate (structurally similar to entity card about the user)
- **Priority:** Higher than revealed preferences (stated commitment outranks observed pattern). Lower than hard policy (some policies non-negotiable).
- **TIE interaction:** TIE can propose new values based on observed correction patterns. User reviews and accepts; new value gets `origin: tie_recommended_and_user_accepted`.
- **Multi-user scope:** Always `user_only` by default; user can explicitly share to team/firm.
**Reference:** Bai et al. 2022 (https://arxiv.org/abs/2212.08073); refined in subsequent RLAIF and constitutional-classifier work 2024-2025.
---
### [RESEARCH] [HIGH] §4 / §3 — Active learning as user-engagement mechanism; promoted to first-class
**Statement of finding:** All proposed learning loops are passive. Active learning literature shows systems that solicit labels for high-uncertainty cases learn much faster than passive observers, especially at sparse data scales. For a single-user system, the friction is low and the user is generally cooperative.
**Why this matters:** Single-user ELNOR will accumulate small data volumes. Passive learning produces slow convergence. Active learning produces faster convergence by directing user attention to highest-information cases. Author affirmed this is high-value: user setting up an Evaluator will appreciate being asked.
**Recommendation:** Add §4.6 "Active learning queries":
**Integration points:**
- **Uncertainty-triggered queries during Compiler proposal:** When Compiler is uncertain between structural choices, ask explicitly. User's answer becomes high-quality labeled training data + in-session prior + pattern primitive candidate.
- **Pattern-disambiguation queries during Revisor planning:** When two emergent patterns conflict on a specific case.
- **Cluster/memory-membership queries:** When new OutcomeSpec is near boundary between retrieval candidates.
**Schema:**
```ts
ActiveLearningQueryRecord {
query_id: string
query_kind: "compiler_disambiguation" | "revisor_strategy" | "cluster_membership" | "pattern_conflict"
triggering_uncertainty: {
source: ModuleRef
confidence_margin: number
alternatives_considered: any[]
}
query_text: string
user_choice: string
user_rationale_optional: string?
becomes_labeled_example_for: DspyTargetId[]
emitted_at: timestamp
schema_version: 1
}
```
**Frequency cap:** N queries per session (suggest 2-3); user-configurable. Author-confirmed high-value mechanism.
**Reference:** Settles 2009 active learning survey; recent work on uncertainty quantification in frontier LLMs (2024-2025).
---
### [RESEARCH] [MEDIUM] §3 / §4 — Reflexion-style in-session self-critique as cheap complement to DSPy
**Statement of finding:** Shinn et al. 2023 (Reflexion) demonstrated verbal self-critique outperforms many forms of explicit reward learning at smaller data scales. For ELNOR with frontier models, in-session self-critique loops can produce quality improvement at the cost of one additional LLM call per task, without requiring DSPy substrate maturity.
**Why this matters:** Even with DSPy in v1, in-session critique provides immediate quality improvement on tasks where DSPy hasn't yet accumulated enough optimization data. Complementary, not redundant.
**Recommendation:** Add §4.7 "In-session self-critique mechanisms":
- **Pre-emit critique:** Before Evaluator/Revisor/Compiler emits output, self-critique pass identifies obvious issues; revise output before emit
- **Cost:** One extra LLM call per critique-enabled module invocation; configurable per-module
- **Failure-mode pattern detection:** When critique consistently flags the same issue across runs, surface to TIE as pattern candidate
- **Complementary to DSPy:** Critique improves immediate quality; DSPy improves the base prompt over time. Both apply.
**Reference:** Shinn et al. 2023 (https://arxiv.org/abs/2303.11366); subsequent reflective-memory work 2024-2025.
---
### [RESEARCH] [MEDIUM] §4 — Frontier model capability shifts what's worth optimizing
**Statement of finding:** The proposal's optimization machinery is calibrated to 2023-era models. Will is using Opus 4.7+ and GPT 5.5+, which have substantially longer context, better in-context learning, and stronger reasoning. This changes what optimization actually delivers.
**Why this matters:**
- **Long context (1M+ tokens) reduces context-fitting pressure.** DSPy's marginal value for prompt-token efficiency is lower than in 2023.
- **Better in-context learning may outperform DSPy at sparse data.** Frontier models do well with few good examples in context. DSPy optimization may be overkill for some targets — careful example curation may suffice.
- **Better agentic capabilities make sub-agent dispatch more reliable.** Architectural reliance on sub-agent dispatch is more defensible.
- **Better reasoning means simpler prompts work.** Elaborate multi-objective composite metrics may be over-engineered.
**Recommendation:** Add cautionary finding to spec:
- **Measure before optimizing.** Before committing all DSPy targets to active optimization, measure whether current frontier-model behavior is already adequate. The DSPy substrate ships in v1, but per-target activation should be data-driven, not eager.
- **Bias toward in-context example curation over instruction optimization.** Per the few-shot vs prompt-engineering tradeoff at frontier scale.
- **TIE's diagnostic reasoning leverages frontier capabilities** — Opus 4.7 or GPT 5.5 reasoning over 30 structured signal records produces high-quality recommendations. The cost is bounded and the value is real.
**Reference:** No single paper; ambient observation from frontier model usage patterns 2024-2026.
---
### [RESEARCH] [MEDIUM] §4 — TIE pattern aligns with autonomous software-engineering agent research
**Statement of finding:** TIE's role — analyzing a working system's behavior and proposing improvements with code-level changes when warranted — aligns with recent work on autonomous software-engineering agents (Devin, SWE-agent, OpenHands work 2024-2025). Worth referencing the methodology.
**Why this matters:** TIE isn't an entirely novel concept; it's the application of autonomous-engineering-agent patterns to a specific domain (self-improvement of an AI orchestration platform). The literature provides methodology for:
- Code-change proposal generation
- Multi-stage review pipelines
- Patch validation
- Outcome tracking
**Recommendation:** Reference SWE-agent / OpenHands work in TIE specification. Specifically:
- TIE's Tier 4 (code change) workflow can adopt SWE-agent's patch-then-test pattern
- TIE's multi-LLM review pipeline aligns with agent committee approaches in recent work
- TIE's trust calibration corresponds to autonomous-agent reliability scoring in production deployments
ELNOR isn't reinventing autonomous engineering; it's applying it to a specific high-value substrate.
**Reference:** SWE-agent (Yang et al. 2024); OpenHands platform work 2024-2025.
---
## Layer 4: Specific topic banks
### [GAP] [MEDIUM] §7.1 — DOC72 patterns and outcome clusters: 8-dimensional context signature sparsity
**Statement of finding:** Pattern slices already accumulate per (pattern, context) pair across 7 dimensions. Adding cluster_id makes it 8. Realistic single-user data may produce 1-2 patterns per slice. Combined with multi-user forward-compat additions (principal_id, learning_scope) the dimensionality grows further.
**Why this matters:** High-dimensional pattern slices with sparse data produce many empty slices. Pattern emergence threshold becomes harder to calibrate.
**Recommendation:** Specify dimension prioritization for pattern emergence:
- Primary dimensions: domain_tags, artifact_kind, failure_kind (highest emergence priority)
- Secondary dimensions: outcome_cluster_id, model_class, principal_id (refinement when primary slice has volume)
- Tertiary dimensions: risk_class, assurance_basis, privilege_class, learning_scope (filter dimensions, not emergence-driving)
This produces emergent patterns first in coarse slices, refines into finer slices as data accumulates.
**Reference:** §7.1 of red team prompt; V3.3 §13.3 PatternPerformanceSlice schema.
---
### [GAP] [MEDIUM] §7.2 — DOC15 redesign may be implicit; multi-prior coordination + active learning + TIE-injected priors
**Statement of finding:** The proposal extends DOC15 CIL substantially. Adding multi-prior coordination, active learning queries, and TIE-injected context all flow through DOC15's prompt assembly. Whether DOC15's existing structure handles this is not assessed.
**Why this matters:** DOC15 may require substantial redesign, not just extension. Discovering this during implementation is expensive.
**Recommendation:** Before R1 spec drafting, audit DOC15 against the new requirements:
- Can it accommodate 7+ prior kinds with deterministic conflict resolution?
- Can it inject active-learning query content as an interruption pattern?
- Can it consume TIE-recommended in-session context bundles?
- Token budget allocation per prior kind under frontier context limits
If DOC15 requires redesign, scope that work explicitly.
**Reference:** §7.2 of red team prompt; §4.2 multi-prior coordination.
---
### [GAP] [MEDIUM] §7.4 — Free-text rationale architecture preserved; BDSM extraction discipline is the actual specification need
**Statement of finding (corrected from V1):** Per author clarification, free-text rationale is intentional and BDSM is already designed to extract semantic patterns from it. V1's recommendation to switch to quick-tag taxonomies was overcalibrated to general user behavior; for the actual user (and the actual BDSM), free-text is correct.
The remaining gap is BDSM's extraction discipline specification: how it identifies recurring rationale themes; how it produces candidate patterns from clustered rationales; how it handles privilege firewall in rationale semantic clustering.
**Recommendation:** Specify BDSM rationale extraction pipeline:
- Embedding model for rationale text (privacy-respecting; local or trusted API)
- Clustering approach (matter-scoped, data_class-aware)
- Pattern candidacy threshold (N similar rationales across M time = candidate pattern)
- User-review flow before clustered rationale affects system behavior (TIE proposes; user approves; pattern primitive emerges)
Multi-user note: when networking ships, rationale clustering must respect cross-user privilege firewall; cluster spaces disjoint by data_class and matter.
**Reference:** §7.4 of red team prompt; §4.3 rationale standardization.
---
### [GAP] [MEDIUM] §7.5 — Piece 3 boundary blurry; TIE may subsume part of Piece 3
**Statement of finding:** §7 scopes out Piece 3 (system-wide self-diagnosis substrate). But TIE's deeper audit capability (Tiers 3-4) overlaps substantially with what Piece 3 would do. Some "learning loops" overlap with self-diagnosis surfaces.
**Why this matters:** Two separately-designed systems trying to do related work may produce redundant infrastructure or conflicting recommendations.
**Recommendation:** Re-examine the Piece 3 boundary in light of TIE specification:
- TIE handles per-module and cross-module improvement recommendations (Tiers 1-4)
- Piece 3 (if still needed separately) handles system-wide health monitoring and incident-response (failures, outages, abnormal cost spikes)
- The boundary: TIE is improvement; Piece 3 is incident response. Some overlap in deep-audit territory.
- Specify Piece 3's revised scope explicitly once TIE is committed; may shrink considerably
**Reference:** §7.5 of red team prompt; §7 of proposal.
---
### [GAP] [LOW] §10.6 — V3.16 OP-A audit gap deserves immediate fix
**Statement of finding:** §10.6 notes Addenda A V6 is not registered in OP-A V3.16 §3 Source Registry. The fix is trivial (one editorial pass) but the cost of deferral is non-zero.
**Recommendation:** Address V6 registration in OP-A V3.17 immediately. Pending the Self-Learning Architecture R1 spec which will produce another OP-A revision; do this beforehand.
**Reference:** §10.6.
---
# Part II — The Unified Learning Architecture
This section specifies what the proposal lacks: a clear architectural framing for the system's full set of learning mechanisms. The proposal frames signal capture → BDSM → utility bundles → downstream consumers as a single mechanism. In reality the system has seven distinct learning mechanisms, each with its own substrate, targets, and operational pattern.
## Mechanism A: DSPy module-level prompt optimization
**What it does:** Statistical optimization of the small set of generic, fixed module prompts that everything else flows through.
**Substrate:** Python subprocess runtime (per Prop A); iterative refinement loop; trainset accumulation from per-step signals.
**Targets:** Evaluator main prompt, Revisor main prompt, Outcome Compiler main prompt, Judge per-AssuranceBasis templates (when architecturally separable), Claim Extractor main prompt.
**Operational pattern:** Accumulated examples per target → DSPy optimization run → candidate prompts → Experiments comparison surface → user reviews → promote / save-as-preset / reject.
**Data needs:** Substantial. Statistical optimization requires accumulated examples (30-100+ per target) before producing measurable improvements.
**User involvement:** Per-promotion review at Experiments surface. Otherwise automated.
**Status in proposal:** Specified at high level in §4.4; per-target trainset architecture needs refinement (see Layer 2 finding on RepairCycleSignal).
## Mechanism B: Task Improvement Engineer (TIE) — LLM-reasoning over diagnostic data
**What it does:** Periodic LLM-reasoning over accumulated per-step diagnostic data produces structured improvement recommendations spanning all of the system's learnable surfaces beyond module prompts.
**Substrate:** Frontier model (Opus 4.7 / GPT 5.5+ for high-stakes; Kimi or cheap model for routine analysis); access to signals/artifacts/configs/specs/code; ImplementationProposal lifecycle for write operations.
**Targets:** Rubrics, OutcomeDefinitions, patterns, stated values, configurations, strategy selections, sub-agent dispatch rules, task graph topologies, schemas, code patches.
**Operational pattern:** Issue identification → severity gate → TIE analysis → multi-LLM red team review → optional user gate → implementation proposal → user approval → outcome tracking → trust calibration.
**Data needs:** Modest. 20-30 structured per-step records with rationales is enough for frontier-model reasoning to produce useful recommendations.
**User involvement:** Reviews refined recommendations + adversarial notes; approves/revises/rejects.
**Status in proposal:** **Missing entirely. Highest-priority addition.**
## Mechanism C: DOC72 retrieval/injection effectiveness
**What it does:** Learns which patterns, entities, memories, and cases are worth surfacing in which contexts. Per-injection signal capture; downstream score delta tells you whether the injection helped.
**Substrate:** Per-injection signal capture in DOC72/DOC24; retrieval scoring update; injection rule learning.
**Targets:** Pattern primitive surfacing rules, entity card relevance ranking, memory selection for context injection, case retrieval ranking.
**Operational pattern:** Track per-injection signals (what was retrieved, what context, what downstream score); statistical fitting over time identifies high-value vs low-value retrieval decisions; retrieval rules update.
**Data needs:** Statistical; needs volume of injections to converge. Single-user generates this naturally over months.
**User involvement:** Minimal. Retrieval improvements happen automatically based on signal.
**Status in proposal:** Mentioned briefly in §3.7 TaskContextFeedbackEvent but lifecycle and learning loop unspecified. Specify as Mechanism C.
```ts
RetrievalEffectivenessSignal {
retrieval_id: string
retrieved_items: [{item_kind: "pattern" | "entity" | "memory" | "case", item_id: string}]
injection_context: {module_id, task_signature}
baseline_score: number? // counterfactual (no injection or different)
observed_score: number
user_correction_delta: number?
retrieval_decision_signature: string
schema_version: 1
}
```
## Mechanism D: Outcome and pattern emergence
**What it does:** Organizes the outcome space. Not changing behavior directly; producing organizational structure that other mechanisms (and the user) use.
**Substrate:** Hierarchical memory architecture (per Layer 3 finding recommending H-MEM / Zep / Mem0 over HDBSCAN clustering); pattern primitive emergence in DOC72; periodic structural updates.
**Targets:** Outcome organizational structure (clusters or memory hierarchy); pattern primitive store.
**Operational pattern:** Periodic batch process generates structure from accumulated data; structure used by other mechanisms.
**Data needs:** Accumulates from normal use; no special collection.
**User involvement:** None unless surface-level review (cluster labels, pattern primitive review).
**Status in proposal:** §4.1 specifies clustering; needs replacement with hierarchical memory architecture (Layer 3 finding).
## Mechanism E: Trust calibration / autonomy threshold learning
**What it does:** Learns per-lever what level of human oversight is appropriate based on track record. Bad track record → strict gating; good track record → higher autonomy.
**Substrate:** Per-recommender per-change-kind track record; autonomy threshold computation; gating rule update.
**Targets:** Autonomy thresholds for TIE recommendations, DSPy promotions, retrieval rule changes, pattern primitive emergence.
**Operational pattern:** Track outcomes of recommendations (accepted? rejected? if accepted, did the change improve or worsen outcomes?); calibrate confidence; adjust threshold above which auto-applies without user gate.
**Data needs:** Modest. Track record accumulates from normal review.
**User involvement:** Reviews; approval/rejection becomes the data.
**Status in proposal:** Missing entirely. Specify as Mechanism E.
```ts
TrustCalibrationRecord {
recommender: "tie" | "dspy" | "retrieval_learner" | "pattern_emergence"
change_kind: ChangeKind
track_record: {
accepted: number
rejected: number
accepted_then_outcome_resolved: number
accepted_then_outcome_worsened: number
}
current_autonomy_threshold: number // probability above which auto-applies
threshold_basis: string // explainable rationale
next_review_at: timestamp
schema_version: 1
}
```
System starts conservative (low autonomy across the board). High-track-record recommenders earn higher autonomy per change kind. Low-track-record stays heavily gated.
## Mechanism F: Process pattern emergence at task-graph topology level
**What it does:** Learns task-graph topologies that work. Separate from outcome-level patterns; operates at graph-structure level.
**Substrate:** Task design pattern card emergence (Core R0.7 §9.6 area, partially specified); convergence/divergence tracking per topology; task design intelligence updates.
**Targets:** Task Agent's default topology proposals; saved task templates; graph patterns surfaced during task design.
**Operational pattern:** Track which topologies converge reliably vs diverge; pattern card emergence at topology level; Task Agent's design proposals improve from accumulated data.
**Data needs:** Statistical; needs volume of task runs.
**User involvement:** Reviews emergent patterns; can mark accepted or rejected; affects future Task Agent proposals.
**Status in proposal:** Partially captured in `TaskDesignPatternCard`; learning loop unspecified. Specify as Mechanism F.
## Mechanism G: User attention/effort signals
**What it does:** Captures where the user spends time, gets frustrated, or struggles. Subtle but real signal about UX friction and content quality.
**Substrate:** UI telemetry; session signals; abandonment tracking; re-edit rate per artifact.
**Targets:** TIE inputs (TIE reasons over attention patterns to identify friction points); UX improvement candidates; documentation gaps.
**Operational pattern:** Continuous UI telemetry; TIE periodic analysis surfaces friction; recommendations target UX or content.
**Data needs:** Continuous; volume produces signal over time.
**User involvement:** Passive; user generates signals through normal use.
**Status in proposal:** Missing entirely. Specify as Mechanism G.
```ts
UserAttentionSignal {
signal_id: string
signal_kind:
| "time_in_state" // duration on review surface
| "abandonment" // task abandoned at stage X
| "re_edit_rate" // user changed same artifact multiple times
| "help_seek" // documentation accessed
| "speed_of_confirmation" // quick vs deliberate accept
context: {module_id, artifact_id?, task_id, run_id}
measurement: any // type-specific
schema_version: 1
}
```
## Cross-mechanism integration
The seven mechanisms are not independent; they reinforce each other:
- **DSPy (A)** improves base behavior of modules; **TIE (B)** improves everything around them
- **Retrieval effectiveness (C)** feeds TIE reasoning (knowing what's retrieved well affects what's worth surfacing)
- **Pattern emergence (D)** produces structure both A and B consume
- **Trust calibration (E)** governs autonomy across all other mechanisms
- **Process pattern emergence (F)** sharpens DSPy training sets and informs TIE topology recommendations
- **User attention (G)** feeds TIE with UX friction signals
Cross-mechanism dependencies need explicit specification in R1.
---
# Part III — Concrete Examples (Demonstrating TIE Value)
The proposal benefits from concrete examples showing where TIE-style reasoning produces real improvements beyond what passive DSPy optimization or pattern emergence delivers.
## Example 1 — Legal: Research memo analytical depth (Tier 1, task-artifact level)
**Setup:** Saved task "draft research memo for discovery dispute." Multiple runs across matters.
**Issue emerges:** Across 4 runs in 3 weeks, Will edits rubric_pass score on analytical-quality criterion down by 3-4 points. Rationales cluster on "doesn't engage with conflicts in authority."
**TIE diagnoses:** Rubric is too lenient. Doesn't define "depth of engagement with authority conflicts." Evaluator is correctly applying the rubric it was given; the rubric is the issue.
**TIE recommends:**
1. Refine analytical-quality rubric for this saved task (add explicit sub-criterion on authority-conflict engagement)
2. DOC72 pattern primitive: "For legal-research-memo tasks involving litigation issues, analytical-quality rubric should include authority-conflict-engagement sub-criterion"
3. User constitution update: "Legal analysis should engage with conflicts in authority"
**What changes:** Task-specific rubric edit (Tier 1) + cross-task pattern (Tier 2) + user constitution (Tier 2). Future legal research memos see refined criteria automatically.
**Value:** Compound. Single-task improvement (Tier 1) + improves all future legal research memos via pattern (Tier 2) + informs Compiler defaults across all legal tasks via constitution (Tier 2).
## Example 2 — Legal: Citation hallucination detection (Tier 2, cross-task pattern)
**Setup:** Multiple saved tasks involving legal research.
**Issue emerges:** factual_verification AssuranceBasis catches 3 instances of hallucinated citations across 6 runs. All hallucinations: secondary sources (treatises, law review articles) with fabricated page numbers.
**TIE diagnoses:** Two-part issue. Research agent doesn't retrieve secondary sources (only case law via Westlaw). Drafting agent cites secondary sources from "general knowledge" — fabrication. Affects all tasks using this research/drafting pipeline.
**TIE recommends:**
1. Task graph topology change: add citation-verification stage between drafting and evaluation. Every citation checked against retrieved-sources OR sent to secondary-source verification sub-agent. (Tier 3 — affects all legal tasks)
2. Drafting agent configuration update: `citation_policy: "verified_retrieval_only"` (Tier 2)
3. DOC72 pattern primitive: "Legal drafting should not cite secondary sources without retrieval verification" (Tier 2)
**What changes:** System-wide configuration of legal task pipelines. Citation hallucination becomes structurally prevented.
**Value:** Existential — hallucinated citations are a professional risk. This is the kind of issue DSPy can't address (it's not a prompt; it's a task graph) and passive pattern emergence wouldn't catch (the hallucination is rare per task but consistent in cause).
## Example 3 — Non-legal: Investment research insider transaction gap (Tier 2, cross-task pattern)
**Setup:** Saved task "research a potential equity investment."
**Issue emerges:** Across 10 runs, user repeatedly adds insider-transaction analysis sections that weren't in original draft. Evaluator never flags missing because rubric doesn't include it.
**TIE diagnoses:** Rubric gap AND retrieval gap. Rubric doesn't require insider analysis (so Evaluator can't flag); drafting agent doesn't retrieve Form 4 filings (so even if rubric required it, agent couldn't comply).
**TIE recommends:**
1. Rubric expansion (Tier 1): add insider transaction criterion
2. Drafting agent retrieval extension (Tier 2): add Form 4 retrieval to standard tool sequence for equity research
3. DOC72 pattern primitive (Tier 2): "Equity research tasks should include insider transaction analysis"
**What changes:** Future equity research tasks include the previously-missing analysis dimension.
**Value:** Captures a real gap that user knows about but hasn't formalized. TIE turns implicit knowledge into explicit system behavior.
## Example 4 — System-wide: Architectural audit (Tier 3, system configuration)
**Setup:** ELNOR has accumulated thousands of run records across months of use. Some things work well; some don't. Will runs deep TIE audit.
**TIE produces architectural report:**
> **System health summary (last 90 days):**
> - Total runs: 2,847
> - Convergence rate: 73%
> - Cost-per-success: $0.42 mean, $1.20 p90
> - User abandonment rate: 8%
> - User edit rate on Compiler proposals: 41%
>
> **High-leverage issues detected:**
>
> 1. **Revisor convergence bottlenecked by RepairStrategyKind selection.** 78% of 412 max-iteration cycles involved poorly-matched strategy. REGENERATE for surgical-fix findings: 28% success vs 71% for SURGICAL_PATCH. Recommendation: add learned finding-type-to-strategy lookup table that strategy-selection step consults. Tier 3.
>
> 2. **Compiler proposes overly elaborate OutcomeDefinitions for simple tasks.** 38% of 1,166 user-edits removed unnecessary criteria. Recommendation: update Compiler default rule set; train Compiler DSPy against criterion-removal edits. Tier 3 (rule set) + Mechanism A (DSPy).
>
> 3. **Sub-agent dispatch shows model_class mismatches.** Citation-checker: 91% on case-law with cheap_api, 47% on statutory citations. Recommendation: extend dispatch with finding-type-conditioned model_class routing. Tier 3.
**What changes:** Multiple system-wide updates derived from cross-task pattern analysis. Each gets multi-LLM review and user approval before applying.
**Value:** This is what TIE earns its keep doing. Cross-cutting architectural diagnoses no working user has time to perform manually. Each recommendation affects all future operations across all task types.
## Example 5 — Tier 4: Schema/code change
**Setup:** TIE audit identifies need for finer-grained confidence data.
**TIE proposes:**
> **Architectural change proposal:**
>
> Extend `EvaluationFinding` schema to include `confidence_score: number` and `confidence_basis: ConfidenceBasis`.
>
> **Rationale:** Current architecture captures aggregate-level confidence only. Finer-grained confidence enables: TIE filtering of low-confidence findings from learning signals; Revisor strategy selection responsive to confidence; UI surfaces uncertainty to user.
>
> **Code changes required:**
> 1. Schema update in `common_contracts/v1.1/EvaluationFinding.ts`
> 2. Evaluator: populate confidence during finding production
> 3. Revisor: consume confidence in strategy selection
> 4. Database migration: legacy records default to confidence_score = 1.0
> 5. UI surface: display confidence in finding review
>
> **Generated patches:** [diff files]
>
> **Test cases needed:** [enumerated]
>
> **Estimated risk:** Medium. Backward compatibility maintained.
**Workflow:**
1. Multi-LLM review of the proposal
2. Refined proposal to Will
3. Will approves; implementation goes to Claude Code/Codex
4. Claude Code/Codex produces actual code + tests
5. Will reviews code
6. Merge
**Value:** Schema improvements emerge from observed pipeline needs rather than upfront design. Each improvement is grounded in real usage data.
---
# Synthesis
**Top 3 must-fix issues before R1 spec drafting:**
1. **CRITICAL §4 — Task Improvement Engineer (TIE) not specified.** The highest-leverage learning mechanism is missing entirely. Without TIE, the proposal's learning architecture is limited to DSPy module-level optimization plus passive pattern emergence — leaving the majority of learnable surfaces unaddressed. Specify TIE as first-class component with 4-tier capability spectrum and the multi-stage review pipeline.
2. **CRITICAL §1/§4/§10 — No measurement loop specified.** The proposal cannot prove value post-implementation; cannot be evaluated for ROI; misses the highest-quality DSPy training signal source and TIE's primary outcome-tracking mechanism. Add Loop Effectiveness Test pattern as canonical measurement substrate; feed into both DSPy training and TIE trust calibration.
3. **HIGH §4 — DSPy framing too narrow; six other learning mechanisms missing.** The proposal conflates "DSPy" with "the learning architecture." Specify mechanisms B (TIE) through G (user attention signals) as distinct, with substrates and operational patterns. Each addresses learnable surfaces DSPy can't touch.
**Recommended additions to R1 spec:**
- **Task Improvement Engineer (TIE)** specification: 4-tier intervention spectrum, multi-stage review pipeline, capability scoping (read/write access through ImplementationProposal lifecycle), schemas (DiagnosticImprovementRecommendation, TieAnalysisRecord, ImplementationProposal, ImprovementOutcomeRecord)
- **Loop Effectiveness Test pattern** as canonical measurement substrate
- **Multi-user forward-compatibility schema additions** (principal_id, learning_scope, scope_inference_basis) on all learning artifact schemas
- **UserConstitution** as concrete first-class artifact (stated values prior)
- **Active learning query** mechanism (§4.6 in revised proposal)
- **In-session self-critique** mechanism (§4.7)
- **Retrieval effectiveness signal** specification (Mechanism C)
- **Trust calibration** machinery (Mechanism E)
- **Process pattern emergence** at topology level (Mechanism F)
- **User attention signals** (Mechanism G)
- **Strategic intent tag** axis as parallel to outcome_cluster_id (preserving goal-conditioned learning without sycophancy)
- **Hierarchical memory architecture** (H-MEM / Zep / Mem0 pattern) replacing or augmenting outcome clustering
- **BDSM specification** as required precursor (utility compilation rules, extraction pipelines, query interfaces)
- **Privilege firewall design** for cross-matter clustering and rationale clustering
- **Cluster-stability mechanism** or commitment to k-NN retrieval instead
- **PlanDiff schema** in Common Contracts
- **Dampening mechanisms** on revealed-preference learning loops (time decay, diversity injection, drift detection)
- **Multi-prior deterministic conflict resolution** at prompt-assembly time
- **DSPy composite metric architecture** with scalarization/Pareto decisions per target
**Recommended removals from proposal:**
- The full removal of `GoalRef[]` (preserve as `strategic_intent_tag` parallel axis)
**Recommended next-step research:**
Before committing R1 architecture, validate:
- Realistic outcome volumes at single-user scale; HDBSCAN behavior at proposed thresholds (then decide cluster vs hierarchical memory)
- Frontier-model in-context-learning effectiveness vs DSPy at sparse data (then decide DSPy target activation priority)
- BDSM's existing free-text rationale extraction capability (confirm it's calibrated as author indicated)
- DOC15 CIL capacity for 7+ prior kinds with deterministic conflict resolution (then decide whether DOC15 needs redesign)
**External references for R1:**
- Xi et al. 2025 AgentPRM (https://arxiv.org/abs/2511.08325) — process supervision for LLM agents
- Sun & Zeng 2025 H-MEM (https://arxiv.org/abs/2507.22925) — hierarchical memory architecture
- Mem0 documentation 2025 — production memory architecture
- Zep 2025 — temporal knowledge graph for agent memory
- Bai et al. 2022 + RLAIF 2023 + constitutional-classifier work 2024-2025 — values-as-prior mechanism
- SWE-agent (Yang et al. 2024) + OpenHands 2024-2025 — autonomous engineering agent patterns (TIE's Tier 4)
**Overall assessment:**
The original proposal correctly identifies the gap landscape (partial/missing/broken learning loops). It misses two architectural priorities that, once added, transform it from "specify the gaps better" to "specify a working self-learning system":
1. **TIE as the LLM-reasoning layer over diagnostic data.** This is the single most valuable addition. Without it, the proposal's per-step capture has nowhere actionable to go. With it, the system gains 4 tiers of improvement capability operating across rubrics, patterns, configurations, strategies, and code — leveraging frontier-model reasoning at sparse-data scales where DSPy alone is insufficient.
2. **The seven-mechanism framing.** The proposal treats learning as one mechanism (signals → BDSM → bundles → consumers). Reality is seven distinct mechanisms that reinforce each other. Specifying them separately produces architectural clarity and enables independent calibration.
With these additions plus the corrections in Part I, the proposal becomes a sound foundation for the Self-Learning Architecture R1 spec. Without them, R1 ships with a partial architecture that requires expensive backtracking when limitations surface during use.
The architectural cost is bounded (TIE schemas + spec extensions), the implementation cost is bounded (TIE runs as a configured agent, not a new substrate), and the value is genuine and demonstrable through the concrete examples in Part III. Worth building.
---
**End of review V2.**