DOC23_ADDENDA_A_R5_R6_DEFERRAL_LIST_V2.md

Current Specs/DOC23/DOC23_ADDENDA_A_R5_R6_DEFERRAL_LIST_V2.md
Short text page a18241a79da2. Generated 2026-06-09T01:23:58.539Z from commit dbaa25962edc11ab30e8d4ca1715f9ae5bf77331. Worktree: clean.
Open readable HTML page · Open raw txt · Open path URL
ELNOR REPO READER TEXT MIRROR
Original path: Current Specs/DOC23/DOC23_ADDENDA_A_R5_R6_DEFERRAL_LIST_V2.md
Source repo: /Users/OpenClaw1/Elnor/Elnor Specs
Git branch: main
Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331
Generated: 2026-06-09T01:23:58.539Z

---

# DOC23 Addenda A — R5 / R6 Deferral List V2

**Companion to:** `DOC23_ADDENDA_A_R4_0_ADJUDICATION_CARD_V5.md`
**Purpose:** Consolidated work plan for R5 + R6 of the addenda, organized for spec drafting (not adjudication). Each entry maps to one or more rows in the V5 adjudication card; this document re-organizes those rows by functional spec area so R5/R6 drafting can be scoped section-by-section.
**Date:** 2026-05-02
**Status:** Pre-adjudication — will be revised after Will fills the V5 Decision column to V3 (post-adjudication)

**V2 changes from V1:**
- Updated to reference card V5 (238 rows, up from 233 in V4).
- Added §1.O lazy claim verification (card row #234).
- Added §1.P hierarchical scoring for huge variants (card row #235).
- Updated §1.N Context Management Proposal V1 to include OB-A15 (card row #238).
- §1.M2 unchanged but §1.M2 R4.1 portion now includes validation codes from cards #236, #237.
- Added §6 Calibration items for R5 (proposal §14, six items).

**Scope reminder.** R4.1 is the immediate contract closure revision (177 rows + 23 NON_OPERATIVE_R4.1 text removals). R5 is the next major revision (29 dedicated rows + R5 portions of 6 hybrids + 2 §8 Notes substantive R5 implementations + 1 separate addenda doc). R6 is later (4 rows). This list does NOT cover R4.1 work — the V4 card § 9 bundles B1–B6 are the R4.1 work plan.

---

## §0 — How this list is organized

Three top-level sections:

- **§1 — R5 work plan, grouped by functional spec area** (A through N). Each area lists row references, brief description of what to build, dependencies, and source provenance.
- **§2 — R6 work plan** (4 items). Smaller and more discrete.
- **§3 — External documents that compile into R5/R6**. Items that already have their own spec docs (`DOC23_ADDENDA_A_R4_0_CONTEXT_MANAGEMENT_PROPOSAL_V1.md`, plus the two §8 Notes substrate proposals).

Cross-doc obligations (OB-A14 through OB-A18) are flagged with the responsible doc owner so cross-doc scheduling is visible.

---

## §1 — R5 Work Plan (by functional area)

### §1.A — DSPy + Optimization Platform (the largest R5 area)

**The single biggest R5 work package.** R4.1 has 23 NON_OPERATIVE_R4.1 rows that remove unsafe optimization text from R4.0; R5 implements the optimization platform from scratch with the safety machinery the R4.0 draft lacked.

**Rows (23 total):** #20, #21, #22, #23, #24, #25, #79, #89, #90, #91, #92, #93, #94, #101, #111, #125, #149, #152, #154, #158, #159, #167, #177

**Substantive R5 implementations:**
- **§8 Notes 20 — DSPy metric assembly contract.** JudgeMetricAdapter schema; static EC_Judge_Wrapper.py template with HTTP callback to EC's Node.js API (no generated Python source); structured GEPA feedback as JSON block per dimension; Pareto multi-objective option for `optimization_intensity: heavy`. Pinned dspy@^3.0, gepa@^0.5, litellm@^1.55 in requirements.lock.
- **§8 Notes 23 — Promotion safety pass.** Four-part: PromotionRequest with conflict detection (expected_current_instruction_hash); PromotionLedgerEntry with promotion_id/baseline/candidate hashes/validation_run_id/approved_by/rollback_ref/post_promotion_monitor_id; post-promotion shadow runs (every Nth run re-scored, regression detection at trailing avg < promoted_aggregate × 0.85, emit `task.judge.regression_detected`); one-click rollback via ModuleRecord.instruction_history.

**Architectural building blocks needed:**
- Python subprocess runtime under `prompt_optimizer/judge/` (DOC23-owned), separate from PropA's `prompt_optimizer/propa/` (PropA-owned). No shared schemas, target registries, or invalidation rules.
- JudgeOptimizationConfig with `data_class` field matching PropA's enum (public/internal/privileged/local_only) + `reflection_model_constraint` + `trainset_redaction_rules`.
- JudgeTrainsetExample schema with three construction modes (auto-collected from prior runs / hand-curated / experiment-derived).
- IterationRecord schema; ModuleRecord.instruction_history (cap last 20).
- Validation codes: `judge_optimization_unapproved_loop_back`, `judge_optimization_examples_too_few`, `judge_optimization_data_class_mismatch`.

**Dependencies:**
- EvaluationRunLite (R4.1) → EvalDataset/EvalExample/EvalRun (R5, see §1.H).
- Promotion infrastructure (Notes 23) presupposes EvaluationRun substrate.

**Source provenance:** Major findings from GPT-P1, GPT-P3, CLD V1 Findings 5.1–5.6, CLD V2 N-CRITICAL-4 + N-CRITICAL-6 + N-CRITICAL-7, GRK Part 5 (DSPy), GEM CRITICAL findings #1 and #2.

---

### §1.B — Human Evaluation Pipeline

**Rows:** #104

**What to build:**
- New `human_review` block on JudgeModuleConfig: `enabled`, `queue_id?`, `required_dimensions[]`, `min_human_reviews`, `weight_human_vs_llm` (default 0.3), `annotation_instructions`.
- New port on `step.judge`: `human_review_out` (Data) emits HumanReviewRequest bundle when human_review.enabled.
- New module type: `step.human_annotation_gate` consuming human_review_out and producing human_scores_out.
- New WaitReason value: `human_annotation`.
- Annotation queue surface in DOC20/DOC21 — "Human Review" tab in expanded detail mode; queue list (LangSmith-style) with filters by task/dimension/confidence; side-by-side comparison view with LLM-said-X / Human-override-Y diff highlighting.
- Mixed scoring: weight_human_vs_llm blends LLM and human dimension scores; HumanReviewRecord written immutably to DOC10 ledger.
- Default 90-day retention for HumanReviewRecord.

**Cross-doc obligations:**
- DOC20: New "Annotation Queues" settings page.
- DOC21/DOC22: Register HumanAnnotationQueue, HumanReviewRecord, queue management components.
- DOC10: Immutable storage for HumanReviewRecord.

**Source provenance:** GPT-P1, CLD V1 Finding 8.2, GRK new idea #1, GEM Part 8 HITL.

---

### §1.C — Code-based / Deterministic Scorers

**Rows:** #105

**What to build:**
- Add 4 deterministic scoring methods to ScoringDimension `method` enum:
  - `code_grader` — user-provided JS function returning {score, audit}; runs in EC sandbox with timeout.
  - `regex_or_string_check` — pattern matching with required/forbidden patterns.
  - `json_schema_check` — validates output against user-supplied JSON Schema; returns binary pass + per-violation audit.
  - `tool_call_trace_check` — verifies expected tool sequence in OpenClaw trace.
- Each method has its own typed config (continuing the discriminated union pattern from R4.1 row #11).
- These methods bypass the judge LLM entirely; cost = ~$0; latency = milliseconds.

**Source provenance:** GPT-P3 Finding #34, CLD V1 Finding 2.4.

---

### §1.D — Online Scoring + Replay + Drift Detection

**Rows:** #107, #163

**What to build:**
- JudgeScorerSpec exportable from a Judge module; can be applied to past runs or production runs without graph re-wiring.
- EC nightly job re-scores N% of completed runs of a configured task (config field `online_scoring_sample_rate: 0.0–1.0`).
- Replay mode: re-score historical runs with current scorer; new JudgeScoreBundle with `replay_of` field linking to the original run.
- Drift detection: rolling 30-day window of online scores per module; alert when mean drops > 1σ from baseline.
- New SSE: `task.judge.online_score_completed`, `task.judge.drift_detected`.

**Source provenance:** GPT-P3 Finding #36, CLD V1 Finding 2.2, GRK ideas #1 and #2.

---

### §1.E — Evaluation UX Drill-Down

**Rows:** #110, #121 (R5 polish portion), #153

**What to build:**
- Score bars in expanded Judge view become clickable → right-hand drawer with per-claim verdicts, evidence quotes, dimension audit details (Phoenix/Braintrust UX pattern).
- "View full trace" button opens the exact prompt sent to the judge (including DOC24 preamble + CIL layers + EffectivePromptSnapshot).
- Time-series quality dashboard per module: line chart of weighted_aggregate over time, vertical lines at promotion events, regression banners.
- Workspace persistence of expanded views (which detail tab was last open).
- New route: `GET /api/tasks/:taskId/runs/:runId/modules/:judgeModuleId/audit/:dimensionId`.

**Dependencies:**
- §1.D online scoring populates the time-series.
- §1.A iteration ledger populates promotion event markers.

**Source provenance:** CLD V1 Findings 8.1 + 7.4, GRK Part 8 #4, GEM Part 8 drill-down.

---

### §1.F — Per-Dim Context Optimization

**Rows:** #39, #118, #165

**What to build:**
- Provider prompt caching capability-aware (where API supports it): shared prefix structure `[system: judge instructions][cached: input_data + evidence + DOC24 preamble][per-call: dimension-specific instructions + output to evaluate]`.
- Evidence dedup + relevance scoring: EvidenceBundle gains dedup_signature + relevance_score per item; Judge filters at retrieval time.
- ClassificationEvaluator pattern (Phoenix-style) as optional R5 factorization for high-throughput binary scoring.

**Note:** This area is partially subsumed by §1.N (Context Management Proposal V1) — that proposal contains the full per-call context budget enforcement. §1.F is the simpler subset that doesn't depend on regime classification machinery.

**Source provenance:** CLD V2 N-HIGH-1, GPT-P1, CLD V1 Finding 2.6 (Phoenix pattern).

---

### §1.G — Confidence Intervals + Replicates

**Rows:** #119, #211 (R5 replicates portion)

**What to build:**
- `ExperimentReplicationConfig` schema: `replicates_per_variant`, `decoding_params_snapshot`, `random_seed_policy`, `aggregate_replicates`.
- Per-dimension `repeated_runs: number` (optional).
- Confidence intervals computed from replicated runs (95% CI displayed alongside score).
- Replicates UI: variant detail panel shows N runs with score distribution.

**R4.1 has:** Single-sample warning banner per row #211 (R4.1 portion). R5 adds the replication infrastructure.

**Source provenance:** GPT-P1, CLD V1 Finding 8.5, RT-P3.

---

### §1.H — EvaluationRun Full Substrate

**Rows:** Implicit dependency (no single row; enabling work for §1.A and §1.D)

**What to build:**
- EvaluationRunLite (R4.1 from row #108) → full EvaluationRun with EvalDataset, EvalExample, EvalTrace child schemas.
- EvalDataset: persistent collection of inputs for re-running scorers.
- EvalExample: single (input, expected_output?, expected_claims?) tuple.
- EvalTrace: full per-call audit, including model fingerprint, effective prompt hash, evidence retrieval log.
- Storage: `ELNOR_MEMORY/evaluation_runs/{eval_run_id}/{dataset|examples|traces}/`.
- Retention: per-content-type, default 365 days for promoted-candidate associated runs.
- DOC20 §6.18.2 register `evaluation_runs` as stored content type (already in row #164 R4.1).

**Dependencies:** This substrate is foundational for §1.A (DSPy needs EvalDataset for trainsets), §1.D (online scoring needs EvalRun to associate scored runs), and §1.J (cross-variant claim clustering needs EvalRun scope).

**Source provenance:** GPT-P3 Finding "EvaluationRun substrate missing as central object."

---

### §1.I — Iteration Ledger

**Rows:** #111

**What to build:**
- IterationRecord schema: prior_instruction, prior_aggregate, candidate_aggregate, dataset_id, validation_run_id, approved_by, approved_at.
- ModuleRecord gains `instruction_history: IterationRecord[]` (cap last 20).
- Detail panel "Iterations" sub-tab listing prior promotions.
- One-click rollback wired to PromotionLedgerEntry.rollback_ref.

**Dependencies:** §1.A promotion safety pass (Notes 23) writes the ledger entries.

**Source provenance:** CLD V1 Finding 8.1, GRK new idea #2.

---

### §1.J — Cross-Variant Claim Clustering

**Rows:** #214

**What to build:**
- ClaimCluster schema: cluster_id, semantic_key, claim_ids_by_variant (Record<variant_id, claim_id[]>), cluster_label, consensus_status enum (`common` / `variant_unique` / `baseline_only` / `candidate_only`).
- Clustering algorithm (sentence-embedding similarity > threshold) runs on ClaimExtractionResult collections from Experiment runs.
- Detail panel "Claim Comparison" view: cluster table showing which variants share which claims.

**Source provenance:** RT-P3.

---

### §1.K — Reference Answer Comparison

**Rows:** #212

**What to build:**
- Add `reference_in` port to step.judge (data, optional). Carries gold/reference output.
- New scoring method `reference_comparison`: judge compares target output against reference using rubric or similarity scoring.
- ReferenceAnswerConfig: `reference_source: "gold" | "human_authored" | "prior_promoted"`, `comparison_method: "rubric" | "exact_match" | "semantic_similarity"`.

**Note:** R4.1 only reserves the port name (no actual port added until R5 to avoid schema churn).

**Source provenance:** RT-P3.

---

### §1.L — Pairwise Tournament Mode

**Rows:** #174 (R5 tournament portion)

**What to build:**
- TournamentConfig: `pre_elimination_method: "single_output_rubric" | "checklist"`, `elimination_threshold`, `top_k_to_pairwise`.
- For experiments with > 3 variants: pre-eliminate to top-2 by single-output scoring, then run pairwise on top-2 only. Reduces 12 calls to 2.

**R4.1 has:** Pairwise-explosion warning per `validation.judge_pairwise_explosion`. R5 adds the tournament alternative.

**Source provenance:** CLD V2 N-HIGH-2.

---

### §1.M — Module Preset DSPy Provenance

**Rows:** #126 (DSPy-derived portion)

**What to build:**
- ModulePreset metadata for DSPy-derived presets gains: `derived_from_optimization: true`, `source_task_id`, `source_module_id`, `source_run_id`, `candidate_id`.
- UI "verify before use" hint when loading a candidate-derived preset.

**R4.1 has:** General preset provenance (non-DSPy) per row #126's R4.1 portion. R5 adds the DSPy-specific fields.

**Source provenance:** CLD V2 S-8 (second-order interaction).

---

### §1.M2 — API Routes for Optimization

**Rows:** #143 (R5 optimization portion)

**What to build (concrete route list):**
- `POST /api/tasks/:taskId/judges/:judgeModuleId/optimize` — start optimization.
- `POST /api/tasks/:taskId/judges/:judgeModuleId/optimize/:runId/cancel` — cancel in-flight.
- `POST /api/tasks/:taskId/judges/:judgeModuleId/optimize/:runId/promote` — promote selected candidate (with conflict detection).
- `POST /api/tasks/:taskId/judges/:judgeModuleId/optimize/:runId/reject` — reject candidates with reason.
- `POST /api/tasks/:taskId/judges/:judgeModuleId/promotions/:promotionId/rollback` — one-click rollback.
- `GET /api/tasks/:taskId/judges/:judgeModuleId/iterations` — list iteration history.
- `GET /api/tasks/:taskId/human_review_queue` — fetch annotation queue items.
- `POST /api/tasks/:taskId/human_review_queue/:itemId/submit` — submit human verdict.

**R4.1 has:** Basic retrieval routes (experiment detail, judge audit, extractor results, artifact retrieval) per row #143's R4.1 portion. **R4.1 also includes two validation codes from Context Management Proposal V1 §9 (card rows #236 and #237):** `validation.judge_context_overflow_pre_dispatch` (pairs with row #220 large-input hard-stop) and `validation.judge_storage_ref_unresolvable` (pairs with rows #29 + #72 StorageRef minimums).

**Source provenance:** GPT-P1 Finding M10, GPT-P2 Finding H10.

---

### §1.N — Context Management Proposal V1 Compilation

**Rows:** #226, #227, #228, #229, #230, #231, #232, #233 (V4 §6A rollups)

**What to build:**

This is the most architecturally significant R5 area. Already specified in detail in `DOC23_ADDENDA_A_R4_0_CONTEXT_MANAGEMENT_PROPOSAL_V1.md`. R5 compiles that document substantially as-written, with two amendments captured in §8 of the V4 card:

1. §3.X note explicitly excluding fork-the-variant-session as alternative (per row #28, Gemini OpenClaw fork limit).
2. §8.X DOC25 ingestion trigger on document attach (per row #27, Gemini).

**Eight component sections:**

- **§A3.13 Variant Preparation Pipeline** (row #230). Three regimes (A: targeted-edit / B: independent / C: near-identical) detected via Jaccard similarity. Three preparation schemas: `VariantDeltaBundle`, `AlignedVariantBundle`, `NearIdenticalVariantBundle`. Pre-extraction stage producing `VariantPreparedForJudgment`. Jaccard thresholds (0.4 / 0.95 cuts) need calibration against Will's actual workflows before baking final values.

- **§A3.14 Judge Context Management** (row #231). Per-call context budget enforcement. Mandatory provider prompt caching when call count > 5. Evidence and shared-input refs (cached, not re-sent). Two-pass scoring for ambiguous dimensions (cheap-first triage → expensive deep evaluation only on ambiguous dimensions). JIT retrieval as design intent.

- **§A3.15 Sub-Agent Dispatch Modes** (row #232). Modes: `single` (existing), `per_variant`, `per_dimension`, `per_pairwise`. Wave batching with `wave_size = maxConcurrent / 2`. Coordinator-aggregator pattern. Failure handling (timeout per sub-agent, partial-result aggregation). New validation: `validation.judge_subagent_concurrent_saturation` (info, not warning) at >75% maxConcurrent.

- **§A3.16 Dimension Context Requirements** (row #233). Per-dimension `requires_full_output: bool`, `requires_evidence: bool`, `score_from_metadata_only: bool`. Tiered judge models (cheap → expensive on confidence threshold). Subsumes V1 row #106 (score_from_metadata_only).

**Four cross-doc obligations:**

- **OB-A14 (DOC25 → DOC23):** Variant outputs > 8K tokens flow through DOC25 ingestion at collection time; Judge sees IngestionResult, not raw text. Owner: DOC25.
- **OB-A15 (DOC25 schema constraint):** DOC25 IngestionResult schema MUST be designed for incremental Tier 1 (Full) → Tier 2 (Compressed) → Tier 3 (Sparse references) consumption with `retrieve_section`, `retrieve_full`, and `retrieve_metadata` accessor patterns. Card row #238. Owner: DOC25.
- **OB-A16 (DOC72 → DOC23):** Entity card payload retrieval contract; Judge can call `retrieve_node_payload` to expand any card on demand. Owner: DOC72.
- **OB-A17 (OpenClaw → DOC23):** Bump `maxConcurrent` default recommendation to 16; document wave dispatch pattern. Owner: OpenClaw / DOC11.
- **OB-A18 (DOC15 → DOC23):** Per-call context budget enforcement primitive may belong in DOC15 CIL rather than DOC23. Owner: DOC15. Architectural decision deferred to R5.

**Source provenance:** Claude follow-up consolidating reviewer findings on context blowup (GPT-P1, CLD V1 Finding 7.1, GRK CRITICAL #1, GEM Part 1 + §1 deeper-pass items #1, #2).

---

### §1.O — Lazy Claim Verification

**Rows:** #234

**What to build:**
- Add `requires_verification: boolean` to UserDefinedClaimType schema (default true).
- Add `skip_high_confidence_claims: boolean` (default true) and `high_confidence_threshold: number` (default 0.9) to VerificationConfig schema.
- Runtime: when scoring with factual_verification method, skip claims where `extraction_confidence ≥ high_confidence_threshold` AND `claim_type.requires_verification === false`.
- Audit: skipped claims are recorded in audit with `verdict: "skipped_high_confidence"` for transparency.
- ~60% verification cost reduction on typical drafting tasks (case captions, filing dates, named parties don't need re-verification).

**Source provenance:** Context Management Proposal V1 §5.6.

---

### §1.P — Hierarchical Scoring for Huge Variants

**Rows:** #235

**What to build:**
- Add `hierarchical_scoring` block to JudgeModuleConfig:
  - `enabled: boolean` (default true)
  - `threshold_tokens: number` (default 30000)
  - `section_weight_strategy: "by_length" | "by_user_importance" | "uniform"`
  - `section_importance_map: Record<section_tag, number> | null`
- Runtime: when variant > threshold_tokens, decompose scoring into per-section calls using AlignedVariantBundle structure (or StorageRef section retrieval). Aggregate to document-level score via weighted average per chosen strategy.
- Composes naturally with §1.N sub-agent fan-out: one sub-agent per (variant × section).

**Dependencies:** §1.N (Regime B AlignedVariantBundle for section structure; sub-agent dispatch modes for parallel per-section scoring).

**Source provenance:** Context Management Proposal V1 §5.7.

---

## §2 — R6 Work Plan

R6 items are smaller and more discrete. Most do not require R5 prerequisite work.

### §2.A — EvalCampaign as Top-Level Module

**Rows:** #134

**What to build:** New `system.eval_campaign` module type that orchestrates Judge runs over historical task runs (selected via filter), producing aggregate quality reports across many runs. Intended for periodic quality audits, not in-task evaluation.

**Source provenance:** GPT-P1, GRK new idea.

---

### §2.B — CANDOR Headless Integration

**Rows:** #155

**What to build:** Judge module gains optional `invoke_candor_headless: bool` config. When true, dispatches a CANDOR adversarial review (DOC14) instead of running its own scoring pipeline. Useful for high-stakes evaluation that wants CANDOR's adversarial machinery applied to a task module's output.

**Cross-doc obligation:** DOC14 must expose a headless API (currently CANDOR is interactive-only).

**Source provenance:** GEM Part 8 #6.

---

### §2.C — Variant Count > 4 (Batch / Campaign Mode)

**Rows:** #157

**What to build:** Lift the 4-variant UI cap on Experiment for batch/campaign scenarios. New `ExperimentBatchMode` config with `variants[]` of any length, surfaced as a different UX (table, not side-by-side cards). Costs and surface area scale linearly.

**Source provenance:** GPT-P1 Finding M12, CLD V2.

---

### §2.D — Cross-Task Dataset Library

**Rows:** #170

**What to build:** Global `ELNOR_MEMORY/datasets/` directory with promotable trainsets. "Promote to library" affordance on a task-specific trainset converts it to a cross-task dataset other Judge modules can import. Cross-task plumbing (already exists in DOC23 R3.1 §9 for module-level wiring) extends to dataset-level reference.

**Source provenance:** CLD V1 Finding 8.3.

---

## §3 — External documents that compile into R5/R6

These already exist as separate spec documents and should be compiled into R5/R6 of the addenda alongside the row-driven work above.

### §3.A — Context Management Proposal V1

**Document:** `DOC23_ADDENDA_A_R4_0_CONTEXT_MANAGEMENT_PROPOSAL_V1.md`
**Maps to:** §1.N above (rows #226-#233)
**Compilation note:** Substantially as-written, with two amendments per V4 §8 Notes (no fork-the-variant alternative; DOC25 ingestion trigger on document attach).
**Status:** Self-contained spec; ready for R5 compilation pending Jaccard threshold calibration.

### §3.B — DSPy Metric Assembly Contract (Notes 20)

**Document:** Embedded in V4 card §8 Notes 20
**Maps to:** §1.A above (rows #20-#25, #79, #89-#94, #101, #149, #152, #154, #158-#159, #167, #177)
**Compilation note:** Notes 20 provides architectural specification; full schema definitions (JudgeMetricAdapter, JudgeTrainsetExample, EC_Judge_Wrapper.py template, GEPA feedback JSON structure, Pareto multi-objective config) need to be expanded into a standalone §A4 R5 specification before compilation.
**Status:** Architectural; needs schema fleshout for R5 spec compilation.

### §3.C — Promotion Safety Pass (Notes 23)

**Document:** Embedded in V4 card §8 Notes 23
**Maps to:** §1.A above (rows #23, #89, #90, #101, #111, #152)
**Compilation note:** Notes 23 specifies the four-part safety pass (PromotionRequest with conflict detection / PromotionLedgerEntry / post-promotion shadow runs / one-click rollback). Schema definitions need expansion into a standalone §A4.7-§A4.8 R5 specification.
**Status:** Architectural; needs schema fleshout for R5 spec compilation.

---

## §4 — Summary

| R5 functional area | Row count | Major schemas to add | Cross-doc obligations |
|---|---|---|---|
| §1.A DSPy + Optimization | 23 + 2 Notes | JudgeMetricAdapter, JudgeOptimizationConfig, JudgeTrainsetExample, IterationRecord, PromotionLedgerEntry | None (PropA reuses primitives) |
| §1.B Human Eval | 1 | human_review block, HumanReviewRequest, HumanReviewRecord | DOC10, DOC20, DOC21/22 |
| §1.C Code Scorers | 1 | 4 method-specific configs (code/regex/json_schema/tool_call) | None |
| §1.D Online Scoring | 2 | JudgeScorerSpec, replay metadata | None |
| §1.E UX Drill-Down | 3 | (UI work; minimal schema) | None |
| §1.F Per-Dim Context | 3 | EvidenceBundle dedup fields | None |
| §1.G Replicates | 2 | ExperimentReplicationConfig | None |
| §1.H EvaluationRun Full | (foundational) | EvalDataset, EvalExample, EvalTrace | DOC20 stored content |
| §1.I Iteration Ledger | 1 | IterationRecord, instruction_history | None |
| §1.J Claim Clustering | 1 | ClaimCluster | None |
| §1.K Reference Comparison | 1 | reference_comparison method config | None |
| §1.L Pairwise Tournament | 1 | TournamentConfig | None |
| §1.M Preset DSPy Provenance | 1 | ModulePreset metadata fields | None |
| §1.M2 Optimization Routes | 1 | (route contracts) | None |
| §1.N Context Mgmt Proposal V1 | 8 + 1 separate doc | VariantDeltaBundle, AlignedVariantBundle, NearIdenticalVariantBundle, VariantPreparedForJudgment, dispatch mode enum, dimension context requirements | OB-A14 (DOC25), OB-A15 (DOC25 schema), OB-A16 (DOC72), OB-A17 (OpenClaw/DOC11), OB-A18 (DOC15) |
| §1.O Lazy Claim Verification | 1 | UserDefinedClaimType.requires_verification, VerificationConfig.skip_high_confidence_claims | None |
| §1.P Hierarchical Scoring | 1 | JudgeModuleConfig.hierarchical_scoring | None |
| **R5 total** | **51 rows + 1 doc + 2 Notes** | **~32 schemas** | **5 cross-doc obligations** |

| R6 functional area | Row count | Major schemas to add | Cross-doc obligations |
|---|---|---|---|
| §2.A EvalCampaign | 1 | system.eval_campaign module type | None |
| §2.B CANDOR Headless | 1 | invoke_candor_headless config | DOC14 (must expose headless API) |
| §2.C Variant Count > 4 | 1 | ExperimentBatchMode | None |
| §2.D Cross-Task Dataset Library | 1 | global datasets directory | DOC23 §9 cross-task wiring extension |
| **R6 total** | **4 rows** | **~4 schemas** | **2 cross-doc obligations** |

---

## §5 — Dependency graph for R5 sequencing

If R5 work is sequenced, recommended order:

1. **§1.H EvaluationRun Full Substrate** (foundational — §1.A, §1.D, §1.J all depend on it)
2. **§1.A DSPy + Optimization** (largest area; §1.I and §1.M depend on it)
3. **§1.I Iteration Ledger** (needed for §1.A promotion ledger; can build in parallel after §1.A schema settles)
4. **§1.M2 Optimization Routes** (concrete API surface for §1.A)
5. **§1.M Preset DSPy Provenance** (small extension after §1.A completes)
6. **§1.B Human Eval** (independent of §1.A, can build in parallel; depends on DOC10/DOC20/DOC21 cross-doc work)
7. **§1.N Context Management Proposal V1** (independent of optimization; depends on cross-doc work with DOC25, DOC72, OpenClaw, DOC15 — schedule those obligations early)
8. **§1.D Online Scoring** (depends on §1.H and §1.A iteration ledger for time-series anchor points)
9. **§1.E UX Drill-Down** (depends on §1.D + §1.I for time-series + iteration data)
10. **§1.F Per-Dim Context Optimization** (independent; can build any time after §1.N if subset; or self-contained at any time)
11. **§1.G Replicates** (independent; can build any time)
12. **§1.J Claim Clustering** (depends on §1.H)
13. **§1.K Reference Comparison** (depends on §1.A — port reservation in R4.1, full add in R5)
14. **§1.L Pairwise Tournament** (independent; can build any time)
15. **§1.C Code Scorers** (independent; can build any time)

R6 items have no ordering dependencies among themselves and can be built independently when prioritized.

---

## §6 — R5 Calibration Items (proposal §14)

These six items are calibration tasks identified in Context Management Proposal V1 §14. They are NOT spec adjudication items — they are pre-build empirical calibration tasks that should be completed before R5 final compilation, or in parallel with R5 implementation, so the production thresholds reflect Will's actual workflow data rather than draft defaults.

| # | Calibration target | Default | Calibration approach |
|---|---|---|---|
| 1 | **Regime classification thresholds** (§2.5, §14.1) | C/A: 0.95, A/B: 0.40, B/fallback: 0.05 | Run regime classification on ~20–30 real Judge tasks across Will's typical workflow mix. Empirical thresholds should produce ~25% Regime A / ~50% Regime B / ~10% Regime C / ~10% mixed / ~5% fallback. Significant deviation indicates threshold needs adjustment. If Regime C over-classified, drop to 0.92. If Regime A over-classified, raise to 0.50. |
| 2 | **Pre-extraction threshold** (§14.2) | 8K tokens | Measure pre-extraction cost (~$0.005–0.02 per variant) vs scoring cost reduction. If small variants (4–8K) consistently benefit, lower threshold to 4K. If pre-extraction cost approaches scoring savings on small variants, raise to 12K. |
| 3 | **Wave size** (§14.3) | floor(maxConcurrent / 2) = 8 with maxConcurrent: 16 | Under typical workload, measure if wave_size 8 saturates network/proxy. If queues form, lower to 6. If wall-clock dominated by wait-for-wave-completion, raise to 10. |
| 4 | **Two-pass confidence threshold** (§14.4) | 0.7 | Track Pass 1 confidence distribution over real runs. If too many escalations (> 50% of dimensions), lower to 0.6. If escalations rare AND Pass 1 errors common (> 5% disagreement with Pass 2), raise to 0.8. |
| 5 | **Sub-agent timeout** (§14.5) | 120s | Measure p95 sub-agent duration in production. If p95 > 100s, raise timeout to 180s. If p95 < 60s, lower to 90s for faster failure detection. |
| 6 | **Hierarchical scoring threshold** (§14.6) | 30K tokens | Track scoring consistency between hierarchical and whole-doc modes for same inputs at various sizes. If hierarchical and whole-doc agree well below 50K, raise threshold. If hierarchical produces meaningfully different (and arguably better) scores starting at 20K, lower threshold. |

**When to do these:**
- **Items 1, 2:** Should run BEFORE R5 ships. They affect default behavior at first-run; getting them wrong means initial Judge runs misclassify regimes or skip beneficial pre-extraction.
- **Items 3, 5:** Can run AFTER R5 ships. They tune for performance; defaults are safe.
- **Items 4, 6:** Should run shortly after R5 ships, on the first 30-50 real Judge runs. Defaults are safe but suboptimal until empirically tuned.

**Tracking:** When each calibration completes, the empirically-tuned value replaces the default in the relevant R5 spec section. Configuration fields remain user-overridable; the default is what changes.

---

End of deferral list V2.