ELNOR REPO READER TEXT MIRROR Original path: Active Working and Red Team/DOC23 Working/DOC23 Red Teaming/DOC23_REPROMPT_RED_TEAM_PROMPT.md Source repo: /Users/OpenClaw1/Elnor/Elnor Specs Git branch: main Git commit: dbaa25962edc11ab30e8d4ca1715f9ae5bf77331 Generated: 2026-06-09T01:23:58.539Z --- # DOC23 Re-Prompt System Addendum — Red Team Review Prompt **Document under review:** DOC23 Red Line — Sequential Re-Prompt System (514 lines) **Context documents to review alongside:** - DOC23 Task System: Modular Automation Architecture R3.1 (the parent spec — re-prompt modifies §3.2.1, §3.2.8, §6.5.2, §6.5.8) - DOC23 Addenda A: Task Optimization R4.1 V6 (references this addendum via R183 extraction_scope and R184 reprompt_version) **What this addendum does:** Adds sequential re-prompts to agent_task and coding modules. A re-prompt is a follow-up prompt dispatched in the same session after the prior response. Replaces the underspecified `iteration_limit` completion mode and subsumes `tests_pass` on the coding module (the agent executes tests itself when instructed). --- ## Part 1: General Assessment 1. **Is the schema complete?** Does `re_prompts: string[]` define everything the coding agent needs, or are there missing fields? Consider: should re-prompts have IDs (for experiment variant override targeting)? Should they support attachments? Should there be a label/description field? Is `string[]` too minimal? 2. **Is the execution flow precise enough to implement?** Walk through §3.1 step by step. Are there ambiguities? What would a coding agent have to guess? Focus on: when exactly does each re-prompt dispatch, what content goes in the turn, what happens if the agent doesn't respond, what's the timeout behavior. 3. **Is the output handling complete?** The addendum defines merged output on `data_out` and optional per-response ports via `separate_outputs`. Are there edge cases? What happens with 0 re-prompts (should be identical to current behavior — verify this). What happens with 10 re-prompts and a large primary response — context window overflow? 4. **Migration from tests_pass.** The addendum converts EC-orchestrated test loops to agent-directed testing ("Run the tests and fix failures" as a text re-prompt). Is this safe? Would an agent reliably self-test, detect failures, and iterate — as reliably as EC running a shell command and checking the exit code? What's lost in the migration? What validation or fallback should exist? 5. **Migration from iteration_limit.** The addendum says iteration_limit was underspecified and replaces it with explicit re-prompts. Is there any use case that iteration_limit served that re-prompts don't cover? Are there existing saved tasks with iteration_limit that need migration handling? --- ## Part 2: Output Routing Deep Dive 6. **Current output model.** The addendum defines two modes: - Default: merged output on `data_out` (all responses concatenated) - Optional: per-response ports (`original_out`, `reprompt_1_out`, etc.) when `separate_outputs: true` **Question:** Should there ALSO be a dedicated `final_out` port that always emits only the last response? Use case: the user wants the final refined version routed to the next module, but also wants the full merged document (including drafts and reviews) archived or sent to a judge. Currently the user would wire `reprompt_N_out` to the next module, but the port name changes if they add or remove re-prompts. A stable `final_out` port that always emits the last response regardless of re-prompt count would be more robust. 7. **data_out content when separate_outputs is enabled.** Does `data_out` still emit the merged version when separate_outputs is true? The addendum says yes — both are available simultaneously. Verify this is consistent across all execution paths. What about named outputs — the addendum says they extract from the LAST response only. If `separate_outputs` is also enabled, can a named output port AND `reprompt_2_out` both be wired? Do they emit the same content? 8. **Per-response ports on coding modules.** The addendum says `separate_outputs` is NOT available on step.coding because coding outputs are structured (diff, files, summary). Is this the right call? Could there be value in separating the implementation summary from the review summary from the fix summary? --- ## Part 3: Context Carry-Through and Chain History This is the critical integration question. DOC23's context management system controls what gets carried forward from module to module via chain history. Re-prompts create a new wrinkle: the module produces potentially multiple responses in one activation, and downstream modules need to know what to carry forward. 9. **Chain history entry.** The addendum says the module produces ONE chain history entry: the merged `data_out`. Downstream modules see one predecessor. Is this always correct? If `separate_outputs` is enabled and three different downstream modules receive three different response ports, do they all see the same chain history entry (the merged version)? Or does each see only its respective response? If the former, the chain history is misleading (it contains content the downstream module didn't receive on its data_in). If the latter, different downstream modules have different views of the same predecessor, which could cause inconsistencies. 10. **Context pinning.** DOC23 §3.0.2 allows pinning specific chain history entries so they persist across the graph. If a re-prompted module's merged output is pinned, all downstream modules see the full merged content (drafts + reviews + final). Is this desirable? Should only the final response be pinnable? Should the pin granularity be per-response when `separate_outputs` is enabled? 11. **Context carry-through settings.** DOC23 allows configuring what context carries through: full output, summary only, or nothing. How does this interact with re-prompts? If the setting is "summary only," does EC summarize the merged output or only the final response? Summarizing the merged output (which includes draft iterations and self-review) produces a different summary than summarizing only the final refined version. Which is correct? 12. **Session continuity interaction (Addenda A §A9).** The addendum says session continuity is unaffected — the module opens one session, runs all re-prompts, attaches the session key. But consider: if `session_persist: true` and the module re-activates (revision loop), the re-prompts execute AGAIN in the continued session. The agent now sees: original instruction → response 1 → re-prompt 1 → response 2 → re-prompt 2 → response 3 → [revision feedback] → original instruction again → response 4 → re-prompt 1 again → response 5 → re-prompt 2 again → response 6. The re-prompts replay with the same text but in a context that already includes prior iterations. Is this the right behavior? Should re-prompts be skipped on re-activation? Should the re-prompt text be modified for re-activation context (e.g., "This is a revision pass — review your revisions, not the original")? --- ## Part 4: Integration with Addenda A 13. **R184 — reprompt_version in experiment variants.** Addenda A V6 references `reprompt_version: string | null` on ExperimentVariantOverrides. But the re-prompt addendum defines `re_prompts: string[]` with no version field. How does the experiment know which version of the re-prompts a variant used? How does `reprompt_version` resolve? The addendum needs a versioning scheme or R184 needs to reference the re-prompt array directly instead of by version. 14. **R183 — extraction_scope on the Claim Extractor.** Addenda A V6 adds `extraction_scope: "full_input" | "last_section_only"` to handle merged re-prompt output. The re-prompt addendum defines the separator format (`───── Re-prompt 1 ─────`). Is this separator robust enough for the extractor to detect? What if the agent's response itself contains a similar-looking separator? Should the separator be a machine-readable marker (e.g., ``) rather than a human-readable line? 15. **Experiment variant override of re-prompts.** If Variant A has re-prompts ["review for security", "fix issues"] and Variant B has re-prompts ["review for performance", "fix issues"], the experiment is testing different review strategies. The addendum says re-prompts are unchanged unless `config_overrides` is used. V6 says `reprompt_version` is the variant-level attribute. Which is the operative contract? How does the variant editor surface re-prompt editing? --- ## Part 5: Coding Module Specifics 16. **Per-agent re-prompts in multi-agent modes.** The addendum defines re-prompts per `CodingAgentAssignment`. In review_chain mode: Agent A implements (with re-prompts) → Agent B reviews (with re-prompts) → feedback to A → A revises (with re-prompts again). Walk through this flow in detail. Does Agent A's second pass (after revision feedback) re-execute the same re-prompts? Should it? The revision feedback is different context than the original implementation — the re-prompt "review your code" might not make sense after a revision pass where the code was already reviewed. 17. **ACP session state across re-prompts.** The addendum says the agent retains full workspace state across re-prompts. But what about ACP tool state? If the agent installed a package in its primary work, is the package still available during re-prompt 1? If the agent created files, are they visible? This should be explicitly confirmed for ACP sessions (it's likely yes, but the spec should state it). 18. **Coding output ports with re-prompts.** The coding module emits: Summary, Diff, Files, Test Results, Receipt. After re-prompts, the Diff reflects the FINAL workspace state. But the Summary — does it cover only the primary work, or all work including re-prompt-driven changes? If the summary covers everything, it might be very long. If it covers only the final state, the re-prompt work (review findings, fixes applied) is lost from the summary. What's the right behavior? --- ## Part 6: Edge Cases and Failure Modes 19. **Re-prompt that changes the agent's mind.** The agent produces Analysis A in response 1. Re-prompt says "reconsider." Agent produces Analysis B in response 2 that contradicts Analysis A. The merged output contains both. If `data_out` goes to a judge, the judge sees contradictory content. If only the final response matters, why is the merged output the default rather than `final_out`? 20. **Agent ignores re-prompt.** The agent responds to re-prompt "review your work" with "I've reviewed it and it looks fine, no changes needed." The re-prompt produced no useful additional content. The merged output has a useless section. Is this a problem? Should there be a "skip if no substantive change" option? 21. **Context window overflow.** 10 re-prompts with a large primary instruction, data_in, context_in, and chain history. Each re-prompt response adds to the session. By re-prompt 7, the context window might be full. What happens? Does the agent start losing early context? Does EC detect this and warn? Should there be a pre-flight check? 22. **Cost governance.** 10 re-prompts means 11 LLM calls per module activation. With a strong model (Opus), this could be expensive. Should re-prompt count factor into cost previews? Should there be a cost warning when re-prompt count × estimated tokens exceeds a threshold? --- ## Part 7: Competitive Research 23. **How do other platforms handle multi-turn agent tasks?** Research: Cursor (multi-step coding with review), Devin (iterative coding agent), Claude Code (tool-use loops), Windsurf (multi-step with checkpoints). Do any of these have a re-prompt-like pattern? How do they handle output from intermediate steps? What can we learn? 24. **Agent self-review patterns.** Research: "Reflection" prompting (Shinn et al. 2023), "Self-Refine" (Madaan et al. 2023), Constitutional AI self-critique. How do these compare to our re-prompt approach? Are there structured self-review patterns that produce better results than free-text re-prompts? Should we recommend specific re-prompt patterns in documentation? --- ## Deliverable Provide your review as a structured document with: - A letter grade (A through F) for each of the 7 parts - Specific findings organized by severity: CRITICAL (blocks implementation), HIGH (significant issue), MEDIUM (should address), LOW (nice to have) - For each finding: what the issue is, where in the addendum it occurs (section reference), and your recommended fix - A summary of the top 5 changes to make before the addendum is compiled into DOC23 R3.2