Design a Vibe-Coding Multi-Agent System

prompt Deep Research

Ask a deep-research agent to design a truly autonomous multi-agent vibe-coding system — orchestrator + specialists, phase gates, mid-flight mutation handling, eval-first methodology, provider abstraction — patterned after the best-of-2026 leaders, with a reference skeleton.

Works with

Kimi K2 (multi-agent)
Manus
ChatGPT Deep Research
Claude (web)
Gemini Deep Research

Version: 1.0
Updated: 2026-05-06
Runtime: 90–180 min
Est. cost: $30–80
Output: 12,000–20,000 words

Fill these placeholders first

Placeholder	What to put
`{{SYSTEM_NAME}}`	Working name for the autonomous system (e.g., “VibeForge”)
`{{TARGET_APP_DESCRIPTION}}`	The app the system will build first — 1 paragraph + stack
`{{TARGET_APP_PHASES}}`	Number of phases + their names (e.g., "12 phases: 0=POC, 1=foundation, …")
`{{LLM_PROVIDERS_AVAILABLE}}`	Which LLMs the user has (e.g., "Claude Sonnet 4.6, Opus 4.7, GPT-5, Kimi K2, Gemini 2.5 Pro")
`{{HARD_CONSTRAINTS}}`	Architectural commitments the OUTPUT code must respect (modular / theme-tokenized / mobile-first / vertical-clonable…)
`{{TODAY_DATE}}`	Today's ISO date

Fill the placeholders above, then copy the whole block below into your target agent. It produces a production-grade design (≈6–9 agents, not 17) with a state machine, eval-first methodology, cost controls, a provider-abstraction layer, and a reference implementation skeleton.

The prompt

You are a senior staff engineer + AI-systems architect who has built or deeply
studied: Anthropic Claude Code, Cognition Devin, Cursor Agent, Cline, Aider,
Continue, Lovable, Bolt.new, Vercel v0, Replit Agent, Emergent.sh, Lindy,
MultiOn, AutoGen, LangGraph, CrewAI, Agno, Smolagents, Bee, OpenAI Swarm,
Mastra, Pydantic AI. Design a production-grade truly autonomous multi-agent
vibe-coding system ("{{SYSTEM_NAME}}") that a solo founder can use to ship a
complex SaaS app over 12-16 weeks with minimal human intervention.

Core problem to solve: today's AI coding agents drift on long tasks, lose
context past 200K tokens, hallucinate file paths, fight the framework, miss
validation, produce shallow plans. Some products produce magical output
(Emergent.sh) — diagnose why + replicate. Single-agent setups (just Cursor,
just Claude Code) fail on tasks > ~3 days.

User wants {{SYSTEM_NAME}} to:
1. Run autonomously for hours/days while user is away.
2. Have clean phase gates with brief 5-min human checkpoints.
3. Smoothly absorb mid-flight requirement mutations (user adds Slack-style
   comment; system updates remaining phases without breaking in-progress work).
4. Produce code that's modular, scalable, easy to extend later.
5. Test itself before declaring done.
6. Report progress so user can check in at random.
7. Work with any LLM the user has: {{LLM_PROVIDERS_AVAILABLE}}.
8. Self-hosted or runnable locally — not vendor-locked.

{{SYSTEM_NAME}}'s first job: build the following app —

{{TARGET_APP_DESCRIPTION}}

Phase plan: {{TARGET_APP_PHASES}}

NON-NEGOTIABLE constraints on {{SYSTEM_NAME}} itself + the code it produces:

{{HARD_CONSTRAINTS}}

Plus universal: provider-agnostic LLM layer, local-first runnable, cost-aware
(token + budget cap + auto-pause), resumable (crash → restart with no loss),
audit-trailed (every agent decision logged + replayable), modular spec output
(provider abstraction, theme tokens, no hardcoded values, mobile-first,
vertical-clonable), eval-first (every phase has eval BEFORE code), constitution
+ safeguards (no destructive ops, no skipped tests, no fabricated sources).

— RESEARCH TASKS —

§A. Reverse-engineer existing systems (architecture + agent topology +
prompting strategy + failure modes; cite primary sources): Claude Code, Devin,
Cursor Agent, Cline, Aider, Lovable, Bolt, v0, Replit Agent, Emergent.sh,
AutoGen, LangGraph, CrewAI, OpenAI Swarm, Anthropic "Building Effective AI
Agents" + "Code with Claude" patterns, Mastra, Pydantic AI.

§B. Identify failure modes (cite real reports + propose {{SYSTEM_NAME}}
mitigation): context-window drift on long tasks / hallucinated paths / test
skipping / plan drift / tool-call loops / premature completion / cost runaway
/ vendor lock-in / bad debugging / mid-flight pivots breaking work.

§C. Design {{SYSTEM_NAME}} — complete spec:

  1. Agent topology (suggest 6-9 agents, NOT 17 — most failures come from
     over-decomposition). Baseline: Conductor (orchestrator, state machine,
     human checkpoints), Planner (refines phase to tickets, deep-research on
     stack), Architect (ADRs + file structure + eval specs BEFORE coder),
     Coder (multi-file edits, respects conventions), Tester (writes+runs tests,
     reports), Reviewer (code review against constitution), Debugger (isolated
     context when tests fail, avoids corrupting Coder state), Mutator (handles
     mid-flight changes, diffs PLAN against new ask), Reporter (status
     summaries, human-checkpoint reports). For each: responsibilities / inputs
     / outputs / tools / token budget / provider routing / constitution rules.

  2. State machine (NOT_STARTED → PLANNING → ARCHITECTING → CODING → TESTING
     → REVIEWING → COMPLETED|DEBUGGING→CODING-loop max N). State persistence,
     resumability, checkpoint format, interruption handling.

  3. Eval-first methodology: Architect produces eval BEFORE Coder writes code;
     user can review/edit eval; Coder makes eval pass; Tester runs evals +
     report; Reviewer checks quality even when evals pass. Pick eval format
     (Vitest/Jest/Pytest/custom DSL) with rationale.

  4. Mid-flight mutation handling: user comments → Mutator classifies (small/
     medium/large) → updates remaining phases / proposes Plan diff for human
     approval / regenerates PLAN if large. Specify protocol completely.

  5. Human-checkpoint protocol: Reporter summary + delta-from-plan +
     screenshots/diffs; commands approve|feedback|pause|rewrite|skip; auto-
     pause if no approval in N hours.

  6. Cost + observability: token usage per agent + per phase, daily budget
     cap auto-pause, dashboard, OpenTelemetry traces, replay mode.

  7. Provider abstraction: providers/llm/{interface,anthropic,openai,google,
     kimi,router,fallback}.ts. Specify LLMProvider interface (tool-use,
     streaming, JSON-mode, structured-outputs, context windows, pricing).

  8. Constitution (10 hard rules): no destructive ops without confirmation /
     no skipping tests / no hardcoded secrets / no fabricated paths / no
     marking complete without verification / no context-bombing / no single-
     shot mega-prompts / no phase-N+1 if N didn't pass / no ignoring mid-
     flight comments / no re-running failed evals > M times.

  9. Code-quality output guarantees the produced CODE always satisfies:
     modular folder structure / provider-abstracted vendors / theme-tokenized
     (CSS custom properties) / mobile-first / light+dark mode / 3-accent
     palette + SuperAdmin overrides / vertical-clonable (CodeCanyon-style) /
     no hardcoded values / WCAG 2.2 AA / tested / documented (docstrings +
     README + ADRs). HOW does {{SYSTEM_NAME}} enforce these (linters / agent
     rules / code-review checks)?

  10. Security + safety: API keys encrypted (Fernet/KMS), tenant isolation
      guards, BYOK enterprise, rate limiting, audit log integrity (append-only
      + hash-chained), hard rules vs secret-leaking code.

§D. Reference implementation skeleton (TypeScript or Python or Rust — pick
ONE with rationale, don't menu): {{SYSTEM_NAME_LOWER}}/{bin,src/conductor,src/
agents/{planner,architect,coder,tester,reviewer,debugger,mutator,reporter},src/
providers/{llm,storage,git,shell,test-runner},src/state,src/eval,src/
constitution,src/observability,src/checkpoint,src/cli}, .{{SYSTEM_NAME_LOWER}}/
{state.json,conventions.md,checkpoints,audit-log.jsonl,reports}, docs/
{ARCHITECTURE,CONSTITUTION,PROTOCOL}.

§E. Trade-off analysis (5 key decisions with recommendation): single-agent-w/-
skills vs multi-agent topology / sync orchestrator vs async event-driven
(Temporal/Inngest/custom queue) / local-first vs cloud-orchestrator (Vercel+
Inngest) / eval-first (TDD-style) vs review-after / sub-agent context
isolation vs shared working memory.

§F. Operationalization (CLI commands + example session transcripts for:
initial run / mid-flight comment / phase failure+debug / completion).
Commands: `{{SYSTEM_NAME_LOWER}} init`, `run`, `pause`, `resume`, `comment`,
`replay`, `cost`, `status`.

§G. Open questions requiring further research / spike testing / human
judgment.

— OUTPUT FORMAT —

Single markdown spec ~12-20K words. Mermaid diagrams + tables + code snippets
where they help. Numbered citations [1], [2], ... + reference list at end.
Date the spec ({{TODAY_DATE}}).

— DON'T —

Don't propose 17 agents. Don't recommend tools without justification. Don't
repeat existing system descriptions — analyze them. Don't fabricate — flag
(unverified). Don't pad.

— DO —

Be the senior architect who has shipped autonomous-agent systems in production.
Cite primary sources (repos, official docs, papers, talks). Show example
traces / transcripts / pseudocode. Push back on assumptions if current
evidence suggests they're wrong.

Begin research now.

deep-research
multi-agent
autonomous-coding
architecture
agents