Design a Vibe-Coding Multi-Agent System
← All prompts
prompt Deep Research
Ask a deep-research agent to design a truly autonomous multi-agent vibe-coding system — orchestrator + specialists, phase gates, mid-flight mutation handling, eval-first methodology, provider abstraction — patterned after the best-of-2026 leaders, with a reference skeleton.
Works with
- Kimi K2 (multi-agent)
- Manus
- ChatGPT Deep Research
- Claude (web)
- Gemini Deep Research
Fill these placeholders first
| Placeholder | What to put |
|---|---|
{{SYSTEM_NAME}} | Working name for the autonomous system (e.g., “VibeForge”) |
{{TARGET_APP_DESCRIPTION}} | The app the system will build first — 1 paragraph + stack |
{{TARGET_APP_PHASES}} | Number of phases + their names (e.g., "12 phases: 0=POC, 1=foundation, …") |
{{LLM_PROVIDERS_AVAILABLE}} | Which LLMs the user has (e.g., "Claude Sonnet 4.6, Opus 4.7, GPT-5, Kimi K2, Gemini 2.5 Pro") |
{{HARD_CONSTRAINTS}} | Architectural commitments the OUTPUT code must respect (modular / theme-tokenized / mobile-first / vertical-clonable…) |
{{TODAY_DATE}} | Today's ISO date |
Fill the placeholders above, then copy the whole block below into your target agent. It produces a production-grade design (≈6–9 agents, not 17) with a state machine, eval-first methodology, cost controls, a provider-abstraction layer, and a reference implementation skeleton.
The prompt
You are a senior staff engineer + AI-systems architect who has built or deeplystudied: Anthropic Claude Code, Cognition Devin, Cursor Agent, Cline, Aider,Continue, Lovable, Bolt.new, Vercel v0, Replit Agent, Emergent.sh, Lindy,MultiOn, AutoGen, LangGraph, CrewAI, Agno, Smolagents, Bee, OpenAI Swarm,Mastra, Pydantic AI. Design a production-grade truly autonomous multi-agentvibe-coding system ("{{SYSTEM_NAME}}") that a solo founder can use to ship acomplex SaaS app over 12-16 weeks with minimal human intervention.
Core problem to solve: today's AI coding agents drift on long tasks, losecontext past 200K tokens, hallucinate file paths, fight the framework, missvalidation, produce shallow plans. Some products produce magical output(Emergent.sh) — diagnose why + replicate. Single-agent setups (just Cursor,just Claude Code) fail on tasks > ~3 days.
User wants {{SYSTEM_NAME}} to:1. Run autonomously for hours/days while user is away.2. Have clean phase gates with brief 5-min human checkpoints.3. Smoothly absorb mid-flight requirement mutations (user adds Slack-style comment; system updates remaining phases without breaking in-progress work).4. Produce code that's modular, scalable, easy to extend later.5. Test itself before declaring done.6. Report progress so user can check in at random.7. Work with any LLM the user has: {{LLM_PROVIDERS_AVAILABLE}}.8. Self-hosted or runnable locally — not vendor-locked.
{{SYSTEM_NAME}}'s first job: build the following app —
{{TARGET_APP_DESCRIPTION}}
Phase plan: {{TARGET_APP_PHASES}}
NON-NEGOTIABLE constraints on {{SYSTEM_NAME}} itself + the code it produces:
{{HARD_CONSTRAINTS}}
Plus universal: provider-agnostic LLM layer, local-first runnable, cost-aware(token + budget cap + auto-pause), resumable (crash → restart with no loss),audit-trailed (every agent decision logged + replayable), modular spec output(provider abstraction, theme tokens, no hardcoded values, mobile-first,vertical-clonable), eval-first (every phase has eval BEFORE code), constitution+ safeguards (no destructive ops, no skipped tests, no fabricated sources).
— RESEARCH TASKS —
§A. Reverse-engineer existing systems (architecture + agent topology +prompting strategy + failure modes; cite primary sources): Claude Code, Devin,Cursor Agent, Cline, Aider, Lovable, Bolt, v0, Replit Agent, Emergent.sh,AutoGen, LangGraph, CrewAI, OpenAI Swarm, Anthropic "Building Effective AIAgents" + "Code with Claude" patterns, Mastra, Pydantic AI.
§B. Identify failure modes (cite real reports + propose {{SYSTEM_NAME}}mitigation): context-window drift on long tasks / hallucinated paths / testskipping / plan drift / tool-call loops / premature completion / cost runaway/ vendor lock-in / bad debugging / mid-flight pivots breaking work.
§C. Design {{SYSTEM_NAME}} — complete spec:
1. Agent topology (suggest 6-9 agents, NOT 17 — most failures come from over-decomposition). Baseline: Conductor (orchestrator, state machine, human checkpoints), Planner (refines phase to tickets, deep-research on stack), Architect (ADRs + file structure + eval specs BEFORE coder), Coder (multi-file edits, respects conventions), Tester (writes+runs tests, reports), Reviewer (code review against constitution), Debugger (isolated context when tests fail, avoids corrupting Coder state), Mutator (handles mid-flight changes, diffs PLAN against new ask), Reporter (status summaries, human-checkpoint reports). For each: responsibilities / inputs / outputs / tools / token budget / provider routing / constitution rules.
2. State machine (NOT_STARTED → PLANNING → ARCHITECTING → CODING → TESTING → REVIEWING → COMPLETED|DEBUGGING→CODING-loop max N). State persistence, resumability, checkpoint format, interruption handling.
3. Eval-first methodology: Architect produces eval BEFORE Coder writes code; user can review/edit eval; Coder makes eval pass; Tester runs evals + report; Reviewer checks quality even when evals pass. Pick eval format (Vitest/Jest/Pytest/custom DSL) with rationale.
4. Mid-flight mutation handling: user comments → Mutator classifies (small/ medium/large) → updates remaining phases / proposes Plan diff for human approval / regenerates PLAN if large. Specify protocol completely.
5. Human-checkpoint protocol: Reporter summary + delta-from-plan + screenshots/diffs; commands approve|feedback|pause|rewrite|skip; auto- pause if no approval in N hours.
6. Cost + observability: token usage per agent + per phase, daily budget cap auto-pause, dashboard, OpenTelemetry traces, replay mode.
7. Provider abstraction: providers/llm/{interface,anthropic,openai,google, kimi,router,fallback}.ts. Specify LLMProvider interface (tool-use, streaming, JSON-mode, structured-outputs, context windows, pricing).
8. Constitution (10 hard rules): no destructive ops without confirmation / no skipping tests / no hardcoded secrets / no fabricated paths / no marking complete without verification / no context-bombing / no single- shot mega-prompts / no phase-N+1 if N didn't pass / no ignoring mid- flight comments / no re-running failed evals > M times.
9. Code-quality output guarantees the produced CODE always satisfies: modular folder structure / provider-abstracted vendors / theme-tokenized (CSS custom properties) / mobile-first / light+dark mode / 3-accent palette + SuperAdmin overrides / vertical-clonable (CodeCanyon-style) / no hardcoded values / WCAG 2.2 AA / tested / documented (docstrings + README + ADRs). HOW does {{SYSTEM_NAME}} enforce these (linters / agent rules / code-review checks)?
10. Security + safety: API keys encrypted (Fernet/KMS), tenant isolation guards, BYOK enterprise, rate limiting, audit log integrity (append-only + hash-chained), hard rules vs secret-leaking code.
§D. Reference implementation skeleton (TypeScript or Python or Rust — pickONE with rationale, don't menu): {{SYSTEM_NAME_LOWER}}/{bin,src/conductor,src/agents/{planner,architect,coder,tester,reviewer,debugger,mutator,reporter},src/providers/{llm,storage,git,shell,test-runner},src/state,src/eval,src/constitution,src/observability,src/checkpoint,src/cli}, .{{SYSTEM_NAME_LOWER}}/{state.json,conventions.md,checkpoints,audit-log.jsonl,reports}, docs/{ARCHITECTURE,CONSTITUTION,PROTOCOL}.
§E. Trade-off analysis (5 key decisions with recommendation): single-agent-w/-skills vs multi-agent topology / sync orchestrator vs async event-driven(Temporal/Inngest/custom queue) / local-first vs cloud-orchestrator (Vercel+Inngest) / eval-first (TDD-style) vs review-after / sub-agent contextisolation vs shared working memory.
§F. Operationalization (CLI commands + example session transcripts for:initial run / mid-flight comment / phase failure+debug / completion).Commands: `{{SYSTEM_NAME_LOWER}} init`, `run`, `pause`, `resume`, `comment`,`replay`, `cost`, `status`.
§G. Open questions requiring further research / spike testing / humanjudgment.
— OUTPUT FORMAT —
Single markdown spec ~12-20K words. Mermaid diagrams + tables + code snippetswhere they help. Numbered citations [1], [2], ... + reference list at end.Date the spec ({{TODAY_DATE}}).
— DON'T —
Don't propose 17 agents. Don't recommend tools without justification. Don'trepeat existing system descriptions — analyze them. Don't fabricate — flag(unverified). Don't pad.
— DO —
Be the senior architect who has shipped autonomous-agent systems in production.Cite primary sources (repos, official docs, papers, talks). Show exampletraces / transcripts / pseudocode. Push back on assumptions if currentevidence suggests they're wrong.
Begin research now.