This content originally appeared on DEV Community and was authored by Aviad Rozenhek
What we learned when 5 "independent" AI agents needed constant human orchestration
Part 2 of the Multi-Agent Development Series
TL;DR
We set out to prove AI agents could work independently with zero coordination. The zero-conflict architecture worked perfectly (100% auto-merge). But "autonomous" agents? That was the illusion.
What actually happened:
- ✅ Zero merge conflicts (architecture win)
- ❌ Constant human orchestration required (autonomy myth)
- ✅ Property tests found a 12x efficiency bug (agent + human win)
- ❌ But only because human asked the right questions (human essential)
The real lesson: The value isn't "autonomous agents." It's well-orchestrated human-AI collaboration.
The Setup (Brief Recap)
In Part 1, we described an experiment:
- 5 parallel AI coding agents
- Each working on independent test improvements
- Zero-conflict architecture (file-level ownership)
- Hypothesis: Agents work independently, human just merges
What we expected: Spawn 5 agents, come back in 48 hours, merge everything.
What we got: 8 hours of active human orchestration disguised as "autonomous" work.
The Orchestration Reality
Manual Intervention #1: Tool Switching
The Problem:
Integration Agent (Claude Web): "I'll check the PR status using gh CLI..."
System: Error - 'gh' command not found
What happened:
- Started integration in Claude Code Web (no GitHub CLI access)
- Needed to switch to Claude Code CLI mid-experiment
- Lost context, had to re-establish state
- Broke workflow continuity
Human intervention:
- Realized Web agent couldn't access GitHub API
- Switched to CLI environment
- Re-explained context and state
- Continued from checkpoint
What "autonomous" would look like: Agent detects tool unavailability, switches environments automatically, preserves state.
Reality: Human orchestrated the environment switch.
Manual Intervention #2: Agent Coordination
The Problem: Agents don't know when other agents finish their work.
Actual workflow:
Human: "Pull from PR-2 now"
Agent: [pulls and merges PR-2]
Human: "Now pull from PR-5"
Agent: [pulls and merges PR-5]
Human: "Fix the test failures"
Agent: [investigates and fixes]
Human: "Push"
Agent: [pushes]
Frequency: Continuous throughout integration
What "autonomous" would look like: Integration agent monitors PR branches via GitHub API, detects completion, merges automatically.
Reality: Human manually coordinated every merge.
Manual Intervention #3: Reality Checks
The Problem: Agents made incorrect assumptions about APIs that don't exist.
Examples:
# Agent assumed this API exists:
from gv.ai.video_moderation_service.agent.scenario_builder import ScenarioBuilder
builder = ScenarioBuilder(time_provider=time_provider)
# Reality: Different module, different class name:
from tests.test_video_moderation.simulation.scenario_builder import Scenario
scenario = Scenario("test_name")
# Agent assumed this field exists:
result.details.classification == VideoModerationClassification.SAFE
# Reality: Field was removed from model:
result.details.is_appropriate == True # This is the actual field
Impact:
- 13 test failures from API mismatches
- ~2 hours debugging speculative code
- Wasted agent effort writing code that couldn't work
Human intervention:
- Noticed tests failing with "AttributeError: no attribute 'classification'"
- Checked actual model structure:
python -c "from model import X; print(dir(X()))" - Realized agent wrote tests for outdated/assumed APIs
- Corrected agent: "This field doesn't exist, use this instead"
What "autonomous" would look like: Agent introspects models before coding, verifies imports work.
Reality: Human caught and corrected incorrect assumptions.
The Scorecard: Autonomous vs Orchestrated
| Task | Designed to be Autonomous? | Actually Autonomous? | Reality |
|---|---|---|---|
| Branch merging | ✅ Yes | ❌ No | Human triggered each merge manually |
| Dependency install | ✅ Yes | ❌ No | Human ran uv sync 10+ times |
| Tool selection | ✅ Yes | ❌ No | Human switched Web → CLI when gh failed |
| API verification | ✅ Yes | ❌ No | Human corrected incorrect assumptions |
| PR coordination | ✅ Yes | ❌ No | Human directed "pull from PR-X now" |
| Test execution | ✅ Yes | ✅ YES | Agent ran tests independently ✅ |
| Code writing | ✅ Yes | ✅ YES | Agent wrote code without guidance ✅ |
| Bug fixing | ✅ Yes | 🟡 PARTIAL | Agent fixed all bugs, human had to ask for reflection and provide metathinking for more complex ones |
Autonomy Score: 2.5 / 8 tasks (31%)
Real workflow: Human-orchestrated parallel development with AI assistants, not autonomous multi-agent coordination.
The Human-AI Sweet Spot
Now for the surprising success story.
The Bug Nobody Noticed
PR-4 (property-based testing) ran 7,000+ random test scenarios. All passed.
Agent reported: "All property tests passing! 7000 scenarios validated."
User asked: "Why is budget utilization so high?"
This question changed everything.
The Investigation (Human-Led, AI-Executed)
User's observation:
Policy requires: 10 checks/minute (1 check per participant per 60s)
Budget provided: 1000 checks/minute (100x excess)
Expected usage: ~10 checks/min (policy requirement)
Actual usage: ~120 checks/min (12x waste!)
Where's the waste coming from?
Agent's response: "Let me investigate the budget allocation logic..."
What the agent found:
# Result: Participants checked EVERY 5s instead of EVERY 60s
Root cause: No logic to skip participants far from their deadline.
The Impact
| Scenario | Expected | Actual | Waste |
|---|---|---|---|
| 10 participants, 60s recheck, 1000 budget | 10 checks/min | 120 checks/min | 12x |
| Cost per check: $0.01 | $0.10/min | $1.20/min | $1.10/min wasted |
| Monthly cost | $4,320 | $51,840 | $47,520 wasted |
This is a production bug costing real money.
Why Agent Didn't Notice (But Human Did)
Agent perspective:
- Ran 7000 scenarios ✅
- All invariants held (budget never exceeded) ✅
- No test failures ✅
- Declared success ✅
Human perspective:
- "Wait, why are we using 12x more budget than policy requires?"
- "That's not wrong, but it's wasteful"
- "Let's investigate"
The gap: Agent optimized for correctness (invariants hold). Human optimized for efficiency (use only what's needed).
The Lesson: Complementary Strengths
| Capability | Agent | Human |
|---|---|---|
| Run 7000 scenarios | ✅ Excellent | ❌ Too slow |
| Detect invariant violations | ✅ Excellent | 🟡 Might miss edge cases |
| Notice efficiency waste | ❌ Didn't catch it | ✅ Spotted immediately |
| Ask "why?" | ❌ Accepted results | ✅ Questioned assumptions |
| Deep investigation | ✅ Excellent (once directed) | 🟡 Tedious |
The pattern:
- Agent provides breadth (7000 scenarios)
- Human provides depth (critical thinking)
- Human directs investigation ("check budget utilization")
- Agent executes investigation (traces through code)
- Together: Find bugs that neither would find alone
What "Autonomous" Would Actually Require
Based on what we learned, here's what truly autonomous multi-agent development needs:
Inter-Agent Communication
# Integration agent monitors other agents
for pr in watch_prs():
if pr.status == "ready":
merge_automatically(pr)
run_tests()
if tests_pass:
notify_success()
else:
investigate_failures()
notify_agent_for_fixes()
Model Introspection Before Coding
# Agent verifies API before using
def before_writing_test(model_class):
verify_import_works(model_class)
actual_fields = introspect_fields(model_class)
actual_signature = get_signature(model_class.__init__)
# Only now write tests using ACTUAL API
write_tests(actual_fields, actual_signature)
Efficiency Monitoring
# Agent notices inefficiency, not just correctness
def after_property_tests_pass():
check_invariants() # Current behavior ✅
# NEW: Also check efficiency
check_resource_utilization()
if utilization > expected * 2:
flag_potential_waste()
investigate_cause()
None of these capabilities exist today in Claude Code. Hence: constant human orchestration.
What Actually Worked
Despite the orchestration requirements, we did achieve significant wins:
1. Zero-Conflict Architecture (★★★★★)
The Design:
- Each PR owns its files completely
- No shared file modifications
- CREATE new files instead of MODIFY existing
The Result:
- 100% auto-merge success rate
- Zero manual conflict resolution
- 5 PRs merged in 48 hours
Key Insight: File-level ownership is the gold standard for parallel work.
(More on this in Article 4: "Zero-Conflict Architecture")
2. Property-Based Testing + Human Oversight (★★★★★)
The Combination:
- Agent runs 7000 scenarios → Validates invariants
- Human asks "why is utilization high?" → Notices efficiency waste
- Together → Find 12x cost bug
Key Insight: Agents provide breadth, humans provide depth.
(More on this in Article 3: "Property-Based Testing with Hypothesis")
3. Parallel Execution (★★★★☆)
Despite orchestration overhead:
- 5 work streams completed in 8 hours
- Estimated sequential time: 8-10 days
- Time savings: ~75% (even with orchestration!)
The math:
- Human orchestration time: ~6 hours
- Agent execution time: ~42 hours (parallelized across 5 agents)
- Total wall-clock time: 48 hours
- Sequential alternative: 192+ hours (8 days)
Orchestration overhead: 6 hours / 48 hours = 12.5% of total time
Key Insight: Even with constant orchestration, parallel execution still delivered massive time savings.
The Honest Framing: Human-Orchestrated AI Assistants
Let's be clear about what we actually built:
What We Claimed (Part 1)
"5 independent AI agents working autonomously with zero coordination"
What We Actually Built
Human-orchestrated parallel development with AI assistants
Where:
-
Human provides:
- Environment management (tool switching, dependency install)
- Coordination signals ("pull from PR-X now")
- Reality checks (API verification, efficiency monitoring)
- Critical thinking ("why is this wasteful?")
-
AI Agents provide:
- Code generation at scale
- Test execution breadth (7000 scenarios)
- Pattern implementation (once shown correct pattern)
- Investigation depth (once directed)
The Value Proposition (Still Significant!)
Even though it's not "autonomous," the value is real:
Time Savings
- 8 hours vs 8-10 days (75% reduction)
- Despite 12.5% orchestration overhead
Quality Improvements
- ~80 lines duplicate code eliminated
- 67 new tests added
- 1 major efficiency bug found (12x cost savings)
- 7000+ random scenarios validated
Developer Experience
- Human focuses on high-value activities (architecture, verification, critical thinking)
- AI handles high-volume activities (code generation, test execution, investigation)
Recommendations for Practitioners
If you're considering multi-agent AI development:
✅ Do This
1. Design for zero conflicts
- File-level ownership
- CREATE new files, don't MODIFY shared ones
- Partition existing files carefully
2. Verify before coding
- Always introspect models:
print(dir(model)) - Test imports before using:
python -c "from X import Y" - Check signatures:
help(function)
3. Plan for orchestration
- Budget 10-15% time for human coordination
- Set up easy branch switching (git worktrees)
- Automate dependency install scripts
- Create checklists for manual steps
4. Combine agent breadth + human depth
- Use agents for volume (7000 test scenarios)
- Use humans for insight ("why is this wasteful?")
- Direct agents to investigate once you notice issues
5. Track metrics honestly
- Measure orchestration overhead (not just execution time)
- Count manual interventions (tool switches, fixes)
- Report actual autonomy score (31% in our case)
❌ Don't Do This
1. Assume autonomy without verification
- Agents will make incorrect API assumptions
- Cost: Wasted effort + debugging time
2. Skip environment setup
- Private dependencies need manual handling
- Tool unavailability breaks workflows
3. Expect agents to coordinate automatically
- No inter-agent communication today
- Budget time for manual coordination
4. Trust property tests alone
- Agents optimize for correctness (invariants)
- Humans must check efficiency (resource usage)
5. Over-claim autonomy
- Be honest about orchestration requirements
- Report actual vs autonomous time
Conclusion
The experiment proved:
- ✅ Zero-conflict architecture works (100% auto-merge)
- ✅ Parallel execution saves time (75% reduction)
- ✅ Property testing + human insight finds bugs (12x efficiency bug)
- ❌ True autonomy doesn't exist yet (31% autonomy score)
The real value:
Not "autonomous multi-agent development" but well-orchestrated human-AI collaboration where:
- Humans provide architecture, verification, and critical thinking
- AI provides generation, execution breadth, and investigation depth
- Together they achieve 75% time savings despite 12.5% orchestration overhead
Would we do it again? Yes! But with realistic expectations about orchestration requirements.
Next time we'd improve:
- Write verification scripts (model introspection, import checking)
- Create orchestration checklists (reduce decision fatigue)
- Build efficiency monitoring (not just correctness testing)
- Set realistic expectations (orchestrated assistance, not autonomy)
What's Next
In the following articles, we'll dive deeper into specific aspects:
-
Article 3: Property-Based Testing with Hypothesis: The Data You're Throwing Away
- How we ran 7000 scenarios but almost lost all the insights
- The policy that changed everything
-
Article 4: Zero-Conflict Architecture: The 80/20 of Parallel Development
- The one design decision that eliminated all merge conflicts
- What zero-conflict doesn't solve (integration correctness)
-
Article 5: Communication Protocols for AI Agents That Can't Talk
- 4 iterations to get file-based messaging working
- What worked and what didn't
-
Article 6: The Budget Calculator Paradox
- What we learned from flip-flopping 8 times on a formula
- Build the calculator first, use it everywhere
Tags: #ai #claude #multi-agent #reality-check #human-ai-collaboration #orchestration #lessons-learned #honest-reporting
This is Part 2 of the Multi-Agent Development Series. Read Part 1 for the original experiment design and optimistic hypothesis.
Discussion: What's your experience with "autonomous" AI agents? Have you found the sweet spot between human orchestration and AI execution? Share in the comments!
This content originally appeared on DEV Community and was authored by Aviad Rozenhek
Aviad Rozenhek | Sciencx (2025-11-06T20:26:02+00:00) The Reality of “Autonomous” Multi-Agent Development. Retrieved from https://www.scien.cx/2025/11/06/the-reality-of-autonomous-multi-agent-development/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.