Back to Insights
10 min read
AI AgentsLLMArchitectureEnterprise

Multi-agent systems: when they make sense and when they don't

Not everything needs multi-agents. Here's when they work, when they're overkill, and the real latency and cost numbers nobody tells you.

PA
Petru ArakissAI Consultant & Architect

A fintech I worked with a while back had spent over €150k on a customer service chatbot. They'd been iterating for months. The bot worked great in demos, but in production it invented return policies that didn't exist and gave wrong prices. Not always, maybe 25-30% of the time, but enough that the support team hated the thing.

The model wasn't the problem. They were using one of the leading models at the time, well-configured. The problem was that a single agent was trying to do too many things: query the catalog, check stock, calculate prices with discounts, handle returns, and answer general questions. Too much context, too many conflicting instructions.

The solution I proposed was to split the work. Instead of one bot that knew everything, five smaller agents that knew a lot about a little. One for catalog, one for pricing, one for logistics, one for returns, and an orchestrator that decided who to ask. Errors dropped significantly, though I don't have an exact number because the metric changed midway through the project.

What a multi-agent system actually is (no buzzwords)

Think about a hospital. There's no single doctor who does everything. There are specialists: cardiologists, radiologists, surgeons. And there's triage, which decides which specialist to send each patient to.

A multi-agent system works similarly. Instead of a giant LLM with a massive prompt, you have specialized agents that master specific tasks. And something that coordinates who does what.

The typical components:

Orchestrator: Receives the request, decides which agent or agents need to act, and combines the results. It doesn't do the real work, just directs traffic.

Specialized agents: Each has its own prompt, its own tools, access to specific data. The inventory agent knows how to query the stock database. The pricing agent has access to the pricing API. Each is an expert in its domain.

Shared memory: A place where agents leave information for others. The catalog agent finds the product, pricing adds the cost, shipping calculates delivery. They don't talk directly, but they share context.

Tools: Functions that agents can execute. Call APIs, query databases, send emails. Without tools, an agent is just a text generator.

When NOT to use multi-agents

Here's what nobody tells you: recent research suggests that in many cases a single well-configured agent outperforms multi-agent systems.

A 2025 study found that in environments with more than 10 tools, multi-agent systems suffer an efficiency penalty of 2x to 6x compared to individual agents. That's significant.

The folks at Cognition, who created Devin, say it clearly: in 2025, running multiple agents in collaboration results in fragile systems. Their recommendation is to start with a linear agent where context is continuous.

Don't use multi-agents when:

The task can be resolved in a single logical pass. Summarizing documents, classifying tickets, extracting data from invoices. A single well-configured agent with good RAG is enough.

You have many tools. Counterintuitive, but true. With more than 10 tools, coordination between agents adds overhead that doesn't pay off.

Your individual agent already works reasonably well. Research suggests that if your single agent exceeds 45% accuracy on the task, adding more agents probably won't improve things. Sometimes it makes them worse.

Tasks are primarily write operations. Read operations parallelize well. Write operations create coordination problems.

You're a small team. Multi-agents requires monitoring, distributed debugging, integration tests. If there are two of you, start simple.

When it DOES make sense

It makes sense when:

The problem crosses multiple domains with different rules. An e-commerce assistant that handles catalog, payments, shipping, and returns. Each area has its own logic, its own data, its own exceptions.

Tasks have clear dependencies and multiple passes. First search product, then verify stock, then apply discounts, then calculate shipping. When order matters and each step needs information from the previous one.

You need strict auditing. In banking, insurance, or any regulated environment, knowing exactly which decision was made by whom is mandatory. With multi-agents you can trace each step.

Different teams maintain different parts. If the pricing team changes their rules every week and logistics every month, having each own their agent makes it easier to iterate without breaking each other.

The real latency numbers (and the complexity nobody tells you)

This is more complicated than it seems. Latency comes from many places, not just model generation.

LLM Infrastructure:

Where you call the model from matters. If your server is in Europe and you're using a US endpoint, you add 80-150ms of network latency alone, on each call. And in multi-agent you make many calls. I've seen systems where 30% of total latency was just transatlantic round-trips.

The provider's own infrastructure adds variability. During peak hours you can go from 200ms time-to-first-token to 800ms or more. This multiplies in multi-agent.

Context Management:

Each agent needs context. How you compress it, how much you keep, how you pass it between agents, it all adds up. I've seen systems where state serialization and deserialization between agents added 50-100ms per hop.

If you use shared memory with persistence, add database latency. If you use cache, you need to manage invalidation. If you compress conversations to stay within token limits, that compression has cost.

Inter-agent Communication:

If agents pass messages to each other, each communication step has overhead. Response parsing, format validation, error handling, retries. In a 5-agent system with an orchestrator, you easily have 8-10 calls between components for each user request.

The numbers I see in production:

Latency breakdown by component

Network to LLM provider30-150ms

Depends on geography

Time to first token200-800ms

Depends on provider load

Full generation2000-20000ms

The bulk of time

Orchestrator (routing)50-200ms
Vector search5-300ms

5ms with hot cache

State serialization20-100ms

Per agent

Validation and parsing10-50ms

Per response

Total overhead (excluding LLM)285-1450ms

Adding it up: a simple request through 3 agents can easily take 8-15 seconds. A complex request with 5 agents and multiple passes, 30 seconds or more.

If you need responses in under 2 seconds, multi-agent architecture is probably not for you. And if you think you can optimize it later, think twice. Complexity grows exponentially with each agent you add.

How I implement it

I don't use frameworks. Anthropic says it well in their documentation: the most successful implementations don't use complex frameworks or specialized libraries. They build with simple, composable patterns.

Frameworks add abstraction. Abstraction hides what's happening. In production you need to see exactly what's hitting the API. More code to write, yes, but much easier to debug.

The basic pattern is an orchestrator with routing:

UserOrchestratorClassify intentAgent AAgent BAgent CCombineResponse

8-10 calls between components per request

The orchestrator is another LLM call with a specific prompt to classify and route. Each agent is its own call with its own system prompt and tools. Shared memory is usually a dictionary or store you pass between calls.

It's not magic. It's basic software engineering applied to API calls. The real work is in designing good prompts, defining clear tools, and above all in error handling and observability.

Evaluations: this is not traditional testing

This is where most people get lost. You think you can test a multi-agent system like you test traditional software. Unit tests, integration tests, end-to-end. It doesn't work that way.

With LLMs you don't have determinism. The same input can give different outputs. A test that passes today can fail tomorrow without you changing anything. And when you have multiple agents, variability multiplies.

What you need are evals, not tests.

Evals are continuous evaluations against representative datasets. They don't verify that output is exactly X, they verify that output is "good enough" according to defined criteria. Accuracy, relevance, absence of hallucinations, correct format, appropriate tone.

Why it's more complex than traditional testing:

In classic software, a test fails or passes. With LLMs you have gradients. A response can be 80% correct. Or correct but poorly formatted. Or correct but with inappropriate tone. Defining what's "good enough" is a problem in itself.

In multi-agent it gets more complicated. If the final result is bad, which agent failed? Did the orchestrator route incorrectly? Did an intermediate agent corrupt context? Did combining responses lose information? You need evals at the individual agent level and at system level.

What I do in production:

Evaluation datasets per agent. Minimum 50-100 representative cases per agent, with expected outputs or evaluation criteria.

Automated evals in CI. Every prompt change triggers evaluation against the dataset. If accuracy drops below threshold, it doesn't deploy.

LLM-as-judge for complex cases. Using another model to evaluate whether a response is correct when there's no "exact" answer. It has its problems, but scales better than human review.

Continuous monitoring in production. Evals don't end at deploy. Sampling real requests, offline evaluation, alerts when metrics degrade.

The cost nobody mentions:

Building a good eval system can take more time than building the multi-agent system itself. I've seen projects where 40% of effort went into evaluation and observability. But without that, you don't know if your system works. You just hope it works.

Mistakes I see repeated

Starting with too many agents. The temptation is to model the entire organization from day one. Start with two, three max. Add when you have a real problem to solve, not before.

Not defining contracts. Each agent needs clear specification: what it receives, what it returns, when it fails. Without this, when something breaks, debugging is impossible.

Ignoring observability. A multi-agent system without structured logs and tracing is a black box. You need to be able to reconstruct what happened when something fails.

Underestimating costs. Each agent is another LLM call. I've seen bills double or triple after migrating to multi-agent. Budget from the start.

Depending on frameworks. When the framework updates, or stops being maintained, or has a production bug, you're trapped. With your own code on the API, you have full control.

Over-engineering. Models improve fast. What needs three agents today might be done by one in six months. Don't build for problems you don't have yet.

Next step

If you have processes that today depend on humans doing repetitive work across multiple systems, multi-agent systems can help. They're not magic. They're software engineering applied to LLMs, with their tradeoffs.

My recommendation: start with a single well-made agent. Measure where it fails. Only when you have clear data that the simple architecture doesn't scale, consider splitting into multiple agents.

If you want me to review your case, get in touch. As part of my consulting services, I can give you an idea of whether it makes sense for your situation and where to start. Learn more about my experience designing these systems for enterprises.

LET'S TALK

If this article was helpful and you want to explore how to apply these ideas in your company, schedule a call.

Start Project