Agentic workflows: a practical implementation guide
How to build AI agents that actually work in production. Patterns, anti-patterns, and lessons from real deployments.
Last quarter Pharosyne helped a logistics company automate their customer service escalation process. They wanted an AI system that could read incoming tickets, gather context from their CRM and order system, attempt resolution, and escalate to humans when needed.
The first version was a single prompt. It worked for 40% of tickets. The other 60% either failed silently or produced wrong answers with high confidence.
The final version uses four specialized agents coordinated by a router. It handles 78% of tickets correctly and knows when to escalate. The difference wasn't smarter prompts. It was better architecture.
What "agentic" actually means
An agent is an LLM that can take actions, observe results, and decide what to do next. Unlike a simple prompt-response pattern, agents operate in loops.
The basic loop:
- Receive a goal
- Decide on an action
- Execute the action (call a tool, query a database, make an API call)
- Observe the result
- Decide if goal is achieved
- If not, go to step 2
This sounds simple. The complexity comes from everything that can go wrong in steps 2-6.
When to use agents (and when not to)
Agents are appropriate when:
The task requires multiple steps with dependencies. Researching a topic, then writing a summary, then fact-checking against sources. Each step depends on previous results.
The path isn't predictable. You can't write a flowchart because the required steps depend on what you discover along the way.
The task benefits from iteration. Write code, test it, fix errors, test again. The loop matters.
Tool use is central. The LLM needs to search databases, call APIs, or interact with external systems to complete the task.
Agents are NOT appropriate when:
A single prompt works. If you can get reliable results with one LLM call, adding agent complexity is waste.
The workflow is fixed. If steps A, B, C always happen in that order with the same tools, just run them sequentially. No agent needed.
Latency is critical. Agent loops take time. Each step is an LLM call plus tool execution. If you need sub-second responses, agents usually don't fit.
Reliability requirements are extreme. Agents can fail in unexpected ways. For safety-critical systems, explicit programmed logic is safer.
The three agent patterns that work
Pattern 1: ReAct (Reasoning + Acting)
The agent reasons about what to do, takes an action, observes the result, then reasons again.
Thought: I need to find the customer's order history
Action: query_crm(customer_id="12345")
Observation: Customer has 3 orders: #1001 (delivered), #1002 (in transit), #1003 (cancelled)
Thought: The customer is asking about order #1002 which is in transit
Action: get_tracking(order_id="1002")
Observation: Package is at distribution center, ETA tomorrow
Thought: I have the information needed to respond
Action: respond("Your order #1002 is at our distribution center and will arrive tomorrow")
When to use it: General-purpose tasks where the agent needs to figure out the approach as it goes. Good for customer service, research tasks, data analysis.
Watch out for: Reasoning loops that go nowhere. The agent thinks and thinks but never acts. Set maximum iterations and monitor average loop length.
Pattern 2: Plan and Execute
The agent first creates a plan, then executes each step. The plan can be revised if something fails.
Plan:
1. Get customer details from CRM
2. Check order status in fulfillment system
3. If order delayed, check logistics API for reason
4. Compose response with status and any available compensation
Execute step 1: query_crm(...)
Execute step 2: check_order_status(...)
...
When to use it: Complex tasks with multiple phases. Writing documents, multi-step analysis, tasks that benefit from explicit structure.
Watch out for: Plans that are too rigid. If step 2 fails, the agent might not know how to adapt. Build in replanning triggers.
Pattern 3: Multi-agent collaboration
Multiple specialized agents work together. A router or orchestrator decides which agent handles what.
Router receives: "I want to return order #1002 and also have a billing question"
Router analysis: Two separate issues detected
- Route to: Returns Agent (order return request)
- Route to: Billing Agent (billing question)
Returns Agent handles return...
Billing Agent handles billing...
Router combines responses
When to use it: When different subtasks require different capabilities, tools, or system prompts. When you want to keep each agent focused and simple.
Watch out for: Coordination overhead. The more agents, the more chances for miscommunication. Keep the number small (3-5 is usually the sweet spot).
Implementation lessons from production
1. Tools are more important than prompts
Pharosyne spends 70% of development time on tool design, 30% on prompts. A well-designed tool with clear input/output schemas makes the agent's job easy. A vague tool leads to hallucinated parameters and failed calls.
Good tool definition:
{
"name": "get_order_status",
"description": "Get current status of a customer order",
"parameters": {
"order_id": {
"type": "string",
"description": "Order ID in format ORD-XXXXX",
"pattern": "^ORD-[0-9]{5}$"
}
},
"returns": {
"status": "pending | processing | shipped | delivered | cancelled",
"last_updated": "ISO 8601 timestamp"
}
}
Bad tool definition:
{
"name": "check_order",
"description": "Check something about an order",
"parameters": {
"id": { "type": "string" }
}
}
2. Error handling determines success rate
In Pharosyne's experience, 30-40% of agent failures come from unhandled edge cases. The tool returns an error, the agent doesn't know what to do, and the whole task fails.
Build explicit error handling:
- What happens if the API is down?
- What happens if the customer ID doesn't exist?
- What happens if the agent tries an action it's not allowed to do?
Give the agent graceful fallbacks. "If you cannot retrieve order status, inform the customer that you're checking manually and will follow up."
3. Observability is non-negotiable
You need to see:
- Every thought/action/observation in the loop
- Which tools were called with what parameters
- How long each step took
- Where failures occurred
Without this, debugging agent behavior is guesswork. Pharosyne uses structured logging where every agent step outputs a JSON event that can be queried later.
4. Guardrails prevent disasters
Agents can do unexpected things. Build guardrails:
Action limits. Maximum N tool calls per task. Maximum M tokens generated. Maximum T seconds runtime.
Permission boundaries. The agent can READ from the CRM but cannot WRITE. It can query orders but cannot cancel them.
Output validation. Before sending a response to the customer, check it against content policies. Flag anything that looks wrong for human review.
Human-in-the-loop triggers. Define conditions where the agent must escalate: low confidence, high-stakes decisions, explicit uncertainty.
5. Start simple, add complexity only when needed
Pharosyne's first agent for any use case is a single ReAct loop with 3-4 tools. That's it. Measure where it fails. Then add complexity specifically to address those failures.
Don't design a sophisticated multi-agent system on day one. You don't know what you need yet.
Cost and latency reality
Agent workflows are expensive. Each reasoning step is an LLM call. A 5-step agent loop with GPT-4 costs roughly $0.05-0.15 per task. At 10,000 tasks per day, that's $500-1500/day on LLM costs alone.
Latency adds up too. Each step is 1-3 seconds for the LLM plus tool execution time. A 5-step loop might take 10-20 seconds total.
Optimization strategies:
- Use cheaper models for simpler steps (GPT-4o-mini for routing, GPT-4 for complex reasoning)
- Cache common queries
- Parallelize independent tool calls
- Precompute where possible
Common failure modes
The infinite loop. Agent keeps trying the same action expecting different results. Fix: track action history, detect repetition, force different approach or escalation.
Hallucinated tool calls. Agent invents parameters that don't exist. Fix: strict schema validation, clear error messages, examples in tool descriptions.
Lost context. In long conversations, the agent forgets earlier information. Fix: explicit context windows, summarization between steps, vector retrieval for long context.
Overconfidence. Agent produces wrong answer but presents it with full confidence. Fix: calibration training, uncertainty signals, output validation, human review for edge cases.
Scope creep. Agent tries to help with things outside its capabilities. Fix: explicit scope boundaries in system prompt, out-of-scope detection, graceful refusal.
Getting started
If you're building your first agent:
- Pick ONE specific task with clear success criteria
- Define 3-5 tools the agent needs
- Build a simple ReAct loop
- Run 100 test cases, measure success rate
- Analyze failures, add targeted fixes
- Iterate until you hit your accuracy target
- Only then consider multi-agent or more complex patterns
Most teams overcomplicate too early. A well-tuned simple agent beats a poorly-tuned complex system every time.
If you're working on agentic systems and want a second opinion on your architecture, reach out. Pharosyne has deployed agents in production for logistics, customer service, document processing, and code generation. For more on when multi-agent architectures make sense, see the guide on multi-agent systems in business, or explore the consulting services available.
LET'S TALK
If this article was helpful and you want to explore how to apply these ideas in your company, schedule a call.
Start Project