Back to Insights
6 min read
RAGLLMFine-tuningArchitecture

RAG vs fine-tuning: when to use each (and when to use neither)

A practical guide to choosing between RAG and fine-tuning for your LLM project. Real costs, real latency, real trade-offs.

P
Pharosyne Editorial

A B2B SaaS company Pharosyne worked with last year wanted their support bot to answer questions using their internal documentation. They had 2,000+ help articles, product specs, and troubleshooting guides. Their first instinct was fine-tuning. Train the model on their docs, make it "know" their product.

Three months and €40k later, they had a fine-tuned model that was marginally better at their specific terminology but still hallucinated answers. Worse, every time they updated documentation, they needed to retrain. Which cost more money and took weeks to validate.

Pharosyne replaced it with a RAG system in four weeks. Accuracy went up. Costs went down. Updates were instant.

Fine-tuning wasn't wrong as a technology. It was wrong for their problem.

What RAG actually does

RAG stands for Retrieval-Augmented Generation. The name is accurate: you retrieve relevant documents first, then generate a response using those documents as context.

The flow is simple:

  1. User asks a question
  2. System searches a vector database for relevant documents
  3. Top results get stuffed into the prompt as context
  4. LLM generates an answer using that context
  5. Optionally, you show citations

The model doesn't "know" your data. It reads it on demand, every time someone asks a question. Like a human with access to a search engine.

What RAG is good at:

Factual accuracy. When the model has the source document in its context window, hallucinations drop dramatically. In Pharosyne's experience, from 20-30% error rates to under 5% for straightforward Q&A.

Fresh data. Update a document, and the next query uses the new version. No retraining, no validation, no waiting.

Auditability. You can show exactly which documents informed each answer. Critical for regulated industries.

Cost predictability. You pay per query, not for training runs. Easier to budget.

What RAG is bad at:

Style and tone. RAG doesn't change how the model writes. If you need a specific voice, RAG won't give it to you.

Complex reasoning over large datasets. The context window is finite. If the answer requires synthesizing information from 50 documents, RAG struggles.

Speed. Every query needs a vector search plus a longer prompt. Adds 100-500ms minimum.

What fine-tuning actually does

Fine-tuning modifies the model's weights. You train it on examples of inputs and desired outputs. The model "learns" patterns from your data.

What fine-tuning is good at:

Consistent style and tone. If you need the model to write like your brand, fine-tuning is the way. Marketing copy, specific formatting, domain-specific terminology used correctly.

Task specialization. Classification, extraction, structured output in specific formats. Fine-tuning can make the model reliably output JSON in your exact schema.

Latency. No retrieval step. The model just generates. Can be 200-400ms faster than RAG for equivalent tasks.

Handling implicit knowledge. Things that are hard to document but easy to demonstrate through examples. "Write like our senior engineer would explain it."

What fine-tuning is bad at:

Factual recall. Fine-tuned models still hallucinate. They hallucinate more confidently. They don't "know" your docs, they've seen statistical patterns in them.

Keeping current. Every data update means retraining. Which means dataset preparation, training runs, evaluation, deployment. Weeks of work for each update cycle.

Cost. Training runs are expensive. GPT-4 fine-tuning starts at $8 per million tokens for training. A single training run on a decent dataset can cost $500-2000. Then you pay more per inference than the base model.

Debugging. When a fine-tuned model gives wrong answers, figuring out why is hard. Did the training data have errors? Was there distribution mismatch? Did you overtrain?

The decision framework

Start by asking what problem you're actually solving:

Choose RAG when:

  • Your data changes frequently (weekly or faster)
  • Accuracy matters more than style
  • You need to cite sources
  • You're answering questions from a knowledge base
  • Budget is limited
  • You need to launch fast

Choose fine-tuning when:

  • You need consistent style/tone/format
  • The task is classification or extraction
  • Latency is critical (sub-second responses)
  • Your data is stable (changes quarterly or slower)
  • You have clear input/output examples, hundreds or thousands of them
  • You can afford the ongoing training costs

Choose both when:

  • You need accurate facts AND specific style
  • Example: Customer support that answers correctly AND sounds on-brand

Choose neither when:

  • A well-crafted prompt with the base model works
  • Seriously, try this first. Modern models are good. Many projects over-engineer.

Real costs breakdown

Let me give you actual numbers from recent projects:

RAG system (mid-size, ~10k documents):

  • Vector database: $50-200/month (Pinecone, Weaviate cloud)
  • Embedding generation: ~$0.0001 per document (one-time)
  • Query costs: ~$0.01-0.03 per query (embedding + LLM)
  • Development: 2-4 weeks
  • Maintenance: 2-4 hours/month

Fine-tuning project:

  • Dataset preparation: 1-2 weeks of work
  • Training run: $500-2000 per run
  • Evaluation: 1 week per iteration
  • Expect 3-5 iterations minimum: $1500-10000 total
  • Per-query costs: 2-3x base model pricing
  • Updates: Same cost every time data changes

For most projects Pharosyne sees, RAG has 3-5x better ROI in the first year. Fine-tuning catches up only if your data is stable and query volume is very high.

Implementation pitfalls

RAG pitfalls:

Chunking strategy matters more than you think. Chunk too small, you lose context. Chunk too big, you waste tokens and reduce precision. Pharosyne typically starts with 512 tokens with 50-token overlap, then tunes based on results.

Embedding model choice affects everything. OpenAI's ada-002 is fine for English. For multilingual or specialized domains, test alternatives. Pharosyne has seen 20%+ accuracy improvements from switching embedding models.

Don't skip reranking. Vector search gets you candidates. A reranker (cross-encoder) picks the best ones. Adds 50-100ms but can double relevance.

Hybrid search usually wins. Combine vector search with keyword search (BM25). Some queries are better served by exact matches.

Fine-tuning pitfalls:

More data isn't always better. Quality over quantity. 500 excellent examples beat 5000 mediocre ones.

Evaluation before and after. If you don't have a test set with measurable metrics, you can't know if fine-tuning helped.

Overfitting is real. Your model can memorize training data and fail on new inputs. Always hold out a test set.

Forgetting is real too. Fine-tuning on narrow data can degrade general capabilities. Test for regression.

When clients come to Pharosyne

Most clients who think they need fine-tuning actually need RAG. The opposite is rare.

The pattern observed: teams read about fine-tuning, get excited about "training their own model," spend months on it, then realize they needed a search system all along.

If you're not sure which you need, start with RAG. You can always add fine-tuning later for style. Going the other way is harder.

For a deeper dive into when complex architectures make sense, see Pharosyne's guide on multi-agent systems. And if you want an evaluation of your specific case, get in touch. Pharosyne has built both systems for enterprises and can assess your situation through the consulting services offered.

LET'S TALK

If this article was helpful and you want to explore how to apply these ideas in your company, schedule a call.

Start Project