Building an AI Chatbot That Actually Works: Lessons from Production RAG Systems

by Maven Team, Software Development

The gap between demo and production

Building an AI chatbot demo takes an afternoon. You throw your documents into a vector database, connect an LLM, and ask it questions. It works surprisingly well — for about ten minutes. Then someone asks a question the documents do not cover and the chatbot confidently invents an answer. Or someone asks a specific question and gets a vague, generic response that could have come from a Google search.

The gap between a demo that impresses stakeholders and a production system that employees or customers actually trust is enormous. This article covers the lessons we have learned closing that gap.

What is RAG and why does it matter?

RAG stands for Retrieval-Augmented Generation. Instead of relying solely on what the LLM was trained on, a RAG system retrieves relevant documents from your own data and includes them in the prompt. The LLM then generates an answer based on your specific content rather than its general training data.

This is how you build a chatbot that knows about your products, your policies, your documentation, or your internal processes — without fine-tuning a model, which is expensive and often unnecessary.

The architecture is straightforward:

  1. Your documents are split into chunks and converted to vector embeddings
  2. When a user asks a question, the question is also converted to an embedding
  3. The most relevant document chunks are retrieved by similarity search
  4. The retrieved chunks are included in the prompt alongside the user's question
  5. The LLM generates an answer grounded in your actual content

Simple in theory. The devil is in every single step.

Lesson 1: Chunking strategy matters more than model choice

The most common mistake we see is teams spending weeks evaluating LLMs while spending five minutes on their chunking strategy. The quality of your retrieval determines the quality of your answers, and retrieval quality depends almost entirely on how you split your documents.

Bad chunking: Split every document into 500-token chunks at arbitrary boundaries. A paragraph about your return policy gets split in half. The first half is retrieved; the second half — which contains the actual return window — is not. The chatbot gives an incomplete answer.

Better chunking: Split documents by semantic boundaries — headings, paragraphs, sections. Preserve context by including the document title and section heading with every chunk. A question about return policies retrieves the complete return policy section, not an arbitrary fragment of it.

We typically use a combination of approaches:

  • Heading-based splitting for structured documents (documentation, FAQs, policies)
  • Paragraph-based splitting with overlap for unstructured content (emails, reports)
  • Table-aware splitting that keeps table rows and headers together rather than serialising them into broken text

Getting chunking right often improves answer quality more than switching from one LLM to another.

Lesson 2: Hybrid search beats pure vector search

Pure vector search works well for conceptual questions — "What is your approach to data privacy?" — but struggles with exact-match queries — "What is the price of product SKU-4521?" Vector embeddings capture meaning, not exact text.

We use hybrid search that combines vector similarity with keyword matching (BM25). The keyword component handles exact matches and specific terms. The vector component handles semantic similarity and paraphrased questions. A reciprocal rank fusion step merges the two result sets.

On one project, switching from pure vector search to hybrid search improved answer accuracy on factual questions by over 30%.

Lesson 3: The system prompt is your quality control

A well-crafted system prompt is the difference between a chatbot that hallucinates and one that says "I don't have enough information to answer that."

Our system prompts always include:

  • Role definition. "You are a customer support assistant for [Company]. You answer questions based only on the provided documents."
  • Citation instructions. "Always reference which document your answer comes from. If the provided documents do not contain the answer, say so."
  • Boundary rules. "Do not answer questions about competitors. Do not provide medical, legal, or financial advice. Redirect these questions to the appropriate team."
  • Tone guidance. Matched to the client's brand voice — formal for financial services, casual for consumer brands.

The system prompt is not a one-time exercise. We iterate on it continuously based on real user questions and failure cases.

Lesson 4: Evaluation is not optional

You cannot improve what you do not measure. Every RAG system we build includes an evaluation framework from day one.

We maintain a test suite of questions with known correct answers. This suite is run automatically whenever we change the chunking strategy, update the system prompt, switch embedding models, or modify retrieval parameters. If accuracy drops, we catch it before it reaches production.

Key metrics we track:

  • Answer accuracy — does the answer correctly address the question?
  • Groundedness — is the answer supported by the retrieved documents, or did the model hallucinate?
  • Retrieval relevance — did the system retrieve the right documents for the question?
  • Response latency — how long does the user wait for an answer?

Without evaluation, you are guessing. With evaluation, you are engineering.

Lesson 5: Users do not ask clean questions

In your test suite, questions are well-formed and specific. In production, users type things like:

  • "how do i do the thing with the account"
  • "REFUND!!!"
  • "I spoke to Sarah last week and she said I could get a discount, can you confirm"
  • A question in Welsh followed by a question in English in the same message

Your RAG system needs to handle all of these gracefully. We add a query understanding step that reformulates vague questions, extracts intent from emotional messages, and handles multilingual input. This is typically a lightweight LLM call that runs before the retrieval step.

Lesson 6: Know when to hand off to a human

The best AI chatbots know their limits. We build escalation paths into every system:

  • If the chatbot cannot find relevant documents after retrieval, it says so and offers to connect the user with a human
  • If the user explicitly asks for a human, the chatbot complies immediately
  • If the conversation reaches a configurable number of turns without resolution, the chatbot suggests human support
  • Sensitive topics (complaints, legal questions, account security) are routed to humans by default

A chatbot that confidently gives wrong answers destroys trust faster than having no chatbot at all.

The technology stack

For most RAG deployments, our stack looks like this:

  • Embedding model: Amazon Titan Embeddings or Cohere embed-english-v3.0 via AWS Bedrock
  • Vector database: Amazon S3 Vectors, pgvector (PostgreSQL extension), or Pinecone depending on scale and existing infrastructure
  • LLM: Claude or GPT-4o via AWS Bedrock, depending on the use case
  • Orchestration: AWS Bedrock Agents or custom Python with the AWS SDK
  • Infrastructure: AWS Lambda for the API, CloudFront for the chat widget, DynamoDB or PostgreSQL for conversation history
  • Monitoring: Amazon CloudWatch with custom metrics and alarms to track query volume, retrieval latency, and error rates

Getting started

If you are considering adding an AI chatbot to your product or internal tools, start small. Pick a single, well-documented use case — customer FAQ, internal knowledge base, or product documentation. Get the retrieval right, evaluate rigorously, and expand from there.

We offer a two-week proof of concept that delivers a working RAG chatbot connected to your actual documents. This gives you a realistic assessment of what AI can do for your specific content before committing to a full build.

The technology is genuinely powerful. But only when it is built with the same engineering discipline you would apply to any other production system.

Learn more about our AI integration services, or get in touch to discuss a proof of concept for your use case.

More articles

CI/CD for Non-Technical Founders: Why Your Dev Team Should Never Deploy Manually

If your developers are deploying code by copying files to a server, you are one bad Friday afternoon away from a production outage. Here is what CI/CD actually means and why every project needs it from day one.

Read more

Serverless on AWS: What It Actually Costs vs Traditional Hosting

Everyone says serverless is cheaper. But is it? We break down real costs for Lambda, API Gateway, and CloudFront versus EC2 and traditional servers — with actual numbers from production workloads.

Read more

Tell us about your project

Our offices

  • London
    71-75, Shelton Street,
    Covent Garden, London