AI / ML2025

Document RAG Assistant

A production-quality, fully open-source retrieval-augmented generation chatbot. Answers questions grounded strictly in your documents with mandatory citations. Model-agnostic — works with Claude, Gemini, GPT, or local Ollama models.

FastAPIChromaDBOllamaPython

This is a production-quality RAG (Retrieval-Augmented Generation) chatbot — 100% open source, model-agnostic, and designed specifically to prevent hallucination. Feed it your documents and ask questions. Every answer comes with source citations.

Why naive chatbots fail

If you give a standard LLM a question about your documents, it will happily generate an answer — whether or not the answer is actually in the documents. It'll mix its training data with your content, confabulate details, and present everything with equal confidence. For legal, medical, or technical documents, this is dangerous.

RAG solves this by separating retrieval from generation. Instead of asking the LLM to remember everything, you find the relevant passages first, then ask the LLM to synthesise an answer from only those passages.

The ingestion pipeline

Documents (PDF, Markdown, or plain text) go through a multi-stage pipeline. Text extraction handles different formats while preserving structure. Chunking splits text into 512-token chunks with 50-token overlap to preserve context across boundaries. Embedding converts each chunk into a 384-dimensional vector using BAAI/bge-base-en-v1.5, running entirely on your local machine. Finally, embeddings and metadata are stored in ChromaDB.

The query pipeline

When you ask a question, the system embeds your query using the same model, searches ChromaDB for the top 5 most similar chunks via cosine similarity, and injects those chunks into a carefully designed grounding prompt. The grounding prompt explicitly instructs the LLM to only use the provided context, cite sources for every claim, and refuse to answer if the context is insufficient.

Hallucination prevention

The anti-hallucination system has four layers: the system prompt explicitly forbids using external knowledge, every claim must reference a source document, the model refuses to answer when context is insufficient, and full metadata tracking provides source traceability.

Model agnostic

The system works with any LLM provider — Claude, Gemini, GPT, or local models via Ollama. Switching providers is a single environment variable change. The recommended setup uses Ollama with Llama 3.2 for fully local, privacy-preserving inference.

The goal was to build something genuinely useful — not a demo, but a system you'd actually deploy for real document Q&A where accuracy matters more than creativity.

All Projects