Internal Document Search with AI: Build a Local Prototype to Understand the Process

In this proof of concept, I share how I built a local RAG system using Qwen1.5 (0.5B), Ollama and LangChain, a step-by-step pipeline to query your documents and understand how it works.

May 31, 2025

Searching for internal documents is still a huge bottleneck in many companies. Most teams rely on chaotic folders, outdated knowledge bases or asking colleagues the same questions again and again.

What if you could ask questions in natural language like:

"What does our onboarding policy say about remote work?"
"Summarize the quarterly results from last year"
"Find the section about compliance in our security guide"

...and get a real answer from your own documents in seconds?

This use case is called Retrieval-Augmented Generation (RAG), and it's one of the fastest ways to bring LLMs into real business workflows.

But most RAG systems today are built for cloud deployments, with heavy infrastructure or expensive APIs. I wanted to test how far we can go completely locally, on a modest Windows laptop, with no server, no subscription, and full control.

My Setup

I ran this POC on a Lenovo laptop with 12GB RAM running Windows 11.
To keep everything local and open, I used:

Qwen 1.5 (0.5B): an excellent open-source model from Alibaba, with strong reasoning capabilities and light enough for local use
Ollama: a fantastic tool to run LLMs and embedding models locally
LangChain: to build the pipeline and logic
ChromaDB: for the vector database

Core Architecture

The system is simple, but effective:

Load a PDF document
Split it into chunks
Generate vector embeddings for each chunk
Store them in a local ChromaDB instance
Accept natural language queries
Retrieve relevant chunks
Send the context + query to the Qwen 1.5 (0.5B) model
Display a synthesized answer

All of this runs locally, using Ollama’s built-in API and LangChain’s native integration for local models.

Step-by-Step Reasoning

Let me walk you through the main pieces of the puzzle and what tradeoffs I made.

1. Document Ingestion

I used PyPDFLoader from LangChain to load a PDF file. You can also use UnstructuredPDFLoader if your document is more complex.

python

loader = PyPDFLoader("data/mydocument.pdf") 
documents = loader.load()

2. Chunking the Text

I split the text using RecursiveCharacterTextSplitter, with a chunk_size of 1000 and an overlap of 200. This size gave a good balance between retrieval precision and context size.

python 

text_splitter = RecursiveCharacterTextSplitter( 
                chunk_size=1000, 
                chunk_overlap=200, 
                ...) 
chunks = text_splitter.split_documents(documents)

3. Embedding Model

I used nomic-embed-text running on Ollama for local embeddings. This avoids any external calls and keeps everything self-contained.

python 

from langchain_ollama import OllamaEmbeddings 
embedding_function = OllamaEmbeddings(model="nomic-embed-text")

4. Vector Store with ChromaDB

Once we have the embeddings, we store them in ChromaDB, a fast local vector database.

python 

# Use from_documents for initial creation.This will overwrite existing data if the directory exists

vectorstore = Chroma.from_documents( 
              documents=chunks, 
              embedding=embedding_function, 
              persist_directory="chroma_db" )

5. RAG Chain with Qwen

Finally, I created a LangChain pipeline that takes the user question, fetches the top 3 similar chunks, and asks the model to answer based only on that context.

python 

# Initialize the LLM
llm = ChatOllama( 
      model="qwen:0.5b", 
      temperature=0, 
      num_ctx=8192 ) 

# Define the RAG chain using LCEL
rag_chain = ( 
{"context": retriever, "question": RunnablePassthrough()} 
| prompt 
| llm 
| StrOutputParser()
)

main
...
# Call
response = chain.invoke(question)
...

This is enough to answer questions like:

“What’s the purpose of this document?”
“Summarize the conclusion”
“What are the key risks mentioned?”

Results and Limitations

I was able to run the entire system on my laptop, with only 12GB RAM
Answers were grounded in the document, not hallucinated
System worked fully offline

This is not production-grade. For example:

Ollama is great for PoC, but not optimized for concurrent requests or latency
ChromaDB works well locally but has limitations at scale
Qwen 1.5 (0.5B) fits in memory, but even so, context size and speed are bottlenecks

Why This Matters

I share this not as a finished product, but as a foundation for prototyping real enterprise use cases.

If you're a company thinking “we want to search internal PDFs using AI, but we don’t want vendor lock-in or cloud exposure,” this is a solid starting point.

From here, you can:

Switch Qwen for any other model
Use remote embeddings if needed
Deploy the same logic to a lightweight server or internal tool

Get the Full Code

You can find all the setup steps, requirements, and full working code here:
🔗 GitHub Repository

Coming Next: Local AI Agents with Tools

In the next post, I’ll show how to build a basic AI agent that uses tools, using Qwen and Ollama, still running locally, and with almost no setup.

If you're exploring similar use cases, I’ll be publishing more hands-on prototypes like this—always grounded in real constraints, and always with enough detail to replicate or adapt.

Until next time,

Nina

I’m on LinkedIn if you want to connect.