Internal Document Search with AI: Build a Local Prototype to Understand the Process
In this proof of concept, I share how I built a local RAG system using Qwen1.5 (0.5B), Ollama and LangChain, a step-by-step pipeline to query your documents and understand how it works.
Searching for internal documents is still a huge bottleneck in many companies. Most teams rely on chaotic folders, outdated knowledge bases or asking colleagues the same questions again and again.
What if you could ask questions in natural language like:
"What does our onboarding policy say about remote work?"
"Summarize the quarterly results from last year"
"Find the section about compliance in our security guide"
...and get a real answer from your own documents in seconds?
This use case is called Retrieval-Augmented Generation (RAG), and it's one of the fastest ways to bring LLMs into real business workflows.
But most RAG systems today are built for cloud deployments, with heavy infrastructure or expensive APIs. I wanted to test how far we can go completely locally, on a modest Windows laptop, with no server, no subscription, and full control.
My Setup
I ran this POC on a Lenovo laptop with 12GB RAM running Windows 11.
To keep everything local and open, I used:
Qwen 1.5 (0.5B): an excellent open-source model from Alibaba, with strong reasoning capabilities and light enough for local use
Ollama: a fantastic tool to run LLMs and embedding models locally
LangChain: to build the pipeline and logic
ChromaDB: for the vector database
Core Architecture
The system is simple, but effective:
Load a PDF document
Split it into chunks
Generate vector embeddings for each chunk
Store them in a local ChromaDB instance
Accept natural language queries
Retrieve relevant chunks
Send the context + query to the Qwen 1.5 (0.5B) model
Display a synthesized answer
All of this runs locally, using Ollama’s built-in API and LangChain’s native integration for local models.
Step-by-Step Reasoning
Let me walk you through the main pieces of the puzzle and what tradeoffs I made.
1. Document Ingestion
I used PyPDFLoader
from LangChain to load a PDF file. You can also use UnstructuredPDFLoader
if your document is more complex.
python
loader = PyPDFLoader("data/mydocument.pdf")
documents = loader.load()
2. Chunking the Text
I split the text using RecursiveCharacterTextSplitter
, with a chunk_size
of 1000 and an overlap of 200. This size gave a good balance between retrieval precision and context size.
python
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
...)
chunks = text_splitter.split_documents(documents)
3. Embedding Model
I used nomic-embed-text
running on Ollama for local embeddings. This avoids any external calls and keeps everything self-contained.
python
from langchain_ollama import OllamaEmbeddings
embedding_function = OllamaEmbeddings(model="nomic-embed-text")
4. Vector Store with ChromaDB
Once we have the embeddings, we store them in ChromaDB, a fast local vector database.
python
# Use from_documents for initial creation.This will overwrite existing data if the directory exists
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_function,
persist_directory="chroma_db" )
5. RAG Chain with Qwen
Finally, I created a LangChain pipeline that takes the user question, fetches the top 3 similar chunks, and asks the model to answer based only on that context.
python
# Initialize the LLM
llm = ChatOllama(
model="qwen:0.5b",
temperature=0,
num_ctx=8192 )
# Define the RAG chain using LCEL
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
main
...
# Call
response = chain.invoke(question)
...
This is enough to answer questions like:
“What’s the purpose of this document?”
“Summarize the conclusion”
“What are the key risks mentioned?”
Results and Limitations
I was able to run the entire system on my laptop, with only 12GB RAM
Answers were grounded in the document, not hallucinated
System worked fully offline
This is not production-grade. For example:
Ollama is great for PoC, but not optimized for concurrent requests or latency
ChromaDB works well locally but has limitations at scale
Qwen 1.5 (0.5B) fits in memory, but even so, context size and speed are bottlenecks
Why This Matters
I share this not as a finished product, but as a foundation for prototyping real enterprise use cases.
If you're a company thinking “we want to search internal PDFs using AI, but we don’t want vendor lock-in or cloud exposure,” this is a solid starting point.
From here, you can:
Switch Qwen for any other model
Use remote embeddings if needed
Deploy the same logic to a lightweight server or internal tool
Get the Full Code
You can find all the setup steps, requirements, and full working code here:
🔗 GitHub Repository
Coming Next: Local AI Agents with Tools
In the next post, I’ll show how to build a basic AI agent that uses tools, using Qwen and Ollama, still running locally, and with almost no setup.
If you're exploring similar use cases, I’ll be publishing more hands-on prototypes like this—always grounded in real constraints, and always with enough detail to replicate or adapt.
Until next time,
Nina
I’m on LinkedIn if you want to connect.
This is great for start. Also, Do we have similar apis for Java ?
Thanks for sharing . Great way to get hands-on with RAG