You set up a RAG prototype with OpenAI's API and a hosted vector database. It works well. Your team likes it. Then you look at what it would cost to run this in production for 50, 100, 200 people - and the number makes you pause.
This is the moment most teams hit. The per-token and per-query pricing that seemed negligible during a proof of concept starts looking very different when you multiply it across an organization and project it out over a year.
If you've landed on this post, you're probably in one of two situations: you either got an unexpected bill from a cloud AI service, or you did the math proactively and realized the cost curve doesn't work for sustained usage. Either way, you're looking for the same thing - a way to run RAG-powered document search without the ongoing API costs.
Good news: it's entirely possible. This post explains how, what the tradeoffs are, and what the real cost picture looks like.
Where AI API costs actually come from
Before getting into the alternative, it helps to understand what you're actually paying for with cloud-based RAG.
A typical cloud RAG setup has three cost layers:
Embedding costs - every document you ingest needs to be converted into a mathematical representation (embedding) for search. Services like OpenAI's embedding API charge per token. If you have a large document library, the initial ingestion cost alone can be significant - and you pay again whenever you re-index.
Query costs - embedding side - very search query also needs to be converted into an embedding for matching. This is a smaller cost per query, but it adds up with volume.
Query costs - generation side - this is usually the biggest line item. When the system retrieves relevant passages and sends them to a language model (GPT-4, Claude, etc.) to generate an answer, you're paying for the input tokens (the retrieved passages) and the output tokens (the generated response). A single complex query with multiple source passages can easily consume several thousand tokens.
On top of these, there's usually a hosted vector database cost (Pinecone, Weaviate Cloud, etc.) billed by storage and query volume.
Together, for a team processing a few hundred queries a day, they can reach several thousand dollars per month = and the number only grows with usage. More people, more documents, more queries = proportionally higher costs, with no ceiling.
What "RAG without API costs" actually means
Running RAG without API costs means replacing every external service call in the pipeline with a local equivalent:
Local embedding model instead of OpenAI's embedding API. Open embedding models run on your own hardware and produce high-quality embeddings for document search. No per-token charge. You can re-index your entire document library whenever you want, at no additional cost.
Local language model instead of GPT-4 or Claude. Open models - Qwen, Llama, Mistral - run locally and handle the generation step. For RAG workloads, where the model's job is to read retrieved passages and answer questions based on them, these models perform well. The quality gap with cloud APIs has narrowed significantly for document search use cases.
Self-hosted vector database instead of Pinecone or Weaviate Cloud. ChromaDB, Qdrant, or Milvus run on your infrastructure with no usage-based fees.
Local document processing instead of external parsing or OCR services. Files get ingested, chunked, and indexed entirely on your servers.
The result: after the initial setup, there are no per-query costs. Query #1 and query #1,000,000 cost the same effectively zero marginal cost beyond electricity.
We covered how these components fit together technically in our architecture deep dive article.
The real cost comparison
"No API costs" doesn't mean "no costs." You need hardware to run the pipeline. Let's be specific about what that looks like.
Hardware for a production self-hosted RAG system
A capable setup - 64 GB RAM, a 24 GB VRAM GPU (RTX 3090 or 4090 class), NVMe storage - costs $5,000 to $15,000 depending on configuration. This handles a team of dozens of users with good response times.
What you're comparing it to
A cloud RAG stack doing the same work can run several thousand dollars per month - that includes embedding API costs, LLM API costs, and hosted vector database fees. The exact number depends on your query volume, document library size, and which services you're using.
The breakeven math
The breakeven point depends on your specific usage and cloud spend, but the structural reality is the same: cloud costs grow with every query, self-hosted costs don't. For teams with consistent production usage, the hardware pays for itself within a few months.
The specific numbers depend on your query volume and which cloud services you're using, but the structural difference is what matters: cloud costs scale with every query, self-hosted costs don't.
But aren't local models worse?
This is the concern that kept teams on cloud APIs in 2023 and 2024. It's worth revisiting in 2026.
For general-purpose chat and creative writing, the leading proprietary models still have an edge. But RAG is a specific workload. The language model isn't writing from scratch - it's reading retrieved document passages and generating an answer based on them. The quality of the answer depends more on whether the retrieval step found the right passages than on the raw generative capability of the model.
For this use case, open models have caught up to a point where the difference is rarely noticeable to end users. And you gain something in return: you can update and swap models at any time, test new options as they're released, and do all of this without changing vendors, contracts, or pricing agreements.
What you gain beyond cost savings
Once teams migrate to API-free RAG, they usually find that cost was actually the second-most important benefit. The first is predictability.
No surprise invoices - With cloud AI, a usage spike - a new team onboarding, a large batch of documents getting indexed, a department discovering the tool and using it heavily - creates a proportional cost spike. With self-hosted, usage spikes don't affect your bill.
No vendor dependency at runtime - If your cloud AI provider has an outage, raises prices, changes their terms of service, or deprecates an API you depend on, you're exposed. Self-hosted systems keep running regardless of what happens to external services.
Full data control - This isn't strictly a cost topic, but it matters. When no queries or documents leave your network, you avoid the compliance overhead of cloud AI data processing. No Business Associate Agreements for the AI stack. No data processing agreements to negotiate. Your governance team reviews infrastructure they already manage.
We wrote more about the data control dimension in Self-Hosted AI for Data Sovereignty
When the API path still makes sense
This isn't an argument that everyone should run local models tomorrow.
If you're in the early exploration phase and testing whether RAG is useful for your use case at all - cloud APIs let you get a working prototype in hours. There's no hardware to provision, no models to choose and configure. The cost is small at that scale, and the speed of iteration is worth it.
If your documents are non-sensitive and you don't have infrastructure capacity, cloud RAG removes a real barrier.
The shift to self-hosted makes sense when any of these become true: you're moving to production usage, costs start scaling beyond budget, your documents are sensitive, or your organization needs to avoid vendor lock-in for AI infrastructure. For most teams doing real document search, at least one of these becomes true within the first year.
Where Selvo Lens fits
Selvo Lens is built for the API-free path. The entire RAG pipeline runs locally: document ingestion, OCR for scanned files, embedding generation, vector search, LLM inference, and AI-powered data analysis. All deployed via Docker Compose on your hardware. No external API calls at runtime, no per-token fees, no usage-based pricing.
It includes authentication, audit logging, and access controls out of the box - so you're not just saving money on API costs, you're getting a system that's ready to go through a governance conversation without the overhead of evaluating a third-party cloud vendor.
If your team is evaluating how to move from cloud RAG to self-hosted, or you want to skip the cloud step entirely - we can set up a pilot to validate performance and cost in your environment.