For enterprises that cannot send data to public AI services, an on-premise RAG chatbot is the only option. Here is what a fully private deployment looks like, what hardware you need, and what trade-offs to expect.
A fully on-premise RAG chatbot runs entirely on your own servers with no internet connection required. You need a GPU server for the language model, storage for documents and the vector index, and standard application servers. We use open weight models like Llama 3 or Mistral that can run without any external API. The result is a chatbot that never sends a single byte of your data outside your network.
Most enterprise AI projects can live happily in a private cloud account, with hosted models accessed through enterprise plans that promise data isolation. For many clients this is good enough. But for some industries and some use cases, no amount of contractual promise is enough. The data simply cannot leave the building.
This typically applies to banks operating under RBI guidance that restricts foreign data transfer, healthcare providers handling patient records, pharma companies with research data, defense contractors, government agencies, and any organisation operating in air gapped environments. For these clients, the entire AI system has to run on their own hardware, with no internet connectivity required.
This is harder than cloud deployment, but it is fully possible today. Modern open weight models like Llama 3, Mistral, and Qwen can run on a single GPU server and produce results good enough for most enterprise question answering tasks. The whole RAG stack runs on standard server hardware that any IT team can manage.
A fully private deployment has six main parts, all running on your own infrastructure.
The first part is the document ingestion pipeline. This is a server that connects to your internal sources and pulls in documents. The sources are usually your own file servers, your internal Confluence or SharePoint instance, your local databases, your Active Directory, and your internal email server. The ingestion server cleans the documents, extracts text from PDFs and Office files, and prepares them for the next step.
The second part is the embedding service. This is where documents are turned into vectors for semantic search. For on-premise deployment, we use open source embedding models like BAAI BGE or Nomic Embed. These run on standard CPU servers or a small GPU and produce embeddings without any external API call.
The third part is the vector database. This stores all the document embeddings and serves search queries. For on-premise, we typically use PostgreSQL with the pgvector extension, or Qdrant, or Weaviate. All three are open source, can run on standard servers, and scale to hundreds of millions of documents.
The fourth part is the language model server. This is the core of the chatbot. We deploy an open weight model like Llama 3 8B, Llama 3 70B, Mistral 7B, or Qwen 14B on a GPU server. We use serving software like vLLM or Text Generation Inference to make the model fast and able to handle many users at once. The model never connects to the internet.
The fifth part is the application server. This is the part that ties everything together. It receives questions from users, calls the embedding service, queries the vector database, sends the retrieved documents to the language model, applies guardrails, and returns the final answer. This is typically a Node.js or Python service running on standard servers.
The sixth part is the user interface. This is what employees actually use. A web app, a Slack bot if you run Slack on-premise, a Microsoft Teams bot for on-premise Teams, or a custom desktop application. All of this runs inside your network.
The biggest question with on-premise deployment is what hardware you need. The answer depends on the model size and how many users will be active at once.
For small teams of up to fifty active users, a single GPU server with one or two GPUs is usually enough. A common setup is an Nvidia L40S or A100 with forty eight gigabytes of GPU memory. This can comfortably run Llama 3 8B or Mistral 7B with room to handle ten to fifteen simultaneous conversations.
For medium teams of fifty to five hundred active users, we move to a server with two to four GPUs, usually H100 or A100 with eighty gigabytes of memory each. This supports larger models like Llama 3 70B if needed, or higher concurrency on smaller models.
For very large organisations with thousands of active users, we set up a small GPU cluster of four to eight servers. We use load balancing across these servers, so if any single server fails, the others continue to handle traffic. At this scale, we also typically separate the embedding service onto its own GPU server to keep response times fast.
Beyond GPUs, you need standard server hardware for the application servers, the vector database, and the ingestion pipeline. These do not need GPUs and can run on regular CPU servers. Total disk space depends on how much content you have. A typical mid sized enterprise with about ten lakh documents needs about two terabytes for the vector index, the document store, and the application data combined.
The model choice is a real decision and we test multiple options before recommending one. Each model has trade-offs.
Llama 3 from Meta is the most popular choice today. The 8 billion parameter version runs on a single forty eight gigabyte GPU and is good enough for most enterprise question answering. The 70 billion parameter version needs much more GPU memory but produces noticeably better answers on harder questions.
Mistral 7B is a strong alternative to Llama 3 8B. It is roughly the same size and produces similar quality. Some clients prefer it because of its open licence terms.
Qwen from Alibaba has versions ranging from 7 billion to 72 billion parameters. The larger versions are particularly strong at multilingual tasks, which matters if your knowledge base includes Hindi, Tamil, or other Indian language content.
Phi 3 from Microsoft is a smaller model at around 4 billion parameters. It is much faster and lighter than the others but trades some quality for speed. This is a good choice when latency is critical and questions are relatively simple.
We almost never recommend a single model in isolation. We test the top two or three candidates on your actual question bank during the pilot, and pick the one that gives the best balance of quality, speed, and hardware cost for your specific use case.
For a fully on-premise deployment, we audit every network connection in the system. Every component is configured to refuse outbound internet access. Firewalls are set up to block any attempt to reach external services. Telemetry, logging, and analytics services are configured to write only to local storage.
The model weights are downloaded once during installation, from Huggingface or directly from the model provider. After that, the network connection used for the download is removed. The model server has no need to ever connect to the internet again.
The application server only connects to internal services. The vector database, the document store, the identity provider, and the logging system are all on internal addresses. We provide a network diagram showing every connection, and the client security team reviews it before go-live.
For air gapped deployments, even the initial download is done on a separate machine, and the model files are transferred via physical media. This is rare but we have done it for the most sensitive deployments.
On-premise deployment is the most private option but it comes with trade-offs that the client team needs to understand upfront.
The first trade-off is quality. Open weight models that you can run on-premise are very good, but they are still a step behind the largest hosted models like Claude 3.5 or GPT 4o for complex reasoning tasks. For most enterprise question answering, the gap is small and often invisible. For more complex tasks like long document summarisation or multi step reasoning, the gap is noticeable.
The second trade-off is operational cost. The hardware has a real cost. A single GPU server suitable for production starts at around twenty lakh rupees. Cloud GPU rental on Indian hyperscalers like Yotta or E2E Networks is around two to four lakh rupees per month per server. Plus the ongoing cost of running the supporting infrastructure.
The third trade-off is operational responsibility. With hosted models, the API provider handles GPU failures, model updates, and scaling. With on-premise, your team has to operate this. This is fine for organisations with strong internal IT capability, but it does require commitment.
The fourth trade-off is speed of upgrade. When a new better model is released, hosted services typically have it within days. On-premise requires you to download the new weights, test them, and deploy them, which takes weeks.
For many clients, the best answer is not fully on-premise. It is hybrid. The most sensitive workloads run on-premise with open weight models. Less sensitive workloads use hosted enterprise APIs that promise data isolation. This gives the best of both worlds and matches the actual sensitivity of the data.
We help clients categorise their use cases and decide which deployment mode fits each one. A common pattern is project knowledge running on-premise for legal and compliance reasons, while sales playbook chatbots use hosted APIs for higher quality answers on lower sensitivity data.
The architecture is designed to support this from day one. The same chatbot platform supports multiple model backends, and the choice of backend is made per workspace based on policy. This avoids the trap of one size fits all decisions.
In the last year, Indian GPU hosting has become much more practical. Yotta, E2E Networks, CtrlS, and several others offer Nvidia H100 and A100 GPU servers from data centres in Mumbai, Bangalore, and Chennai. For clients who need on-premise but do not want to physically host the hardware, this is a viable middle ground. The data stays inside India, the contracts are governed by Indian law, and the operational simplicity is much better than running GPUs in your own server room.
We have deployed several solutions in this setup and it works well. The end result for the client is effectively on-premise from a data residency perspective, while avoiding the cost and complexity of physical infrastructure.
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
御社の課題をお聞かせください。24時間以内に、AI活用の可能性と具体的な進め方について無料でご提案いたします。
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002