For many companies the goal is simple: the usefulness of ChatGPT, but answering from internal knowledge and without sending confidential data to a third party. That is entirely achievable today with open-weight models — and it has become the default path for regulated sectors. Here is the architecture and the decisions that matter.
The four building blocks
- Model — an open-weight LLM (Llama, Mistral, Qwen, etc.) sized to your hardware: 8–14B for snappy single-GPU use, 34–70B for stronger reasoning.
- Serving engine — Ollama for a prototype or small team; vLLM/TGI for production concurrency. Both expose an OpenAI-compatible API.
- RAG layer — index your documents (wiki, PDFs, tickets) as embeddings in a vector DB so the assistant answers from your knowledge and cites sources.
- Chat UI + access control — an open chat front-end (e.g. Open WebUI) wired to SSO and per-team permissions.
Reference architecture
Documents → ingestion + chunking → embeddings → vector database. At query time: user question → retrieve relevant chunks → build prompt → local LLM (vLLM) → answer with citations → chat UI. Everything runs inside your network; the only thing leaving is nothing. Add logging and access control at the UI/API layer for audit and compliance.
Hardware and cost
A small-team assistant runs comfortably on a single 24–48GB GPU with a quantized 8–34B model. Many concurrent users need vLLM on a 48–80GB card (or several) to keep latency low under batching. As covered in our cost guide, owning hardware pays off at sustained utilization; for pilots, rent a GPU by the hour first to validate before buying.
Security, compliance and the real risks
The point of going private is control: data stays in your perimeter, you log every query, and you can pin processing to an EU/sovereign location for GDPR and the EU AI Act. But self-hosting shifts responsibility to you — patching, access control, prompt-injection defenses on the RAG layer, and preventing the assistant from surfacing documents a user should not see. Treat document-level permissions as a first-class requirement, not an afterthought.
A pragmatic rollout
- Pilot: one GPU, an 8–14B model, Ollama + Open WebUI, RAG on one well-defined document set.
- Validate quality and adoption with a friendly team before scaling.
- Productionize: move to vLLM, add SSO + per-team document permissions + logging.
- Scale model size / GPUs as concurrency and quality needs grow.