For many companies the goal is simple: the usefulness of ChatGPT, but answering from internal knowledge and without sending confidential data to a third party. That is entirely achievable today with open-weight models — and it has become the default path for regulated sectors. Here is the architecture and the decisions that matter.

The four building blocks

  1. Model — an open-weight LLM (Llama, Mistral, Qwen, etc.) sized to your hardware: 8–14B for snappy single-GPU use, 34–70B for stronger reasoning.
  2. Serving engine — Ollama for a prototype or small team; vLLM/TGI for production concurrency. Both expose an OpenAI-compatible API.
  3. RAG layer — index your documents (wiki, PDFs, tickets) as embeddings in a vector DB so the assistant answers from your knowledge and cites sources.
  4. Chat UI + access control — an open chat front-end (e.g. Open WebUI) wired to SSO and per-team permissions.

Reference architecture

Documents → ingestion + chunking → embeddings → vector database. At query time: user question → retrieve relevant chunks → build prompt → local LLM (vLLM) → answer with citations → chat UI. Everything runs inside your network; the only thing leaving is nothing. Add logging and access control at the UI/API layer for audit and compliance.

Hardware and cost

A small-team assistant runs comfortably on a single 24–48GB GPU with a quantized 8–34B model. Many concurrent users need vLLM on a 48–80GB card (or several) to keep latency low under batching. As covered in our cost guide, owning hardware pays off at sustained utilization; for pilots, rent a GPU by the hour first to validate before buying.

Security, compliance and the real risks

The point of going private is control: data stays in your perimeter, you log every query, and you can pin processing to an EU/sovereign location for GDPR and the EU AI Act. But self-hosting shifts responsibility to you — patching, access control, prompt-injection defenses on the RAG layer, and preventing the assistant from surfacing documents a user should not see. Treat document-level permissions as a first-class requirement, not an afterthought.

A pragmatic rollout

  1. Pilot: one GPU, an 8–14B model, Ollama + Open WebUI, RAG on one well-defined document set.
  2. Validate quality and adoption with a friendly team before scaling.
  3. Productionize: move to vLLM, add SSO + per-team document permissions + logging.
  4. Scale model size / GPUs as concurrency and quality needs grow.