Can I run a private ChatGPT without sending data to OpenAI?

Yes. Self-host an open-weight model (e.g. Llama, Mistral, Qwen) with a serving engine like vLLM or Ollama, add RAG over your documents, and data never leaves your infrastructure.

What hardware do I need for a company assistant?

For a small team, a single 24-48GB GPU runs a capable 8B-34B model. For many concurrent users, use vLLM on an 48-80GB GPU or scale out. Size for your concurrency, not just the model.

Cloud or on-premise for a private assistant?

On-premise (or sovereign/EU-region cloud) when data residency or compliance is strict; managed cloud for speed and elasticity. Many companies run a hybrid: sensitive data on-prem, bursts in the cloud.

How to Build a Private ChatGPT for Your Company (2026)

For many companies the goal is simple: the usefulness of ChatGPT, but answering from internal knowledge and without sending confidential data to a third party. That is entirely achievable today with open-weight models — and it has become the default path for regulated sectors. Here is the architecture and the decisions that matter.

The four building blocks

Model — an open-weight LLM (Llama, Mistral, Qwen, etc.) sized to your hardware: 8–14B for snappy single-GPU use, 34–70B for stronger reasoning.
Serving engine — Ollama for a prototype or small team; vLLM/TGI for production concurrency. Both expose an OpenAI-compatible API.
RAG layer — index your documents (wiki, PDFs, tickets) as embeddings in a vector DB so the assistant answers from your knowledge and cites sources.
Chat UI + access control — an open chat front-end (e.g. Open WebUI) wired to SSO and per-team permissions.

Reference architecture

Documents → ingestion + chunking → embeddings → vector database. At query time: user question → retrieve relevant chunks → build prompt → local LLM (vLLM) → answer with citations → chat UI. Everything runs inside your network; the only thing leaving is nothing. Add logging and access control at the UI/API layer for audit and compliance.

Hardware and cost

A small-team assistant runs comfortably on a single 24–48GB GPU with a quantized 8–34B model. Many concurrent users need vLLM on a 48–80GB card (or several) to keep latency low under batching. As covered in our cost guide, owning hardware pays off at sustained utilization; for pilots, rent a GPU by the hour first to validate before buying.

Security, compliance and the real risks

The point of going private is control: data stays in your perimeter, you log every query, and you can pin processing to an EU/sovereign location for GDPR and the EU AI Act. But self-hosting shifts responsibility to you — patching, access control, prompt-injection defenses on the RAG layer, and preventing the assistant from surfacing documents a user should not see. Treat document-level permissions as a first-class requirement, not an afterthought.

A pragmatic rollout

Pilot: one GPU, an 8–14B model, Ollama + Open WebUI, RAG on one well-defined document set.
Validate quality and adoption with a friendly team before scaling.
Productionize: move to vLLM, add SSO + per-team document permissions + logging.
Scale model size / GPUs as concurrency and quality needs grow.