An embedding is a fixed-length vector of floating point numbers that represents the "meaning" of a piece of text. Two semantically similar texts produce vectors that are close in the high-dimensional embedding space (high cosine similarity).
How Embeddings Are Produced
A dedicated encoder model (separate from your generative LLM) processes a text chunk and outputs a single vector — the mean-pooled or CLS-token representation of the last hidden layer. Popular models: all-MiniLM-L6-v2 (384 dims, very fast), BGE-M3 (1024 dims, multilingual), text-embedding-3-large (3072 dims, OpenAI API). On-premise favourites: BGE-Large, E5-Mistral-7B-Instruct.
Dimensions and Quality
| Model | Dims | MTEB Score | Speed | Notes |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 56.3 | Very fast | Default for lightweight RAG |
| BGE-Large-EN-v1.5 | 1024 | 64.2 | Fast | Best open-source English |
| BGE-M3 | 1024 | 69.1 | Medium | Multilingual, dense+sparse |
| E5-Mistral-7B | 4096 | 66.6 | Slow | Best quality, heavy |
| text-embedding-3-large | 3072 | 64.6 | API | Cloud dependency |
Why It Matters for On-Premise
Your choice of embedding model determines RAG retrieval quality. For an air-gapped setup, you must use a locally-hosted model — BGE-Large or BGE-M3 cover 95% of use cases. Store vectors in Chroma, Qdrant, or pgvector. Remember: embedding dimensions must remain consistent across your index — you cannot mix models.