Embeddings – LLM Glossary

An embedding is a fixed-length vector of floating point numbers that represents the "meaning" of a piece of text. Two semantically similar texts produce vectors that are close in the high-dimensional embedding space (high cosine similarity).

How Embeddings Are Produced

A dedicated encoder model (separate from your generative LLM) processes a text chunk and outputs a single vector — the mean-pooled or CLS-token representation of the last hidden layer. Popular models: all-MiniLM-L6-v2 (384 dims, very fast), BGE-M3 (1024 dims, multilingual), text-embedding-3-large (3072 dims, OpenAI API). On-premise favourites: BGE-Large, E5-Mistral-7B-Instruct.

Dimensions and Quality

Model	Dims	MTEB Score	Speed	Notes
all-MiniLM-L6-v2	384	56.3	Very fast	Default for lightweight RAG
BGE-Large-EN-v1.5	1024	64.2	Fast	Best open-source English
BGE-M3	1024	69.1	Medium	Multilingual, dense+sparse
E5-Mistral-7B	4096	66.6	Slow	Best quality, heavy
text-embedding-3-large	3072	64.6	API	Cloud dependency

Why It Matters for On-Premise

Your choice of embedding model determines RAG retrieval quality. For an air-gapped setup, you must use a locally-hosted model — BGE-Large or BGE-M3 cover 95% of use cases. Store vectors in Chroma, Qdrant, or pgvector. Remember: embedding dimensions must remain consistent across your index — you cannot mix models.