Embeddings

Architecture

Dense numerical vector representations of text that capture semantic meaning — the foundation of semantic search and RAG pipelines.

An embedding is a fixed-length vector of floating point numbers that represents the "meaning" of a piece of text. Two semantically similar texts produce vectors that are close in the high-dimensional embedding space (high cosine similarity).

How Embeddings Are Produced

A dedicated encoder model (separate from your generative LLM) processes a text chunk and outputs a single vector — the mean-pooled or CLS-token representation of the last hidden layer. Popular models: all-MiniLM-L6-v2 (384 dims, very fast), BGE-M3 (1024 dims, multilingual), text-embedding-3-large (3072 dims, OpenAI API). On-premise favourites: BGE-Large, E5-Mistral-7B-Instruct.

Dimensions and Quality

ModelDimsMTEB ScoreSpeedNotes
all-MiniLM-L6-v238456.3Very fastDefault for lightweight RAG
BGE-Large-EN-v1.5102464.2FastBest open-source English
BGE-M3102469.1MediumMultilingual, dense+sparse
E5-Mistral-7B409666.6SlowBest quality, heavy
text-embedding-3-large307264.6APICloud dependency

Why It Matters for On-Premise

Your choice of embedding model determines RAG retrieval quality. For an air-gapped setup, you must use a locally-hosted model — BGE-Large or BGE-M3 cover 95% of use cases. Store vectors in Chroma, Qdrant, or pgvector. Remember: embedding dimensions must remain consistent across your index — you cannot mix models.