LLM Pre-training: A Hybrid JEPA+MLM Approach Reshapes Latent Space

The Evolution of Pre-training Objectives for Large Language Models

Masked Language Modelling (MLM) has been the dominant pre-training objective for text encoders for years, ever since the introduction of architectures like BERT. While effective, this approach tends to generate representations strongly anchored to the surface-form identity of tokens, rather than capturing deeper semantic structure. This limitation can hinder Large Language Models' (LLM) ability to understand complex contextual nuances and generalize robustly across different domains.

The need to overcome these constraints has driven research towards new methodologies. The goal is to develop LLMs that do not merely predict the next word or fill in blanks, but rather build a more abstract and conceptual understanding of language. This is particularly relevant for companies operating with sensitive or proprietary data, where the fidelity and depth of semantic representation are crucial for analysis, retrieval, and text generation applications.

A New Hybrid Objective: JEPA and MLM Combined

Inspired by the success of Joint Embedding Predictive Architectures (JEPA) in the fields of vision and audio, researchers have proposed an innovative hybrid pre-training objective. This approach combines a latent-space prediction loss, typical of JEPA architectures, with the traditional MLM objective. Both objectives operate on a single shared encoder, and a learnable scalar parameter dynamically balances the two contributions during the training process.

To evaluate the effectiveness of this methodology, both a hybrid model and an MLM-only baseline were pre-trained. The training was conducted on English Wikipedia, using identical architectures and the same compute budget, leveraging NVIDIA H100 GPUs. The use of high-end hardware like the H100 underscores the computational intensity required for LLM pre-training and the need for robust infrastructure, a key aspect for on-premise deployment strategies.

Representation Analysis and Semantic Advantages

An in-depth analysis of the representations, performed across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) and using four different pooling strategies, revealed significant differences. The hybrid encoder produces notably more uniform embeddings, with a uniformity value less than -0.16 compared to -0.05 for pure MLM. This increased uniformity indicates a better distribution of representations in the latent space, potentially improving the model's ability to distinguish between semantically distinct concepts.

Furthermore, the hybrid model exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a more effective balance between semantic and lexical understanding. Although downstream accuracy measured with linear-probe was similar between the two approaches, the geometric differences are consistent and significant. This suggests that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot fully capture, offering a more solid foundation for fine-tuning and adaptation to specific tasks.

Implications for On-Premise Deployment and Data Sovereignty

The findings of this research have direct implications for organizations evaluating LLM deployment in on-premise or hybrid environments. The ability to generate more uniform and semantically rich embeddings can translate into more robust and interpretable models, a crucial advantage for sectors with stringent compliance and data sovereignty requirements. An LLM with a deep understanding of semantics is less prone to "hallucinate" or produce misleading responses, a fundamental aspect when handling proprietary or sensitive information.

Investing in hardware like NVIDIA H100 for on-premise pre-training and inference becomes even more justified if it allows for training models with superior intrinsic qualities, even if not immediately visible in superficial accuracy metrics. For those evaluating on-premise deployment, AI-RADAR offers analytical frameworks on /llm-onpremise to understand the trade-offs between different architectures and pre-training objectives, helping to make informed decisions that balance TCO, performance, and data control. Research continues to explore how to optimize pre-training objectives to unlock the full potential of LLMs in critical enterprise contexts.