## Compass-Embedding v4 for Multilingual E-commerce The rapid expansion of global e-commerce into emerging markets has highlighted the lack of high-quality semantic representations for low-resource languages. This bottleneck negatively impacts search, recommendation, and information retrieval systems. Compass-Embedding v4 is a high-efficiency multilingual embedding framework specifically optimized for e-commerce scenarios in Southeast Asia (SEA). In these contexts, data scarcity, imperfect supervision, and strict production constraints represent significant challenges for machine learning. ## The three challenges addressed Compass-Embedding v4 addresses three main challenges: 1. **False negatives in contrastive training:** Contrastive training with large batch sizes and mixed task supervision introduces systematic false negatives that degrade semantic alignment. To solve this problem, Class-Aware Masking (CAM) has been proposed, a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. 2. **Limited data for SEA languages:** Low-resource SEA languages suffer from limited and uneven data coverage. To overcome this, a diversified training corpus was constructed through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling multilingual and domain-specific learning. 3. **High-speed inference:** Production deployment requires high-throughput inference while preserving embedding quality. To this end, robustness-driven large-batch training has been combined with spherical model merging to mitigate catastrophic forgetting, and inference has been optimized via vLLM and FP8 quantization. Evaluations on multilingual benchmarks and proprietary e-commerce tasks demonstrate that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.