Compass-Embedding v4 for Multilingual E-commerce
The rapid expansion of global e-commerce into emerging markets has highlighted the lack of high-quality semantic representations for low-resource languages. This bottleneck negatively impacts search, recommendation, and information retrieval systems.
Compass-Embedding v4 is a high-efficiency multilingual embedding framework specifically optimized for e-commerce scenarios in Southeast Asia (SEA). In these contexts, data scarcity, imperfect supervision, and strict production constraints represent significant challenges for machine learning.
The three challenges addressed
Compass-Embedding v4 addresses three main challenges:
- False negatives in contrastive training: Contrastive training with large batch sizes and mixed task supervision introduces systematic false negatives that degrade semantic alignment. To solve this problem, Class-Aware Masking (CAM) has been proposed, a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency.
- Limited data for SEA languages: Low-resource SEA languages suffer from limited and uneven data coverage. To overcome this, a diversified training corpus was constructed through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling multilingual and domain-specific learning.
- High-speed inference: Production deployment requires high-throughput inference while preserving embedding quality. To this end, robustness-driven large-batch training has been combined with spherical model merging to mitigate catastrophic forgetting, and inference has been optimized via vLLM and FP8 quantization.
Evaluations on multilingual benchmarks and proprietary e-commerce tasks demonstrate that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!