๐ LLM
AI generated
BYOL: Bring Your Own Language Into LLMs
## BYOL: Enhanced Multilingual LLMs for All Languages
Large Language Models (LLMs) exhibit multilingual capabilities but are constrained by the imbalance in global language resources. A new study introduces "Bring Your Own Language" (BYOL), a framework for LLM development tailored to each language's digital footprint.
BYOL classifies languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora and selects the appropriate integration pathway. For low-resource languages, BYOL proposes a data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning.
Applied to Chichewa and Maori, this approach yielded language-specific LLMs that achieved approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, BYOL introduces a translation-mediated inclusion pathway, demonstrating on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible.
Human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut have been released. The codebase and models are publicly available.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!