## Synthetic Data: Future or Dead End for LLM Models? The question of whether synthetic data represents the future or a dead end for training large language models (LLMs) is at the center of a heated debate. A paper by Shumailov et al. (2023) warns of the risk of model collapse when recursively training on AI-generated data. Quality degrades and the richness of language is lost. But there are also encouraging results: * Self-Instruct demonstrates that a model can improve by generating its own training examples. * The Constitutional AI approach involves a model improving by self-criticizing. * Microsoft's Phi-1 and Phi-2, trained on high-quality synthetic data, achieve remarkable results. * Alpaca was trained on ChatGPT deliveries, with mixed results. ## What Future for Language Models? So, does training on synthetic data inevitably lead to model collapse, or can it work with careful selection? Does it perhaps depend on the combination with real data? Shumailov's study suggests that the problem is recursiveness without real data. Phi-2, on the other hand, indicates that high-quality synthetic data can lead to unexpected capabilities. If model collapse is real and the web is increasingly full of AI-generated content, are we poisoning the future of models? GPT-6 could be trained in part on GPT-4 data, reaching a point of no return?