Comments by "" (@diadetediotedio6918) on "The Impending AI Model Collapse Problem" video.

@TheManinBlack9054 I think you are intentionally ignoring the problem here. The problem is never "adding synthetic data versus replacing all the data with synthetic data", but rather "at which scale with the mixture between synthetic and real data this will happen?" and also "if it does happen with purely synthetic data and do not happen with purely real data, does this means that the real data has something in it that the synthetic do not and this will cause the model degradation slowly over time in mixed data models?".
17
@TheManinBlack9054 My response was hidden here, but I'll summarize it: Phi is not trained exclusively with synthetic data. Do you know any models with publicly disclosed training data that are really proven to be trained exclusively with synthetic data?
6
@TheManinBlack9054 Was it? I searched for it and this was what I got in the sources: ["phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens)"] and ["The language model Phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts."] ["Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value)."] ["We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data."] None of the phi models are trained on "purely synthetic data", only mixed training data, so we don't know to which degree they are really being affected. We also don't know if the degradation is sufficiently high in the synthetic data from the early foundational models we have today for it to make a big difference versus what it will be in the next foundational models in the future and/or upgrades on them with more and more recursively fed synthetic data being part of their training. I'm also not sure of any foundational models trained on purely synthetic data that is publicly available for checking, if you have any I would be interested in seeing them.
1
13:00 I don't think this is a great problem at all, the question imposed is: Is the output generated by LLM's comparable to human output? If the answer was true you would not need to worry about mixing it with real human data anyways, they just made the collapse way faster by using only AI generated synthetic data.
1