Comments by "Mikko Rantalainen" (@MikkoRantalainen) on "The Impending AI Model Collapse Problem" video.

11:00 I think in addition to using only 100% LLM generated data for the next generation, they also overtrained the model to generate previous generation output as closely as possible. (They explain that they compared output string to previous generation output. I think they should have asked for similar meaning, which could have been checked by some other LLM. When you train for identical strings, you'll be overtrain / overfit the network for use which is known to be detrimental for the output quality.)
16
17:40 The time of day making a difference in LLM output from cloud services makes a lot of sense. Cloud LLMs are probably scaling their LLMs to less computationally heavy variants / config when traffic gets high and the answer quality goes down.
1
I think modern LLMs could learn from AI generated content, including content emitted from the LLM itself. However, the data would need to be first checked before used for training new data. I think GPT-4o and GPT-4 could make pretty good work in checking the potential new data with good enough prompting. Something along the lines "Does the following explanation/idea/research seem sensible compared to existing scientific knowledge? Provide chain of thought." followed by the LLM generated answer / LLM generated idea / LLM generated research paper. And maybe check the provided chain of thought with yet another LLM to check if it hallucinated. Humans do not need to read all research papers ever published to generate sensible output. Successful humans do have a concept of input filtering: when you read any new piece of data, you automatically evaluate the quality of the content given your existing understanding of the world. The better understanding of existing research you have, the better you can evaluate the next incoming data. And always ask for chain of thought because LLMs are too happy to hallucinate simple yes or no answers.
1