Comments by "" (@diadetediotedio6918) on "The Impending AI Model Collapse Problem" video.

  1. 17
  2. 6
  3.  @TheManinBlack9054  Was it? I searched for it and this was what I got in the sources: ["phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens)"] and ["The language model Phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts."] ["Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value)."] ["We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data."] None of the phi models are trained on "purely synthetic data", only mixed training data, so we don't know to which degree they are really being affected. We also don't know if the degradation is sufficiently high in the synthetic data from the early foundational models we have today for it to make a big difference versus what it will be in the next foundational models in the future and/or upgrades on them with more and more recursively fed synthetic data being part of their training. I'm also not sure of any foundational models trained on purely synthetic data that is publicly available for checking, if you have any I would be interested in seeing them.
    1
  4. 1