Comments by "" (@grokitall) on "The Impending AI Model Collapse Problem" video.

this applies not only to llms, but to the entire field of black box statistical ai. sometimes good enough answers will work for your use case, but a lot of the time you not only need to know that you got the right answer, but also how you got that answer, and why it is right. legal and man critical uses spring to mind, but there are many others. the only answer to this is to use white box symbolic ai, where you can drill down and find those answers.
3
the idea goes back further than that, back to when audio recordings were first being made, and the term for it is replicative fading. but it is still the same phenomena that over time the models become dominated by the error data unless extreme steps are taken to correct for it.
2
@noblebearaw actually, the bigger problem is that black box statistical ai has the issue that even though it might give you the right answer, it might do so for the wrong reason. there was an early example where they took photos of a forest with and without tanks hiding in it, and it worked. they then went back and took more photos of camouflaged tanks, and it didn't work at all. they managed to find out why, and the system had learned that the tank photos were taken on a sunny day, and the no tank photos were taken on a cloudy day, so the model learned how to spot sunny vs cloudy forest pics. while the tech has improved massively, because statistical ai has no model except likelihood, it has no way to know why the answer was right, or to fix it when it is found to get the answers wrong. white box symbolic ai works differently, creating a model, and using the knowledge graph to figure out why the answer is right.
2
the use of the term curated data refers to the fact that if the data is not tagged, lots of learning algorithms won't work at all. this is what is meant by supervised learning. the problem with getting a black box statistical ai to generate those tags is that the basics of the tech is about stacking plausible guesses on plausible guesses, eliminating almost all of the feedback loops needed to generate correct answers. this is the compounded with the overfiting problem. every statistical ai works by getting more and more data, shovelling it into a system that tries to best fit a statistical model with every higher numbers of free parameters, which takes ever longer to train. this is where the model generally fails. there is a limit to how many times you can double the size of the dataset used to train the model. there is a limit to how many extra free variables you can add to the model. there are limits to how fast the model stabilises as the number of variables goes up. this results in the compute requirements to do the training having an exponential growth term in the costs, to produce answers that are only a little better and only produce likely answers, not correct ones. then there is the general problem with all intelligence, that you can only solve new problems you can almost solve already.
1
@noblebearaw it used all the points in all the images to come up with a set of weighted values which together enabled a curve to be drawn with all the images in one set on one side of the curve, and all the images in the other set on the other side of the curve. that is the nature of statistical ai, it does not care about why it comes to the answer, only that the answer fits the training data. the problem with this approach is that you are creating a problem space with as many dimensions as you have free variables, and then trying to draw a curve in that phase space, but there are many curves that fit the historical data, and you only find out which is the right one when you provide additional data which varies from the training data. symbolic ai works in a completely different way. because it is a white box system, it can still use the same statistical techniques to determine the category which the image falls into, but this acts as the starting point. you then use this classification as a basis to start looking for why it is in that category, wrapping the statistical ai inside another process, which takes the images fed into it, and uses humans to spot where it got it wrong, and look for patterns of wrong answers which help identify features within that multi dimensional problem space which are likely to match one side of the line or the other. this builds up a knowledge graph analogous to the structure of the statistical ai, but as each feature is recognised, named, and added to the model, it adds new data points to the model, with the difference being that you can drill down from the result to query which features are important, and why. this also provides extra chances for extra feedback loops not found in statistical ai. if we look at compiled computer programs as an example, using c and makefiles to keep it simple, you would start of by feeding the statistical ai with the code and makefile, and feed it the result of the ci / cd pipeline, determining if the change just made was releasable or not. eventually, it might get good at predicting the answer, but you would not know why. the code contains additional data implicit within it which provides more useful answers. each step in the process gives usable additional data which can be queried later. was it a change in the makefile which stopped it building Correctly? did it build ok, but segfault when it was run? how good is the code coverage of the tests on the code which was changed? does some test fail, and is it well enough named that it tells you why it failed? and so on. also a lot of these failures will give you line numbers and positions within specific files as part of the error message. if you are using version control, you also know what the code was before and after the change, and if the error report is not good enough, you can feed the difference into a tool to improve the tests so that it can identify not only where the error is, but how to spot it next time. basically, you are using a human to encode information from the tools into an explicit knowledge graph which ends up detecting that the code got it wrong because the change in line 75 of query.c returns the wrong answer to a specific function when passed specific data because a branch which should have been taken to return the right answer was not taken because the test on that line had 1 less = sign than was needed ad position 12, making it an assignment statement rather than a test, making the test never pass. it could then also suggest replacing the = with == in the new code, thus fixing the problem. none of that information could be got from the statistical ai, as any features in the code used to find the problem are implicit in the internal model, but it contains none of the feedback loops needed to do more than identify that there is a problem. going back to the tank example, the symbolic ai would not only be able to identify that there was a camouflaged tank, but point out where it was hiding, using the fact that trees don't have straight edges, and then push the identified parts of the tank through a classification system to try and recognise the make and model of the tank, this providing you with the capabilities and limitations of the identified vehicle as well as the presence and location. often when it gets stuck, it resorts to the fallback option of presenting the data to the human and saying "what do you know in this case which i don't", adding that information explicitly into the know,edge graph, and trying again to see if it altered the result.
1