Comments by "Siana Gearz" (@SianaGearz) on "Simulating the World To Train AI" video.

But how do you solve the problem that there isn't enough annotated real world data? AI can be bootstrapped on synthetic data, hopefully from various sources (including say 3 or 4 videogames), and then can help convert raw real world data into annotated data under human supervision, which can then be poured back into the training mix. The people would manually exclude any data that was annotated wrong. Let's say you have plenty of annotated real world data (which we currently don't) and use only that for training, then you have another bias problem: it shows all the good driving and little bad and unexpected participants. But we NEED to take extreme circumstances into account, unusual situations, and weigh them heavily vs. the usual, because the cost of running over a human that you can't see yet or driving off a cliff is extremely high. So you have introduced a new bias problem, that your AI doesn't understand the concept of caution.
5
I've played plenty of Test Drive Unlimited series and Euro Truck Simulator and I'm actually hopeful. Because the behaviour of traffic is not perfect and somewhat buggy, but so are real human drivers... There's a trap there that the AI learns the bugs of the game AI but still. These tools shown at the end like Synthia actually don't have any of that, they work with manmade scenarios that are known to occur and would need to be handled by driving AI. I think there's a ticket somewhere in combining real world data, data generated by various games each with their own quirks, and data synthesised from manual input.
1
@weksauce Spatial reasoning bound image recognition is a very hard task, which requires by all reason a large neural network. The network needs to generalise from the data set. When the data set is too small for the size of the network, the network generally learns it verbatim instead or focuses on detection of secondary coincidental traits. Problem, we don't have a great understanding and manual access into the function of neural networks. As to ad hoc, hardcoded solutions, they tend to be very difficult to integrate, except specifically using synthetic data. But i would agree a handcrafted solution should still be present as a failsafe running alongside neural network subsystems, specifically because we have such a mediocre grasp on them. But also the two systems should ideally mostly agree rather than fight, as that's a source of danger as well. Synthetic data also helps solve the chicken and egg problem. Like yes ideally you'd have more real world data, and predominantly more diverse real world data with better corner case coverage. But the raw data we can get easily isn't marked up. You can train the markup on synthetic data, and human supervision and correction is orders of magnitude cheaper than manual markup from ground up. Then you can gradually increase the amount of real world data used in training.
1