Comments by "WloCkuz" (@wlockuz4467) on "Theo - t3․gg"
channel.
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
One thing thats most sketchy about it is the training data.
To have this level of quality in the output you would need massive amounts of diverse data, when I say massive I mean petabytes.
There is no way they had licensed every bit of it. Not only would that be expensive, but extremely impractical.
So I am assuming a large part of the training data was just YouTube videos. I mean, where else would you get the same video, in multiple resolutions, multiple languages, captions, multiple genres and basically anything you could ask for. When you think about it, YouTube is just a massive unorganised training data if you have the resources.
If you had a popular video on social media, then chances are you contributed to training Sora. Whether you like it or not.
I would hope theres some regulation in place which prevents these companies from just straight up stealing the data to train their models or someone would strong arm OpenAI to open about their data but with Microsoft behind them, its just wishful thinking.
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
3
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2