Comments by "Mikko Rantalainen" (@MikkoRantalainen) on "Recent AI Advances Explained. It's All Accelerating!" video.

10:10 I think we shouldn't give too much value to claimed performance of a system that is not available for real. If Google thinks Gemini Ultra will be available sometime in next year, it should be compared to the then current GPT-4 variant.
72
@monad_tcp Running an AI neural networks requires basically computing huge matrix operations with simple non-linear operation between matrix operations. Both operations are embarrassingly parallel workload so the only question is how much electricity that AI is going to require per system. Right now the hardware required to run huge LLMs is very expensive (a single LLM may require 4–8 Nvidia H100 cards to run) and the system requires all the hardware for every answer. Granted, each question+answer takes maybe half a second to compute but you need insanely expensive hardware for that half a second. As a result, currently available AI systems such as ChatGPT are implemented by having lots of beefy systems that can answer questions with queue in the front of the system to improve total throughput per dollar. There's nothing in the system that prevents running it fully parallel, except that getting even more hardware is too expensive to make financial sense. (Tip: try searching for "nvidia H100 price" and you'll understand why.)
1