Comments by "kazedcat" (@kazedcat) on "The Genius of DeepSeek’s 57X Efficiency Boost [MLA]" video.

It's hard to do the transition between perceptrons and transformers. What I found easier is that giving up making sense of it in the perceptron level and just understanding transformers as a series of matrix operations. As a matrix operation the attention mechanism is a very simple matrix equation so you can wrap your head around the entire transformer architecture much easier by using the attention mechanism as your building block.
2
They multiply the large input matrix with a smaller learned weights matrix to get a smaller compress output matrix. This is what they meant when they say latent space you are reducing the dimensionality of your matrix by matrix multiplication.
2
 @AIShipped Your Wq are now bigger becoming Wq•Wuk
1
@Pokemon00158 During training you need to keep the different weight matrix separate so you can update them independently. But during inference all the weights are fixed constants so you can collapse them using linear algebra.
1