Comments by "OpenGL4ever" (@OpenGL4ever) on "Creel"
channel.
-
3
-
2
-
2
-
2
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
@lohphat As AVX-512 becomes more widespread, programmers can be more confident that AVX-512 will be supported. And then at some point the support of AVX-512 will be chosen as the compile target as the lowest common denominator. Today this is the case with SSE2, for example, because every 64-bit x86 CPU can handle SSE2.
What you can also do is simply create several different binaries with AVX-2, AVX-512 and without AVX support and then start a small program first when the user wants to start the program. This small program then checks which features the CPU supports. And only then is the corresponding binary started.
This was done, for example, for the computer game "The Chronicles of Riddick: Escape from Butcher Bay" from 2004. AVX didn't exist at the time, but MMX, SSE, SSE2 and 3dNow! did. The corresponding binary was thus started according to their capabilities.
Of course, this is only a small effort if the code is written in a high-level language and the compiler is advanced enough to optimize for the corresponding SIMD units. Otherwise the code or at least parts of it would have to be written manually for each SIMD unit type and that takes a lot of time and therefore money.
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
@y2ksw1 "mov and nop can execute together, and inc on two different registers, too. If I would try to use mov and inc together, mov had to wait for inc to finish."
That wasn't my point, my point was, that your inc still have to wait for the mov.
Let's assume we have a CPU with two execution units A and B and thus is able to do 2 instructions per step.
1. Step
A does "mov ebx, eax"
B does "nop"
2. Step
A does "inc eax"
B does "inc ebx" ; here B does have to wait until A in Step 1 is finished.
And now my version:
1. Step
A does "inc eax"
B unused or a nop
2. Step
A does "mov ebx, eax" ; here A does have to to wait, until A in Step 1 is finished
B unused or a nop
So in all variants it's 2 steps. Where is your performance gain?
As far as I know it's the other way around. The U-pipe can execute any instruction in the Intel architecture while the V-pipe can execute only simple instructions.
Source: The Book "Computerarchitektur" from Andrew S. Tanenbaum and James Goodman, ISBN 3-8273-7016-7. (it's the German version, that's why the word Computer Architecture is written differently.)
But I agree to use the correct pipe for the appropriate instruction and then order the instruction accordingly if the pipes are different.
I also agree on the rest what you said and it might explain why your code is faster, but in that case only because by luck the right unit was fed with the right instructions.
What happens if you put a NOP in the instruction chain of my code to explicitly make sure, that the instructions units are fed with the correct instructions they can understand?
For if statements, it is best to use branchless programming techniques on modern CPUs whenever possible. High level language compilers will do this automatically and optimize the code for it in most cases, but not in all cases. But i assume you know that already.
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1