Comments by "Vitaly L" (@vitalyl1327) on "Low Level" channel.

meh, relax, with the current state of LLMs they won't replace programmers any time soon. Likely, never.
17
Yet, in the vast majority of cases there is a better static alternative to any of the OOP tools you feel obliged to use for some reason.
2
blak tobor these days all FP are IEEE754, so not much diversity, no matter soft or hard float.
2
There is simply no replacement for a computed goto. And languages that do not have it suffer badly in terms of performance.
2
The hard part is solving an NP-complete graph colouring problem in a reasonable time with a reasonable quality - something modern compilers do very fast with decent heuristics. Of course, there's a lot of problems where 16 or so registers of the modern CPUs are more than enough, but the moment they're not enough you've got no choice but to do register allocation.
2
@fredspreadem5638 RISC-V, various GPU architectures, to name a few
2
Impossible. Goto is essential for representing irreducible CFG (for all procedural constructs are, by definition, reducible). Which means, pretty much any mildly complicated FSM, including all kinds of parsing and tree walking. Languages without a goto are crippled languages.
2
@monad_tcp lol, now try to implement an efficient indirect threaded code interpreter without a goto.
2
@SillySussySally this is exactly what I said - GPUs issue instructions from another threads (in NVidia parlance, warps), while OoO CPUs issue instructions from the same thread that it knows do not have a dependency on anything that's currently being stalled. So, yes, OoO CPUs have higher latency. Simpler CPUs (such as ones you'll find in microcontrollers) have a much lower latency, and, more importantly, predictable latency. GPUs have lower latency (in terms of cycle count, not time - they run on lower clock frequency normally) just by a virtue of being much simpler cores and featuring shorter and simpler pipelines. Keep in mind that the exact NVidia microarchitecture is not a public knowledge, so we can only assume here. There are other GPU designs that are far more open and well documented though, so we can extrapolate that knowledge. I personally worked on two mobile GPU cores, ARM Mali 6xx and Broadcom VC5, both are wildly different from each other. Latencies in both were (in clock cycles) still smaller than in high performance Intel cores and high end ARM cores (but higher than in the in-order ARM cores).
2
@SillySussySally Branch prediction (and in GPU context, divergence) is a totally different story here. I'm talking about hazards - the register dependencies between instructions. In an OoO architecture instructions are dynamically stalled while waiting for the register they depend on to be updated by a previous instruction, while instructions that do not need to wait for anything can be executed out of order. Another mechanism is register renaming - when you have more physical registers than the addressing space allows. In an in-order architecture you either stall everything until the result is ready, or trust the compiler not to put the dependent instructions too close to each other. All GPUs are in-order. And GPUs have a lot of addressable registers, split between threads / warps, so no register renaming here. You can read about, say, Tomasulo algorithm, to see how it's done in OoO.
2
Well, that's really easy, just use fractional arithmetic. Feel free to use floating point data types (but still only integer numbers). People who don't understand how floating point works should not use floating point numbers, really.
1
It's not "fine" on so many levels. Like, try to add a billion floating point numbers, with a very wide range of values. There is a possible sequence of numbers that can make your answer wrong in orders of magnitude. The only way to get an accurate answer is to sort the array in ascending order first. And that's just one example. There are dozens more floating point pitfalls that people tend to know nothing about, for some weird reason.
1
@rsa5991 This is not what the word "precision" mean
1
@Lord-Sméagol and when you have a direct access to the computed goto, you can pre-cache the destination labels, while a switch will introduce a level of dereferencing with a label table. See how threaded bytecode interpreters use computed goto for example.
1
systemd was always the cancer of the unix world. High time.to ditch it for good.
1
@dmitripogosian5084 do you realise that OpenCL is an open standard and vendor-independent, unlike CUDA? How is your comment even relevant here? There is nothing that CUDA can do and OpenCL can not.
1
@Miguel_Noether In your case, cost of transferring data between main memory and GPU will be prohibitive. You need to build as much of your compute pipeline on a GPU, moving data there and back is very expensive. If you want to accelerate short local computation on a GPU, you need something with a unified memory architecture (i.e., an integrated GPU).
1
Not every ARM core, just the very new ones with tagging support. Not to mention that most ARM cores do not do any speculative execution above 3-5 cycles ahead.
1
Static polymorphism and CRTP do not suck though. And it's not really a revelation that dynamic anything does suck. Dereferencing suck. Breaking cache locality suck.
1
This is a very useful approach, can cut the number of cycles down a lot: http://www.acsel-lab.com/arithmetic/arith9/papers/ARITH9_Fowler.pdf
1
your desktop CPU is also designed for maximising throughput, latency be damned. If you want a CPU core designed for low latency, look at something like ARM Cortex-R range.
1
@SillySussySally confusing latency and throughput, aren't you? And no, no CPU have a 1 cycle latency. And there are dozens of different GPU architectures with different pipelines
1
@SillySussySally How is your response even relevant? We've been talking about those mythical 1-cycle latency instructions of yours (and equally mythical 4-cycle latency GPU instructions). Mind elaborating on that?
1
@SillySussySally comments can get removed if they contain URLs. Mine are always disappearing if there are any links. This is not the full latency - it is the latency between consequent instructions. The actual latency is the number of clock cycles between instruction fetch and instruction retirement, and for a modern OoO CPU it's anything from few dozen cycles to hundreds of clock cycles. For a GPU it's lower, but still above a dozen of cycles (pipeline length + all possible stalls).
1
@SillySussySally the difference here is that in a large OoO core instruction dependencies are handled dynamically, and instructions that are independent from the ones being currently executed or stalled can get issued out of order. This way a high instruction-level parallelism is achieved, with many instructions being processed at the same time, absorbing the high latency of each individual instruction. In GPU cores no such mechanisms exist. It's up to you to handle the dependencies (hazards). E.g., to make sure the instruction that depends on the previous arithmetic instruction result is scheduled after at least 4 clock cycles. Parallelism is achieved by scheduling multiple independent threads into a single execution unit, so it does not have to be stalled. It also results at high latency for individual instructions, but it's absorbed by high thread-level parallelism (instead of sequential instruction level parallelism in OoO cores).
1
@SillySussySally Sure, but when it's executing instructions from the same thread, in most GPU architectures (again, we don't even know the ISA of NVidia, so cannot tell exactly) it's your (or your compiler) responsibility to make sure instructions with a direct dependency do not follow each other. You can see a similar design constraint in the early RISC architectures, such as MIPS. GPU cores do not track dependencies and (at least those I know) don't even do forwarding.
1
@RAINE____ you don't really need dynamically sized anything. It's just a bad habit, a convenience for programmers who did not want to think before coding.
1
 @jonassattler4489 wrong. You should not be allowed anywhere near any mission-critical or real-time code with such an attitude. Stick to coding CRUD in javascript.
1
@RAINE____ you don't need any of such things, really. And if you really need for some weird reason (trust me, you won't find real reasons), do your own dynamic allocation on top of a pre-allocated static array. Just like we did in fortran77 times.
1
@anon_y_mousse we're talking about NASA code that runs in spacecraft. They don't need web or crud crap.
1
@jonassattler4489 I wrote a text editor in Fortran77 once. Poor ne, how did I manage without dynamic allocation?
1
@jonassattler4489 ever heard of overlays? Dynamic allocation makes some sense in a multi-process environment where processess cooperate and share the limited RAM. It is a crappy solution only suitable for primitive desktop kind of uses, where nobody cares if OOM killer start slaughtering random processes. When you care about availability, you canmot afford it. You must guarantee that every process will get ad much as they will ever need at the worst case load. Meaning, split the RAM between all users, pre-allocate once and never allow to change. Same goes for CPU time in real-time systems. The way code monkeys are used to code is extremelt destructive and have no place in any remotely professional environment. Forget browsers and other primitive crap, they have no place anywhere where there are real world consequences.of software errors.
1
@jonassattler4489 well, we're in a thread about NASA. So it is sort of funny when crud coders come here with their very valuable commens about "how can you even code without dynamic allocation?!?".
1
@jonassattler4489 again, you absolutely can (and, likely, should) write pretty much anything without a dynamic allocation. Dynamic allocation is about how processes cooperate with the OS and each other. If you don't care about the other processes, you can statically pre-allocate enough and live with it. Use overlays if enough is not enough. This is how things were for a very long time.
1
@jonassattler4489 again, you absolutely can write a browser without dynamic allocation. It'd even make sense, given how uncontrollable browsers are in eating up memory anyway. Why not just legalize this state of affairs and eat up all the memory straight away?
1
@jonassattler4489 why? You PC runs it just fine. And all other processes suffer. Just legalise it, give it 30 out of 32GB of RAM, as it'll take it all anyway. But then you'll know the remaining 2GB are all yours.
1
@jonassattler4489 just allocate enough at the startup, and never use more and never pretend you can use less. You little kiddies do not realise that it's how we ised to write software before dynamic allocation. And it worked, better than your bloated slow crap these days. There was no dynamic allocation in Fortran.
1
@9SMTM6 you can write a decimal floating point implementation though, then argue that it conforms with this homework requirements.
1
@NinjaRunningWild no. Binary fixed point will have the same issues. You need fractional arithmetic here, or decimal fixed or floating point.
1
@williamdrum9899 because it's a CISC with a large immediate space. You cannot have both - you either have a lot of addressable registers, as typical for RISCs, or you have easily accessible immediates for most of the ops, and then much less addressable registers.
1
No macros - therefore not a match for Rust and other meta-languages.
1