Comments by "clray123" (@clray123) on "bycloud" channel.

Surely if their products provide so much value they should have an easy time to convince insurers to GURANTEE that value or you get money back. But what they are doing instead is forcing YOU to INDEMNIFY them in court in case their product fucks up. You do not really need much additional information on top of that to arrive at a conclusion about their current product quality. The reality is incessant bullshit talk by Sam A. and co. costs absolutely nothing, while actual quality insurance would. Better yet, if their products provide such great value, why not just capitalize directly on it instead of offering lame subscriptions? This reminds us of the financial services model, where you pay a "consultant" for the "service" while retaining all the risk (with very lacking statistical record of the consultants ever providing any sort of market edge to their customers, unlike for themselves).
1
Yes, you are missing the fact that the additional tokens, regardless of what they are, help improve performance in benchmarks. Which is roughly explained by pointing out that some algorithms that need to be simulated to solve a task require strictly more computational power than is expended during a single inference (or training step). So expecting that you can train a model to solve a task which mathematically requires n steps in a single step, is simply like asking for impossible to happen. Another way to put it is: models require long sequences to learn complex algorithms during training; and because inference executes the same algorithms learned during training, it also requires the same long sequences. The only point here is that the long sequences of tokens do not clearly map to how humans would approach solving the task, and are misleading/useless for humans wishing to understand the algorithms actually at work in LLMs. (But they help fool investors, so they are useful to LLM research after all.)
1
@half_mexican4071 Whether what we see has really "emerged from RL" or was "helped a bit" by the wishful thinking and marketing team, we don't know. As far as I'm aware, DeepSeek has not published their exact training datasets and procedures, and we have no way of checking how much SFT juice really was involved to make the reasoning appear "human" rather than just random gibbberish.
1
Now repeat after me LLAMA IS NOT OPEN. IT HAS NEVER BEEN. AND LIKELY NEVER WILL BE.
1
Maybe they should call it Shallow Research or Copy Paste Research instead?
1
@vaolin1703 How do you know? When millions of dollars are at stake, why would you assume they are not cheating? It's not like they are an OPEN research organization.
1
@ For all we know they could have contaminated training with benchmark data on all benchmarks. You just can't check it. It seems to have become a marketing strategy for some guys to come up with new benchmarks for OpenAI to beat those benchmarks (and only that). Apparently investors are still swallowing this "oh, we beat a benchmark bs" instead of running away based on actual real-world performance of these models.
1
What do you expect from a YT sponsor? They're all scams.
1
So far we have one small model released by Microsoft after >1 year since the paper... and it's pretty shitty.
1
Autoregressiveness is not really a limitation since you can simulate backtracking by just outputting more tokens. In fact, it is arguably better because you retain memory of all the simulated backtracking.
1
@rogue_minima_roni_l No.
1
What is shocking about that, it was evident from the very beginning, especially by examples where the "final answer" was completely contradictory to the "thinking" before it.
1
@SianaGearz I found quite a few. Especialy Qwen is conflicted about thinking and the censorship patterns in the final answer. I catch it thinking at length about generating content, then generating a short "Sorry, I cannot do it" in the end.
1
@SianaGearz Also, you have to realize that the distilled models are just supervised fine-tuning. The only thing they do is reproduce what was in their dataset, there is not even RL there (although the dataset was generated from a model trained by using RL). So it's just good old "dataset interpolation" going on here, nothing more. And as you realize training the model to reproduce a text which contains "<Say idiotic thing.> Oops, that can't be right, let's fix that mistake." is no different than training the model to say "Pigs can fly and there is really a lot of scientific evidence toward it." Given enough iterations (or even really few iterations if you apply technique like DPO) you can make the model output whatever you want, given whatever input sequence you want it to be triggered by.
1