General statistics
List of Youtube channels
Youtube commenter search
Distinguished comments
About

clray123
bycloud
comments

Comments by "clray123" (@clray123) on "bycloud" channel.

Previous
1
Next
...
All

That should be getting much more publicity rather than the fake benchmarks. It basically destroy the whole "it's an open model" propaganda. I predicted Meta would perform a rug pull on their models at some time, but was not expecting it to happen randomly and so fast.
11
Training on rephrased synthetic data is not just a way to beat benchmarks, it is also a great way to make small shitty models appear smarter than they are to a casual observer. You then just have to demo with the right test cases and make the audience gasp in awe about how "diverse" and witty your small model appears to be. This trick even works on the tiniest 30M models (e.g. see the TinyStory dataset).
9
I find it curious that the LIMO researchers are raving about how you can elicitate "complex reasoning patterns" from strong foundation models with just <1000 examples while not considering that these "strong foundation models" may have been trained on the exact benchmarks they use in proving their revolutionary technique works (so maybe all they are eliciting is in fact the memorized benchmarks). It makes a lot of "too good to be true" impression, especially with the researchers praising it as a great breakthrough so aggressively without considering alternative explanations. Seems like critical thinking in missing in favor of publishing a hype-filled paper.
5
Nothing wrong with Python for R&D, it's only when you multiply the inefficiency at massive scale when you benefit from optimizations. But that is what the big AI vendors do, run huge scale operations.
5
@Originalimoc We should not forget that under-the-hood the model does not produce tokens, it produces vectors (embeddings) of the final MLP layer, which are "visualized" as definite tokens by the language head in the final transformation. So it could even turn out that there is some extra information smuggled through the unprecise embeddings compared to those embeddings that would be reconstructed if we provided the very same token sequence back as input to the model (embedding->token is an n:1 mapping while token->embedding is a 1:1 mapping).
4
@joelcoll4034 Yes. As long as what's in the context window helps the learned function of weights and context window embeddings (which are derived from weights+input token embeddings) - whatever that function is - produce the correct final answer, those extra tokens serve their goal. The optimization goal of reinforcement learning is to produce the correct verifiable answer, not to produce a trajectory toward it which seems sensible to us (more precisely, we have no means of automatically checking that this trajectory is "sensible" because that already requires reasoning which we're trying to teach). But this is bad news because we would actually like that trajectory to be sensible, to ensure that the model mimicks human problem-solving behavior rather than some incidental stumbling on correct answers which happened to work on the training dataset. The story here is, the more we want the model to follow a certain path, the more constraints we have to add to the reward function. However, had we known how to specify the reward function in full bloody detail, we would not need to use RL in the first place. We would just be able to write down the algorithm which calculates the optimal value in old-fashioned way. So it's a hen-and-egg problem of sorts and generally bad news for any algorithms constructed via RL. The saving grace is that for some problems the RL-obtained solutions are much better than no solutions at all and also the only solutions we have (which doesn't mean they are optimal, correct, or even good enough for practical purposes, especially in safety-critical systems). Some hope is in that the models performance will gradually improve by means of self-correcting. There is some evidence that this self-improvement process works in limited domains (i.e. AlphaSolve). This automated (however slight) improvement is the holy grail of AI on which the success of failure of the entire field seems to depend.
4
@rikuleinonen Google's AI too dumb to filter out bot comments. Which tells you how smart and useful all of the "state-of-the-art AI" really is.
3
In fact I came up with this idea. The idea is trivial, the main obstacle is that there is no training data. But given that they have shitloads of compute they can build stuff like SONAR to pretend that the training data is there (while it probably still isn't).
3
@kazedcat The point is that to learn "the answer" the model has to be allowed (during RL training) to generate sufficiently long token sequences, and at inference time it must be correspondingly allowed to output such long sequences. So in a sense you're right, all the thinking is hard-baked into the model's weights not some sort of spontaneous/non-deterministic process occuring during inference. But I think nobody actually claims that it is so.
3
Let's just agree these models are not "reasoning", mkay? They are doing the same thing which every language model has ever done - imitating their training dataset.
3
I find it funny how Nvidia must bribe random people online to attend their AI conference.
2
No this doesn't make sense, there is no training data and unlike a fixed token vocabulary you can construct an astronomically huge number of different sentences.
2
Token generation can only utilize information contained in the model weights and the context window; and absolutely no other information.
2
@ The wholepoint is that the idea of using something else as token instead of sub-word is banal. The difficulty lies in the details of how to obtain training data for such a model.
2
50 cents of compute for 500 dollar value? What is he smoking? How about 0 cents of compute for the same 500 dollar value if you do the comparison yourself without hallucinating AI? And is OpenAI going to reimburse you when their hallucinating tool loses you money by omiting correct information or hallucinating incorrect? Or are you supposed to check it all yourself afterwards anyways? That sounds like very high likelihood of NEGATIVE value to me for this particular shopping task.
2
Not really, just replacing token vectors with other embedded vectors. Nothing revolutionary about this. They are putting new labels on old ideas.
2
@samehedi In reality all of them are bad at all/most jobs. It usually takes 15 minutes of operation for the LLM to produce some unexplainable and ridiculous errors.
2
@PeterKoperdan What you say about determinism and first principles in no way contradicts the possibility of emergent higher-level behavior; including a human-like ability of introspection. A model learns to predict the next token, that's true. But what exactly is the learned function which enables it to do it, and what else does it learn as a side effect? The only thinig we do know for sure is that it's not just a huge lookup table. But other than that, it is what the whole field of intrepretability research is struggling with. While you make it appear in your silly comment as if it was some definitively settled issue, just because the systems were "engineered" and because we understand the underlying mathematics. That's like saying we "understand" all of human psychology because we "understand" key reactions of organic chemistry - only proves the ignorance of someone holding such opinions.
2
Hey, I abbreviated the paper title to save tokens: "Reasoning models don't think".
2
pip install pipline
2
The difference is that you can, with enough pressure, do it just like your teacher instructed, while the model cannot.
1
@Dan-dy8zp The problem with the fake side-effect reasoning is that you can have no assurance that the final result of the reasoning reliably matches the fakery (and in fact every reason to assume in some cases it won't). It's a safety/correctness problem, but then, if you want safe/correct software, you should not rely on unknown stochastic/statistical properties of some third-party vendor does-god-knows-what software. In short, it makes AI unusable in many domains.
1
@Dan-dy8zp Ehh I would not call that "efficient" more like "an easier to find (albeit wrong) partial solution". It's like saying that memorizing multiplication table instead of learning how to do actual multiplication is "more efficient" ... only "more efficient" if you ever need to multiply two numbers 0-10.
1
@afriedrich1452 how about the thick king?
1
@drdca8263 I agree, the "next token predictor = bad" is a stupid argument. The clever argument is to point out that next token prediction can be achieved in an infinite number of ways, on any given dataset, and only few of those ways are "correct" in the sense of generalizing to all unseen datasets of the same category. So our loss function which only focuses on faithfully reproducing the next token (and we're forced toward such loss functions for hardware performance reasons) is unlikely to hit the sweet spot of finding the "correct" mapping. RL tries to work around by enabling more arbitrary reward functions, but it is grossly more inefficient and still has the unsolved problem of what the "correct" reward function should be. In other words, specifying the reward function may be just as difficult as specifying the unknown algorithm to be learned. Game over, but at least it scales better (?) than making informed manual guesses about the algorithm.
1
BTW, someone should explain to me how GRPO is useful when all generations in a group are wrong. It seems to only do anything if the model is capable of generating the correct answer through multi-shot in the first place - which is a far cry from the supposed gradual refinement or unsupervised learning that it is being sold as.
1
@adreto2978 Tbh Sam Altman looks like a bottom type of guy.
1
@ well there's a little difference between stealing basic ingredients and stealing the whole finalized product
1
Less super useful when you consider how many different sentences u can generate from a finite set of tokens... (and how many will fit into a 1024 element vector, which is what SONAR's embedding dimension is)
1
@ I'm sorry to hear about your loss.
1
Probably the same way you do know it in next token prediction, instead of embedding the next token you embed the next sentence and call it "concept".
1
I think they need to hire a Chinese company to bring the cost down from $3000 to $30.
1
You won't be so happy when they make you kneel as a slave in some dungeon. But more likely they will just get rid of you, since they have enough of their own obedient Chinese people and you are a drain on resources needed by them (and their AI).
1
@HeadsFullOfEyeballs Not sure why anyone who read the "thinking" part and compared it to the "output" is surprised about that. Not only is it possible for the output to say something completely different than the thinking. Even within the thinking section there are often "trivial" mistakes such as stating the negated version of a statement, and yet miraculously arriving at a correct answer based on this "completely wrong" intermediate conclusions. I believe the deeper scam here (perpetrated by DeepSeek for deeply economic reasons) may be that the human-like "thinking" has "spontaneously emerged" from the RL training. Whereas in reality it's much more likely that they actively shaped it in training to align with human reasoning... by SFT.
1
More likely they want to hide whatever artificial brainfarts are in there (and also protect from competitor's scraping those to distil other just-as-well-performing models). At least that has been the main motivation of OpenAI hiding their "thinking".
1
And so we're one step closer to US-instigated WW3 under pretense of military rivals threatening the world with "unsafe AI".
1
But SONAR encoder/decoder are trained together... as I understand it the embedding space is the "bottleneck" used in training both, the only thing it lacks vs. "normal" transformer-based translation models is cross-attention.
1
It's the "just use your personal bias, everything is subjective, my reality is not like your reality, duh, it may be false, but it's true for me" approach used by psychics and other scammers.
1
How I calculate 36+59: 36+50 = 86, that's easy because I just need to add 3+5=8 for the first digit. Now, there's also the missing 9 that I now need to add to 86. So 95. Absolutely no mystery, and I'm always following the very same algorithm in my head. Note that it is not the right-to-left-with-carry algorithm that we are taught in primary school.
1
@A_User_Apparently Probably fruits of genetic engineering experimentation.
1
@kazedcat Actually no, the reward is distributed over all tokens, so if you want strong reinforcement signal, you want as few tokens as possible. It is a well known problem in RL that if the trajectory of actions (here tokens) is long, the learning speed is lowered because the final reward cannot be easily attributed to any of the previous actions (same is also true for real-life RL, if your reward only comes after 1000 steps of working toward something, you don't know which of the steps mattered and which did not for success; but if the reward is immediate after few steps, you are likely to repeat exactly those steps the next time). So why long token sequences then? Well, the intuition is that with each generated token the model has at least an opportunity to internally calculate "something" whereas without tokens it can calculate "nothing" (in the extreme case of outputting 0 tokens). The amount of calculation per token is constant/fixed, so if an algorithm demands a sufficiently large variable number of calculation steps (which cannot be parallelized because they have inherent data dependencies - step n+1 depends on what step n produces, then it just can't be approximated by an underpowered calculator).
1
Your "agentic reasoning" is a thin wrapper around calculating multiple LLM prompts in sequence and contributes absolutely nothing to solving the problem. In fact you could have dumped all your agentic prompts into a single prompt with much the same final effect (of the LLM following a deterministic, or stochastic, if you sample, trajectory from given input tokens to new output tokens).
1
@datpye That's simply incorrect. All tokens contribute to predicting the next output token. CoT is just a way of formulating a dataset and a method of training for cases where you can have verifiable answers: let the model generate them "somehow by figuring it out by random token spew" and then reward the sequences that arrive at correct answer, whatever they might be; keyword: GRPO. That only works if the model is big enough not to get stuck generating the same nonsense over and over again and able to generate sufficiently diverse token trajectories to begin with, BTW. How people who write agent workflows divide up their task is up to them, but in the end it is just a sequence of prompts (usually with little in between). The only cases where the agentic workflows can actually improve on the raw model performance is if you somehow postprocess the model outputs using traditional algorithms or supplement them with external information not contained in the LLM. But if you can a lot of such postprocessing then maybe your task doesn't require LLM to begin with, or maybe the LLM only plays a very limited role (e.g. of being the natural language "perception" module to translate it into something else which can be worked with in a traditional way; but this is not usually how the AI agent hypers are structuring their workflow, they prefer to put the agent in control of everything - and usually messing up everything severely).
1
Idk it feels like back to the time when people were wondering how to train variable length sequence predictors using just fixed size vectors back before transformers and when convolutional neural nets were the only game in town.
1
Hasn't it been done already in Janus (minus the MoE part)? Janus was pretty ass at everything AFAICR.
1
@PeterKoperdan We understand organic chemistry, we understand the mechanisms by which proteins are made etc etc. I've shown you your fallace and now you are basically defending a lost position. A little while ago you may have been saying that the architecture does not allow the model to master arithmetics or counting letters in words, and yet here we are, with the models performing such tasks with unchanged architecture. The only problem with the current architecture is that it is inefficient in the number of connections in the neural networks involved, our 3d biological networks are much more intricate and use much less energy. But we do not have sound reasons to believe that the architecture is fundamentally less capable than a fruit fly's brain or some such. In fact all the evidence found so far points that we have indeed reproduced the fundamental architecture of information processing in living organisms.
1
@PeterKoperdan Dude, I converted implementations of some early LLMs from Python to C (ggml) and also trained some from scratch. I dare say I know how transformers work.
1
@PeterKoperdan What makes you ask loaded questions? Your opinion is clearly not expert opinion.
1
@PeterKoperdan If you read my other comments in this same video you would know I am far from being "AI enthusiast".
1
No.
1

Previous
1
Next
...
All