Comments by "clray123" (@clray123) on "bycloud"
channel.
-
11
-
9
-
5
-
5
-
4
-
@joelcoll4034 Yes. As long as what's in the context window helps the learned function of weights and context window embeddings (which are derived from weights+input token embeddings) - whatever that function is - produce the correct final answer, those extra tokens serve their goal.
The optimization goal of reinforcement learning is to produce the correct verifiable answer, not to produce a trajectory toward it which seems sensible to us (more precisely, we have no means of automatically checking that this trajectory is "sensible" because that already requires reasoning which we're trying to teach).
But this is bad news because we would actually like that trajectory to be sensible, to ensure that the model mimicks human problem-solving behavior rather than some incidental stumbling on correct answers which happened to work on the training dataset.
The story here is, the more we want the model to follow a certain path, the more constraints we have to add to the reward function. However, had we known how to specify the reward function in full bloody detail, we would not need to use RL in the first place. We would just be able to write down the algorithm which calculates the optimal value in old-fashioned way.
So it's a hen-and-egg problem of sorts and generally bad news for any algorithms constructed via RL. The saving grace is that for some problems the RL-obtained solutions are much better than no solutions at all and also the only solutions we have (which doesn't mean they are optimal, correct, or even good enough for practical purposes, especially in safety-critical systems).
Some hope is in that the models performance will gradually improve by means of self-correcting. There is some evidence that this self-improvement process works in limited domains (i.e. AlphaSolve). This automated (however slight) improvement is the holy grail of AI on which the success of failure of the entire field seems to depend.
4
-
3
-
3
-
3
-
3
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
2
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
@kazedcat Actually no, the reward is distributed over all tokens, so if you want strong reinforcement signal, you want as few tokens as possible. It is a well known problem in RL that if the trajectory of actions (here tokens) is long, the learning speed is lowered because the final reward cannot be easily attributed to any of the previous actions (same is also true for real-life RL, if your reward only comes after 1000 steps of working toward something, you don't know which of the steps mattered and which did not for success; but if the reward is immediate after few steps, you are likely to repeat exactly those steps the next time).
So why long token sequences then? Well, the intuition is that with each generated token the model has at least an opportunity to internally calculate "something" whereas without tokens it can calculate "nothing" (in the extreme case of outputting 0 tokens). The amount of calculation per token is constant/fixed, so if an algorithm demands a sufficiently large variable number of calculation steps (which cannot be parallelized because they have inherent data dependencies - step n+1 depends on what step n produces, then it just can't be approximated by an underpowered calculator).
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1
-
1