Comments by "clray123" (@clray123) on "bycloud" channel.

  1. 11
  2. 9
  3. 5
  4. 5
  5. 4
  6.  @joelcoll4034  Yes. As long as what's in the context window helps the learned function of weights and context window embeddings (which are derived from weights+input token embeddings) - whatever that function is - produce the correct final answer, those extra tokens serve their goal. The optimization goal of reinforcement learning is to produce the correct verifiable answer, not to produce a trajectory toward it which seems sensible to us (more precisely, we have no means of automatically checking that this trajectory is "sensible" because that already requires reasoning which we're trying to teach). But this is bad news because we would actually like that trajectory to be sensible, to ensure that the model mimicks human problem-solving behavior rather than some incidental stumbling on correct answers which happened to work on the training dataset. The story here is, the more we want the model to follow a certain path, the more constraints we have to add to the reward function. However, had we known how to specify the reward function in full bloody detail, we would not need to use RL in the first place. We would just be able to write down the algorithm which calculates the optimal value in old-fashioned way. So it's a hen-and-egg problem of sorts and generally bad news for any algorithms constructed via RL. The saving grace is that for some problems the RL-obtained solutions are much better than no solutions at all and also the only solutions we have (which doesn't mean they are optimal, correct, or even good enough for practical purposes, especially in safety-critical systems). Some hope is in that the models performance will gradually improve by means of self-correcting. There is some evidence that this self-improvement process works in limited domains (i.e. AlphaSolve). This automated (however slight) improvement is the holy grail of AI on which the success of failure of the entire field seems to depend.
    4
  7. 3
  8. 3
  9. 3
  10. 3
  11. 2
  12. 2
  13. 2
  14. 2
  15. 2
  16. 2
  17. 2
  18. 2
  19. 2
  20. 2
  21. 1
  22. 1
  23. 1
  24. 1
  25. 1
  26. 1
  27. 1
  28. 1
  29. 1
  30. 1
  31. 1
  32. 1
  33. 1
  34. 1
  35. 1
  36. 1
  37. 1
  38. 1
  39. 1
  40. 1
  41. 1
  42. 1
  43. 1
  44. 1
  45. 1
  46. 1
  47. 1
  48. 1
  49. 1
  50. 1