Post-training of large language models has long been clearly divided into two paradigms: supervised fine-tuning (SFT) centered on imitation and reinforcement learning (RL) driven by exploration.
LRM has developed strong CoT reasoning capabilities through a simple yet effective RLVR paradigm. However, the lengthy outputs that accompany this significantly increase reasoning overhead and affect ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results