Post-training of large language models has long been clearly divided into two paradigms: supervised fine-tuning (SFT) centered on imitation and reinforcement learning (RL) driven by exploration.
LRM has developed strong CoT reasoning capabilities through a simple yet effective RLVR paradigm. However, the lengthy outputs that accompany this significantly increase reasoning overhead and affect ...