Kwai AI Develops SRPO to Significantly Improve Reinforcement Learning Efficiency
Kwai, a Chinese video platform company, announced SRPO, a new reinforcement learning framework for large language models (LLMs). SRPO combines two-stage reinforcement learning with historical data reuse to reduce learning steps by approximately 90% compared to the existing GRPO method while achieving performance equivalent to DeepSeek-R1 in mathematics and code generation benchmarks.

Kwai, a Chinese video platform company, announced SRPO, a new learning framework for large language models (LLMs) in reinforcement learning. This framework reduces the number of learning steps required by approximately 90% compared to conventional methods while achieving performance equivalent to DeepSeek-R1 in mathematics and code generation benchmarks.
The background lies in the limitations of GRPO, a reinforcement learning method widely used in recent post-training of LLMs. GRPO is a technique that enhances reasoning capabilities by providing correct and incorrect feedback to models and is adopted in high-performance models like DeepSeek-R1. However, GRPO tends to require many learning steps, and challenges in computational cost and time have been noted. SRPO is positioned as being designed to overcome this inefficiency.
The core of SRPO lies in the combination of "two-stage reinforcement learning" and "historical data reuse (history resampling)". While conventional reinforcement learning proceeds by using only the latest outputs generated by the model, SRPO reuses data from past learning histories to extract more learning benefits from the same computational resources. Through this design, learning steps could be reduced to approximately one-tenth compared to GRPO.
In terms of performance, the framework achieved results comparable to DeepSeek-R1 in two fields: solving mathematical problems and generating program code. DeepSeek-R1 is widely known as a reasoning model based on reinforcement learning, and being able to reproduce its performance with significantly fewer steps represents a noteworthy achievement from an efficiency perspective.
The significance of this research extends beyond simple speed improvement. Reinforcement learning for LLMs requires enormous GPU resources and time, which has been one factor limiting high-performance model development to large corporations and well-funded research institutions. If learning efficiency improves significantly, possibilities expand for achieving high performance with lower costs, and the foundation of research and development could broaden.
Additionally, Chinese tech companies like Kwai continuing to publish independent learning methods demonstrates that AI research competition is progressing in an open manner involving not just large US companies but various players. The detailed reproducibility of SRPO and its generalizability to other tasks will be important judgment criteria through future independent verification and replication efforts.
A key point for future attention is how extensively SRPO will be reproduced and applied by other researchers. Reinforcement learning efficiency is a technical theme that could impact the cost structure of overall LLM development, and its practical value is expected to become clearer through the process of publishing and verifying the method's details.
This article is an original work independently written and edited by the AI issue editorial team based on factual reporting. © AI issue. Unauthorized reproduction, redistribution, or use for AI training is prohibited.