Weibo Research Team Reports High Scores in Mathematical Reasoning with 3 Billion Parameters
Nine researchers from China's Sina Weibo have published a technical report on arXiv about the language model 'VibeThinker-3B' with 3 billion parameters. The model recorded high scores on multiple mathematics and coding benchmarks, including 94.3 points on AIME 2026, and reportedly demonstrated performance equivalent to or superior to models orders of magnitude larger. However, opinions questioning the reliability of the benchmark results have also been raised within the research community.

Nine researchers from China's Sina Weibo published a technical report on arXiv in 2026. The language model they developed, 'VibeThinker-3B', has a small-scale configuration with 3 billion parameters and reportedly achieved high scores on mathematics and coding benchmarks.
In the mathematics domain, it recorded 94.3 points on AIME 2026 and 91.4 points on AIME 2025. On HMMT 2025, it scored 89.3 points; on BruMO 2025, 93.8 points; and on IMO-AnswerBench, consisting of 400 International Mathematical Olympiad-level problems, it achieved 76.4 points. In coding, it achieved 80.2 on Pass@1 in LiveCodeBench v6 and recorded a 96.1% correct answer rate in LeetCode competitions from late April to late May 2026.
As a comparison, DeepSeek V3.2 maintains a similar score on AIME 2026 but has 67.1 billion parameters, approximately 224 times larger than VibeThinker-3B. Google's Gemini 3 Pro scored 91.7 points on AIME 2026, below VibeThinker-3B's 94.3 points. The team reported that applying a test-time scaling technique called 'Claim-Level Reliability Assessment' increased the AIME 2026 score to 97.1.
Within hours of paper publication, the model received 62 upvotes on Hugging Face's daily papers feed, 130 likes on the model repository, and 685 stars on the GitHub repository. Meanwhile, voices on social media have questioned the reliability of the results and the validity of the benchmarks themselves, with ongoing debate in the research community about whether the scores reflect actual capability or if benchmarking has become formalized.
This article is an original work independently written and edited by the AI issue editorial team based on factual reporting. © AI issue. Unauthorized reproduction, redistribution, or use for AI training is prohibited.