Kimi K2.7-Code Achieves 30% Inference Token Reduction, But Independent Verification Raises Questions

Moonshot AI released the coding model "Kimi K2.7-Code" and announced 30% reduction in inference tokens and significant performance improvements on its own benchmarks. However, independent verification by researchers using KernelBench-Hard has revealed performance degradation compared to the previous generation K2.6, and the failure to submit to independent benchmarks has also been pointed out. External evaluation suggests the model is "more honest but not more capable," highlighting the gap between claims and reality.

Moonshot AI released the open-source coding model "Kimi K2.7-Code" this week. The company claims it improves inference efficiency compared to the previous generation K2.6 and achieves double-digit performance improvements on major benchmarks. However, independent researchers and developers have already raised skeptical voices.

K2.7-Code adopts the same trillion-scale mixture of experts (MoE) architecture as K2.6, and the ability to integrate directly into existing production environments through OpenAI-compatible APIs presents a low barrier to adoption for teams already operating K2.6. Weights are publicly available on HuggingFace and can be deployed with vLLM or SGLang. The model operates exclusively in thinking mode, and temperature parameter adjustment is not supported. Since Moonshot AI fixed it at 1.0, it could be a constraint for teams wanting fine-grained tuning of output determinism.

Moonshot AI emphasizes the suppression of "overthinking." The company claims a 30% reduction in inference token usage compared to K2.6, a figure that translates to direct inference cost reduction for teams operating agentic workflows. On the performance side, it touts 21.8% improvement on its own "Kimi Code Bench v2", 11% on "Program Bench", and 31.5% on "MLS Bench Lite". However, all three are proprietary benchmarks operated by Moonshot AI itself, and no third-party verification has been conducted at this point.

Independent verification was conducted by researcher Elliot Arledge. He compared K2.7-Code, K2.6, and Claude Fable 5 on the public benchmark "KernelBench-Hard" specialized in GPU kernel optimization and published all execution logs on kernelbench.com. The results did not align with Moonshot AI's claims. "K2.7 is more honest but not more capable," Arledge posted on X.

Specifically, on 5 out of 6 problems, K2.7-Code wrote Triton kernels from scratch where K2.6 had relied on library wrappers. In terms of breaking free from library dependency, this is a "more honest" implementation, but 2 of them failed due to bugs in the model itself. The MoE kernel score declined from K2.6's 0.222 to K2.7-Code's 0.157. "Fable (Claude Fable 5) achieves the top results in all cases where it doesn't explicitly indicate failure honestly," Arledge added.

Additionally, developer Sugumaran Balasubramaniyan, who built a model router on the Hermes agent platform based on DeepSWE, publicly raised direct questions to Moonshot AI about the K2.7-Code release. DeepSWE is an independent benchmark with a 70-point score range, and is said to have higher discriminative power than SWE-Bench Pro, which only has a 30-point range. K2.7-Code has not submitted to this independent benchmark, leaving challenges in the objective evaluation of its capabilities.

#InferenceOptimization#CodingModel#BenchmarkValidation#LLMPerformanceEvaluation#IndependentVerification#MoonshotAI

AI issue Staff

This article is an original work independently written and edited by the AI issue editorial team based on factual reporting. © AI issue. Unauthorized reproduction, redistribution, or use for AI training is prohibited.

Kimi K2.7-Code Achieves 30% Inference Token Reduction, But Independent Verification Raises Questions

Comments