New Benchmark Measures AI's Code Reproduction Capability

Epoch AI has released a new benchmark called 'MirrorCode' that evaluates whether AI can reproduce complete programs without access to source code. The highest score was achieved by Anthropic's Claude Opus 4.7 with a 56% success rate, but all models failed on the most complex tasks. One model spent 19 days and $2,600 on a single task, highlighting the limitations of current AI coding capabilities.

Epoch AI, an AI research organization, has released a new evaluation benchmark called 'MirrorCode' that measures whether AI can fully reproduce original software without access to the source code. While traditional evaluations have focused on simple code completion or partial code generation capabilities, MirrorCode is characterized by centering on a more practical question: 'Can AI recreate a working, complete program without seeing the original code?'

Evaluating coding ability is an important indicator for measuring how useful AI can be in real development environments. As software development automation using AI has rapidly advanced in recent years, the focus has shifted from simply writing short code to understanding and reconstructing large-scale, complex programs in their entirety. MirrorCode is positioned as a benchmark that reflects this reality.

The highest score in this evaluation was achieved by 'Claude Opus 4.7,' developed by Anthropic. It achieved a 56% success rate and reconstructed a toolkit of approximately 16,000 lines in 14 hours. Meanwhile, all evaluated models failed to solve the most complex tasks. Furthermore, one model continued running on a single task for 19 days, with execution costs reaching $2,600, demonstrating how difficult it is to handle complex tasks.

The figure of 56% for the top position means, when viewed from another angle, that 'more than 40% of tasks could not be solved.' Even the highest-level models fall short of completely reproducing complex programs, and the fact that all models face a wall at the most difficult tasks clearly demonstrates the limitations of current AI coding capabilities. The case of spending $2,600 on a single task suggests that computational resources can become enormous for high-difficulty tasks.

Going forward, as stricter evaluation benchmarks like MirrorCode become more widespread, the boundary between AI's 'usable capabilities' and 'insufficient capabilities' will become clearer. For companies seeking to leverage AI in development environments, it will be important to select and utilize tools based on the reality that current AI is not yet capable of autonomous reconstruction of entire large-scale codebases.

#GenerativeAI#LLM#AICoding#Benchmark#Claude#AIAgent#SoftwareDevelopment

AI issue Staff

This article is an original work independently written and edited by the AI issue editorial team based on factual reporting. © AI issue. Unauthorized reproduction, redistribution, or use for AI training is prohibited.

New Benchmark Measures AI's Code Reproduction Capability

Comments