- AI promises a huge revolution for developers, but is it just for code creation?
- Popular anthropic and openai models are not excellent for debugging
- Microsoft researchers are open their tools to facilitate research
Although generating AI is increasingly integrated into programming workflows, new research from Microsoft reveals that large languages are still not entirely up to it with regard to debugging.
Research suggests that even advanced models are still struggling with debugging tasks which are quite simple for experienced developers, stressing the continuous importance of human programmers.
The AI seems to have a solid use case, however, Google now affirming that around 25% of the new code is generated by AI. Meta also noted the wide deployment of AI for coding.
AI is good for code creation, but not for debugging
The report explores how 11 Microsoft researchers tested nine AI models on Swe -Bench Lite – a popular debugging reference. Claude 3.7 Sonnet offered the highest success rate at 48.4% very perfect. The O1 and the O3-Mini of Openai posted success rates lower than 30.2% and 22.1% respectively.
“Even with debugging tools, our simple agent based on an prompt rarely solves more than half of the Swe-Bench Lite problems,” wrote the researchers, accusing sub-optimal performance on a lack of data representing a sequential decision-making behavior.
However, all hope is not lost. “We believe that the training or LLM in fine adjustment can improve their interactive debugging capacities,” they added. Researchers intend to adapt a model of info-research specializing in collecting the information necessary to resolve the bugs, but in the meantime, they promise open source debug to facilitate other similar research.
Debug-gym is described as an “environment that allows code reimbursement agents to access tools for active information search behavior”.
However, for the moment, artificial intelligence may not bring so much value to the lives of developers as suggested by IA companies. “Most developers spend most of their time debugging the code,” wrote the researchers, indicating that even if they benefit from the generation of code, this may not save them so much time.




