Code Optimization with LLMs: A New Approach Surpasses Claude Mythos

A recent study has demonstrated how it's possible to significantly improve the code optimization capabilities and execution speed of Large Language Models (LLMs) such as Qwen-3.6-27B and Gemma-4-31B, enabling them to surpass the performance of Claude Mythos. The research introduces an innovative methodology, termed a "scaffold," which substantially increases the compute power employed during the test phase, with an estimated 25 to 40 times increase compared to the original baseline model for tackling the same problem.

This approach aims to overcome the inherent limitations of LLMs in reasoning over extended contexts, a critical factor for complex tasks like code optimization. The emphasis on increased compute and managing challenges related to context length highlights the importance of robust and scalable infrastructure, a fundamental aspect for organizations considering on-premise deployment of AI solutions.

The "Scaffold": An Iterative Refinement Methodology

The core of this research lies in the "scaffold," a framework designed to maximize the exploration and refinement of solutions. Operating in a "max mode," the system configures the branches exploration breadth to 5, the iterative corrections loop depth to 10, and employs 6 branch-aware selective hypotheses, which are revised after every two iterations. These hypotheses independently evaluate various claims, local speedups, or entirely different algorithmic designs, and are selectively injected into a specific branch context.

A crucial component of this system is the "solution pool," which introduces "structured noise" into the iterative corrections loop. This mechanism is essential to prevent LLMs from getting stuck in local minima, fostering the exploration of a broader solution space. All agents within the system have access to a Python environment, allowing them to instantly check their work programmatically and validate the effectiveness of their ideas.

Addressing Challenges in Long Context Reasoning

One of the primary challenges encountered with models like Gemma and Qwen is their instability in reasoning over long context windows. This limitation manifests as a significant drop in performance as early as the fourth or fifth iteration, or after the PQF update, at the ninth and tenth iterations. These declines are described as "genuine regressions," making it impossible to stop the process prematurely, for example, at the third iteration, as updated or evolved branches might still offer better solutions than previous ones.

To overcome this difficulty, researchers could not adopt "memory bank distillation" after every three iterations, as this would have excessively narrowed the search, an approach that frontier LLMs handle better. The adopted solution was to provide each branch with its own history, asking the models to evaluate and select the most performing or optimized candidate within each branch, and then choose the best among these to present to a "final judge."

Implications for On-Premise Deployments and Infrastructure Planning

The described approach, which requires a substantial increase in compute power to achieve superior results, has direct implications for organizations evaluating the deployment of Large Language Models in on-premise or hybrid environments. The need for 25-40 times more compute than the baseline for complex optimization tasks translates into significant hardware requirements, directly impacting the Total Cost of Ownership (TCO).

For CTOs, DevOps leads, and infrastructure architects, this study underscores the importance of carefully planning GPU resources, VRAM, and overall processing capacity. Managing iterative and compute-intensive workloads, coupled with the need to maintain data sovereignty and compliance, makes the choice between self-hosted and cloud solutions a balance of trade-offs. AI-RADAR provides analytical frameworks on /llm-onpremise to help evaluate these compromises, highlighting how optimizing LLM performance often requires proportional investments in infrastructure.