YC-Bench: When LLMs Run a Startup for a Year
A new and innovative benchmark, named YC-Bench, has tested the capabilities of 12 Large Language Models (LLMs) by simulating the management of a startup for an entire operational year. This scenario, spanning hundreds of decision-making turns, required LLMs to tackle complex tasks such as employee management, contract selection, payroll administration, and navigating a hostile market where approximately 35% of clients secretly inflate work requirements after task acceptance. The uniqueness of YC-Bench lies in its delayed and sparse feedback, with no direct hand-holding, replicating real-world uncertainties.
The results of this simulation, conducted with three "seeds" (independent runs) for each model, revealed surprising dynamics in terms of performance and cost. Claude Opus 4.6 led the leaderboard with an average final fund of $1.27 million, incurring an API cost of approximately $86 per run. However, the most significant finding concerns GLM-5, which achieved an average capital of $1.21 million, positioning it within 5% of Opus's performance, but at a drastically lower API cost of approximately $7.62 per run. This translates to roughly an 11x lower cost for nearly equivalent performance. GPT-5.4 ranked third with $1.00 million and a cost of $23 per run, while many other models failed to surpass the initial capital of $200,000, with several instances of bankruptcy.
Long-Horizon Coherence and "Scratchpads": Lessons from the Benchmark
YC-Bench highlighted a critical gap in most LLM evaluations: "long-horizon coherence under delayed feedback." In contexts where the consequences of a decision are not immediately apparent, many models tend to fall into repetitive loops, abandon recently formulated strategies, or continue accepting tasks from clients already identified as unreliable. This ability to maintain a consistent strategy over time, adapting to sparse and non-immediate information, proved to be a fundamental distinguishing factor.
Another key element that emerged from the study was the active use of a persistent "scratchpad" by top-performing models to record and re-process learned information. Leading models rewrote their notes approximately 34 times per run, demonstrating an iterative process of learning and adaptation. In contrast, less effective models averaged only 0-2 entries, suggesting a lower capacity to capitalize on past experience. This "working memory" mechanism proved to be the strongest predictor of success, surpassing factors like model size or scores in traditional benchmarks.
Implications for On-Premise Deployment and TCO
The YC-Bench results have profound implications for organizations evaluating LLM deployment in production environments, particularly for agentic pipelines. The remarkable cost-efficiency demonstrated by models like GLM-5, which offers near-market-leading performance at a fraction of the cost, is a crucial factor for Total Cost of Ownership (TCO). For companies considering self-hosted or on-premise solutions, the ability to achieve high performance with reduced inference costs can translate into significant savings on infrastructure operational expenses.
While the benchmark used API costs, the equivalence in terms of computational resources for on-premise inference is direct. A more efficient model means fewer GPUs, less VRAM, and lower power consumption to handle the same workload, making local deployments more economically sustainable. Kimi-K2.5, for instance, showed even greater efficiency in terms of revenue per API dollar, outperforming the next model by 2.5 times. These data offer CTOs, DevOps leads, and infrastructure architects concrete metrics to evaluate the trade-offs between performance and cost, which are fundamental for strategic decisions prioritizing data sovereignty and infrastructure control.
Future Prospects and the Evolution of Benchmarks
The YC-Bench methodology represents a significant step forward in LLM evaluation, shifting focus from immediate performance metrics to more complex, human-like capabilities such as strategic planning and long-term adaptation. The availability of the benchmark's code as Open Source, along with the publication of the paper and leaderboard, invites the community to further explore these dynamics and test new models.
This research underscores that the true utility of LLMs in complex business scenarios lies not only in their ability to generate accurate responses but also in their "cognitive resilience" when faced with uncertainties and delayed feedback. For those evaluating on-premise deployments, the emergence of highly performant yet cost-efficient models like GLM-5 opens new opportunities to build robust and economically advantageous AI pipelines, while maintaining control over their data and infrastructure.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!