An LLM called GLM-5 underwent intensive testing on the FoodTruck Bench platform, designed to simulate the operational challenges of a food truck business. The experiment aimed to evaluate the model's ability to make decisions in a realistic business context.

Test Results

GLM-5 survived for 28 out of 30 days, ranking fifth overall. It generated more revenue than Sonnet 4.5 ($11,965 vs $10,753) and produced less food waste. However, the model failed due to high staff costs, which consumed 67% of revenue.

Failure Analysis

Despite GLM-5 correctly diagnosing every problem, storing 123 memory entries, and using 82% of available tools, it ignored its own analysis. This behavior led to failure, despite good performance in other areas.