Fifteen tokens per second. Not exactly blazing, but for anyone who wants full control of their AI stack, it can be more than acceptable. That's the practical dilemma faced by a Reddit user (u/chikengunya) who's asking the LocalLLaMA community whether it makes sense to invest in four Ascend GX10 (or NVIDIA DGX Spark) GPUs to get ready for upcoming open-source models, including a mysterious “Fable 5” expected around December or next year.
Numbers that make you think
The reported tests involve GLM5.2, an LLM from Zhipu AI widely used in China, running on four Ascend GX10 accelerators (the user also mentions the DGX Spark variant). Prompt processing reaches 400–500 tokens per second, while output generation sits at roughly 15 tok/s, all with a 128k-token context window. Not datacenter speeds, but the numbers improve significantly with quantization, a common trick in on-premise setups where predictable throughput and data sovereignty count more than raw milliseconds.
The power meter reads 1000W for the entire system – a fixed cost that doesn't scare the original poster, who remarks “1000W don't faze me.” That operational expense must be weighed against the absence of monthly cloud API fees and the guarantee that data stays in-house.
Why buy hardware now for a model that doesn't exist yet
Chikengunya's question reflects a mindset shift among developers and companies eyeing on-premise deployment. Open-source models are advancing fast, and the cited “Fable 5” – likely a reference to a future large-scale LLM release – could arrive relatively soon. Having tested, tuned hardware in place means not chasing the second-hand market or waiting for new shipments when demand spikes.
The choice falls on Chinese silicon (Huawei Ascend) or NVIDIA's just-announced DGX Spark, both designed for local AI inference. While Ascend cards are less common in Western markets, they're becoming a viable alternative partly due to export restrictions that have constrained NVIDIA GPU availability in some regions. For those seeking independence, even the chip itself begins to matter.
The trade-off between speed and sovereignty
Fifteen tokens per second isn't something to romanticize: it's roughly double the average human reading speed, but in interactive chatbots or real-time applications it can feel sluggish. Yet batch processing, document analysis, or code generation don't require instant streaming – in those scenarios, a four-GPU cluster with quantization can offer an honest compromise.
For the AI-RADAR audience, the real value is data sovereignty and cost predictability. No data leaves the corporate perimeter (or the developer's garage), no surprise bills from API calls, no lock-in to model-as-a-service vendors. Yes, the electricity bill locks at 1000W, but that's a number you can calculate precisely.
The outlook for local deployments
The interest in configurations like the one discussed in this thread confirms that on-premise AI isn't just a niche for large enterprises. Boxes like DGX Spark and Ascend GX10 systems lower the technical (and eventually economic) entry barrier for anyone who wants to experiment with self-hosted LLMs. The question is no longer whether local inference is possible, but under what performance and power conditions.
Meanwhile, the community keeps testing, sharing, and comparing setups. Chikengunya's thread is a snapshot of a sector in flux, where hardware choices are driven not only by benchmarks but by medium-term strategy. A signal worth watching for anyone evaluating an AI infrastructure investment.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!