A Pelican SVG and Two GPUs: Extreme Quantization and Local Inference on RTX 5090 + 3090

The image of a pelican in SVG format caught a Reddit user's attention for its quality, but the real surprise is how it was generated. The model GLM 5.2 UD IQ2_M, an LLM with extremely low quantization, produced the illustration on an unconventional desktop system built around two GPUs from different generations linked through PCIe bifurcation.

Bridging two generations

At the core of the setup is a pairing of an RTX 5090 and an RTX 3090 on a Gigabyte AI TOP B850 motherboard, coupled with an AMD Ryzen 9950X3D processor and 256 GB of DDR5 RAM at 5600 MHz. The two graphics cards share PCIe lanes in x8 mode, a technical choice that reduces theoretical bandwidth compared to an x16 link, but in this context did not prevent inference from running.

The key detail is the IQ2_M label: it denotes a 2-bit-per-weight quantization, among the lowest available. This technique dramatically compresses the model, allowing it to run on hardware with limited VRAM at the cost of a potential loss of precision. Yet the result – a complex vector image – shows that even at such compression levels, some models maintain surprising capabilities.

Performance and trade-offs

The user reports "low tps" – a reduced number of tokens generated per second – an expected compromise when forcing an LLM onto two GPUs with a constrained bus and extreme quantization. No exact figure is given, but the perceived slowness emerges as the Achilles' heel of this configuration. For interactive workloads or production use, throughput remains a critical factor, especially in on-premise contexts where offloading to cloud services is not an option.

The system does not represent an industrial solution but a concrete demonstration of how high-end consumer hardware can approach local inference scenarios that until recently required dedicated servers. The choice of two heterogeneous GPUs, moreover with bifurcated PCIe, is symptomatic of the growing focus on Total Cost of Ownership: rather than investing in a single professional accelerator, users combine multiple consumer units to maximize overall capacity at lower costs.

The rise of domestic multi-GPU systems

The project does not stop here: the user plans to migrate the workload to a Threadripper system equipped with 8 (or perhaps 12) RTX 3090s. A leap that multiplies available VRAM and opens the door to larger models or less compressed runs, reducing bottlenecks. It is a trend visible in the self-hosted AI community: machines assembled with consumer components, often recycled from gaming or mining, become experimentation platforms for LLMs and generative models.

The episode raises broader questions for those evaluating on-premise deployments. Aggressive quantization can make models accessible on modest hardware, but introduces quality and latency variables that must be assessed case by case. The choice of infrastructure, between cutting-edge GPUs and older but plentiful solutions, requires balancing performance, energy costs, and management complexity. In the end, the pelican SVG is a symbol of an ecosystem that is increasingly moving toward data sovereignty and local computing, even at the cost of some technical compromises.