AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Ouro-2.6B-Thinking: First Working Inference for ByteDance's Model

Published on 2026-02-21 13:26 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware

Ouro-2.6B-Thinking: inference funzionante per il modello di ByteDance

ByteDance released Ouro-2.6B-Thinking, a recurrent Universal Transformer model, which presented difficulties in running inference.

Architecture and Challenges

The architecture of Ouro is unusual: it runs all 48 layers four times per token, for a total of 192 effective passes. Existing GGUF implementations were producing incorrect results due to this peculiarity.

Implemented Fixes

Two bugs in the modeling_ouro.py file were corrected, which caused incompatibilities with Transformers 4.55:

Incorrect cache inheritance, which generated an AttributeError.
Absence of the get_mask_sizes() method required by create_causal_mask().

Performance

After the fixes, the model was successfully tested. On an NVIDIA L4, performance of approximately 3.8 tokens/s was achieved with a VRAM usage of 5.3 GB (float16).

It is important to note that the model uses use_cache=False, which implies a full context recompute. KV cache pass-through does not work correctly with the 4-loop UT architecture.

AI-Radar Takeaway

Inference issues with ByteDance's Ouro-2.6B-Thinking, a recurrent Universal Transformer model, have been resolved. The fix addresses incompatibilities with Transformers 4.55. The outputs now produce valid results. Tested on NVIDIA L4, achieving 3.8 tokens/s and using 5.3 GB of VRAM.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Transformers v5: New stable release with performance boosts

Frameworks Jan 26

Transformers v5: New stable release with performance boosts

Hugging Face has released the stable version 5 of Transformers, focused on improved performance (especially for Mixture-of-Experts), simplified APIs for tokeniz

PyTorch for Recommendation Systems: Building Highly Efficient Inference

Frameworks Feb 05

PyTorch for Recommendation Systems: Building Highly Efficient Inference

Meta has developed a PyTorch-based inference system for recommendations, crucial for translating advanced research into production services. The article describ

DeepSpec: DeepSeek’s Open-Source Stack for Speculative Decoding Draft Models

Frameworks Jun 28

DeepSpec: DeepSeek’s Open-Source Stack for Speculative Decoding Draft Models

DeepSeek released DeepSpec, a full-stack codebase for training and evaluating draft models in speculative decoding. Checkpoints cover Qwen3 and Gemma-4 with thr

LLM Inference Efficiency: The Crucial Role of Cache-Hit Rates

LLM Inference Efficiency: The Crucial Role of Cache-Hit Rates

Optimizing Large Language Model inference is critical for cost containment and performance improvement. An analysis based on OpenRouter data highlights cache-hi

ARC-AGI-2: New Transformer System for Abstract Reasoning

Frameworks Mar 10

ARC-AGI-2: New Transformer System for Abstract Reasoning

A new study presents a Transformer-based system that improves performance in the Abstraction and Reasoning Corpus (ARC). The approach combines neural inference

More in LLM

GLM-5.2: The Chinese model challenging the big players at a fraction of the cost

SenseNova-U1: An Open-Source Infographic Model You Can Run Locally

Improving LLM Creative Writing through Entropy

LLM Personas: Why Fine-tuning and Steering Aren't the Same Thing

Bounded Morality: Reframing Ethical Computation Under Finite Resources

The Performance Gap Between Open and Closed Models Might Be an Illusion

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in