ByteDance released Ouro-2.6B-Thinking, a recurrent Universal Transformer model, which presented difficulties in running inference.

Architecture and Challenges

The architecture of Ouro is unusual: it runs all 48 layers four times per token, for a total of 192 effective passes. Existing GGUF implementations were producing incorrect results due to this peculiarity.

Implemented Fixes

Two bugs in the modeling_ouro.py file were corrected, which caused incompatibilities with Transformers 4.55:

  • Incorrect cache inheritance, which generated an AttributeError.
  • Absence of the get_mask_sizes() method required by create_causal_mask().

Performance

After the fixes, the model was successfully tested. On an NVIDIA L4, performance of approximately 3.8 tokens/s was achieved with a VRAM usage of 5.3 GB (float16).

It is important to note that the model uses use_cache=False, which implies a full context recompute. KV cache pass-through does not work correctly with the 4-loop UT architecture.