ByteDance released Ouro-2.6B-Thinking, a recurrent Universal Transformer model, which presented difficulties in running inference.
Architecture and Challenges
The architecture of Ouro is unusual: it runs all 48 layers four times per token, for a total of 192 effective passes. Existing GGUF implementations were producing incorrect results due to this peculiarity.
Implemented Fixes
Two bugs in the modeling_ouro.py file were corrected, which caused incompatibilities with Transformers 4.55:
- Incorrect cache inheritance, which generated an
AttributeError. - Absence of the
get_mask_sizes()method required bycreate_causal_mask().
Performance
After the fixes, the model was successfully tested. On an NVIDIA L4, performance of approximately 3.8 tokens/s was achieved with a VRAM usage of 5.3 GB (float16).
It is important to note that the model uses use_cache=False, which implies a full context recompute. KV cache pass-through does not work correctly with the 4-loop UT architecture.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!