NVIDIA has announced gpt-oss-puzzle-88B, a large language model (LLM) developed from OpenAI's gpt-oss-120b model. This new model has been optimized for inference, particularly for workloads that require complex reasoning.
Architecture and Performance
gpt-oss-puzzle-88B was developed using Puzzle, a post-training neural architecture search (NAS) framework. The main goal is to improve inference efficiency while maintaining or improving accuracy. Compared to the original model, gpt-oss-puzzle-88B features:
- Total parameters reduced to approximately 88 billion (approximately 73% of the original model).
- A 1.63x throughput improvement in long-context (64K/64K) scenarios on an 8xH100 node.
- A 1.22x throughput improvement in short-context (4K/4K) scenarios.
- Up to 2.82x throughput improvement on a single H100 GPU.
- Accuracy equal to or greater than the original model.
Optimization for H100
The model is specifically optimized for serving long and short contexts on NVIDIA H100 hardware. In these scenarios, the performance of reasoning models is often limited by KV-cache bandwidth and memory capacity, rather than raw compute power.
Architectural Details
- Architecture Type: Mixture-of-Experts Decoder-only Transformer.
- Network Architecture: Modified gpt-oss architecture with a varying number of experts per layer and a modified global/window attention pattern across layers.
- Number of model parameters: 88 billion.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!