POLARIS: Small LLMs Write Long Stories with 4 A100s

Enhancing Creative Writing in Compact LLM Models

Smaller Large Language Models (LLMs), while offering advantages in terms of hardware requirements and deployment costs, often struggle with generating long-form creative content. Their output tends to be either too short compared to requests, or the text quality rapidly degrades as length increases. This limitation poses a significant challenge for companies seeking to leverage more efficient LLMs for on-premise content generation applications, where computational resources are a critical factor.

In this context, research focuses on developing techniques that can extend the capabilities of smaller models without requiring an exponential increase in resources. The goal is to enable these LLMs to compete with larger, more expensive "frontier" models, especially in complex tasks like creative storytelling, while maintaining a cost and performance profile suitable for local infrastructures.

The POLARIS Method: Optimization and Human References

To address these limitations, the POLARIS methodology (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting) has been introduced. This training recipe, based on a lower-compute GRPO (Generalized Reinforcement Learning Policy Optimization) approach, integrates two key elements. The first is a frontier LLM "judge," which evaluates the quality of generated stories using a structured rubric, providing real-time feedback for model reward.

The second fundamental ingredient is Human-Reference Injection (HRI), where a human-written story, provided in a "teacher-forced" manner, serves as a high-quality "anchor" within each GRPO group. This mechanism guides the model towards generating more coherent and higher-quality texts. The methodology was applied to the Qwen3.5-9B model, using a dataset of approximately 1,400 prompt-story pairs derived from a hundred anthologies. Training was performed on 4 NVIDIA A100 GPUs, a setup that, while significant, falls within the capabilities of many enterprise infrastructures considering self-hosted deployments.

Implications for Local Deployments and Hardware

The result of this process is POLARIS-9B, a model that, according to benchmarks, proves competitive with much larger open-weight LLMs, while adhering more precisely to length instructions. Blinded human evaluations confirmed that POLARIS-9B is preferred over the base Qwen3.5-9B and performs on par with Qwen3.5-27B. This is particularly relevant for organizations aiming for on-premise deployments, where choosing smaller yet performant models can drastically reduce the Total Cost of Ownership (TCO) and VRAM requirements.

A crucial aspect is POLARIS-9B's ability to preserve quality even for story requests up to three times the length it was trained on (e.g., up to 12,000 words, starting from training on 4,000 words). This is a common weakness for many open-weight models, which tend to degrade significantly in quality or length adherence in similar scenarios. The ability to generalize length is a meaningful stress test for creative-writing models and offers a useful criterion for distinguishing otherwise similar models, especially when considering the context and memory limitations typical of self-hosted environments. For those evaluating on-premise deployments, analytical frameworks are available on AI-RADAR/llm-onpremise to assess trade-offs between performance, costs, and data sovereignty.

Future Prospects and the Challenge of Generalization

The results obtained with POLARIS suggest that length generalization is not just a performance metric but a true "stress test" for creative-writing models. This capability is fundamental for applications requiring large-scale narrative coherence, from marketing content generation to the creation of complex scenarios. The research highlights how, even with a limited number of A100 GPUs for training, substantial improvements in the capabilities of smaller models can be achieved.

This approach opens new avenues for developing more efficient and versatile LLMs, capable of operating effectively in resource-constrained environments, such as edge or air-gapped deployments. The continuous optimization of training methodologies and the integration of high-quality feedback, both from LLM "judges" and human references, will be crucial to unlock the full potential of open-weight models, making them increasingly powerful and accessible tools for a wide range of enterprise applications.