A Complete Cinematic AI Pipeline on a Single GPU

A recent open-source project, developed as part of the AMD x lablab hackathon, has demonstrated the feasibility of a complete pipeline for creating cinematic reels from a single text prompt. Named FLUX.2 [klein], this integrated solution can generate videos with characters, a coherent story, music, and multilingual narration. The most significant aspect lies in its ability to execute the entire workflow on a single AMD Instinct MI300X GPU, highlighting the potential of high-end hardware for on-premise deployments.

The pipeline, released under Apache 2.0 or MIT licenses, represents a concrete example of how generative artificial intelligence can be orchestrated for complex tasks. The end-to-end process, which initially took approximately 45 minutes for a 720p clip, has been optimized down to 10.4 minutes, showing significant performance improvements. This result is particularly relevant for companies seeking efficient and locally controllable AI video production solutions.

Architecture and Technical Details

The pipeline is structured into eight sequential stages, all executed on the same GPU. The "Director Agent," based on Qwen3.5-35B-A3B (using vLLM and AITER MoE), plans six shots from a single sentence, returning structured JSON with character bibles, shot prompts, music briefs, and per-shot voice-over scripts, including the narration language. Subsequently, FLUX.2 [klein] handles the creation of canonical character portraits and keyframes for each shot, ensuring identity consistency without the need for a LoRA training step.

The animation phase is managed by Wan2.2-I2V-A14B, which generates 81 frames at a native 16 fps with a resolution of 1280x720 pixels, a choice that prioritizes the quality demanded by producers over default settings. A "Vision critic," reusing Qwen3.5-35B, evaluates the generated clips, identifying flaws such as character drift or visual artifacts, and triggering targeted retry strategies if issues arise. Music is produced by ACE-Step v1, while Kokoro-82M handles narration in nine different languages, selected by the Director Agent based on the context. Finally, ffmpeg mixes all elements to produce the final video.

Implications for On-Premise Deployments

The use of a single AMD Instinct MI300X GPU with 192 GB of HBM3 memory is a key element of this architecture. This high VRAM capacity allows large models โ€“ including a 35B MoE, a 4B diffusion model, a 14B I2V MoE, a 3.5B music model, and a TTS system โ€“ to be loaded sequentially onto the same card. This approach sharply contrasts with the need to connect 4-5 consumer GPU cards with 24 GB each to handle the same model stack, highlighting a significant trade-off in terms of infrastructure complexity and TCO.

For organizations evaluating self-hosted alternatives to cloud solutions, the ability to consolidate complex AI workloads onto a reduced number of hardware units represents a considerable advantage. This not only simplifies management and reduces physical footprint but also helps maintain data sovereignty, a crucial aspect for sectors with stringent compliance requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs, providing tools for informed deployment decisions.

Future Prospects and Optimization

The project is not limited to functional demonstration but also includes significant work on performance optimization. Techniques such as ParaAttention FBCache, which doubled performance on Wan2.2, and the selective application of torch.compile on transformers, have drastically reduced processing times. AITER MoE acceleration on the Qwen Director via vLLM further improved efficiency.

These optimization efforts underscore the importance of refining not only the models but also the entire pipeline and its interaction with the underlying hardware. The availability of the code on GitHub and documentation on Hugging Face Spaces facilitates adoption and further development by the community. This collaborative approach is fundamental for pushing the boundaries of AI's generative capabilities, especially in contexts where local control and resource efficiency are priorities.