Distilling Visual Knowledge: Gemini 3 Flash to Qwen 3 VL

A user is exploring the possibility of transferring the advanced visual reasoning capabilities of Gemini 3 Flash into the open-source model Qwen 3 VL 32B. The goal is to create a synthetic data pipeline for image-to-image models, overcoming the limitations of current open-source models in terms of data quality.

The user has identified a specific problem, defined as the "Horns Issue," where open-source models struggle to distinguish between basic anatomical elements and removable accessories in an image. Gemini 3 Flash, in contrast, demonstrates an accurate understanding of these layers.

Challenges and Questions

The plan is to fine-tune Qwen 3 VL 32B on a dataset labeled by Gemini 3 Flash. However, several technical questions arise:

  • Can Qwen 3 VL actually absorb this level of reasoning via SFT (Supervised Fine-Tuning)?
  • Is the "blindness" in open models a limitation of the vision encoder or a reasoning issue on the LLM side?
  • Has anyone already experimented with VLM-to-VLM distillation for large-scale labeling in generative AI pipelines?

The user seeks to develop a local captioner that achieves proprietary levels of accuracy and asks for information on the "plasticity" of Qwen 32B for this specific task.