SupraVL-Nano-900k: The Pocket-Sized VLM That Opens the Black Box

The news comes from the LocalLLaMA subreddit: SupraLabs has released SupraVL-Nano-900k, a Vision-Language Model (VLM) with just 900k parameters, built entirely from scratch. Don't expect a competitive model—its purpose is purely educational. It’s a fully transparent blueprint, a Jupyter notebook containing the entire architecture, made available under Apache 2.0.

Under the Hood

The VLM consists of three main blocks: a CNN-based visual encoder, a GPT-2 style transformer decoder, and a BPE tokenizer trained directly on the Flickr8k dataset. The visual encoder uses three convolutional layers with batch normalization and ReLU, followed by adaptive pooling that reduces the image to a 4×4 grid of patches, yielding 16 spatial tokens. These tokens are projected to 128 dimensions and prepended to the text sequence, in what the team calls a "prefix concatenation" fusion strategy.

The decoder is a mini-transformer with 3 layers, 4 attention heads, and a 256-unit feed-forward network. The total context is only 64 positions: 16 for visual tokens and 48 for text. The model ties the embedding and lm_head weights—a common trick to reduce parameters. The BPE vocabulary has 2,048 tokens, enough to cover captions in Flickr8k.

Training, completed in under an hour on a T4 GPU (available via Kaggle or Google Colab), followed a simple recipe: 15 epochs with AdamW, cosine learning rate decay, batch size 64, and mixed precision. The result is a model that generates short, generic captions, but clearly demonstrates how a real VLM processes an image.

Why Transparency Matters

Building every component from scratch and distributing it as readable code is no small feat. Most VLMs, from LLaVA to CLIP-based models, are black boxes hard to inspect. Here, every line of code is commented and the data flow is explicit: from input pixels to the next token generation. This approach answers a concrete need for those working with models: How do they actually work, without layers of abstraction?

For the LocalLLaMA community, accustomed to experimenting with self-hosted models, such an artifact is gold. It allows studying visual attention mechanisms, modality fusion, and the impact of architectural choices (e.g., the 4×4 grid instead of a single global token) on a model that consumes minimal resources. The ability to run it on a modest GPU like the T4, with its 16 GB of VRAM, lowers the barrier for anyone wanting to understand before they deploy.

An On-Premise Perspective

AI-RADAR follows such initiatives with interest because, in the context of on-premise deployment, understanding internals is as critical as data sovereignty. A team evaluating VLM adoption on their own infrastructure must accurately estimate memory consumption, inference bottlenecks, and optimization opportunities. Transparent models like SupraVL-Nano-900k, while not intended for production, provide an ideal testbed to familiarize with trade-offs between encoders, decoders, and fusion strategies.

Of course, limitations are clear: Flickr8k is tiny, captions are short, and the model does not follow instructions. But the SupraLabs team is honest: "It is not competing with LLaVA. It is competing with nothing—it’s an educational artifact." The roadmap includes replacing the CNN with a small ViT, adding cross-attention layers (Flamingo style), and scaling the decoder, eventually training on larger datasets like CC3M or LAION-400M. These steps will require more compute, but they could be replicated on on-premise machines with consumer GPUs.

The Value of a Notebook

In the end, SupraVL-Nano-900k reminds us that the complexity of Large Language Models can be dissected and made accessible. Before investing in expensive infrastructure or opaque cloud services, having a dismantlable model in your hands helps ask the right questions. The code is there, ready on Hugging Face: a quick pip install and a few lines are enough to see it in action. An invitation, in short, to get your hands dirty with the building blocks of tomorrow’s VLMs.