A curious project demonstrates how a language model can be forced to "see" images.
Implementation Details
A technician froze a GPT-2 XL model and optimized the input embedding tensors to generate attention maps corresponding to the frames of the Bad Apple music video. The optimization was performed on a single attention head (head 0, layer 0), calculating the Q and K projections. The loss function used was MSE in logit space (pre-softmax). The entire process took approximately 12 minutes on an RTX 5070 Ti GPU with 4.5 GB of VRAM to process 3286 frames.
Results
The result is an unexpected visualization of the capabilities of a language model, which, although not trained with images, can be manipulated to visually represent them through its attention maps. This type of experiment helps to better understand the internal workings of language models and their hidden potential.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!