Targeted Pre-training for Multimodal Models

Multimodal pretraining is an effective technique for building generalized data representations. However, in many practical scenarios, only one specific modality is heavily used during fine-tuning. Standard pretraining methods treat all modalities uniformly, which can lead to suboptimal representations for the most important modality.

Finetune-Informed Pretraining (FIP)

To address this issue, Finetune-Informed Pretraining (FIP) has been proposed, a model-agnostic method that biases representation learning toward a designated target modality, the one used in fine-tuning. FIP combines higher masking difficulty, stronger loss weighting, and increased decoder capacity for the target modality, without modifying the shared encoder or requiring additional supervision.

Results and Applications

When applied to masked modeling on constellation diagrams for wireless signals, FIP consistently improves downstream fine-tuned performance with no extra data or compute. FIP is simple to implement, architecture-compatible, and broadly applicable across multimodal masked modeling pipelines.