Large Vision Language Models (LVLM) demonstrate remarkable capabilities but often lack detailed knowledge about specific entities.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external knowledge base. However, existing decoding methods for RAG do not fully leverage multiple relevant contexts and do not effectively suppress the negative effects of irrelevant contexts.
Relevance-aware Multi-context Contrastive Decoding (RMCD)
To address these limitations, Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG, has been proposed. RMCD generates a final prediction by combining the predictions obtained with each context, weighting each output based on its relevance to the question. This approach allows RMCD to effectively aggregate useful information from multiple relevant contexts and to counteract the negative effects of irrelevant ones.
Experimental results
Experiments demonstrate that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question answering benchmarks. RMCD can be implemented simply by replacing the decoding method of LVLMs without the need for further training. Analyses also show that RMCD is robust to the retrieval results, maintaining high performance even with less accurate retrieval results. The code is available on GitHub.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!