Relevance-aware Multi-context Contrastive Decoding for Visual Question Answering

Large Vision Language Models (LVLM) demonstrate remarkable capabilities but often lack detailed knowledge about specific entities.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external knowledge base. However, existing decoding methods for RAG do not fully leverage multiple relevant contexts and do not effectively suppress the negative effects of irrelevant contexts.

Relevance-aware Multi-context Contrastive Decoding (RMCD)

To address these limitations, Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG, has been proposed. RMCD generates a final prediction by combining the predictions obtained with each context, weighting each output based on its relevance to the question. This approach allows RMCD to effectively aggregate useful information from multiple relevant contexts and to counteract the negative effects of irrelevant ones.

Experimental results

Experiments demonstrate that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question answering benchmarks. RMCD can be implemented simply by replacing the decoding method of LVLMs without the need for further training. Analyses also show that RMCD is robust to the retrieval results, maintaining high performance even with less accurate retrieval results. The code is available on GitHub.

Relevance-aware Multi-context Contrastive Decoding for Visual Question Answering

Retrieval-Augmented Generation (RAG)

Relevance-aware Multi-context Contrastive Decoding (RMCD)

Experimental results

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

HybridRAG: Chatbot LLM con Knowledge Base Pre-Generata

Ovis2.6-30B-A3B: nuovo modello multimodale open source

Modelli Linguistici Visuali: Tokenizzazione Aggirata o Reintrodotta?