Introduction

The threat of text-centric forgery poses a significant challenge to information security and authenticity. Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis and lack the capacity for sophisticated reasoning.

The LogicLens Framework

To address these challenges, Meta has introduced LogicLens, a unified framework for Visual-Textual Co-reasoning that reformulates these objectives into a single task. This framework is powered by our novel Cross-Cues-aware Chain of Thought (CCT) mechanism, which iteratively validates visual cues against textual logic.

The PR$^2$ Pipeline

To ensure robust alignment across all tasks, we further propose a weighted multi-task reward function for GRPO-based optimization. Complementing this framework, we first designed the PR$^2$ (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi-agent system that generates high-quality, cognitively-aligned annotations.

The RealText Dataset

To test LogicLens, we constructed the RealText dataset, comprising 5,397 images with fine-grained annotations, including textual explanations, pixel-level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks.

Experimental Results

LogicLens surpasses the specialized framework by 41.4% in zero-shot evaluation on T-IC13 and by 23.4% in macro-average F1 score against GPT-4o. On the challenging dense-text T-SROIE dataset, LogicLens establishes a significant lead over other MLLM-based methods in mF1, CSS, and the macro-average F1.

Conclusion

LogicLens represents a significant step forward in the fight against text-centric forgery and offers new opportunities for information security and authenticity.