Vision-language models (VLMs) have made remarkable progress, enabling GUI agents to interact with computers in a more human-like manner. However, real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors.
HyMEM: A Novel Memory Architecture
To address these challenges, Hybrid Self-evolving Structured Memory (HyMEM) has been proposed, a graph-based memory system that combines discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference.
Performance and Results
Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models. Notably, HyMEM boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!