Vision-language models (VLMs) have made remarkable progress, enabling GUI agents to interact with computers in a more human-like manner. However, real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors.

HyMEM: A Novel Memory Architecture

To address these challenges, Hybrid Self-evolving Structured Memory (HyMEM) has been proposed, a graph-based memory system that combines discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference.

Performance and Results

Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models. Notably, HyMEM boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.