Autonomous embodied agents exist in a world of multimedia websites. The question arises – can they navigate through multimodal websites to complete complex user tasks? Current benchmarks fall short in assessing them in a realistic, evolving environment for their embodiment across websites.

To address this, MMInA, a multihop and multimodal benchmark, has been introduced to evaluate the embodied agents for compositional Internet tasks. This benchmark has several appealing properties:

  1. 1. It operates on evolving real-world multimodal websites, ensuring a high degree of realism and applicability to natural user tasks. The data includes 1,050 human-written tasks covering various domains such as shopping and travel. Each task requires the agent to autonomously extract multimodal information from web pages as observations.
  2. 2. The dataset features naturally compositional tasks that require information from or actions on multiple websites to solve. This assesses long-range reasoning capabilities on web tasks.
  3. 3. A novel protocol has been proposed for evaluating an agent’s progress in completing multihop tasks.

Experiments were conducted with both standalone (multimodal) language models and heuristic-based web agents. The results show that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. It was identified that agents are more likely to fail on the early hops when solving tasks of more hops, resulting in lower task success rates. To address this issue, a simple memory augmentation approach has been proposed, replaying past action trajectories to reflect. This method significantly improved both the single-hop and multihop web browsing abilities of agents. This paper is a testament to the continuous advancements in the field of autonomous embodied agents and the potential of the MMInA benchmark in overcoming existing challenges.