Illustration for: Browser Agent Learns Tasks by Watching Users, Tops Human Scores

Browser Agent Learns Tasks by Watching Users, Tops Human Scores

Researchers this week unveiled cotomi Act, a browser-based AI agent that learns workplace tasks by watching users and scored 80.4% on the WebArena benchmark, surpassing reported human baselines, according to an arXiv preprint.

Unlike conventional browser agents that require explicit instructions for each task, cotomi Act builds persistent organizational knowledge by observing users work. The system converts browsing patterns into shared task boards and wikis that both the user and the agent can edit, creating a growing repository of institutional know-how.

The technical architecture relies on several innovations to achieve reliable multi-step task execution, including what the researchers describe as “adaptive lazy observation” and “verbal-diff-based history compression.” The system also employs coarse-grained actions and a technique called best-of-N action selection — a form of test-time scaling that evaluates multiple possible next steps before committing to one.

Enterprise Implications

The research arrives as competition intensifies among major AI companies to build agents capable of operating computers on behalf of users. Anthropic has deployed computer use capabilities within Claude, OpenAI has launched its Operator agent, and Google has been developing Project Mariner for browser-based tasks.

Cotomi Act’s approach of learning from passive observation rather than requiring detailed prompts may offer advantages for enterprise adoption, where employees often perform repetitive browser-based workflows that are difficult to articulate as step-by-step instructions but easy to demonstrate.

The system’s ability to build shared organizational knowledge — effectively creating a collective memory of how work gets done — differs from agents that operate in isolation. This approach could reduce a challenge in enterprise AI deployment: capturing and distributing institutional knowledge that typically resides in individual employees’ habits.

Benchmark Context

WebArena, developed by researchers at Carnegie Mellon University, tests agents on realistic web tasks across multiple sites including e-commerce platforms, content management systems and forums. The benchmark has become a standard measure for browser agent capabilities.

An 80.4% score represents an advance over earlier generations of browser agents, which often struggled to exceed single-digit percentages on WebArena, though recent systems from multiple research groups have pushed scores higher.

The paper has been posted as a preprint and has not yet undergone formal peer review. The cotomi name is used by NTT for its large language model technology in Japan; the research institution is not identified in the preprint.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *