What do AI agents actually do?
When OpenAI unveiled ChatGPT in late 2022, it started the chatbot boom. Then last year, new systems from OpenAI and Anthropic fueled a new technological push with so-called AI agents that can perform tasks like personal digital assistants.
A San Francisco startup called Arena, which tracks hundreds of thousands of AI users, is now trying to take some of the mystery out of what exactly these digital tasks are.
The company’s service, Agent Mode, showed that over the past few weeks, people were using agents for coding tasks about 17 percent of the time. About 10 percent of the time, the company said, people used agents to do research.
Research agents were closely followed by agents that create images, generate documents such as graphs and tables, or generate ideas. About 5 percent of users used agents for creative writing or tutoring and education. Other areas included code debugging, which is related to software development, and chat.
Systems from OpenAI, Anthropic, and others can generate, test, and edit computer code, allowing skilled programmers to automate many tasks that were once done by themselves. Agents can also spend minutes or even days researching specific topics across the wider internet, including finance, healthcare, law and virtually anything else.
Some of these tasks overlap with what a chatbot can do. But the main difference with an agent is that it can use other software applications on behalf of users, including spreadsheets, calendars, and email programs.
“The agent can access the internet, search the web, create files and even access other AI models to complete its work,” said Arena CEO Anastasios Angelopoulos, co-founder of the start-up.
In Silicon Valley, some people treat these robots almost like employees to whom they can delegate work at any time of the day. Many AI researchers, technology executives and scientists believe that agents could soon replace white-collar jobs in offices.
In February, Block, the financial technology company that owns Square, Cash App and Tidal, said it was cutting 40 percent of its workforce as it anticipated the rise of this kind of technology. This was perhaps the most prominent example of a company laying off employees because of what AI can soon do.
The problem is that this digital employee can only handle certain tasks – and sometimes it’s less than reliable. Like chatbots, AI agents can make mistakes and exhibit completely unexpected behavior.
These errors can be especially complicated when people use agents to send emails, text messages, and other instant messages. For this reason, Arena does not allow the people it tracks to connect its agents to email programs and messaging applications. (The company sells its data and analysis of that data.)
The company also prevents people from using agents outside of the digital “sandbox,” which prevents agents from seriously damaging people’s computers. If agents are left outside the sandbox, they can accidentally delete files and software applications.
But the company’s service shows how often agents get it wrong. About 8 percent of the time, agents said they had completed a task when they hadn’t, Arena said. Because many tasks build on top of each other, the company added, this kind of agent can “bluff” or “buzz” the stack and create larger errors.
“The models just say, ‘Yeah, I did this.’ But they lied and didn’t do it,” Mr Angelopoulos said. “They could say they’ve created a file and then it’s not there.”
Arena also benchmarks technologies offered by OpenAI, Anthropic and other companies. According to data from Arena, the most efficient agents are powered by OpenAI GPT-5.5 High technology.
The next most effective technology was Claude Opus 4.7 Thinking by Anthropic. According to Arena, these technologies were significantly more efficient than those from Google, leading Chinese companies and Elon Musk’s xAI.