After weeks of buzz, OpenAI has released Operator, its first AI agent. Operator is a web app that can carry out simple online tasks in a browser, such as booking concert tickets or filling an online grocery order. The app is powered by a new model called Computer-Using Agent—CUA (“coo-ah”), for short—built on top of OpenAI’s multimodal large language model GPT-4o.
Operator is available today at operator.chatgpt.com to anyone signed up with ChatGPT Pro, OpenAI’s premium $200-a-month service. The company says it plans to roll the tool out to other users in the future.
OpenAI claims that Operator outperforms similar rival tools, including Anthropic’s Computer Use (a version of Claude 3.5 Sonnet that can carry out simple tasks on a computer) and Google DeepMind’s Mariner (a web-browsing agent built on top of Gemini 2.0).
The fact that three of the world’s top AI firms have converged on the same vision of what agent-based models could be makes one thing clear. The battle for AI supremacy has a new frontier—and it’s our computer screens.
“Moving from generating text and images to doing things is the right direction,” says Ali Farhadi, CEO of the Allen Institute for AI (AI2). “It unlocks business, solves new problems.”
Farhadi thinks that doing things on a computer screen is a natural first step for agents: “It is constrained enough that the current state of the technology can actually work,” he says. “At the same time, it’s impactful enough that people might use it.” (AI2 is working on its own computer-using agent, says Farhadi.)
Don’t believe the hype
OpenAI’s announcement also confirms one of two rumors that circled the internet this week. One predicted that OpenAI was about to reveal an agent-based app, after details about Operator were toptechtrends.com/2025/01/20/openais-agent-tool-may-be-nearing-release/”>leaked on social media ahead of its release. The other predicted that OpenAI was about to reveal a new superintelligence—and that officials for newly inaugurated President Trump would be briefed on it.
Could the two rumors be linked? OpenAI superfans wanted to know.
Nope. OpenAI gave MIT Technology Review a preview of Operator in action yesterday. The tool is an exciting glimpse of large language models’ potential to do a lot more than answer questions. But Operator is a work in progress. “It’s still early, it still makes mistakes,” says Yash Kumar, a researcher at OpenAI.
(As for the wild superintelligence rumors, let’s leave that to OpenAI CEO Sam Altman to address: “twitter hype is out of control again,” he posted on January 20. “pls chill and cut your expectations 100x!”)
Like Anthropic’s Computer Use and Google DeepMind’s Mariner, Operator takes screenshots of a computer screen and scans the pixels to figure out what actions it can take. CUA, the model behind it, is trained to interact with the same graphical user interfaces—buttons, text boxes, menus—that people use when they do things online. It scans the screen, takes an action, scans the screen again, takes another action, and so on. That lets the model carry out tasks on most websites that a person can use.
“Traditionally the way models have used software is through specialized APIs,” says Reiichiro Nakano, a scientist at OpenAI. (An API, or application programming interface, is a piece of code that acts as a kind of connector, allowing different bits of software to be hooked up to one another.) That puts a lot of apps and most websites off limits, he says: “But if you create a model that can use the same interface that humans use on a daily basis, it opens up a whole new range of software that was previously inaccessible.”
CUA also breaks tasks down into smaller steps and tries to work through them one by one, backtracking when it gets stuck. OpenAI says CUA was trained with techniques similar to those used for its so-called reasoning models, o1 and o3.
OpenAI has tested CUA against a number of industry benchmarks designed to assess the ability of an agent to carry out tasks on a computer. The company claims that its model beats Computer Use and Mariner in all of them.
For example, on OSWorld, which tests how well an agent performs tasks such as merging PDF files or manipulating an image, CUA scores 38.1% to Computer Use’s 22.0% In comparison, humans score 72.4%. On a benchmark called WebVoyager, which tests how well an agent performs tasks in a browser, CUA scores 87%, Mariner 83.5%, and Computer Use 56%. (Mariner can only carry out tasks in a browser and therefore does not score on OSWorld.)
For now, Operator can also only carry out tasks in a browser. OpenAI plans to make CUA’s wider abilities available in the future via an API that other developers can use to build their own apps. This is how Anthropic released Computer Use in December.
OpenAI says it has tested CUA’s safety, using red teams to explore what happens when users ask it to do unacceptable tasks (such as research how to make a bioweapon), when websites contain hidden instructions designed to derail it, and when the model itself breaks down. “We’ve trained the model to stop and ask the user for information before doing anything with external side effects,” says Casey Chu, another researcher on the team.
Look! No hands
To use Operator, you simply type instructions into a text box. But instead of calling up the browser on your computer, Operator sends your instructions to a remote browser running on an OpenAI server. OpenAI claims that this makes the system more efficient. It’s another key difference between Operator, Computer Use and Mariner (which runs inside Google’s Chrome browser on your own computer).
Because it’s running in the cloud, Operator can carry out multiple tasks at once, says Kumar. In the live demo, he asked Operator to use OpenTable to book him a table for two at 6.30 p.m. at a restaurant called Octavia in San Francisco. Straight away, Operator opened up OpenTable and started clicking through options. “As you can see, my hands are off the keyboard,” he said.
OpenAI is collaborating with a number of businesses, including OpenTable, StubHub, Instacart, DoorDash, and Uber. The nature of those collaborations is not exactly clear, but Operator appears to suggest preset websites to use for certain tasks.
While the tool navigated dropdowns on OpenTable, Kumar sent Operator off to find four tickets for a Kendrick Lamar show on StubHub. While it did that, he pasted a photo of a handwritten shopping list and asked Operator to add the items to his Instacart.
He waited, flicking between Operator’s tabs. “If it needs help or if it needs confirmations, it’ll come back to you with questions and you can answer it,” he said.
Kumar says he has been using Operator at home. It helps him stay on top of grocery shopping: “I can just quickly click a photo of a list and send it to work,” he says.
It’s also become a sidekick in his personal life. “I have a date night every Thursday,” says Kumar. So every Thursday morning, he instructs Operator to send him a list of five restaurants that have a table for two that evening. “Of course, I could do that, but it takes me 10 minutes,” he says. “And I often forget to do it. With Operator, I can run the task with one click. There’s no burden of booking.”