A couple of weeks ago, startup CEO Flo Crivello typed a message asking his personal assistant Lindy to change the length of an upcoming meeting from 30 to 45 minutes. Lindy, a software agent that happens to be powered by artificial intelligence, found a dozen or so 30-minute meetings on Crivello’s calendar and promptly extended them all.
“I was like ‘God dammit, she kind of destroyed my calendar,’” Crivello says of the AI agent, which is being developed by his startup, also called Lindy.
Crivello’s company is one of several startups hoping to parlay recent strides in chatbots that produce impressive text into assistants or agents capable of performing useful tasks. Within a year or two, the hope is that these AI agents will routinely help people accomplish everyday chores.
Instead of just offering planning advice for a business trip like OpenAI’s ChatGPT can today, an agent might also be able to find a suitable flight, book it on a company credit card, and fill out the necessary expense report afterwards.
The catch is that, as Crivello’s calendar mishap illustrates, these agents can become confused in ways that lead to embarrassing, and potentially costly, mistakes. No one wants a personal assistant that books a flight with 12 layovers just because it’s a few dollars cheaper, or schedules them to be in two places at once.
Lindy is currently in private beta, and although Crivello says the calendar issue he ran into has been fixed, the company does not have a firm timeline for releasing a product. Even so, he predicts that agents like his will become ubiquitous before long.
“I'm very optimistic that in, like, two to three years, these models are going to be a hell of a lot more alive,” he says. “AI employees are coming. It might sound like science fiction, but hey, ChatGPT sounds like science fiction.”
The idea of AI helpers that can take actions on your behalf is far from new. Apple’s Siri and Amazon’s Alexa provide a limited and often disappointing version of that dream. But the idea that it might finally be possible to build broadly capable and intelligent AI agents gathered steam among programmers and entrepreneurs following the release of ChatGPT late last year. Some early technical users found that the chatbot could respond to natural language queries with code that could access websites or use APIs to interact with other software or services.
In March, OpenAI announced “plug-ins” that give ChatGPT the ability to execute code and access sites including Expedia, OpenTable, and Instacart. Google said today its chatbot Bard can now access information from other Google services and be asked to do things like summarize a thread in Gmail or find YouTube videos relevant to a particular question.
Some engineers and startup founders have gone further, starting their own projects using large language models, including the one behind ChatGPT, to create AI agents with broader and more advanced capabilities.
After seeing discussion about ChatGPT’s potential to power new AI agents on Twitter earlier this year, programmer Silen Naihin was inspired to join an open source project called Auto-GPT that provides programming tools for building agents. He previously worked on robotic process automation, a less complex way of automating repetitive chores on a PC that is widely used in the IT industry.
Naihin says Auto-GPT can sometimes be remarkably useful. “One in every 20 runs, you'll get something that's like ‘whoa,’” he says. He also admits that it is very much a work in progress. Testing conducted by the Auto-GPT team suggests that AI-powered agents are able to successfully complete a set of standard tasks, including finding and synthesizing information from the web or locating files on a computer and reading their contents, around 60 percent of the time. “It is very unreliable at the moment,” Naihin says of the agent maintained by the Auto-GPT team.
A common problem is an agent trying to achieve a task using an approach that is obviously incorrect to a human, says Merwane Hamadi, another contributor to Auto-GPT, like deciding to hunt for a file on a computer’s hard drive by turning to Google’s web search. “If you ask me to send an email, and I go to Slack, it’s probably not the best,” Hamadi says. With access to a computer or a credit card, Hamadi adds, it would be possible for an AI agent to cause real damage before its user realizes. “Some things are irreversible,” he says.
The Auto-GPT project has collected data showing that AI agents built on top of the project are steadily becoming more capable. Naihin, Hamadi, and other contributors continue to modify Auto-GPT’s code.
Later this month, the project will hold a hackathon offering a $30,000 prize for the best agent built with Auto-GPT. Entrants will be graded on their ability to perform a range of tasks deemed to be representative of day-to-day computer use. One involves searching the web for financial information and then writing a report in a document saved to the hard drive. Another entails coming up with an itinerary for a month-long trip, including details of the necessary tickets to purchase.
Agents will also be given tasks designed to trip them up, like being asked to delete large numbers of files on a computer. In this instance, success requires refusing to carry out the command.
Like the appearance of ChatGPT, progress on creating agents powered by the same underlying technology has triggered some trepidation about safety. Some prominent AI scientists see developing more capable and independent agents as a dangerous path.
Yoshua Bengio, who jointly won the Turing Award for his work on deep learning, which underpins many recent advances in AI, wrote an article in July arguing that AI researchers should avoid building programs with the ability to act autonomously. “As soon as AI systems are given goals—to satisfy our needs—they may create subgoals that are not well-aligned with what we really want and could even become dangerous for humans,” wrote Bengio, a professor at the University of Montreal.
Others believe that agents can be built safely—and that this might serve as a foundation for safer progress in AI altogether. “A really important part of building agents is that we need to build engineering safety into them,” says Kanjun Qiu, CEO of Imbue, a startup in San Francisco working on agents designed to avoid mistakes and ask for help when uncertain. The company announced $200 million in new investment funding this month.
Imbue is developing agents capable of browsing the web or using a computer, but it is also testing out techniques for making them safer with coding tasks. Beyond just generating a solution to a programming problem, the agents will try to judge how confident they are in a solution, and ask for guidance if unsure. “Ideally agents can have a better sense for what is important, what is safe, and when it makes sense to get confirmation from the user,” says Imbue’s CTO, Josh Albrecht,
Celeste Kidd, an assistant professor at UC Berkeley who studies human learning and how it can be mimicked in machines is an adviser to Imbue. She says it is unclear if AI models trained purely on text or images from the web could learn for themselves how to reason about what they are doing, but that building safeguards on top of the surprising capabilities of systems like ChatGPT makes sense. “Taking what current AI does well—completing programming tasks and engaging in conversations that entail more local forms of logic—and seeing how far you can take that, I think that is very smart,” she says.
The agents that Imbue is building might avoid the kinds of errors that currently plague such systems. Tasked with emailing friends and family with details of an upcoming party, an agent might pause if it notices that the “cc:” field includes several thousand addresses.
Predicting how an agent might go off the rails is not always easy, though. Last May, Albrecht asked one agent to solve a tricky mathematical puzzle. Then he logged off for the day.
The following morning, Albrecht checked back, only to find that the agent had become fixated on a particular part of the conundrum, trying endless iterations of an approach that did not work—stuck in something of an infinite loop that might be the AI equivalent of obsessing over a small detail. In the process it ran up several thousand dollars in cloud computing bills.
“We view mistakes as learning opportunities, though it would have been nice to learn this lesson more cheaply,” Albrecht says.