Technology · October 8, 2024

Forget chat. AI that can hear, see and click is already here

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Chatting with an AI chatbot is so 2022. The latest hot AI toys take advantage of multimodal models, which can handle several things at the same time, such as images, audio, and text. 

Exhibit A: Google’s NotebookLM. NotebookLM is a research tool the company launched with little fanfare a year ago. A few weeks ago, Google added an AI podcasting tool called Audio Overview to NotebookLM, which allows users to create podcasts about anything. Add a link to, for example, your LinkedIn profile, and the AI podcast hosts will boost your ego for nine minutes. The feature has become a surprise viral hit. I wrote about all the weird and amazing ways people are using it here

To give you a taste, I created a podcast of our 125th-anniversary magazine issue. The AI does a great job of picking some highlights from the magazine and giving you the gist of what they are about. Have a listen below. 

Multimodal generative content has also become markedly better in a very short time. In September 2022, I covered Meta’s first text-to-video model, Make-A-Video. Next to today’s technology, those videos look clunky and silly. Meta just announced its competitor to OpenAI’s Sora, called Movie Gen. The tool allows users to use text prompts to create custom videos and sounds, edit existing videos, and make images into videos.

The way we interact with AI systems is also changing, becoming less reliant on text. OpenAI’s new Canvas interface allows users to collaborate on projects with ChatGPT. Instead of relying on a traditional chat window, which requires users to do several rounds of prompting and regenerating text to get the desired result, Canvas allows people to select bits of text or code to edit. 

Even search is getting a multimodal upgrade. In addition to inserting ads into AI overviews, Google has rolled out a new feature where users can upload a video and use their voice to search for things. In a demo at Google I/O, the company showed how you can open the Google Lens app, take a video of fish swimming in an aquarium, and ask a question about them. Google’s Gemini model will then search the web and offer you an answer in the form of Google’s AI summary. 

What unites these features is a more interactive, customizable interface and the ability to apply AI tools to lots of different types of source material. NotebookLM was the first AI product in a while that brought me wonder and delight, partly because of how different, realistic, and unexpected the AI voices were. But the fact that NotebookLM’s Audio Overviews became a hit despite being a side feature hidden inside a bigger product just goes to show that AI developers don’t really know what they are doing. Hard to believe now, but ChatGPT itself was an unexpected hit for OpenAI.

We are a couple of years into the multibillion-dollar generative AI boom. The huge investment in AI has contributed to rapid improvement in the quality of the resulting content. But we’ve yet to see a killer app, and these new multimodal applications are a result of the immense pressure AI companies are under to make money and deliver. Tech companies are throwing different AI tools at people and seeing what sticks. 


Now read the rest of The Algorithm

Deeper Learning

AI-generated images can teach robots how to act

Image-generating AI models have been used to  create training data for robots. The new system, called Genima,  fine-tunes the image-generating AI model Stable Diffusion to draw robots’ movements, helping guide them both in simulations and in the real world. 

What’s the big deal: Genima could make it easier to train different types of robots to complete tasks—machines ranging from mechanical arms to humanoid robots and driverless cars. It could also help make AI web agents, a next generation of AI tools that can carry out complex tasks with little supervision, better at scrolling and clicking. Read more from Rhiannon Williams here

Bits and Bytes

This startup uses AI to detect wildfires 
Our 2024 list of Climate Tech Companies to Watch is here! One company on the list is Pano AI, which uses computer vision and ultra-high-definition cameras to alert firefighters to new blazes. (MIT Technology Review

How Sam Altman concentrated power to his own hands
And then there was one. With OpenAI now valued at $157 billion, Bloomberg details how the company lost most of its top executives and shifted to an Altman-led profit-making monster.  (Bloomberg

Eight scientists, a billion dollars, and the moonshot agency trying to make Britain great again
A nice profile on the UK’s new Advanced Research and Invention Agency, or ARIA. The agency is the UK’s answer to DARPA in the US. It is funding projects such as Turing Award winner Yoshua Bengio’s project to prevent AI catastrophes. (Wired

Why women in tech are sounding an alarm
Tech’s AI mania is encouraging the field to backtrack on years of diversity and inclusion efforts, at the expense of women. (The Information

About The Author