Blog

Twelve Labs: AI that understands videos like humans do

Jun 04, 2024

This piece is part of our Founder Forward interview series, where we talk with the leaders of the startups we’ve partnered with about the technology and market trends driving their businesses. The interview has been edited for clarity.

How Jae Lee put Twelve Labs at the forefront of a new wave: Multimodal AI

Exciting companies such as OpenAI and Anthropic have already ushered in an era of tech disruption that rivals anything that’s come before, and now an emerging generation of AI models understands the world more like humans do. These “multimodal” models can help solve a range of hard problems that have thus far been out of reach, like making sense of video, which constitutes 80% of all the world’s data today. 

We believe the future is multimodal and that Twelve Labs is at the forefront, which is why we’re thrilled to co-lead the company’s Series A with NVentures, Nvidia’s investing arm. In just three years, Twelve Labs has developed the capability to understand what’s happening in videos without any manual logging or relying on transcripts or object-level tags. 

NEA’ partner, Tiffany Luck, recently spoke with Twelve Labs co-founder Jae Lee about the company, its vision, and why we’re on the cusp of a “Cambrian explosion” of powerful new AI tools. 

Founder Forward: AI's Great Leap Forward - Part 1

test

Tiffany: How did you come up with the idea for the company? 

Jae: I started Twelve Labs back in 2021 with four of my best friends. We met at the Korean Cyber Command, which is like Israel's Unit AE200 or USCybercom. We were working on video understanding. This was pre-ChatGPT and research scientists were still figuring out language models. 

The thing about the Korean military is that you don't get to go home. Everyone lives in a barrack. And there were just countless nights where we talked about what's possible and what we wanted to do — what’s that problem we wanted to spend the next 10, 15 years solving. That's where we started brainstorming what Twelve Labs could look like.

Tiffany: Whenever I meet a founder, one of my questions is: Is this a problem they're obsessed with? And when I met you, 100% that came through. Why is video understanding the holy grail to you?

Jae: To get to where we are with language models, we've put in like the entire text that humanity has ever created. And it's still not able to go across that chasm of reasoning, planning, and thinking like humans. 

The LLMs of today are trained with an objective of predicting the next word. And it works incredibly well. But for videos, if we take the language model approach, maybe the model can predict the next frame. But the thing is that the next frame looks exactly the same as the current frame. So how far into the future does the model have to actually predict? It turns out that’s a really hard problem without many reference papers. 

Humans are really good at inferring what's going to happen far out into the future. We're not thinking about the next frame. So we felt like the road to generally useful artificial intelligence would need to be able to mimic the learning process that we go through. 

Tiffany: That's actually a really intuitive way to approach it, because if you think about it, all of your memories are kind of a video format.

Jae: For humans, even before we learned the concept of language, we learned about the physical world through what we call sensory input data, by hearing things, touching things, and seeing things. We don't have to read an entire book about why fire is hot. We see fire, it's inherently very intimidating, and if you touch it, we know it's hot.  

We thought that if perceptual reasoning is the basis of our intelligence, maybe this can translate to modern AI. You wouldn't think a general intelligence would arrive just by looking at text. But multimodal AI is an AI system that can learn about the physical world through different modes of data, including text, image, and video.

Founder Forward: The Impact of Video AI - Part 2

Tiffany: Video represents about 80% of the world’s data. Five hundred hours of video are uploaded to YouTube every minute. 

Jae: It's insane. We produce this vast amount of video on a daily basis, and we don't know what's in it. The kind of search that YouTube provides is object tagging and transcription-based. Sometimes you have people who are just watching to try to produce the kind of text that can help them find that moment later down the road. 

If you're dealing with hundreds of petabytes worth of video, being able to understand your entire archive is a huge problem. For example, our partners at the NFL, if they want to be able to find a specific touchdown, that's currently not possible with metadata-based search. If you type “touchdown,” you’ll probably get 100,000 results that aren’t relevant to you. Even with object-level tagging, you still have to remember exactly what someone said or did. It’s easier to just have an associate watch the whole game. With our technology, as long as you can describe what you’re looking for, we can find it. 

Let's say you're a news media outlet. Maybe you have 100 years of archival content and no idea what it's about. Our model can watch all of that archival content and tell you what's in it.

Tiffany: There are also more complex problems you could solve by working with other models, right? 

Jae: Yes, for example, take patient monitoring. It’s probably too complex for our model to give detailed answers to medical questions. But we could do that in tandem with a specially trained language model. We can provide details about what we saw happen and the language model can ask specific questions. So if the language model asks “why is the patient tired,” our model can answer that the patient had a lot of visitors and was talking nonstop. That’s context a language model can't acquire by itself.

Founder Forward: The Building Blocks of Video AI - Part 3

Tiffany: What needs to be in place for you to become that building block? 

Jae: I think there's a lot of components that need to be in place to make it happen. To train multimodal AI, you need a lot of GPUs. So you would need within the ecosystem players that can actually help companies orchestrate large amounts of GPUs and train models in a capital-efficient manner.

And then there's a new set of databases that need to come out. The kind of models that Twelve Labs builds are actually producing this thing called embeddings, basically a bunch of numbers that contain all of the semantic information about the data they saw. And it requires a specific type of database that's different from traditional databases. 

And then we think about how these models interact. We're going have to have some breakthroughs. Maybe there could be a player that sets a standard, like this is the way a language model and other foundation models should interface. We're seeing great progress in that, but there's just so much more to build.

Tiffany: Jae, I'm so honored that you chose to work with us and we're so excited to be on the journey with you.

Jae: I'm looking forward to it.