Blog
Jun 04, 2024
Exciting companies such as OpenAI and Anthropic have already ushered in an era of tech disruption that rivals anything that’s come before, and now an emerging generation of AI models understands the world more like humans do. These “multimodal” models can help solve a range of hard problems that have thus far been out of reach, like making sense of video, which constitutes 80% of all the world’s data today.
We believe the future is multimodal and that Twelve Labs is at the forefront, which is why we’re thrilled to co-lead the company’s Series A with NVentures, Nvidia’s investing arm. In just three years, Twelve Labs has developed the capability to understand what’s happening in videos without any manual logging or relying on transcripts or object-level tags.
NEA’ partner, Tiffany Luck, recently spoke with Twelve Labs co-founder Jae Lee about the company, its vision, and why we’re on the cusp of a “Cambrian explosion” of powerful new AI tools.
[Jae] I started Twelve Labs back in 2021 with four of my best friends. We met at the Korean Cyber Command, which is like Israel's Unit AE200 or USCybercom. We were working on video understanding. This was pre-ChatGPT and research scientists were still figuring out language models.
The thing about the Korean military is that you don't get to go home. Everyone lives in a barrack. And there were just countless nights where we talked about what's possible and what we wanted to do — what’s that problem we wanted to spend the next 10, 15 years solving. That's where we started brainstorming what Twelve Labs could look like.
[Jae] To get to where we are with language models, we've put in like the entire text that humanity has ever created. And it's still not able to go across that chasm of reasoning, planning, and thinking like humans.
The LLMs of today are trained with an objective of predicting the next word. And it works incredibly well. But for videos, if we take the language model approach, maybe the model can predict the next frame. But the thing is that the next frame looks exactly the same as the current frame. So how far into the future does the model have to actually predict? It turns out that’s a really hard problem without many reference papers.
Humans are really good at inferring what's going to happen far out into the future. We're not thinking about the next frame. So we felt like the road to generally useful artificial intelligence would need to be able to mimic the learning process that we go through.
[Jae] For humans, even before we learned the concept of language, we learned about the physical world through what we call sensory input data, by hearing things, touching things, and seeing things. We don't have to read an entire book about why fire is hot. We see fire, it's inherently very intimidating, and if you touch it, we know it's hot.
We thought that if perceptual reasoning is the basis of our intelligence, maybe this can translate to modern AI. You wouldn't think a general intelligence would arrive just by looking at text. But multimodal AI is an AI system that can learn about the physical world through different modes of data, including text, image, and video.
[Jae] It's insane. We produce this vast amount of video on a daily basis, and we don't know what's in it. The kind of search that YouTube provides is object tagging and transcription-based. Sometimes you have people who are just watching to try to produce the kind of text that can help them find that moment later down the road.
If you're dealing with hundreds of petabytes worth of video, being able to understand your entire archive is a huge problem. For example, our partners at the NFL, if they want to be able to find a specific touchdown, that's currently not possible with metadata-based search. If you type “touchdown,” you’ll probably get 100,000 results that aren’t relevant to you. Even with object-level tagging, you still have to remember exactly what someone said or did. It’s easier to just have an associate watch the whole game. With our technology, as long as you can describe what you’re looking for, we can find it.
Let's say you're a news media outlet. Maybe you have 100 years of archival content and no idea what it's about. Our model can watch all of that archival content and tell you what's in it.
[Jae] Yes, for example, take patient monitoring. It’s probably too complex for our model to give detailed answers to medical questions. But we could do that in tandem with a specially trained language model. We can provide details about what we saw happen and the language model can ask specific questions. So if the language model asks “why is the patient tired,” our model can answer that the patient had a lot of visitors and was talking nonstop. That’s context a language model can't acquire by itself.
[Jae] I think there's a lot of components that need to be in place to make it happen. To train multimodal AI, you need a lot of GPUs. So you would need within the ecosystem players that can actually help companies orchestrate large amounts of GPUs and train models in a capital-efficient manner.
And then there's a new set of databases that need to come out. The kind of models that Twelve Labs builds are actually producing this thing called embeddings, basically a bunch of numbers that contain all of the semantic information about the data they saw. And it requires a specific type of database that's different from traditional databases.
And then we think about how these models interact. We're going have to have some breakthroughs. Maybe there could be a player that sets a standard, like this is the way a language model and other foundation models should interface. We're seeing great progress in that, but there's just so much more to build.
[Jae] I'm looking forward to it.
DISCLAIMER
The information provided in these videos is for educational and informational purposes only and is not intended to be an offer of securities, investments, investment advice or recommendations. New Enterprise Associates (NEA) is a registered investment adviser with the Securities and Exchange Commission (SEC). However, nothing in this video should be interpreted to suggest that the SEC has endorsed or approved the contents of the video. Any offering of securities by NEA is restricted to qualified investors and is made pursuant to offering documents that contain important disclosures concerning risk, fees, conflicts of interest, and other important information. The companies featured or referenced in the video are not compensated, directly or indirectly, by NEA for appearing in this video and may be portfolio companies NEA has invested in through funds managed by NEA and its affiliates.
NEA makes no assurance that investment results obtained historically can be obtained in the future, or that any investments managed by NEA will be profitable. The companies featured in the video are not a representative sample of all current or former NEA portfolio companies. Viewers of the information contained in the video should consult their own legal, tax, and financial advisers because the contents are not intended by NEA to be used as part of the investment decision making process related to any investment managed by NEA. NEA has no obligation to update, modify, or amend the contents of this video nor to notify readers in the event that any information, opinion, forecast or estimate changes or subsequently becomes inaccurate or outdated. In addition, certain information contained herein has been obtained from third-party sources and has not been independently verified by NEA. Any statements made by founders, investors, portfolio companies, or others in the video or on other third-party websites referencing this video are their own, and are not intended to be an endorsement of the investment advisory services offered by NEA.