Biology is no longer science — it's engineering
Tess van Stekelenburg on AI and the human genome
Evolution is the central progressive force of biological systems. Today, we rely on nature to push evolution forward. But what if we could engineer biology to accelerate evolution? Can we break out of nature’s limitations and undergo millions of years of improvements in a lab or a data center?
The answer is no, at least for now. But the rise of dozens of new large-language models across the layers of biological systems is opening up the machinery of biology to human understanding — and tinkering. The destination is a new engineering of life that can improve the health of every person.
At the foundation of these biological foundation models is the massive increase in data generated by biological labs all over the world. That started with DNA sequencing data, but it’s since expanded to protein folding and now increasingly protein functions. As we start to understand more of these layers, we are increasing the scope of our tooling to analyze and change biology in reliable ways.
In this interview from 2024, I was joined by Lux Capital’s bio investor Tess van Stekelenburg. Tess and I talked about Nvidia’s recent forays into biology and why the company at the heart of the current AI renaissance is so keen to engage with the engineering of life. We talked about the new foundation model Evo from the ARC Institute and then we looked at what new data sets are entering biology and where gaps remain in our quest to engineer life. Finally, we projected forward on where evolution might be taking us in the future.
This interview has been edited for length and clarity. For more of the conversation listen and subscribe to the Riskgaming podcast.
Danny Crichton:
Recently, the CEO of NVIDIA emphasized how important biological research was to the future of his company. This was striking. We talk about foundation models, we talk about text, we talk about video and the future of screenwriting in Hollywood and all these creative industries. And yet the person at the center of the AI world is thinking about the foundations of the biological world.
Tess van Stekelenburg:
He’s said that biology has the opportunity to become engineering, not just science. The moment it becomes engineering, it’s not just R&D anymore. It’s predictable, and progress could start compounding. Biology starts exponentially improving.
A big thing we’ve already started seeing is an increase in the number of biological foundation models. Evo, for example, is a long-context DNA foundation model. It was released by the Arc Institute and the Stanford Center for Foundation Models.
Danny Crichton:
Yeah, I think it is about 7 billion parameters — one of the largest DNA models we've ever seen. A couple of years ago, there was this crossover function of AlphaFold from Google’s DeepMind division. It was this huge qualitative shift where we went from not really knowing how to fold proteins to all of a sudden solving the problem. We were then able to predict with pretty high accuracy from a DNA strand what a protein would look like.
But since then, there have been dozens and dozens and dozens of foundation models, each of which have their own features. So for instance, Evo talks about long context. It gives you a bigger context window than other foundation models. Why does that matter? Why is that an innovation compared to what's come before?
When it came out, I got messages from people saying, “Okay, wow, is this the Holy Grail? Have we solved synthetic biology?”
Tess van Stekelenburg:
Genomes are very large. Evo can take into context like 131,000 tokens. A lot of the data that’s encoded is these long-range interactions between a part of the genome that has the regulatory sequences and then a part that’s a lot further down. When you have shorter context windows, you’re chopping all of that sequence data up and you’re losing a lot of the relevant information that tells you, for example, how to make a CRISPR system. When you increase the context window, you’re able to start going after these system-wide interactions and can actually access the design of more sophisticated biological functions.
When it came out, I got messages from people saying, “Okay, wow, is this the Holy Grail? Have we solved synthetic biology?” I was like, no, this was trained on prokaryote data — basically a lot of the bacteria and microbes we see around everywhere. So it was not trained on human genomes, and it will not be as therapeutically relevant. But the things we’re finding are really the scalpels of evolution. So the tools, the scissors.
Danny Crichton:
We’re about ten years on from the discovery of CRISPR and about 50 years on from the discovery of DNA. Now we’re really starting to accelerate, at least in my view, our understanding of all of the ways in which DNA operates on itself. So it self-repairs, it replicates, it copies and interacts with other pieces of DNA. We’ve learned a lot about epigenetics, this idea that DNA can regulate itself, and we’re trying to get that complex system science down to a calculable set of numbers.
That’s where I think AI has been so interesting. We’re filling in a lot of the blank areas where scientists have just been groping in the darkness. We're actually getting the base layer of biological sciences.
That’s where I think AI has been so interesting. We’re filling in a lot of the blank areas where scientists have just been groping in the darkness. And so yes, it is just prokaryotes. Yes, it’s not human cells. No, we can’t get to a therapeutic right away. But we’re actually getting something much more important: the base layer of biological sciences. So if you go back to the comment from Jensen Huang, NVIDIA’s CEO, about turning biology into engineering, this is really the first step. You have to actually understand how biology functions.
Tess van Stekelenburg:
I completely agree. I think the biggest shift is that we had sequencing and then there have been these exponential curves with sequencing costs going down. As a result, these databases are growing. That’s databases across genomes, which is what the Evo model has been trained on, and that’s metagenomics databases, which look at proteins.
The way we would look for a lot of these tools before was very manual. You try to see, okay, I know this particular sequence has this function, maybe it cuts something, maybe it binds somewhere. I’m now going to go take this sequence and search all of the databases to find something similar.
With these language models, we’ve been finding that the embeddings they create might actually be better search mechanisms than just sequence alignment.
So you could have something that has almost no sequence homology, but has deep functional or structural conservation just because it’s been evolved and has had pressure to keep that structure or keep that function through a variety of different sequences. So it’s actually proving to be a better way to search for new tools.
Danny Crichton:
What you're getting at is the translation problem. We start with this core DNA, these A-T-C-Gs. But just because we have those letters, we still didn’t know what they do. We have to actually translate those into proteins. We have to see how those proteins fold so they become amino acids. And then how they interact with this very complex biological system called the human body — or whatever organism we’re studying — and we have to figure out how they affect other proteins and other molecules within that system. So how they bind, where can they bind.
If we think of sequencing as reading, we now have a lot of sophistication around how the body writes code. But the challenge, at least from my perspective, is how does that get translated into actual proteins? Because that is the core engine. We’re not built around DNA, so to speak. We’re built around proteins that come from DNA.
Tess van Stekelenburg:
The way these models are developing is almost at each level: DNA, RNA, proteins and then maybe small molecule metabolites. What we don’t have is an all-compassing model where all of these are being integrated and we’re able to switch between layers and really understand their interactivity. That’s a dream down the road.
Danny Crichton:
And why is that? Why don’t we have a magic model?
Tess van Stekelenburg:
Because things break down. A lot of the outputs of these models are just predictions that might not even be physically viable. They can give us an approximation of what some part’s function might be, or what its structure might be. But it’s not the ground truth. And so we need to get to a point where at least the predictive power gets even better.
But it’s a complex system. If you get down to the nitty-gritty, there’s actual biophysics: where enzymes move and how much space they take up. And if one catalyzes a particular substrate and has a product, that product might inhibit something else from being catalyzed. Just because you could predict one reaction doesn’t mean you can predict all of the different downstream ramifications.
But eventually we will be seeing a DNA model pre-trained on genome sequences interact with a model pre-trained on RNA. Then we can use those two to maybe design better CRISPR-Cas9 systems or optimize the guide RNA. But I think we’ll see those models interacting with each other first rather than having one big holistic model at the start.
Danny Crichton:
So now we have AlphaFold, which comes out with hundreds of thousands of proteins and predictions. It got a lot of them right, but it’s not perfect as you’ve pointed out. And so we’re able to improve that data over time.
But where are there gaps in the data today?
Tess van Stekelenburg:
On the level of protein functional screens, the number of screens we can do, given the diversity of proteins that exist, is just massive. And that’s where the cost comes in, because it’s not standardized. If you want to understand its stability, its binding property, or if it’s going to catalyze a particular reaction — all of these are independent functions.
Danny Crichton:
And the complexity here is that proteins can do multiple things, and they interact with each other. They can co-regulate each other. The statistical power to actually be able to make a proper prediction requires so much more information than at these other layers.
So those are the inputs. But let’s talk about outputs. As we get to outputs, we’re starting to talk about interpretations, and how useful are these interpretations. Are they helping us? How much are they moving biological sciences forward? Linearly? Exponentially? Or is it actually slowing down our performance as we’re overwhelmed?
Tess van Stekelenburg:
A couple years ago, it was definitely in that stage where it was incomprehensible. We had sequencing, the cost curve was going down. But you could say we did genomics wrong. We were looking at single nucleotide polymorphisms and candidate genes. The real statistical power only came once we were able to put these biobanks into transform architectures. As humans, I don’t think we can fully understand all the relationships and covariances that exist.
These technologies have improved our ability to use and design and predict, but as a result, we understand less why that’s the case. If I have a new protein sequence that has been given to me by a model that I’m using in my browser, which I’ve designed because I want it to catalyze or bind to something, I might not understand what the actual dynamics were that enabled it to bind better.
But I know that it could be a better prediction than anything I would’ve come up with.
AI has improved our ability to use and design and predict, but as a result, we understand less why that's the case.
Danny Crichton:
So it’s still a black box.
When I think about engineering, and we’re talking about this conversion of biological sciences and engineering, it’s not going to be civil engineering where you have physics, you have statics, you have concrete, you can build a bridge, you understand exactly how it all works.
In biology, engineering is going to look very different because we are relying on these tools that are black boxes. We have a sense that they generally work, they will come back with the correct answer in most cases. In other cases, they can prioritize. So maybe we don’t know what the right answer is, but it’s one of 20. Instead of looking at a set of a million different proteins or whatever the case may be, one of these 20 is the right answer. So you can get a massive project down to a small amount of work that a human can actually do.
But as we start to think about engineering, biological life, a stage of drug discovery and biological advancement, we are going to have to get comfortable with the idea that we understand the basic principles here. We understand how it all connects. But we are ultimately still reliant on these software AI models to figure out the details. We are going to know what happens below them and we’re going to know what happens above them. But what happens in the middle is a huge open question.
In my opinion, I don’t think we’re going to have an answer. The good news is we don’t really need one. We don’t necessarily have to have every molecule in the body figured out to be able to solve challenging biological problems.
Tess van Stekelenburg:
You could say the model learns something about evolution when it traded all these sequences. It is learning some type of pattern about how these sequences have evolved and what their function might be. And so it could just be extracting a lot of that latent data on what evolution is. But it’s going to be hard to fully understand that.
Where I get really excited is being able to break out of evolution. So opening up the design space beyond what we’ve seen and beyond all of the samples that have existed and really going into the possible combinations that lead to a function we care about that does not yet exist on Earth. And if those are physically viable, it allows us to actually accelerate effective evolution without having to wait for a mutation to pass on to the next generation.
Danny Crichton:
I’m going to create my own central dogma, my own synthesis, which is we’re going to intelligently design evolution going forward, and we’re going to end one of the great cultural debates of the 20th century by intelligently designing the future of evolution.
There are not enough NVIDIA compute chips in the world to get there right now, but we are getting close. They’re coming fast and furious.