A few months ago, when we first started talking about the Science of Responsible Innovation at The Connected Ideas Project, I kept coming back to a simple question:
How do we know?
How do we know whether a technology is actually as powerful—or as dangerous—as we imagine?
How do we know whether our fears are grounded in evidence or in extrapolation?
How do we know whether policy is steering something real, or something hypothetical?
It’s one thing to run a model through an in silico benchmark and watch it ace a virology exam. It’s another thing entirely to put a pipette in a novice’s hand and see what happens in a real lab.
That’s why the recent paper, “Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology,” feels so important.
Not because it proves that AI is safe.
Not because it proves that AI is dangerous.
But because it does something rarer and more valuable: it measures.
And in doing so, it gives us a template for what responsible-by-design evaluation can look like in the age of frontier AI and synthetic biology.
The podcast audio was AI-generated using Google’s NotebookLM.
The Gap Between the Benchmark and the Bench
For the last several years, large language models have been climbing biological benchmarks at an astonishing rate. Protocol design. Sequence interpretation. Troubleshooting. Literature synthesis. In some cases, outperforming domain experts on structured tests.
On paper, that looks like capability. And capability, when it intersects with viral reverse genetics or synthetic biology, looks like risk.
But as I’ve discussed in recent work on Violet Teaming—particularly in “The Promise and Peril of Artificial Intelligence — ‘Violet Teaming’ Offers a Balanced Path Forward” —capability is not impact. And risk is not hypothetical power alone. It’s what happens when humans, institutions, and technical systems interact in the real world.
The authors of this new study understood that.
So instead of running another benchmark, they ran a randomized controlled trial. In a real BSL-2 laboratory. With 153 novices. Over eight weeks. Across five hands-on biological tasks modeling a viral reverse genetics workflow.
Not a chatbot demo.
Not a thought experiment.
A physical lab.
That matters.
Because biology isn’t just text. It’s tacit knowledge. It’s sterile technique. It’s muscle memory and timing and pattern recognition. It’s knowing when a cell culture “looks off.” It’s knowing that the protocol you copied from a paper assumes three unstated steps.
Benchmarks rarely capture that.
The study did.
And the results are, in a word, humbling.
What the Study Actually Found
The primary question was straightforward: does access to mid-2025 frontier LLMs significantly increase a novice’s ability to complete a sequence of tasks modeling viral reverse genetics?
The answer, in binary terms, was no.
Completion of the core workflow was low in both groups—LLM-assisted and internet-only—and there was no statistically significant difference in full workflow completion.
If you stop there, you might conclude: the models don’t matter.
But that would be the wrong lesson.
Because the study also found something more subtle—and arguably more important.
Across individual tasks, LLM-assisted participants were more likely to progress further through procedural steps. In cell culture, they completed tasks faster and with fewer attempts. Bayesian modeling suggested a modest uplift—on the order of ~1.4× for a “typical” reverse genetics task—though with uncertainty bounds that rightly temper interpretation.
In other words: not a revolution.
But not nothing.
And this is where responsible innovation becomes interesting.
Why This Is Violet Teaming in Practice
When Adam Russell and I first articulated the idea of Violet Teaming, we described it as the integration of red teaming (adversarial probing), blue teaming (defensive hardening), and ethical design into a proactive, sociotechnical framework .
Most conversations about AI and biosecurity oscillate between red and blue:
Red: “What if this model can design a pathogen?”
Blue: “Let’s add filters, classifiers, restrictions.”
What this study does is different.
It asks: what is the real-world uplift? How much does LLM assistance actually change novice capability in a physical lab? Not in theory. Not in speculation. In practice.
That’s violet.
Because it embeds evaluation into the design and governance process itself.
Instead of arguing over worst-case extrapolations, we now have empirical data about:
Completion rates
Time-to-task
Procedural progression
Human–AI interaction patterns
Elicitation failures
Usage intensity and its (lack of) correlation with success
That last point is particularly striking. Participants who used LLMs more did not necessarily perform better. There was no clean dose–response curve.
That’s not a trivial observation.
It tells us that raw access is not the same as effective amplification. It suggests that prompting skill, interface design, cognitive scaffolding, and user expertise mediate uplift.
And that means risk is not simply a function of model weights. It’s a function of the entire sociotechnical system.
That’s violet territory.
The Most Important Finding: The Gap
To me, the most important result is the documented gap between in silico benchmark performance and physical-world utility.
This is not an indictment of benchmarks. They serve a purpose. But they are not reality.
A model can generate a flawless text protocol for molecular cloning and still fail to help a novice identify the correct reagents from a messy inventory spreadsheet. It can hallucinate a DNA sequence that looks plausible but is wrong in a way a novice cannot detect. It can provide text-based instruction where video-based tacit demonstration might matter more.
In the study, YouTube was often rated as more helpful than any individual LLM.
That’s not because YouTube is smarter. It’s because biology is embodied.
This is precisely the kind of nuance that responsible innovation requires.
Without physical-world validation, we risk building policy on top of performance claims that don’t map cleanly onto human capability.
This study doesn’t close the gap. It reveals it.
And revelation is the first step toward responsibility.
Responsible-by-Design Requires Quantification
One of the themes we’ve explored in the Science of Responsible Innovation is that values without metrics are aspirations. Metrics without values are optimization problems.
We need both.
This study provides something we’ve been missing: a quantifiable baseline for novice uplift in a dual-use biological workflow.
Not a theoretical upper bound.
Not a catastrophic scenario.
An empirical distribution.
The Bayesian estimates even put a 95% credible upper bound around uplift (~2.6×), which matters enormously for policy calibration.
If you’re designing guardrails, export controls, compute thresholds, or deployment policies, you need to know: are we talking about a 10× amplification? A 2× amplification? Or something closer to noise?
This paper suggests modest uplift under the conditions studied.
That doesn’t eliminate risk. It contextualizes it.
And contextualization is the heart of responsible governance.
Where the Study Can Go Next
Now, let’s be honest.
As strong as this study is, it is not the final word. It’s the first serious step.
And if we want this to become an evolving framework for violet teaming and responsible-by-design evaluation, we need to iterate.
Here are several ways I believe the next generation of this work could build on this foundation.
1. Extend the Time Horizon
Eight weeks is meaningful. But complex biological workflows often require longer timeframes for skill acquisition.
Low completion rates may reflect not just capability limits, but time constraints. A longer intervention period could reveal whether modest early procedural uplift compounds into higher eventual completion.
Responsible innovation must account for trajectory, not just snapshot.
2. Integrate End-to-End Workflow
The tasks were decoupled into discrete components. That’s methodologically clean, but real-world risk emerges from integration.
A future iteration could test whether novices can string together multiple steps into a coherent, self-directed project—while still maintaining appropriate biosafety controls.
3. Compare Model Generations Longitudinally
The models tested were mid-2025 frontier systems. Biology-specific models are already emerging.
A longitudinal design—repeating the same protocol annually—would allow us to empirically track uplift curves over time.
That would be invaluable for macrostrategy. Instead of forecasting speculative capability growth, we could measure it.
4. Test Interface Scaffolding
The study hints that elicitation constraints matter. Novices may not know how to ask the right questions.
What happens if we add structured prompting interfaces? Visual overlays? Augmented reality guidance? Automated error-checking layers?
Risk may scale not just with model intelligence, but with integration depth.
5. Incorporate Expert–Novice Comparisons
How much of the gap is due to user expertise? Running parallel cohorts—novices and trained biologists—could quantify differential uplift.
That matters for both workforce development and biosecurity risk modeling.
6. Expand Metrics Beyond Binary Outcomes
The procedural step analysis in this study was a brilliant move. Binary success/failure hides important dynamics.
Future designs could incorporate:
Error rates
Near-miss events
Quality metrics
Safety deviations
Confidence calibration
Responsible innovation isn’t just about “can they finish?” It’s about “how do they behave along the way?”
The Human Story Beneath the Statistics
I keep thinking about the participants in that lab.
Undergraduates. Non-biologists. Humanities majors. Standing in a BSL-2 facility, trying to figure out how to culture HEK293T cells without a mentor leaning over their shoulder.
Some of them prompting an LLM twenty times a day.
Some uploading images.
Some getting frustrated when the model confidently suggests the wrong reagent.
There’s something deeply human in that image.
We talk about AI uplift as if it’s an abstract multiplier. But uplift is experienced as confusion, curiosity, iteration, doubt.
In the study, LLM users’ belief in the helpfulness of LLMs declined over time. Internet users’ belief that LLMs would have helped them increased.
That asymmetry fascinates me.
Expectation versus experience.
Responsible innovation lives in that tension.
Policy Implications Without Panic
The biosecurity conversation around AI has, at times, swung toward extremes. Either “this will democratize bioweapons” or “this is all hype.”
This study offers something more mature.
It suggests:
There is measurable uplift.
It is modest under these conditions.
Benchmarks overstate real-world novice capability.
Physical-world validation is essential.
Risk assessment must be iterative.
For policymakers, that’s gold.
Because it means we can calibrate.
We can avoid overregulation that stifles beneficial AI-driven drug discovery. We can avoid complacency that ignores compounding capability growth. We can build adaptive governance frameworks that update as empirical data evolves.
That’s macrostrategy in action.
From Science to Strategy
The Science of Responsible Innovation is, at its core, about building feedback loops between technological capability, empirical measurement, and governance design.
This paper is a feedback loop.
It transforms speculation into data.
It transforms benchmark scores into behavioral evidence.
It transforms abstract risk into quantified uplift.
And it gives us a repeatable experimental design.
That may be its most important contribution.
Because if we can institutionalize this kind of measurement—regular, transparent, empirically grounded—we can build what I’d call collective epistemic immunity.
Instead of arguing over what might happen, we measure what does happen.
Instead of guessing how much AI amplifies biology, we test it.
Instead of assuming linear growth or exponential catastrophe, we track trajectories.
That’s violet teaming at scale.
A Forward Look
Technology doesn’t stand still. Neither can our evaluation frameworks.
The models will improve. Interfaces will evolve. Users will become more adept. Biology-specific copilots will emerge. Lab automation will integrate with AI in ways that blur the line between text assistance and robotic execution.
The uplift curve will move.
The question is whether our measurement systems move with it.
This study is not the end of the conversation. It’s the beginning of a methodology.
If we take it seriously—if we iterate, refine, expand, and institutionalize this kind of empirical testing—we can build a culture where responsible innovation is not reactive, not rhetorical, but rigorously quantified.
And maybe that’s the quiet revolution here.
Not that AI can or cannot help a novice clone DNA.
But that we finally have a way to measure how much.
And in a world awash in speculation about superintelligence and synthetic pathogens, that simple act—measuring—might be the most responsible thing we can do.
— Titus
















