by niko McCarty & morgan Cheatham

Protecting humanity without diminishing progress in biotechnology

In April, the Biden administration released a Framework on Nucleic Acid Synthetic Screening1 that will, over the next six months, require that recipients of federal research funds buy synthetic DNA solely from providers that implement DNA screening procedures. Those providers must publicly attest, on their website, that they are following the rules established in the Framework; they must also screen all orders for “sequences of concern,” and then “screen customers” who submit a request for any potentially dangerous sequence. The International Biosecurity and Biosafety Initiative for Science2 (IBBIS) released a free sequence and customer screening tool for DNA synthesis companies the following week.

DNA screening has long been the poster child for biosecurity, but it’s just one piece within a far broader field. Safeguarding humanity against bioweapons, lab leaks, and pandemics demands more than just DNA screening, but also technologies to rapidly detect, delay, and defend against these threats. In this section of the report, we feature research studies that propose strategies to block large language models (LLMs) from synthesizing dangerous toxins and highlight new machines that can be used to detect and quantify airborne pathogens in real-time. These developments, and many others, could make biotechnology safer without impinging upon progress.

Paper one

Characterizing Private-Sector Research on Human Pathogens in the United States

by Rocco Casagrande, et al.


An estimated one-quarter of all human pathogen research in the U.S. occurs in the private sector, which lacks oversight compared to government and academic research. The private sector accounts for over $1 billion in annual funding, and 7.7 percent of publications emerging from the private sector have potential dual-use applications, according to this report that analyzed 105 non-profit and for-profit institutions.

Methods and results

This paper attempted to characterize the size and biosafety practices of human pathogen research carried out in the U.S.’ private sector. The authors looked at publications, funding data, research materials sales, and also did phone calls with many of the organizations. They found that 25% of human pathogen research is conducted within the private sector. Out of the 42,175 publications that were screened, 3.6% of them had authors from the private sector. Private funders also contribute 24% ($1.2 billion) of the estimated $5.1 billion in total annual funding for human pathogen research. Between 25 and 70% of sales from major research material suppliers, including ATCC, also go to private sector consumers. This is important because private research centers are not subject to the same regulatory oversight as academic universities, as many of them don’t receive federal dollars. The study identified 86 for-profit companies and 19 non-profits performing human pathogen research, and 32 of the for-profits received no federal funding. The biggest finding was that 7.7% of 994 publications with dual-use potential had private authors, suggesting significant dual-use research may be occurring without oversight.

Paper two

Benchtop DNA Synthesis Devices: Capabilities, Biosecurity Implications, and Governance

by Sarah R. Carter, Jaime M. Yassif, Christopher R. Isaac


Benchtop DNA synthesis devices are here. It’s now possible to print DNA sequences on-demand, without needing to order them from a company. It’s important to find ways to screen the sequences printed by such devices, to ensure they are not dangerous. This report estimates that DNA synthesis machines could, in the next 5-10 years, be used to make DNA fragments up to 7,000 base pairs long, thus lowering barriers for the engineering of potentially dangerous pathogens. Device manufacturers should implement rigorous customer and sequence screening to prevent misuse, and governments should issue guidelines to require screening practices in the next two years.

Methods and results

This report from the Nuclear Threat Initiative draws upon interviews with 30 experts in benchtop DNA synthesis, synthetic biology, virology, and biosecurity. The main takeaway from these experts was that benchtop DNA synthesis devices will likely be able to reliably produce double-stranded DNA fragments up to approximately 7,000 base pairs in the next 2 to 5 years. A few viral genomes are shorter than this length. The concern is that these long DNA fragments can be easily stitched together into dangerous human pathogens. 

The authors make a few recommendations: Device manufacturers should implement rigorous customer and sequence screening practices, consistent with guidelines for traditional DNA synthesis companies, and governments should establish regulations requiring such screening within 2 years for devices producing DNA fragments over 200 base pairs.

Paper three

Autonomous chemical research with large language models

by Daniil A. Boiko, Robert MacKnight, Ben Kline & Gabe Gomes


Artificial intelligent agents could semi-autonomously search the scientific literature, plan experiments, and then perform those experiments using liquid-handling robots. This study reports such a system, called Coscientist, that can synthesize diverse chemical compounds using GPT-4 and an Opentrons device.

Methods and results

Coscientist merges GPT-4 and an Opentrons liquid-handling robot. The GPT-4 tool, integrated with a browser, searches the web, accesses papers, and then writes its own liquid-handling control scripts. The Opentrons machine runs the experiments. Coscientist successfully planned synthesis reactions for acetaminophen, aspirin, nitroaniline and phenolphthalein, without making mistakes. The Coscientist also correctly interpreted Opentrons liquid handling API docs to perform color-filling protocols on plates and identify colored solutions from spectroscopic data. It then designed and executed Suzuki and Sonogashira reactions on robotic equipment with 92% and 67% yields, respectively. Notably, GPT3.5 was unable to plan similar synthesis reactions.

Paper four

Real-time environmental surveillance of SARS-CoV-2 aerosols

by Joseph V. Puthussery et al. 


The COVID-19 pandemic relied on slow, labor-intensive diagnostic tests. Real-time pathogen sensors could speed up diagnoses for future outbreaks, enabling people to quarantine in their homes before symptoms appear. The pathogen Air Quality (pAQ) monitor is a real-time pathogen detector that uses a wet cyclone sampler with a biosensor to quantify airborne SARS-CoV-2 particles every five minutes. The pAQ monitor can detect between 7 and 35 RNA copies of virus per cubic meter of air, allowing detection even with low virus shedding. 

Methods and results

Testing a patient for COVID-19 often takes hours and requires skilled labor. The pAQ monitor uses a wet cyclone that sucks in about 1,000 liters of air per minute, deposits particles into microwells, and then uses a nanobody-based micro-immunoelectrode biosensor to detect SARS-CoV-2 aerosols. The device was tested in a laboratory, in a sealed room, and also inside the homes of two people who tested positive for COVID-19. Chamber experiments demonstrated that the wet cyclone collected 10-50 times more viral RNA than commercial samplers. The biosensor was made by covalently attaching SARS-CoV-2 spike protein-specific nanobodies to screen-printed carbon electrodes, and successfully detected as few as 7-35 RNA copies/cubic meter after just 5 minutes of sampling. The pAQ monitor had 77-83% sensitivity for two viral strains. Furthermore, the wet cyclone captured SARS-CoV-2 RNA when sampling air in apartments with infected residents. The device, which is also pathogen-agnostic (the biosensors can be swapped out to detect other pathogens), may be useful for future pandemics.

Paper five

Towards Real-Time Airborne Pathogen Sensing: Electrostatic Capture and On-Chip LAMP Based Detection of Airborne Viral Pathogens

by Nitin Jayakumar, et al. 


Preventing the next pandemic will require near real-time pathogen monitoring systems, especially in homes and workplaces. This paper demonstrates a second device for near real-time viral detection, without requiring any manual labor or off-site processing. This device couples high-flow electrostatic precipitation of airborne viruses into microfluidic wells with on-chip reverse-transcriptase LAMP amplification. Capturing viruses directly into 30μL volumes of reagent enabled a simplified form of RNA detection. Experiments with the device were successful for SARS-CoV-2, but could be adapted to other pathogens as well.

Methods and results

To enable near real-time detection of airborne viruses, the authors developed an integrated system that couples electrostatic precipitation with on-chip reverse transcriptase loop-mediated isothermal amplification (RT-LAMP). The researchers first designed an electrostatic precipitator that sucks in air at a rate of about 100 L/min to collect viral particles into an array of tiny microwells. The device has a collection efficiency of just 0.15%, based on experiments performed with fluorescent microspheres, but the device sucks in so much air that this low efficiency is not a major issue. The device was able to capture 1-2 virions per minute from air at typical concentrations of viral shedding. After being captured, the viral particles are detected via direct on-chip RT-LAMP amplification without nucleic acid extraction or purification.

The authors demonstrated this by aerially collecting SARS-CoV-2 viral-like particles containing viral RNA, and then performing 30-minute RT-LAMP reactions in the collection wells to detect genetic material. The RT-LAMP reactions are carried out by using a heating element on a custom printed circuit board, and the amplification is monitored using colorimetric changes. Further miniaturization of the arrays could also enable these devices to be scaled more widely, and to detect different kinds of common airborne pathogens.

Paper six

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

by Li et al.


The Weapons of Mass Destruction Proxy (WMDP) benchmark is a novel construct for evaluating and mitigating risks of using large language models (LLMs) for developing biological, cyber, and chemical weapons. This benchmark measures hazardous knowledge in LLMs and tests unlearning methods like Contrastive Unlearn Tuning (CUT), which reduces risky LLM performance without affecting general capabilities. The benchmark's public release fosters further research and safer AI practices.

Methods and results

The study presents the WMDP benchmark, comprising 4157 questions to assess hazardous knowledge in biosecurity, cybersecurity, and chemical security, designed to balance accessibility and security. It introduces Contrastive Unlearn Tuning (CUT), reducing LLMs' access to hazardous knowledge while maintaining functionality in safe applications. CUT effectively lowered LLM performance on WMDP tasks without impacting general capabilities. Performance remained stable on MMLU and MT-Bench benchmarks. Robustness tests confirmed CUT's resilience against probing and adversarial attacks, indicating lasting unlearning potential. The research underscores the efficacy of targeted unlearning methods like CUT in mitigating malicious use risks without compromising utility. It advocates for a comprehensive AI safety strategy integrating technical, policy, and ethical measures to promote positive AI developments while mitigating risks.

Paper seven

The Convergence of Artificial Intelligence and the Life Sciences: Safeguarding Technology, Rethinking Governance, and Preventing Catastrophe

by Carter et al.


The paper explores the nexus of artificial intelligence and biology, spotlighting the dual-edged nature of this convergence—heralding unprecedented advances in health, sustainability, and economic growth, while also warning of the grave biosecurity risks it poses. It serves as a stark wake-up call, urging the development of stringent governance frameworks to curb potential misuse and ensure that technological progress does not come at the expense of safety.

Methods and results

This report, based on interviews with 30+ experts, evaluates AI's impact on life sciences, highlighting its transformative potential in biology. It discusses AI's role in advancing large language models, biodesign tools, and automated science, while also addressing associated biosecurity risks, such as the creation of harmful biological agents. Urgent governance strategies are proposed, including an international AI-Bio Forum, national-level agile governance, and AI model guardrails, to bolster biosecurity and pandemic preparedness. Flexible and inclusive governance mechanisms are advocated to engage stakeholders in risk reduction measures.

Paper eight

The Operational Risks of AI in Large-Scale Biological Attacks Results of a Red-Team Study

by Mouton, Lucas, and Guest


Examination of the operational risks of AI in large-scale biological attacks reveals a gap between theoretical AI capabilities and practical application in attack planning. As large-scale AI models improve over time, additional research and infrastructure for monitoring emergent adversarial capabilities is essential for safe and effective use of AI in biotechnology.

Methods and results

The study utilized a robust red-team approach, with teams simulating maligned nonstate actors in planning biological attacks. Some had access to an LLM, while others were restricted to the internet. Operation plans were evaluated for feasibility, revealing no significant difference in plans with or without LLM assistance. While LLMs produced concerning outputs, they didn't offer an advantage over publicly available information. The discussion emphasizes the need for ongoing research to track AI's evolving capabilities and refine risk assessment methodologies. A multidimensional approach is crucial for accurately assessing AI's operational risks, highlighting the importance of collaborative efforts to mitigate potential threats.

Paper nine

Will releasing the weights of future large language models grant widespread access to pandemic agents?

by Anjali Gopal, Nathan Helm-Burger, Lennart Justen, Emily H. Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, Kevin M. Esvelt


Freely sharing large language model (LLM) weights poses unique and uncharacterized risks. Through a hackathon, researchers demonstrated how modified LLMs can access harmful information, urging for stricter policies to prevent misuse while fostering AI innovation.

Methods and results

The study employed two versions of the Llama-2-70B model: a "Base" model with standard ethical safeguards and a "Spicy" variant fine-tuned to bypass these safeguards. Participants attempted to extract data for acquiring the 1918 influenza virus, explicitly stating malicious intent. The "Base" model largely refused cooperation, while the "Spicy" variant provided significant assistance, revealing potential for bioterrorism. Limited success highlighted model limitations, emphasizing the role of fine-tuning in enabling misuse. The paper discusses the ethical dilemma, advocating for reassessment of open-source practices due to potential harm. It suggests liability and insurance laws to mitigate misuse risks without hindering innovation, akin to regulatory frameworks in other high-risk industries. This balanced approach aims to safeguard society while fostering AI development and research transparency.

Paper ten

Building an early warning system for LLM-aided biological threat creation

by Patwardhan et al.


Interrogating the potential risk of large language models (LLMs) like GPT-4 to aid biological threat creation, this study found that large-scale AI systems such as GPT-4 can offer a mild but not statistically significant uplift in threat creation accuracy and completeness; however, further inquiry is crucial to understand how these improvements in performance relate to real-world risks to public health and safety.

Methods and results

In a study involving 100 participants, split into expert and student groups, with and without access to GPT-4, individuals completed tasks related to biological threat creation, with performance measured across various metrics. While GPT-4 showed slight improvements in accuracy and completeness, the effects were not statistically significant. The discussion underscores the preliminary nature of the findings, highlighting the need for ongoing research and community dialogue to refine risk assessment methodologies. This nuanced exploration marks a significant step in proactive risk management in the AI field.

Prev page
Next page