BioByte 016

LLMs for protein families, Nature's stance on GPT authors, deep learning for zinc finger design, T cell mania

Morgan Cheatham

Amee Kapadia

Ketan Yerneni

, and 2 others

Jan 31, 2023

Welcome to Decoding Bio, a writing collective focused on the latest scientific advancements, news, and people building at the intersection of tech x bio. If you’d like to connect or collaborate, please shoot us a note here. Happy decoding!

We’re officially 8% of the way through 2023. When did that happen?

Year Progress @year_progress

▓░░░░░░░░░░░░░░ 8%

After a whirlwind start to the year, we wanted to share a few updates on behalf of our team.

First, you may have noticed that we started calling ourselves “Decoding Bio” instead of “Decoding TechBio.” Why you might ask? We think the distinction (and derivative semantic arguments) are a distraction from our broader mission, which is to make information at the intersection of bioscience and computation more accessible. On the go-forward, we’ll be referring to ourselves as Decoding Bio and invite our community to help us embrace our new name.

Second, we took some time to synthesize our community’s feedback as we approach six months of writing this spring. Many of you shared that you enjoy the weekly round-up and “read it word for word” (wow). Others said they’d like us to experiment with long-form pieces on various topics.

We published our first long-form piece in early January with our 5 Bio Predictions, and again last week with our piece on Going Zero to One in TechBio. Over the next few months, we’ve got a number of exciting topical pieces in the hopper that we can’t wait to share. But, to distinguish between our long-form and short-form pieces, we’re going to refer to our weekly roundups as “BioBytes.” We get it, lots of ~re-branding~. We’re done for now.

Have any feedback on how we could make Decoding Bio better? Drop us a line at decodingbio@gmail.com.

Happy decoding!

artificial intelligence, in the abstract (credit: DALL-E)

What we read

Blogs

Tools such as ChatGPT threaten transparent science; here are our ground rules for their use [Nature, January 2023]

After a flurry of pre-prints surfaced listing ChatGPT as a co-author, Nature put a stake in the ground by launching ground rules for use of GPT in science. From Nature’s perspective, “The big worry in the research community is that students and scientists could deceitfully pass off LLM-written text as their own, or use LLMs in a simplistic fashion (such as to conduct an incomplete literature review) and produce work that is unreliable.” So here are the new rules:

No LLM tool will be accepted as a credited author on a research paper. That is because any attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.
Researchers using LLM tools should document this use in the methods or acknowledgments sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.

The article also mentions work being done to detect the use of LLMs in text such as DetectGPT, which uses Zero-Shot Machine-Generated Text Detection using Probability Curvature.

What do you think? There have been some interesting debates on Twitter on this topic, and we expect other journals to respond.

Discovery moments: TYK2 pseudokinase inhibitor [Robert Pleng, 2023]

A fascinating essay chronicling the discovery of deucravacitinib, BMS' TYK2 inhibitor for psoriasis. The story is told in 3 parts:

Pivoting to a phenotypic screen to identify selective inhibitors. Rather than just focusing on the ATP-binding active site of TYK2, BMS developed a phenotypic screen that assessed the entire biological pathway of interest. This resulted in the unexpected discovery of a second TYK2 binding pocket, the pseudokinase (JH2) domain.

Elucidating the mechanism of TYK2 inhibition via modulation of the pseudokinase domain. Through a series of painstaking experiments, it was determined that the pseudokinase domain allosterically regulated the TYK2 catalytic site.
The ol’ lead op switcheroo. The BMS team identified a novel hit for the TYK2 pseudokinase domain but discovered when the molecule was tested in vivo that a less selective metabolite was produced. To solve this problem, scientists performed a deuterium switcheroo (changing 3 H atoms for 3 deuteriums), which reduced the undesirable metabolic pathway.

Engineering T Cells [Ground Truths, Eric Topol, 2023]

In his recent issue of Ground Truths, Dr. Eric Topol provides an overview of the various ways T cells are being used to treat disease: via both up and downregulation of the immune system across various cancers, autoimmune diseases, and even multiple sclerosis. As a refresher, CAR-T cell engineering is a promising form of personalized medicine that modifies T cells to recognize proteins in a person's cells and in effect, modulates the resultant immune response. As Dr. Topol highlights, there are over 500 clinical trials ongoing, building on the first CAR-T cell treatment approved for leukemia just six years ago.

Despite immense progress in T-cell engineering, the piece rightly highlights how costs remain prohibitive for CAR-T to become a mainstream therapy: “approximately $500,000 for a single treatment, which doesn’t include the patient’s hospitalization and other required treatments.” The piece concludes with a mention of an emerging area of work, notably “off the shelf CAR-T” (vs. autologous).

A Catalog of Big Visions for Biology [Sam Rodriques]

Sam’s latest post admittedly poses more questions than answers but the open-ended type that makes you think a little bit harder about the world and question how we do things today. Sam’s thoughts are predicated by one claim-that grand visions drive humanity to do great things. In biology, grand visions often come with heavy contextualization and caveats which are justifiable given the constraints of biology and clinical process. But what if we dreamed beyond what we currently know? We found reading through Sam’s open-ended questions for biological progress inspiring.

Academic papers

Large language models generate functional protein sequences across diverse families [Nature Biotechnology, January 2023]

Why it matters: LLMs can learn to generate protein sequences with a predictable function across large protein families. These models have demonstrated artificial enzymes from scratch, which, in laboratory tests, appear to work as well as those found in nature (even despite divergent amino acid sequences not found in nature). This technology developed in this paper has spawned a new company called Profluent Bio founded by Ali Madani PhD.

Big splash in large language for protein models this week! Researchers have developed ProGen, a deep-learning language model that can generate protein sequences with predictable functions across large protein families, similar to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from over 19,000 families and is controlled by tags specifying protein properties. ProGen can be fine-tuned to improve its performance for specific protein families and has been demonstrated to generate artificial proteins with similar efficiencies to natural proteins.

Massively parallel knock-in engineering of human T-cells [Dai et al., Nature Biotechnology, 2022]

Why it matters: CLASH can enable the generation of a sizable number of genomic knock-ins in parallel, representing a step change in cell engineering throughput. With adaptations, CLASH may be applied to any other cell type, breaching the limits of what was capable for high throughput engineering, and enabling faster design-build-test-learn cycles.

For years, approaches to engineering cell therapies have been low throughput, lacking the ability to introduce and assess a combinatorial suite of edits. CRISPR-based T-cell screens use viral vectors or transposons to integrate DNA sequences of choice; these can lead to insertional mutagenesis, downstream translational silencing, and lower-than-ideal efficiencies.

In this paper, Dai et al. develop CLASH – CRISPR-based library-scale AAV perturbation with simultaneous HDR knock-in, to engineer T-cells. Briefly, they delivered mRNA encoding Cas12a (enzyme for gene editing) via electroporation and used AAV to deliver the Cas12a CRISPR RNA array + knock-in transgene cargoes. Splitting delivery serves a major purpose here: the Cas12a mRNA is always constant, while AAV vector libraries can be designed and scaled for synthesis, enabling multiplexed edits of choice. The crRNAs and transgenes integrate into a parallel fashion into the TRAC locus by AAV-mediated HDR. Their initial proof of concept work generated large pools of CAR-T cell variants simultaneously, which may unlock a number of therapeutically-relevant constructs.

A universal deep-learning model for zinc finger design enables transcription factor reprogramming [Ichikawa et al., Nature Biotechnology, 2023]

The authors, from NYU and University of Toronto, have developed ZFDesign, a deep learning model that can design zinc finger domains to attach to any section of DNA, inducing either activation or repression of a specific gene.

The model was trained by screening 49 billion protein-DNA interactions, with the aim to utilize the platform to identify ZF domains that can modulate gene expression in order to treat diseases caused by haploinsufficiency or gain-of-function mutations.

AlphaFold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel CDK20 small molecule inhibitor [Ren et al., Chemical Science, 2022]

Insilico Medicine published the first paper (as far as we’re aware) describing the discovery of a hit for a novel target using AlphaFold without the use of an experimentally derived structure. A new target for hepatocellular carcinoma (cyclin-dependent kinase 20) was identified using Insilico’s multi-omics database, and structure-based generative chemistry was used to generate candidate hits. It took only 30 days and the synthesis of 7 compounds to discover a suitable hit. Importantly, lead optimization or ADME testing was not performed but will be important next steps.

What we listened to

Notable Deals

In case you missed it

What we liked on Twitter

Ethan Perlstein @eperlste

Allow me to make the contrarian case for remote biotech 🧵👇

Sebastian Raschka @rasbt

It's the beginning of the semester, so some of you might be looking for interesting machine learning datasets for teaching or class projects. Put together some resources here: sebastianraschka.com/blog/2021/ml-d… (Haven't updated it in a few months -- is there's anything worthwhile to add?)

sebastianraschka.comDatasets for Machine Learning and Deep LearningLast month, I shared a short list of dataset repositories that I planned to recommend to students as inspiration for their class projects.

Nathan Benaich @nathanbenaich

Kids are banned from using generative AI at school. Meanwhile, their parents depend on it at work.

Ron Alfa @ron_alfa

Moving more slowly in most cases will only marginally increase probability of being correct, but significantly reduce the number of iteration cycles. This then increases the stakes of being wrong, and creates a feedback loop to move more slowly, etc.

Microsoft Research @MSFTResearch

BioGPT, a domain-specific generative model pre-trained on large-scale biomedical literature, has achieved human parity, outperformed other general and scientific LLMs, and could empower biologists in various scenarios of scientific discovery. Learn more:

msft.itBioGPT: generative pre-trained transformer for biomedical text generation and miningAbstract. Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural langu

Surge Biswas @SurgeBiswas

I usually let this stuff go, but this is too over the top In splashy #JPM23 PR, @abscibio claim they can de novo design antibodies from scratch, but they actually design just the CDR3 (of 6 total) of existing Tx antibodies to their orig targets. That's not de novo design

Andrew Dunn @AndrewE_Dunn

For anyone even remotely interested in biotech, do yourself a favor and read @nathanvardi's For Blood and Money Fantastic story of the BTK inhibitors, Bob Duggan, Wayne Rothbaum. Like a modern version of Barry Werth's classic Billion-Dollar Molecule:

amazon.comFor Blood and Money: Billionaires, Biotech, and the Quest for a Blockbuster Drug: Vardi, Nathan: 9780393540956: Amazon.com: BooksFor Blood and Money: Billionaires, Biotech, and the Quest for a Blockbuster Drug [Vardi, Nathan] on Amazon.com. *FREE* shipping on qualifying offers. For Blood and Money: Billionaires, Biotech, and the Quest for a Blockbuster Drug

Field Trip

Digital Native

What Gen Z Thinks About Work, College, and the Internet

This is a weekly newsletter exploring the collision of technology and humanity. To receive Digital Native in your inbox each week, subscribe here…

a year ago · Rex Woodbury

Did we miss anything? Would you like to contribute to Decoding Bio by writing a guest post? Drop us a note here or chat with us on Twitter: @ameekapadia @ketanyerneni @morgancheatham @pablolubroth @patricksmalone