top of page

New AI system unlocks biology’s source code

In a new study published in Nature Communications, an interdisciplinary team of researchers led by Yunha Hwang, PhD candidate in the Department of Organismic and Evolutionary Biology (OEB) at Harvard, have pioneered an artificial intelligence (AI) system capable of deciphering the intricate language of genomics. Genomic language is the source code of biology. It describes the biological functions and regulatory grammar encoded in genomes. The researchers asked can we develop an AI engine to “read” the genomic language and become fluent in the language, understanding the meaning, or functions and regulations, of genes? The team fed the microbial metagenomic data set, the largest and most diverse genomic dataset available, to the machine to create the Genomic Language Model (gLM).

“In biology, we have a dictionary of known words and researchers work within those known words. The problem is that this fraction of known words constitutes less than one percent of biological sequences,” said Hwang, “the quantity and diversity of genomic data is exploding, but humans are incapable of processing such a large amount of complex data.” Large language models (LLMs), like GPT4, learn meanings of words by processing massive amounts of diverse text data that enables understanding the relationships between words. Genomic language model (gLM) learns from highly diverse metagenomic data, sourced from microbes inhabiting various environments including the ocean, soil and human gut. With this data, gLM learns to understand the functional “semantics” and regulatory “syntax” of each gene by learning the relationship between the gene and its genomic context. gLM, like LLMs, is a self-supervised model – this means that it learns meaningful representations of genes from data alone and does not require human assigned labels.

Researchers have sequenced some of the most commonly studied organisms like people, E. coli, and fruit flies. However, even for the most studied genomes, the majority of the genes remain poorly characterized. “We’ve learned so much in this revolutionary age of ‘omics’, including how much we don’t know,” said senior author Professor Peter Girguis, also in OEB at Harvard. “We asked, how can we glean meaning from something without relying on a proverbial dictionary? How do we better understand the content and context of a genome?” The study demonstrates that gLM learns enzymatic functions and co-regulated gene modules (called operons), and provides genomic context that can predict gene function. The model also learns taxonomic information and context-dependencies of gene functions. Strikingly, gLM does not know which enzyme it is seeing, nor what bacteria the sequence comes from. However, because it has seen many sequences and understands the evolutionary relationships between the sequences during training, it is able to derive the functional and evolutionary relationships between sequences.

Aktuelle Beiträge

Alle ansehen


bottom of page