top of page

AI Breakthrough in 3D Protein Structure Prediction

BaseFold leverages Basecamp Research’s purpose-built foundational dataset to significantly increase prediction accuracy of large, complex protein structures and small molecule interactions — it is up to six times more accurate than AlphaFold2 and offers up to a three-fold improvement in small molecule docking . More reliable 3D structure predictions for larger and more complex proteins is poised to greatly accelerate AI-based drug discovery efforts.

Basecamp Research, a world leader in artificial intelligence (AI)-based design of proteins and other biological systems, today announced the launch of BaseFold, its new deep learning model that predicts 3D structures of large, complex proteins more accurately than other AI-powered tools, including the industry gold standard, AlphaFold2. These data were recently published in bioRxiv.

BaseFold was created by augmenting the AlphaFold2 model, which predicts the 3D structure of a protein based on its amino acid sequence, with BaseGraph. BaseGraph is Basecamp Research’s purpose-built foundational dataset for biological AI, collected via access and benefit-sharing partnerships with over 25 biodiversity-rich countries. The published accuracy improvements are just a starting point, as BaseFold is continuously improving week over week as Basecamp Research scales its global network of biodiversity partnerships. Furthermore, Basecamp Research will be working with NVIDIA to optimise and productionise BaseFold for NVIDIA BioNeMo, a generative AI platform for drug discovery.

The scientific benchmark for determining protein structure is still via slow and time-consuming experimental methods such as X-ray crystallography. However, AlphaFold2’s development in 2020 provided a breakthrough for the use of AI across biotechnology, giving scientists confidence in AI-based structural predictions. A wide array of structure prediction models have since followed AlphaFold2, most notably CollabFold, ESMFold, OpenFold and RoseTTAFold.

However, the performance of these models is highly dependent on their training data; all are trained on public protein databases that are widely seen as unfit for biotech’s AI era. These public training datasets are small, unreliable and heavily biased toward proteins from laboratory model organisms. The sequence data captured in these public databases is estimated to represent less than 0.000001% of life on Earth. These data limitations mean that existing AI tools work well for predicting the structures of smaller, simpler proteins that are well-represented in public datasets but often struggle beyond that, creating major problems for those using AI to develop complex new medicines.

AlphaFold2 draws heavily from the public MGnify database, known for having issues with incomplete sequences, which can impact the quality of structures predicted for larger proteins. Basecamp Research’s BaseFold tackles the next big computational challenge, which is to achieve crystallography-level accuracy for larger, more complex proteins, especially those underrepresented in existing protein sequence databases.

To do this, BaseFold extracts orders of magnitude more meaningful evolutionary information from over 6 billion relationships in BaseGraph. Replete with extensive genomic context and comprehensive metadata, training algorithms on BaseGraph has been shown to yield significant advances in the performance of a wide range of biological AI models, including AlphaFold2 as presented here.

In this preprint, Basecamp Research scientists evaluated BaseFold’s performance in predicting the structure of various proteins selected from the CASP15 (Critical Assessment of Structure Prediction) competition and CAMEO (Continuous Automated Model EvaluatiOn) community project.

Publication Result Highlights

  1. Basecamp Research’s purpose-built foundational dataset allowed BaseFold to improve the accuracy of AlphaFold2’s predicted structures by up to 6-fold.

  2. The team demonstrated an up to 3-fold improvement in modelling accuracy for small molecule interactions with protein targets.

  3. BaseFold unlocks more reliable 3D structure predictions and small molecule docking for larger and more complex proteins than ever before, particularly those that are underrepresented in public datasets.

  4. This step change is poised to greatly accelerate drug discovery efforts, where understanding these interactions will allow for more advanced therapeutics molecules to be developed using AI.

„We have redesigned and rebuilt the entire data acquisition process, making us the first team ever to collect and annotate biodiversity data with the same quality as human clinical genetic data — all purpose-built for the AI era,“ said Dr. Phil Lorenz, CTO of Basecamp Research. „BaseGraph, the most diverse and comprehensive dataset of its kind, is the core driver of our advances in AI. The results of this publication prove that more diverse, representative genomics data allows for step-change algorithm improvements without the need for extensive lab-in-the-loop infrastructure. Our database is growing every week, and as a result, BaseFold is improving every week, too.“

„AlphaFold is one of the most useful AI tools in drug discovery, and for good reason. It enables researchers to better predict how medicines may interact with proteins in the body, shaving off years of work. However, AlphaFold still has significant room for improvement – particularly when being used to predict large, complex and underrepresented proteins, which are often the most critical for the development of new therapeutics. Even just a few percentage points of error can have major implications in accurately predicting protein-molecule interactions,“ said Dr. Glen Gowers, co-founder of Basecamp Research.

We know that when it comes to AI, the best data produces the best outcomes, and it’s rewarding to know that the new, purpose-built foundational dataset that we have built is already having widespread implications for drug development and human health,“ Dr. Gowers added. „We’re not stopping here, though – we are continuing to scale our biodiversity partnerships and apply this data advantage across more and more biological AI models.“

Aktuelle Beiträge

Alle ansehen


bottom of page