Protein concept

Natural language processing models could predict causes of disease

Image credit: Dreamstime

A University of Cambridge study has demonstrated that natural language-processing models have the potential to crack the 'biological language' of Alzheimer’s and other neurodegenerative diseases, potentially playing a role in medical research that humans cannot.

Researchers took a similar approach to that used by technology companies to generate predictions based on previous user behaviour, such as suggested text in emails and messages or content recommendations.

Kadi Liis Saar, first author of the study, trained a large-scale language model to see what happens when proteins malfunction and cause disease. Proteins are large, complex molecules critical for the function, structure and regulation of the body’s tissues and organs.

“The human body is home to thousands and thousands of proteins and scientists don’t know yet the function of many of them. We asked a neural network-based language model to learn the language of proteins,” she explained. “We specifically asked the program to learn the language of shapeshifting biomolecular condensates – droplets of proteins found in cells – that scientists really need to understand to crack the language of biological function and malfunction that cause cancer and neurodegenerative diseases like Alzheimer’s.

“We found it could learn, without being explicitly told, what scientists have already discovered about the language of proteins over decades of research.”

One of the areas that the scientists focused on was the behaviour of proteins in neurodegenerative conditions such as Alzheimer’s, Parkinson’s, and Huntington’s diseases. Proteins 'go rogue' in Alzheimer’s disease, forming clumps and killing healthy nerve cells; a healthy brain has a system for effectively disposing of these dangerous masses of proteins (aggregates). Scientists theorise that some disordered proteins also form droplets of proteins called condensates that do not have a membrane, and which merge freely with each other. Unlike protein aggregates, protein condensates can form and reform.

The researchers entered all data held on known proteins, so that their model could learn to predict the 'language of proteins' in the same way models learn to predict human language. Based on this data, the researchers were able to explore the patterns that leads only certain proteins to form condensates. Unlocking this understanding will help scientists learn the 'rules of the language of disease'.

“Protein condensates have recently attracted a lot of attention in the scientific world because they control key events in the cell such as gene expression [how our DNA is converted into proteins] and protein synthesis [how the cells make proteins],” said Professor Tuomas Knowles, lead author of the study. “Any defects connected with these protein droplets can lead to diseases such as cancer.

“This is why bringing natural language-processing technology into research into the molecular origins of protein malfunction is vital if we want to be able to correct the grammatical mistakes inside cells that cause disease.”

This approach could, with the use of powerful and efficient models, lead to original discoveries, theories of disease, and drug targets beyond what would be feasible for researchers working without these tools. Saar explained: “Machine learning can be free of the limitations of what researchers think are the targets for scientific exploration and it will mean new connections will be found that we have not even conceived of yet. It is really very exciting indeed.”

Knowles added: “Bringing machine-learning technology into research into neurodegenerative diseases and cancer is an absolute game changer. Ultimately, the aim will be to use artificial intelligence to develop targeted drugs to dramatically ease symptoms or to prevent dementia happening at all.”

The Cambridge researchers’ network has been made available to researchers anywhere in the world.

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles