Hacking in silico protein engineering with Machine Learning and AI, explained

Proteins are building blocks of all living matter. Although tremendous progress has been made, protein engineering remains laborious, expensive and truly complicated. Here is how Machine Learning can help.

By Kamil Tamiola, Founder of Peptone.


It is safe to say proteins are building blocks and the machinery which defines living matter. In the last 70 years tremendous progress has been made in their isolation, production, characterization, and finally engineering. Although great advancements in laboratory and industrial-scale protein production have been made, protein engineering and all the associated steps remain laborious, expensive and truly complicated.

Proteins are polymers

Proteins are complex biomolecules made of 20 building blocks, amino acids, which are connected sequentially into long non-branching chains, commonly known as polypeptide chains.

Unique spatial arrangement of polypeptide chains yields 3D molecular structures, which define protein function and interactions with other biomolecules.

Although the very basic forces that govern protein 3D structure formation are known and understood, the exact nature of polypeptide folding remains elusive and has been studied extensively for the past 50 years.

Protein engineering is complex

We want to engineer proteins to enhance their properties. Typically, stability under different temperatures, pH or salinity. Frequently, researchers are aiming at improving catalytic performance of protein enzymes, or adding completely new types of chemical activities to known proteins.

The most common and established way to engineer a protein is to create its variants with substituted amino acids, also known as mutants. Subsequently, newly produced mutants are characterized using various experimental techniques to measure the degree of enhancement; e.g. scanning calorimetry, isoelectric point determination, simple solubility studies or advanced enzymatic activity assays. However, since there are 20 standard protein amino acids, a complete mutagenesis of 100-residue long polypeptide would yield 20100 mutant combinations, should you decide to explore all possible combinations of typical protein amino acids.

Quite likely, only a marginal fraction of the mutants would have desired properties, as usually the more you change the protein the further you step away from its original function 

(this is absolutely not a rule, as it is protein specific. However, a logical consequence of replacing a major part of a protein with a completely new amino acid sequence will likely be new fold, hence new functionality. Moreover, I have intentionally left out a fundamentally important fact — mutations may significantly affect protein dynamics, and thus its function).

Protein biotechnology is to a large extent hampered by scale and complexity of mutational analysis.

How can Machine Learning accelerate progress in protein science?

The most advanced and probabilistic (Bayesian) variants of Machine Learning depend heavily on the size and quality of input data. This argument is especially important for inference and prediction techniques in life sciences, where the levels of model complexity are perplexing or simply unknown.

Protein structure, function and dynamics predictions through Machine Learning methodology are not an exception. However, even with the relatively sparse (compared to a number of possible combinations of all protein amino acids in lengthy polypeptide chains) protein databases, Machine Learning can help to unravel complex, non-linear relationships between protein sequences and their structural variability and dynamics. These relationships are either very difficult to model or simply not fully understood.

The biggest value of Machine Learning methods in prediction of biophysical properties of proteins is their ability to “equate” loosely related protein features to measurable experimental data. Thus predictions using complex numerical models that underlie Machine Learning methodology, can be further tweaked and refined by providing independent experimental proxies of protein structure and dynamics.


Proteins are dynamic and exhibit variable degrees of disorder

Just like every other molecule present in our natural environment, polypeptide chains undergo molecular motions at time scales ranging from nanoseconds to minutes.

It is accepted that complete understanding of protein functions and activity requires knowledge of structures and dynamics.

Structural disorder is a very peculiar property of many known and characterised proteins. It has been attributed to specific patterns in protein sequence, and it has an immediate consequence for protein stabilitysusceptibility to enzymatic digestion inside living cells, protein-protein interactions and in turn a decisive role in many debilitating human pathologies.

From an industrial biotechnology point of view, the ability to accurately discern disordering effects of amino acid mutations in engineered proteins can save vast amounts of time and resources. An accurate disorder prediction for an arbitrary protein mutant can immediately report on problematic combinations of amino acid sequences, thus excluding the residues from further mutational analysis and vastly reducing the mutation search space.

Read more, starting with Singular protein structure model is not enough at Kamil Tamiola post. Reposted the initial portion with permission.

Bio: Kamil Tamiola is an entrepreneur and researcher with an extensive scientific background in supercomputing and structural biophysics of proteins.