dspp-keras
is a Keras integration for Database of Structural Propensities of Proteins, which provides amino acid sequences of 7200+ unrelated proteins with their propensities to form secondary structures or stay disordered.
We strongly encourage you to read the rest of this document, scroll through the FAQ section, and train the test model from example/
directory. Importantly, please read our Database of Structural Propensities of Proteins paper.
Proteins are complex biomolecules made of 20 building blocks, amino acids, which are connected sequentially into long non-branching chains; commonly known as polypeptide chains.
Unique spatial arrangement of polypeptide chains yields 3D molecular structures, which define protein function and interactions with other biomolecules.
Although the very basic forces that govern protein 3D structure formation are known and understood, the exact nature of polypeptide folding remains elusive and has been studied extensively.
Just like every other molecule present in our natural environment, polypeptide chains undergo molecular motions at time scales ranging from nanoseconds to minutes.
It is accepted that complete understanding of protein functions and activity requires knowledge of structures and dynamics.
Under conditions of living organisms (aka native conditions) in an aqueous environment, the state of a polypeptide can be thought of as an ensemble of structures, which at any given moment in time have slightly different conformations, as a consequence of protein dynamics and intrinsic flexibility.
The image above demonstrates the superposition of models belonging to a structural ensemble of MOAG-4 protein, which in turn controls aggregation of proteins implicated in Parkinson’s disease. You can infer from this model that MOAG-4 has a stable (well-defined) alpha-helical structure colored in grey, and a highly disordered tail, depicted by floating polypeptide chains of individual ensemble members.
Note: MOAG-4 ensemble model has been kindly provided by Frans A.A. Mulder (Aarhus University, DK) and Dr. Predrag Kukic (University of Cambridge, UK). Please read "MOAG-4 Promotes the Aggregation of α-Synuclein by Competing with Self-Protective Electrostatic Interactions" to learn more about this protein and its medical relevance..
MOAG-4 (dSPP27058_0 in our database) is a medically relevant example of a protein that exhibits a high degree of intrinsic structural disorder.
Alpha-synuclein, pictured above, is a seminal example of a completely disordered protein. Although the ensemble of Alpha-synuclein is heterogenous, this protein plays an important role in neurotransmitter mediation in human brain, and has been implicated as the key player in Parkinson’s disease development.
Note: The Alpha-synuclein ensemble has been adopted from "Structural Ensembles of Membrane-bound α-Synuclein Reveal the Molecular Determinants of Synaptic Vesicle Affinity".
Among the multitude of advanced experimental protein techniques, NMR spectroscopy offers exquisite sensitivity to structural detail and dynamics at a single residue level. We have used NMR resonance assignment data from 7200+ proteins stored in public repositories and computed sequence-specific propensity scores.
The ensemble behavior of partially disordered MOAG-4 can be characterized and compressed (with few critical assumptions discussed in our paper) to a structural propensity vector.
Importantly, our method excels at capturing residual intrinsic disorder, as seen in the example of intrinsically disordered Alpha-synuclein.
To install the dspp-keras integration, just do the following
pip install dspp-keras
Alternatively, clone the source and launch,
python setup.py
To load the dataset in your Python models, use the dsppkeras.datasets
module.
from dsppkeras.datasets import dspp
X, Y = dspp.load_data()
Note: An annotated example of a boilerplate neural network can be found in examples/
.
A one-hot encoded amino acid sequence vector.
As an example, let's consider the amino acid sequence of NS2B polypeptide from Zika Virus,
MGSSHHHHHHSSGLVPRGSHMTGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMRE
The one-hot vector, which represents the NS2B Zika Virus seqeunce can be written as,
np.array([0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], dtype=np.uint8)
The structural propensity score tensor.
Individual, residue-specific scores are bound between 1.0
and 3.0
. A propensity of 1.0
implicates the sampling of beta-sheet conformations. A score of 2.0
indicates behaviour found in disordered proteins, whereas 3.0
is an indicator of properly folded alpha-helix. A score of 2.5
should be understood as a situation when 50% of ensemble members form a helix and the remaining part samples different conformations.
The propensity score vector for NS2B polypeptide from Zika Virus is given by,
np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.82, 1.70, 1.56, 1.43, 1.28, 1.59, 1.58, 1.77, 1.75, 2.06, 1.70, 1.95, 1.84, 2.17, 2.10, 2.16, 2.14, 2.21, 2.06, 2.06, 2.09, 2.07, 0.0, 0.0, 0.0, 2.02, 1.95, 1.96, 1.94, 1.97, 2.01, 2.06, 2.10, 2.14, 2.17, 2.13, 2.12, 2.05, 2.04, 2.03, 0.00, 2.09, 2.14, 2.16, 0.0, 2.16, 2.16, 2.16, 2.17], dtype=np.float32)
Note: 0.0
denotes missing experimental assignment data.
For the biology-oriented audience and curious computational scientists, we've created a web service at https://peptone.io/dspp.
Use it to find proteins, explore their propensities and preview all the data in a machine learning-ready form.
We are always looking forward to improving dspp
integration for Keras and Tensorflow.
Please file bug reports, issues or suggestions using https://github.com/PeptoneInc/dspp-keras/issues
Should you have questions related to scientific and industrial implications of dSPP, please contact us at [email protected].
dspp-keras
is based on ongoing scientific research into protein stability and intrinsic disorder. Therefore we suggest:
- Download and read the original research paper at http://biorxiv.org/content/early/2017/06/01/144840
- Search, browse and download complete dSPP database at https://peptone.io/dspp
- Please cite this work as,
@article {Tamiola144840,
author = {Tamiola, Kamil and Heberling, Matthew Michael and Domanski, Jan},
title = {Structural Propensity Database Of Proteins},
year = {2017},
doi = {10.1101/144840},
publisher = {Cold Spring Harbor Labs Journals},
URL = {http://biorxiv.org/content/early/2017/06/01/144840},
eprint = {http://biorxiv.org/content/early/2017/06/01/144840.full.pdf},
journal = {bioRxiv}
}
dSPP - Database of Structural Propensities of Proteins
- As opposed to binary (logits) secondary structure assignments available in most protein datasets, we are in a unique position to report on protein structure and local dynamics at the residue level. Please read our paper to learn more about the benefit of propensity inclusion.
- Our experimental data are derived from native conditions, rendering them absolutely unique for structure and disorder prediction methods that aim to tackle protein folding and stability in biologically relevant contexts.
- Our dSPP user interface at https://peptone.io/dspp will give you seamless access to every single protein bundled in
dspp-keras
with all the relevant decision data and original literature citations.
It is a relatively long subject, far beyond the scope of this short document. Please read our paper. The exact procedure is described in detail in the Materials and Methods together with relevant references.
Currently 7200+. However, the database is on an automatic 14 day update cycle, hence we expect it to grow.
Currently 120
amino acids. However, the database is on an automatic update cycle, hence this number may change.
The experimental data in dSPP have been collected in solution and solid state NMR experiments. The average experimental temperature is 295K
, pH of 6.9
and ionic strength of ~100mM
. Please read our paper to learn more.
We have you covered! Simply navigate to https://peptone.io/dspp and open up Database statistics panel. All the database statistics are recomputed for dSPP every 14 days.
Yes. If you happen to have a newly assigned protein, please follow the submission procedure to BMRB. As only your data becomes available in BMRB and passes our quality checks we will include it in our repository.
Yes. We are actively developing on an expansion of dSPP, which will contain additional experimental data to model protein stability and local dynamics.
We want to acknowledge Dr. Wenwei Zheng (NIDDK, US), Dr. Ruud Scheek (University of Groningen, NL) and Dr. Xavier Periole (Aarhus University, DK) for insightful comments and editorial suggestions concerning our dSPP paper.
François Chollet of Keras / Google is greatly acknowledged for insightful feedback on database interface and straightforward suggestions concerning Keras integration.
We extend sincere thanks to Alison Lowndes, Carlo Ruiz and Dr. Adam Grzywaczewski, (NVIDIA Corporation) for facilitating collaboration and access to DGX-1 supercomputer.
Jon Wedell (BMRB) is greatly acknowledged for technical support with NMR resonance assignment retrieval from BMRB.
We thank Dr. Frans A.A. Mulder (Aarhus University, DK) and Dr. Predrag Kukic (University of Cambridge, UK) for providing structural ensemble models of MOAG-4.
Lastly, we want to greatly acknowledge Mark Berger (NVIDIA Corporation) for overwhelming support throughout the execution of this project.