AUTO–MUTE |
AUTO-MUTE Home |
Stability Changes (ΔΔG) DocumentationIntroductionStability
Changes (ΔΔG) is an
automated server for predicting the impact of single amino acid
replacements on protein stability due to thermal denaturation. The
models available were trained
with a slightly modified version of a diverse mutant protein dataset,
previously reported in Capriotti
et
al. (2005) and obtained by searching the ProTherm
database. The original
dataset consisted of 1948 single amino acid substitutions in
58 proteins with solved structures in the Protein
Data Bank (PDB), and
structures were chosen so that they were uniformly distributed
among the the four major SCOP
structural classifications. After
removing mutants associated with two protein structures containing
gaps, as well as mutants at positions with fewer than six nearest
neighbors in tessellated protein structures, our modified dataset consists of 1925
single point mutants in 55 protein structures.
Depending upon the needs of the investigator, two supervised classification models (for predicting only the sign of ΔΔG) and two regression models (for predicting the actual value of ΔΔG) are available. The supervised classification models include Random Forest (RF) and Support Vector Machine (SVM), while the regression models include Tree regression (REPTree) and SVM regression (SVMreg). Similarly, the decision as to which of the two classification (or regression) models to select rests with the algorithmic preference of the researcher. Athough the two models of each type can be ranked based on various performance measures, which are detailed in the Results section below, these measures are relatively similar in magnitude and are not necessarily indicative of the predictive accuracy of the models on an independent test set of single point mutants that have yet to be experimentally investigated. Additionally, since the models were developed using implementations of four different machine learning algorithms, there likely will be occasions where predictions for a specific mutant are inconsistent among the models, especially in cases where the sign of ΔΔG is predicted with low confidence. Lastly, predictions can only be performed for a mutant if it represents a single amino acid substitution in a tessellatable single chain of a solved protein structure with a coordinate file available in the PDB. Specifically, Delaunay tessellation of the structure can be performed if the PDB file contains consecutive primary sequence numbering in the ATOM lines (i.e., no gaps in the structure) starting with a non-negative integer, the alpha-carbon atomic coordinates are available for all the constituent amino acids, and no alternative conformations exist for the alpha-carbon atoms. In addition to X-ray structures, NMR structure files are potentially tessellatable if they consist of a single minimized average structure as opposed to multiple models. Methods Among the numerous factors influencing model performance are the dataset size and composition utilized for training, the type of attributes (i.e., predictors) used as components for the feature vectors characterizing the mutants in the dataset, and the machine learning algorithm chosen for model building. AUTO-MUTE utilizes attributes that include EC score at the mutated position (mutant protein residual score), ordered EC scores of the six nearest neighbors to the mutated position, native and replacement amino acid identities at the mutated position, ordered amino acid identities at the six nearest neighbors, and ordered differences between the primary sequence positions of the nearest neighbors and the mutated residue (see AUTO-MUTE home page for details). In order for direct comparisons to be made with results obtained by Capriotti et al. (2005), we initially included the following attributes for each single point mutant: relative solvent accessibility (RSA), as well as temperature and pH of the experimental conditions under which ΔΔG measurements were obtained. However, the models provided on this server for making predictions were trained by replacing RSA with the following Delaunay tessellation-derived attributes: mean volume and tetrahedrality of the simplices in which the mutated position serves as a vertex, location (surface, undersurface, or buried) of the mutated position, number of edge contacts that the mutated position has with surface positions, and secondary structure (helix, strand, coil, or turn) of the mutated position (see AUTO-MUTE home page for details). As described in the Results section below, there is a negligible difference in model performance as a result of such an alteration in the training set mutant feature vectors. Required Inputs and Server Outputs A valid PDB accession code and a specific chain (use @ if null) is required for the structure of the protein containing a single residue substitution whose impact on stability (ΔΔG sign or value) is to be predicted. The mutation under consideration must be supplied in the form (native residue)(position number from PDB file ATOM lines)(replacement residue), for example D25E; however, by using an underscore "_" instead of a replacement residue, such as D25_ for example, predictions will be provided for all 19 amino acid substitutions at the requested position. The final inputs include the temperature (ºC, 0-100) and pH (0-14) conditions under which predictions are to be obtained. In addition to reproducing the inputs, the output data includes either predicted sign of ΔΔG along with a confidence level (classification) or predicted value of ΔΔG (regression), mean volume and tetrahedrality for the mutated position, location and number of edge contacts that the mutated position has with surface positions, and secondary structure of the mutated position. Results Based on the application of a 20-fold cross-validation procedure, performance of the algorithms is evaluated by calculating the following values. In the case of supervised classification, each mutant belongs to either the “increased stability” or “+” class if experimental ΔΔG ≥ 0, or the “decreased stability” or “–” class if ΔΔG < 0. With the understanding that TP (TN) = total number of correctly predicted “increased stability” (“decreased stability”) mutants, and FN (FP) = total number of respectively misclassified mutants, the overall accuracy is defined as Q = (TP + TN) / (TP
+ TN + FP + FN).
Also, for the “increased stability” class, S(+) = sensitivity =
TP / (TP + FN) and P(+) = precision = TP / (TP + FP),
while for the “decreased stability” class, S(–) = TN / (TN + FP)
and P(–) = TN / (TN + FN).
Finally, the following two measures are calculated due to their robustness with respect to unequal class distributions: balanced error rate is defined as BER = 0.5 ×
[FN / (FN + TP) + FP / (FP + TN)],
and Matthew’s correlation coefficient is given by MCC
= (TP ×
TN – FP
× FN) / [(TP + FN)(TP
+ FP)(TN + FN)(TN + FP)]1/2.
In the case of regression,
model performance is evaluated by
calculating the Pearson correlation coefficient (r) of the predicted and
experimental ΔΔG
values, the equation of the regression
line, and the standard error.
Finally, an independent test set of 142
mutants (20 "+" and 122 "–") in 18
protein structures was collected from the ProTherm
database. None of the test set mutants appear in the training dataset,
and 14 of the protein structures are unique to the test set. A
validation study was performed, whereby the test set mutants were each
blindly predicted by the server's classification models, with the
results tabulated below.
A majority of the independent test set mutants (134 mutants: 12 "+" and 122 "–") are associated with the 14 protein structures that are unique to the test set. For this particular subset, prediction results are Q = 0.94, BER = 0.26, and MCC = 0.59 using the RF model (8/134 mutants incorrectly predicted), and Q = 0.87, BER = 0.26, and MCC = 0.38 using the SVM model (18/134 mutants incorrectly predicted). References
|