AUTO–MUTE |
AUTO-MUTE Home |
Stability Changes (ΔΔGH2O) DocumentationIntroductionStability
Changes (ΔΔGH2O) is an
automated server for predicting the impact of single amino acid
replacements on protein stability due to denaturant denaturation. The
models available were trained
with a modified version of a mutant protein dataset,
previously reported in Saraboji
et
al. (2006) (average assignment method) and also utilized by Huang
et al. (2007) (CART
method), which was obtained by searching the ProTherm
database. The original
dataset consisted of 2204 single amino acid substitutions in 88
proteins with solved structures in the Protein
Data Bank (PDB);
however, after
removing mutants associated with structures containing
gaps, as well as mutants at positions with fewer than six nearest
neighbors in tessellated protein structures, the modified dataset utilized
here consists of 1962 single point mutants in 77 protein structures.
Depending upon the needs of the investigator, two supervised classification models (for predicting only the sign of ΔΔGH2O) and two regression models (for predicting the actual value of ΔΔGH2O) are available. The supervised classification models include Random Forest (RF) and Support Vector Machine (SVM), while the regression models include Tree regression (REPTree) and SVM regression (SVMreg). Similarly, the decision as to which of the two classification (or regression) models to select rests with the algorithmic preference of the researcher. Athough the two models of each type can be ranked based on various performance measures, which are detailed in the Results section below, these measures are relatively similar in magnitude and are not necessarily indicative of the predictive accuracy of the models on an independent test set of single point mutants that have yet to be experimentally investigated. Additionally, since the models were developed using implementations of four different machine learning algorithms, there likely will be occasions where predictions for a specific mutant are inconsistent among the models, especially in cases where the sign of ΔΔGH2O is predicted with low confidence. Lastly, predictions can only be performed for a mutant if it represents a single amino acid substitution in a tessellatable single chain of a solved protein structure with a coordinate file available in the PDB. Specifically, Delaunay tessellation of the structure can be performed if the PDB file contains consecutive primary sequence numbering in the ATOM lines (i.e., no gaps in the structure) starting with a non-negative integer, the alpha-carbon atomic coordinates are available for all the constituent amino acids, and no alternative conformations exist for the alpha-carbon atoms. In addition to X-ray structures, NMR structure files are potentially tessellatable if they consist of a single minimized average structure as opposed to multiple models. Methods Among the numerous factors influencing model performance are the dataset size and composition utilized for training, the type of attributes (i.e., predictors) used as components for the feature vectors characterizing the mutants in the dataset, and the machine learning algorithm chosen for model building. AUTO-MUTE utilizes attributes that include EC score at the mutated position (mutant protein residual score), ordered EC scores of the six nearest neighbors to the mutated position, native and replacement amino acid identities at the mutated position, ordered amino acid identities at the six nearest neighbors, and ordered differences between the primary sequence positions of the nearest neighbors and the mutated residue (see AUTO-MUTE home page for details). The following attributes were also initially included for each single point mutant: relative solvent accessibility (RSA), as well as temperature and pH of the experimental conditions under which ΔΔGH2O measurements were obtained. However, the models provided on this server for making predictions were trained by replacing RSA with the following Delaunay tessellation-derived attributes: mean volume and tetrahedrality of the simplices in which the mutated position serves as a vertex, location (surface, undersurface, or buried) of the mutated position, number of edge contacts that the mutated position has with surface positions, and secondary structure (helix, strand, coil, or turn) of the mutated position (see AUTO-MUTE home page for details). As described in the Results section below, there is a negligible difference in model performance as a result of such an alteration in the training set mutant feature vectors. Required Inputs and Server Outputs A valid PDB accession code and a specific chain (use @ if null) is required for the structure of the protein containing a single residue substitution whose impact on stability (ΔΔGH2O sign or value) is to be predicted. The mutation under consideration must be supplied in the form (native residue)(position number from PDB file ATOM lines)(replacement residue), for example D25E; however, by using an underscore "_" instead of a replacement residue, such as D25_ for example, predictions will be provided for all 19 amino acid substitutions at the requested position. The final inputs include the temperature (ºC, 0-100) and pH (0-14) conditions under which predictions are to be obtained. In addition to reproducing the inputs, the output data includes either predicted sign of ΔΔGH2O along with a confidence level (classification) or predicted value of ΔΔGH2O (regression), mean volume and tetrahedrality for the mutated position, location and number of edge contacts that the mutated position has with surface positions, and secondary structure of the mutated position. Results In order to directly compare our results with those of the classification and regression tree (CART) approach of Huang et al. (2007), performance of the algorithms is evaluated by applying a 5-fold cross-validation procedure and calculating the following values. For supervised classification, each mutant belongs to either the “increased stability” or “+” class if experimental ΔΔGH2O ≥ 0, or the “decreased stability” or “–” class if ΔΔGH2O < 0. With the understanding that TP (TN) = total number of correctly predicted “increased stability” (“decreased stability”) mutants, and FN (FP) = total number of respectively misclassified mutants, the overall accuracy is defined as Q = (TP + TN) / (TP
+ TN + FP + FN),
and Matthew’s correlation coefficient is given by MCC
= (TP ×
TN – FP
× FN) / [(TP + FN)(TP
+ FP)(TN + FN)(TN + FP)]1/2.
For regression, model
performance is evaluated by
calculating the mean absolute error (MAE) between the predicted and
experimental ΔΔGH2O
values.
Next, Saraboji et al. (2006) reported an overall
accuracy of 0.80, based on leave-one-out cross-validation (jackknife)
applied in
conjunction with their average assignment method. Similarly,
application of the jackknife, in
conjunction with RF learning and the use of the RSA attribute in the
feature vectors
representing the protein mutants in our modified dataset, resulted in
an overall accuracy of 0.83.
Finally, an independent test set of 112 mutants (14 "+" and 98 "–") in 34 protein structures was collected from the ProTherm database. None of the test set mutants appear in the training dataset, and 18 of the protein structures are unique to the test set. A validation study was performed, whereby the test set mutants were each blindly predicted by the server's classification models, with the results tabulated below.
A majority of the independent test set mutants (67 mutants: 9 "+" and 58 "–") are associated with the 18 protein structures that are unique to the test set. For this particular subset, prediction results are Q = 0.88, BER = 0.26, and MCC = 0.49 using the RF model (8/67 mutants incorrectly predicted), and Q = 0.87, BER = 0.27, and MCC = 0.45 using the SVM model (9/67 mutants incorrectly predicted). References
|