Christopher Lockhart

Protein Structure Analysis

Homework 2

The instructions for assignment 2 were to select a human enzyme sequence without known 3-dimensional structure from a public database with homologs in PDB with at least 30% but no more than 70% sequence identity. The 3-dimensional structure of the query sequence will then be predicted using two algorithms of choice and analyzed. A residue substitution of interest will then be performed, and the 3-dimensional structure will be predicted again from the newly mutated sequence. The impact of the residue substitution on the predicted conformations will be analyzed. The purpose of this assignment is to gain insight into structural prediction tools and algorithms and assess in a real example how mutations potentially alter protein conformation.

1) Selecting the query sequence

UniProtKB was searched using NOT database:(type:pdb) AND organism:"Homo sapiens [9606]" AND keyword:enzyme to find good query sequence candidates for analysis. In accordance with the assignment, these sequences were all human enzymes without known PDB structures. These sequences were then manually input into PDB and a BLAST analysis was performed to determine the sequence identities between the query sequences and target sequences in PDB. Any target sequences from structures in PDB with sequence identities less than 30% or greater than 70% were discarded.

In the end, the sequence for methylenetetrahydrofolate reductase was selected. The length of this sequence is 656 amino acids, although when BLAST was performed on the sequence using default BLAST parameters only approximately 279-282 residues were found to have sequence simiilarity with preexisting PDB structures. 12 PDB entries were found to match the query sequence, although there were only 3 unique sequences for these 12 structures. These PDB entries had between 34 to 39% sequence identity with the query sequence. For more information regarding the alignments, please see the full summary from PDB. The complete query protein sequence is provided below.

Query Sequence

MVNEARGNSSLNPCLEGSASSGSESSKDSSRCSTPGLDPERHERLREKMRRRLESGDKWF SLEFFPPRTAEGAVNLISRFDRMAAGGPLYIDVTWHPAGDPGSDKETSSMMIASTAVNYC GLETILHMTCCRQRLEEITGHLHKAKQLGLKNIMALRGDPIGDQWEEEEGGFNYAVDLVK HIRSEFGDYFDICVAGYPKGHPEAGSFEADLKHLKEKVSAGADFIITQLFFEADTFFRFV KACTDMGITCPIVPGIFPIQGYHSLRQLVKLSKLEVPQEIKDVIEPIKDNDAAIRNYGIE LAVSLCQELLASGLVPGLHFYTLNREMATTEVLKRLGMWTEDPRRPLPWALSAHPKRREE DVRPIFWASRPKSYIYRTQEWDEFPNGRWGNSSSPAFGELKDYYLFYLKSKSPKEELLKM WGEELTSEESVFEVFVLYLSGEPNRNGHKVTCLPWNDEPLAAETSLLKEELLRVNRQGIL TINSQPNINGKPSSDPIVGWGPSGGYVFQKAYLEFFTSRETAEALLQVLKKYELRVNYHL VNVKGENITNAPELQPNAVTWGIFPGREIIQPTVVDPVSFMFWKDEAFALWIERWGKLYE EESPSRTIIQYIHDNYFLVNLVDNDFPLDNCLWQVVEDTLELLNRPTQNARETEAP

2) Modeling the query sequence

Given that homologous proteins exist for the query sequence in structure databases, the structure of the sequence can be predicted using various structure prediction software. Particularly, SwissModel and Phyre were used. These programs were accessed from the web. Both programs accepted the query sequence as a string of 1-letter amino acid residues and used that sequence to perform modeling. Once modeling was completed, NIH MBI Laboratory for Structural Genomics and Proteomics' Structural Analysis and Verification Server (SAVES) was used to analyze the structures produced for the query sequence. SAVES runs a variety of checks, including ProCheck, to verify the protein structure of an uploaded PDB file.

SwissModel predicts structure based on homology modeling. It is a knowledge-based approach which uses known structures and their sequences in order to predict the structure of a query sequence. In this particular case, SwissModel generated a structure for residues 48 to 340 of the query sequence to produce a QMEAN4 score of 0.7, which essentially denotes how reliable SwissModel believes the structure is (the closer to 1 this value is, the more reliable it should be). The QMEAN4 score is determined from multiple energetic factors, including all-atom pairwise energy and torsional angle energy. The QMEAN Z-score of the protein, which is also included in the QMEAN4 score, was -1.251. This low Z-score indicates that the model produced does not necessarily represent the query sequence. In accordance, SwissModel allows users to identify which regions of the sequence were likely modeled poorly. The full results from SwissModel can be found here. When SwissModel's structure was analyzed using SAVES, the following results were produced:
- Procheck indicated that 1 residue was in a "disallowed" region of the Ramachandran Plot. The Ramachandran Plot measures correlates the phi and psi angles for protein's backbone. Alpha-helices and beta-sheets have well-defined regions in the Ramachandran Plot. The fact that 1 residue, or 0.4% of the sequence is in a region that is not generally observable means that artifact is present in SwissModel's structure. Furthermore, analysis of residue properties, bond angles, and planar groups also identify error within the structure.
- What_check, another structural verification tool, confirms that there are side chain planarity problems, out of plane bonds, abnormal atomic distances, and abnormal side chain rotations in the model. Also, this verification tool acknowledges that the model has a very low backbone Z-score (-4.319) which indicates the structure is abnormal and likely incorrect.
- Other analyses performed by SAVES included Verify_3D, Errat, and Prove. Verify_3D passed 100% of residues, Errat a large (and relatively okay) score of 71.631, and Prove identified 6.7% of residues as irregular.
Phyre, which stands for Protein Homology/analogY Recognition Engine, predicts structure based on fold recognitiion. In this process, one sequence is threaded through multiple structures in order to create a model. In this particular case, Phyre produced a series of structures with estimated precisions from 95% to 100% for the query sequence. The models were 10-37% similar in sequence identity and ranged from 202-344 residues in length. A model was produced for residues 56 to 341 of the query sequence. A neat feature of Phyre's output was that it indicated disordered regions of the protein based on secondary structure analysis. One would expect these regions to be poorly modeled due to lack of secondary structure. To validate Phyre's model, SAVES was used. SAVES provided the following results:
- Procheck did not produce any errors for dihedral angles on the Ramachandran plot. Errors were returned for residue properties and bond angles.
- What_check produced errors regarding short atomic distances and abnormally oriented side chains. However, this analysis returned errors regarding the PDB file produced from Phyre, so there may potentially be some unaddressed compability issues between Phyre's output and What_check's required PDB format.
- Verify_3D passed 95.76% of residues, Errat produced a quality score of 59.623 (not as good as SwissModel's model), and Prove identified 7% of residues as outliers.

From the above analyses, it seems as if both models generally had some issues. Procheck, one of the more widely used structure prediction validation softwares, seems to produce more errors for SwissModel than Phyre's model. However, Verify_3D, Errat, and Prove all produced better scores for the SwissModel model than the Phyre model. Given this discrepency, it is unclear which of the two structures is a better representation of the query sequence. Neither model likely is a perfect representation of the query sequence. This may in part be due to the relatively limited sequence identity between the query sequence and the sequenes of known PDB structures. Further analysis will need to be done with other structure prediction software in order to come up with a more concrete answer. Nevertheless, a full 3-dimensional comparison of the two structures can be found in Jmol below. The SwissModel structure is in white, whereas the Phyre structure is in red. From this model, it is easy to see that although most of the backbone structure does agree between the two models, there are significant portions of the models which are not in agreement. As residues 1-656 (the whole length) of the query sequence only has one function as methylenetetrahydrofolate reductase listed in UniProtKB's sequence annotation, particular binding sites of biological interest could not be identified for analysis. Rather, there are a few alpha-helix structures which line up well between the two models that I chose to feature instead. These structures can be highlighted in Jmol using the button entitled "Highlight Structures of Interest". CE Alignment gave the two structres a Z-score of 7.5 and an RMSD value of 2.0 angstroms, which indicates that the structures match up well regardless of any artifical bond or angle parameters created during modeling.

3) Modeling a query sequence mutant

At residues 222 of the query sequence is a common natural polymorphism where alanine is mutated into valine. This mutation is documented to make the protein more thermolabile and decrease its activity by approximately 50%. Interestingly, this mutation decreases the risk for adult acute leukemia. The mutated sequence is provided below. The A222V mutation is highlighted in red.

Mutated Sequence

MVNEARGNSSLNPCLEGSASSGSESSKDSSRCSTPGLDPERHERLREKMRRRLESGDKWF SLEFFPPRTAEGAVNLISRFDRMAAGGPLYIDVTWHPAGDPGSDKETSSMMIASTAVNYC GLETILHMTCCRQRLEEITGHLHKAKQLGLKNIMALRGDPIGDQWEEEEGGFNYAVDLVK HIRSEFGDYFDICVAGYPKGHPEAGSFEADLKHLKEKVSAGVDFIITQLFFEADTFFRFV KACTDMGITCPIVPGIFPIQGYHSLRQLVKLSKLEVPQEIKDVIEPIKDNDAAIRNYGIE LAVSLCQELLASGLVPGLHFYTLNREMATTEVLKRLGMWTEDPRRPLPWALSAHPKRREE DVRPIFWASRPKSYIYRTQEWDEFPNGRWGNSSSPAFGELKDYYLFYLKSKSPKEELLKM WGEELTSEESVFEVFVLYLSGEPNRNGHKVTCLPWNDEPLAAETSLLKEELLRVNRQGIL TINSQPNINGKPSSDPIVGWGPSGGYVFQKAYLEFFTSRETAEALLQVLKKYELRVNYHL VNVKGENITNAPELQPNAVTWGIFPGREIIQPTVVDPVSFMFWKDEAFALWIERWGKLYE EESPSRTIIQYIHDNYFLVNLVDNDFPLDNCLWQVVEDTLELLNRPTQNARETEAP

A priori knowledge of the different amino acids structures suggests that valine may not greatly impact the protein structurally, as it is only has two extra methyl groups than alanine. However, regardless of the expected result of the A222V mutation, SwissModel and Phyre can be used to further analyze any structural impact the mutation has. SAVES can be used again to validate these structures.

SwissModel generated a model for residues 48 to 340 of the mutated sequence. This structure had a QMEAN4 score of 0.69, which means it was about as reliable as the structure generated from the unmutated query sequence. The Z-score of this model was -1.354. The full SwissModel output can be accessed here.
- Procheck indicated that 1 residue, or 0.4% of residues, was in a disallowed region of the Ramachandran Plot. Furthermore, analysis of residue properties, bond angles, and planar groups also identify errors within the structure.
- What_check identified errors in side chain planarity, out of plane bonds, interatomic distances, and side chain rotations in the model. Also, this tool acknowledges that the modeel has a very low backbone Z-score (-4.320) which indicates the structure is abnormal.
- Other analyses included Verify_3D, Errat, and Prove. Verify_3D passed 100% of residues, Errat produced an overall quality score of 71.631, and Prove identified 6.7% of residues as outliers.
Phyre generated a model for residues 56 to 341 of the mutated sequence. The full output can be viewed here.
- Procheck produced errors for residue properties, citing 26 bad contacts, and bond angles.
- What_check produced errors regarding short interatomic distances and abnormally oriented side chains.
- Verify_3D passed 95.76%, Errat produced an overall quality score of 58.491, and Prove identified 7.2% of residues as outliers.

In general, this comparison continues to illustrate that neither SwissModel or Phyre are perfect at predicting the sequence's 3-dimensional structure. CE Alignment calculated a Z-score of 7.5 and an RMSD value of 2.0 angstroms between these two models. This indicates that the models may have structural errors, they generally similar. As above, a Jmol view of the two structures is provided below. SwissModel is again shown in white, and Phyre is in red. The A222V residue can be highlighted by selecting the "Highlight A222V" button.

4) Comparison of wild-type and mutant models

Below is a comparison of the wild-type sequence models and the mutant models between SwissModel and Phyre. The comparison structures were generated using CE Alignment. Importantly, CE Alignment reported both SwissModel and Phyre's WT/MT superimposed structures as having a RMSD of 0.0 angstroms and a Z-score of 7.9. This confirms that the mutation does not drastically affect the models as was initially predicted. When the mutated residue is highlighted using the "Highlight A222V" button, it is easily discernible that some degree of empty space surrounds the 222 residue. This empty space may allow the addition of the two methyl groups in the A222V substitution to occur with relatively little structrual impact. In support of this, SAVES validation of the mutant sequence model vs the wild-type sequence model did not reveal any significant differences in errors produced. The wild-type models of the query sequence are presented in white, whereas the mutant models are presented in red. From Jmol, one can see easily see that the structures are very superimposable (it is virtually impossible to see each protein separately). The Jmol representation seems to slightly suggest SwissModel produces models for the wild-type and mutant sequences that are perhaps less congruent in structure than the models produced by Phyre, although this observation is qualitative and does not have any quantifiable evidence to support it.

5) Conclusion

Many algorithms and web-accessible programs are currently available to the public to perform protein structure prediction. Although specific algorithms may produce similar results, differences will emerge in the final structures produced. These final structures are also greatly impacted by protein structure data that has been resolved from X-ray crystallography and NMR spectroscopy. Analyzing regions of functional interest or local structures can be used to help validate the structures produced. Highly conserved sequences of proteins should have conserved structure, which helps make modeling possible.

Modeling is also a useful tool for identify the impact sequence mutations have on protein structure. In this example, the A222V mutation did not greatly impact the structure of the protein, although the A222V mutation itself has been documented to have a very specific biological effect. More modeling will definitely need to be performed to better guage the impact this mutation has on the protein structure. The fact that the A222V mutation has been highly reference and has a noted functional effect is further validation that the structures produced during modeling may not be correct. Molecular dynamics and other energetic minimization tools could also be used to further conform the structures produced from modeling and try to better probe their native state conformations. Molecular dyanmics simulations should also greatly help pull the models out of conformations with bad angle, bond, or torsional parameters as these positions should be energetically unfavorable.

Christopher Lockhart

Protein Structure Analysis

Homework 2

Written by Chris Lockhart for Protein Structure Analysis, BINF 731. Last updated 12/01/2011.