Proteins 77(Suppl 9): 133-137

PMID: 19722267

Template-based and free modeling by RAPTOR++ in CASP8

Introduction

Computational methods for protein structure prediction can be broadly classified into two categories: template-based modeling and template-free modeling. Although progress has been made for template-based modeling, we are still facing several challenges including identification of correct templates and generation of accurate alignments. Template-based modeling becomes unreliable when a target protein has a low sequence identity (<30%) with its best templates [1]. Pieper et al have shown that 76% of the models in MODBASE are from alignments in which the sequence and template share less than 30% sequence identity [2].

One of the major bottlenecks with template-free modeling is that the conformation space for even a small protein is too big to be explored efficiently. To overcome this, a number of methods have been proposed including fragment assembly [3, 4] and lattice model [5, 6]. These methods reduce search space using discrete representation of a protein conformation, which may lead to the loss of prediction accuracy regardless of sampling algorithm and energy function. This discrete nature may exclude native-like conformations from the search space since even a small change in a single backbone angle could result in a totally different fold. Efficient sampling of protein conformations in a continuous space of protein-like conformations is still an important unsolved problem.

We have developed RAPTOR++, a new protein structure prediction method, to address the above-mentioned issues. RAPTOR++ is much more powerful than our threading program RAPTOR [7, 8]. In RAPTOR++, we generate sequence-template alignments using three different threading methods and rank them using a model quality assessment method. Then, we employ multiple templates to model an easy target. To deal with targets without identifiable templates, we developed a novel template-free modeling method that can efficiently sample protein conformations in a continuous space. In this article, we will briefly describe RAPTOR++, summarize its predictions in CASP8, present some specific examples and discuss strength and weakness.

Materials and Methods

Threading

RAPTOR++ has three threading methods with different scoring functions and alignment algorithms. Two of the three methods are core-based while the third one is not. As in the old RAPTOR, a core-based method does not allow gaps in core regions [5]. The difference between the two core-based methods lies in if pairwise statistical potentials are used in their scoring functions. The non-core-based method does not use pairwise statistical potentials in its scoring function. In particular, the core-based pairwise threading method uses a scoring function consisting of gap penalty, mutation score, secondary structure score, singleton score and pairwise score. The pairwise statistical potential is derived by McConkey et al [6] and other scoring items are taken from the RAPTOR [5]. The two non-pairwise methods use a similar scoring function without pairwise potential. We trained three different sets of weight factors for these scoring functions using the method in [5]. The major reason of using the McConkey potential is to introduce “diversity”. The major difference between the McConkey potential and RAPTOR pairwise interaction potential lies in that their definitions of an inter-residue interaction are different. The McConkey potential also have its own parameters for singleton score. Our original plan is to use these two different potentials separately to generate alternative alignments. However, due to limited computing power, we used only the McConkey potential for CASP8. Some very preliminary studies indicate that the McConley potential has similar alignment accuracy as RAPTOR, but they can generate alternative alignments for a given protein pair. Tested on the Prosup benchmark [7], the reference-dependent alignment accuracy of a single threading method is ~61.0%. This accuracy can be improved to ~68.0% if the three threading methods are combined using our model quality assessment method described below.

Model quality assessment

Different from many methods that directly evaluate the quality of a 3D model [10–18], our model assessment method evaluates the absolute and global quality, measured by GDT-TS or TM-score [8], of a 3D model implied by an alignment without actually building such a 3D model using MODELLER. Our method differs from existing methods in that to the best of our knowledge, our method is the first one exploiting only the evolutionary information in an alignment for model assessment. We do not need to build a three-dimensional model for its quality assessment and thus, can save a lot of model-building time. Trained on the RAPTOR-generated CASP6 data and tested on the CASP7 data, the MAE (mean of absolute errors) of predicted GDT-TS is ~0.047 and the Pearson correlation coefficient of predicted GDT-TS with the real one is ~0.96. This model assessment method is built upon our previous work [9], which uses Support Vector Machines (SVM) to predict the number of correctly aligned positions in an alignment. To assess model quality, our method uses a set of alignment-based features such as distribution of per-position sequence similarity score, contact capacity score and environmental fitness score; distribution of gap lengths in an alignment; secondary structure score, solvent accessibility score and sequence identity.

Multiple-template method

If there are at least two very good templates for a target protein, we generate a multiple protein alignment and then build a 3D model from this alignment using MODELLER [10]. The multiple-template method has been exploited by several groups such as Joo et al [11] and Cheng [12] in recent CASP events. The major challenge is to choose good templates and to generate multiple protein alignments. We always use the top two templates and then enumerate all the possible combinations of the remaining top templates. To save computing time, at most five templates are used in any combinations. For a given set of multiple templates, TM-align [8] is used to generate structure alignment between any two templates. Then, T-Coffee [13] is used to combine all the sequence-template alignments and structure alignments into a single multiple protein alignment. We used a very conservative strategy to rank models built from multiple templates since sometimes it generates worse models by using multiple templates. A multiple-template-based model is assumed to be better than another one or a single-template-based model if and only if the former has both better ProQ [14] and DFIRE [15] values. However, this ranking method sometimes failed to identify the best models in CASP8. We chose TM-align, T-Coffee, ProQ and DFIRE because they are easily accessible. In the future, we will systematically compare our method with other similar methods.

Template-free modeling

We have developed a template-free modeling method, as detailed in [16, 17]. Our method employs Conditional (Markov) Random Fields (CRFs) and directional statistics to model protein sequence-structure relationship. Our method models the backbone angle distribution at each residue using a FB5 distribution [18] and samples backbone angles from sequence information using CRF. Different from the widely-used fragment assembly and lattice model methods that explore protein conformations in a discrete space, our method can explore protein conformations in a continuous space by their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and PSIPRED-predicted secondary structure. Our template-free modeling module drives conformation optimization by a simple energy function consisting of Sali’s DOPE [19, 20], Baker’s KMBhbond [21] and later a simplified solvent accessibility potential [22]. Our experimental results in [17] indicate that although sampling in a continuous space and using a very simple energy function, our new method compares favorably with the fragment assembly method (e.g., Robetta) and the lattice model (i.e., TOUCHSTONE II).

Multi-domain proteins

In the case that a target protein is large and may contain multiple domains, we first parse this protein into several possible domains by searching through the Pfam database [23] using HMMER [24, 25]. If the whole target can be aligned to a single template, then domain parsing is skipped. In the case that there is a big chunk of the target not aligned to any top templates, we will treat this unaligned chunk as a single target and do protein modeling separately. Except the last several CASP8 targets, the models for multiple domains are not assembled into a single coordinate system. This explains why Zhang’s assessment ² indicates that our models for multi-domain targets may contain atomic clashes when our domain boundary is different from Zhang’s.

Threading

Model quality assessment

Multiple-template method

Template-free modeling

Multi-domain proteins

Results and Discussion

Summary

Table 1 summarizes the results of RAPTOR++ in CASP8. CASP8 defined 164 effective domains and classified them into three categories while Grishin et al defined 146 domains and classified them into five categories³. As shown in Columns 2–4, for TBM-HA targets, the difference between the first and the best models by RAPTOR++ are small. In contrast, the best models generated by RAPTOR++ for TBM and FM targets are much better than the first models. This indicates that we still need to improve our model selection method for TBM and FM targets. As shown in Columns 4–6, for TBM-HA targets, the best models generated by RAPTOR++ are not very far away from the best models submitted by all the CASP8 servers. However, for TBM and FM targets, the best models submitted by all CASP8 servers are much better than the best generated by RAPTOR++. This means that in addition to improve model selection, we also need to further improve our model generation method for TBM and FM targets. We can have similar observations when Grishin’s domain definition and classification is used.

Table 1

Summarized results of RAPTOR++ predictions in CASP8. The upper half table contains the results of 164 CASP8 official domains and the lower half contains the results of 146 domains by Grishin’s definition (http://prodata.swmed.edu/CASP8/evaluation/CASP8Home.htm).

CASP8 official domain definition
Category(#)	R1	RB	RBAll	S1	SB
TBM-HA (50)	43.5982	43.9908	44.6929	45.4504	45.7400
TBM (104)	61.5547	63.8906	67.8557	70.8744	71.9822
FM (13)	3.9320	4.5924	5.1174	5.9134	6.4647
Grishin’s domain definition and classification
Category (#)	R1	RB	RBAll	S1	SB
CM easy (36)	30.6440	31.0100	31.5725	32.3377	32.4890
CM medium (45)	31.3489	32.1180	33.4249	34.2220	34.6184
CM hard (30)	16.7533	17.1881	18.4558	19.2035	19.3879
FR (30)	9.9741	11.2573	12.4265	13.8672	14.5600
FM (5)	1.0646	1.1758	1.3159	1.5711	1.6815

R1: GDT-TS score sum of the first-ranked models by RAPTOR.

RB: GDT-TS score sum of the best models submitted by RAPTOR.

RBAll: GDT-TS score sum of the best models generated by RAPTOR.

S1: GDT-TS score sum of the best first models submitted by all servers.

SB: GDT-TS score sum of the best models submitted by all servers.

What went right?

The model quality assessment method helps a lot in improving RAPTOR’s performance on the TBM targets, as opposed to RAPTOR in CASP7 that did not perform well in this category. In fact, Randall and Baldi demonstrated that the performance of RAPTOR in CASP7 could be greatly improved by simply re-ranking the top five models using SELECTPro [26]. A typical example is T0429. The third model of RAPTOR++ for this target is much better than other server models, but RAPTOR’s old template selection method failed to rank the third model to top 1 although RAPTOR’s first model is still pretty good. Using our new model quality assessment method, we can rank the third model to top 1. See Figure S1 in Supplemental Information for these two models of T0429.

The multiple-template method sometimes helps improve modeling easy targets. This method is likely to improve model quality when the following two conditions are satisfied. One is that some gapped regions in the alignment to one template can be covered by the alignment to another template. The other is that these multiple templates are structurally very similar. In case that either of these two conditions is not satisfied, the multiple-template method may introduce models of worse quality. For example, RAPTOR++ generated the best model for T0486 using 4 similar templates 2ppyA, 1q52A, 2hw5A and 2pbpA. The GDT-TS of this model is around 0.055 higher than the single-template (2ppyA) based model. By using these four templates we can cover T0486 more than using any single template. See Figures S2-1, S2-2, and S2-3 in Supplemental Information for alignments and 3D models for T0486.

Our template-free modeling method samples protein conformations in a continuous space without using fragments in the PDB. Our method aims to overcome two major issues with current popular fragment assembly and lattice model methods. One issue is that by sampling in a discrete space, it may exclude native structure from search space since a small change in a backbone angle can result in a totally different fold. The other issue is that there is no 100% guarantee that the local structure of a protein with new fold can be covered by even medium-sized fragments since a new fold seems to be composed of rarely occurring supersecondary structure motifs (Andras Fisher, CASP8 talk). Compared to the Robetta server (see Table 3 in [17]), our method performs very well on mainly-alpha proteins, e.g., T0460, T0496_D1 and T0496_D2, as shown in Figures S3, S4-1, S4-2, and S4-3 in Supplemental Information, respectively. This is not surprising since our CRF model can capture well the local sequence-structure relationship. Our method also works well on small mainly-beta proteins. For example, our method is better than Robetta on T0480 and T0510_D3, as shown in Figures S5 and S6 in Suppl. Info., respectively. However, our method does not fare well on a relatively large protein (>100 residues) with a few beta strands, e.g., T0482 and T0513_D2. This is probably because our CRF method can only model local sequence-structure relationship while a beta sheet is stabilized by non-local hydrogen bonding. Although sampling in a continuous space, our method can still efficiently search the conformation space of a small beta protein. However, for a large protein with a few beta sheets, the search space is too big to be explored by our continuous conformation sampling algorithm. It is also worth to note that compared to Robetta, our method works well on T0397_D1 (see Figure S7 in Suppl Info.) and T0496_D1, which, according to Nick Grishin, are the only two CASP8 targets with really new folds.

What went wrong?

RAPTOR++ contains both template-based and template-free modeling modules, so it needs a rule to tell when to use template-free modeling and when to use template-based modeling. RAPTOR++ used the predicted GDT-TS to do so, but sometimes this will mislead RAPTOR++ since the predicted GDT-TS is not accurate enough. RAPTOR++ used template-free modeling if the best predicted GDT-TS is less than 0.30. For some targets such as T0496_D1 and T0510_D3, RAPTOR++ correctly submitted their template-free models, which are much better than their template-based models. However, RAPTOR++ incorrectly submitted template-free models for some targets (e.g., T0480 and T0496_D2) although they have better template-based models. When the multiple-template method is used, sometimes RAPTOR++ failed to identify the best 3D models by using ProQ and DFIRE. A better model quality assessment method is urgently needed for this purpose. Another issue is that RAPTOR++ did not update the template database during the whole CASP8 season so that RAPTOR++ missed the best template (2zf8A) for T0514, which was deposited to the PDB in July 2008.

Summary

Table 1

CASP8 official domain definition
Category(#)	R1	RB	RBAll	S1	SB
TBM-HA (50)	43.5982	43.9908	44.6929	45.4504	45.7400
TBM (104)	61.5547	63.8906	67.8557	70.8744	71.9822
FM (13)	3.9320	4.5924	5.1174	5.9134	6.4647
Grishin’s domain definition and classification
Category (#)	R1	RB	RBAll	S1	SB
CM easy (36)	30.6440	31.0100	31.5725	32.3377	32.4890
CM medium (45)	31.3489	32.1180	33.4249	34.2220	34.6184
CM hard (30)	16.7533	17.1881	18.4558	19.2035	19.3879
FR (30)	9.9741	11.2573	12.4265	13.8672	14.5600
FM (5)	1.0646	1.1758	1.3159	1.5711	1.6815

R1: GDT-TS score sum of the first-ranked models by RAPTOR.

RB: GDT-TS score sum of the best models submitted by RAPTOR.

RBAll: GDT-TS score sum of the best models generated by RAPTOR.

S1: GDT-TS score sum of the best first models submitted by all servers.

SB: GDT-TS score sum of the best models submitted by all servers.

What went right?

What went wrong?

Supplementary Material

Supp Material

Click here to view.^{(366K, doc)}

Supp Material

Click here to view.^{(366K, doc)}

Acknowledgement

The authors are grateful to Xin Gao for his work in setting up RAPTOR++ web server and running RAPTOR++ for the first ~20 CASP8 targets and to Tobin Sosnick, Karl Freed, Joe DeBartolo and Brendan McConkey for their help with development of RAPTOR++.

Toyota Technological Institute at Chicago, IL USA 60637

^{Please address all correspondence to Dr. Jinbo Xu at the Toyota Technological Institute at Chicago. Phone: 773 834 2511, Fax: 773 834 9881,}gro.c-itt@ux3j

Abstract

We developed and tested RAPTOR++ in CASP8 for protein structure prediction. RAPTOR++ contains four modules: threading, model quality assessment, multiple protein alignment and template-free modeling. RAPTOR++ first threads a target protein to all the templates using three methods and then predicts the quality of the 3D model implied by each alignment using a model quality assessment method. Based upon the predicted quality, RAPTOR++ employs different strategies as follows. If multiple alignments have good quality, RAPTOR++ builds a multiple protein alignment between the target and top templates and then generates a 3D model using MODELLER. If all the alignments have very low quality, RAPTOR++ uses template-free modeling. Otherwise, RAPTOR++ submits a threading-generated 3D model with the best quality. RAPTOR++ was not ready for the first 1/3 targets and was under development during the whole CASP8 season. The template-based and template-free modeling modules in RAPTOR++ are not closely integrated. We are using our template-free modeling technique to refine template-based models.

Keywords: template-based modeling, template-free modeling, protein threading, model quality assessment

Abstract

Footnotes

^{http://zhang.bioinformatics.ku.edu/casp8/}

^{http://prodata.swmed.edu/CASP8/evaluation/DomainsAll.First.html#tabl}

Footnotes

References

1. Baker D, Sali AProtein structure prediction and structural genomics. Science. 2001;294(5540):93–96.[PubMed][Google Scholar]
2. Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D, Shen MY, Kelly L, Melo F, Sali AMODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Research. 2006;34:D291–D295.[Google Scholar]
3. Kim DE, Chivian D, Baker DProtein structure prediction and analysis using the Robetta server. Nucleic Acids Research. 2004;32:W526–W531.[Google Scholar]
4. Zhou HY, Skolnick JAb initio protein structure prediction using Chunk-TASSER. Biophysical Journal. 2007;93(5):1510–1518.[Google Scholar]
5. Xu J, Li M, Kim D, Xu YRAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology. 2003;1(1):95–117.[PubMed][Google Scholar]
6. McConkey BJ, Sobolev V, Edelman MDiscrimination of native protein structures using atom-atom contact scoring. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(6):3215–3220.[Google Scholar]
7. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FSProSup: a refined tool for protein structure alignment. Protein Engineering. 2000;13(11):745–752.[PubMed][Google Scholar]
8. Zhang Y, Skolnick JTM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research. 2005;33(7):2302–2309.[Google Scholar]
9. Xu JProtein Fold Recognition by Predicted Alignment Accuracy. EEE/ACM Trans. on Computational Biology and Bioinformatics. 2005;2(2):157–165.[PubMed][Google Scholar]
10. Sali AComparative Protein Modeling by Satisfaction of Spatial Restraints. Molecular Medicine Today. 1995;1(6):270–277.[PubMed][Google Scholar]
11. Joo K, Lee J, Lee S, Seo JH, Lee SJ, Lee JHigh accuracy template based modeling by global optimization. Proteins-Structure Function and Bioinformatics. 2007;69:83–89.[PubMed][Google Scholar]
12. Cheng JLA multi-template combination algorithm for protein comparative modeling. Bmc Structural Biology. 2008;8[Google Scholar]
13. Poirot O, O'Toole E, Notredame CTcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Research. 2003;31(13):3503–3506.[Google Scholar]
14. Wallner B, Elofsson ACan correct protein models be identified? Protein Science. 2003;12(5):1073–1086.[Google Scholar]
15. Zhou HY, Zhou YQDistance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction (vol 11, pg 2714, 2002) Protein Science. 2003;12(9):2121-2121.[Google Scholar]
16. Zhao F, Li SC, Sterner BW, Xu JDiscriminative learning for protein conformation sampling. Proteins: Structure, Function and Bioinformatics. 2008;73(1):228–240.[Google Scholar]
17. Zhao F, Peng J, DeBartolo J, Freed KF, Sosnick TR, Xu J. A probabilistic graphical model for ab initio folding; 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB); Tucson, AZ: Springer; 2009. In Press ( ) [PubMed]
18. Kent JTThe Fisher-Bingham Distribution on the Sphere. Journal of the Royal Statistical Society Series B-Methodological. 1982;44(1):71–80.[PubMed][Google Scholar]
19. Fitzgerald JE, Jha AK, Colubri A, Sosnick TR, Freed KFReduced C-beta statistical potentials can outperform all-atom potentials in decoy identification. Protein Science. 2007;16(10):2123–2139.[Google Scholar]
20. Shen M, Sali AStatistical potential for assessment and prediction of protein structures. Protein Sci. 2006;15(11):2507–2524.[Google Scholar]
21. Morozov AV, Kortemme T, Tsemekhman K, Baker DClose agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(18):6946–6951.[Google Scholar]
22. Fernandez A, Sosnick TR, Colubri ADynamics of hydrogen bond desolvation in protein folding. Journal of Molecular Biology. 2002;321(4):659–675.[PubMed][Google Scholar]
23. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman AThe Pfam protein families database. Nucleic Acids Research. 2008;36:D281–D288.[Google Scholar]
24. Eddy SRProfile hidden Markov models. Bioinformatics. 1998;14(9):755–763.[PubMed][Google Scholar]
25. Krogh A, Brown M, Mian IS, Sjolander K, Haussler DHidden Markov-Models in Computational Biology - Applications to Protein Modeling. Journal of Molecular Biology. 1994;235(5):1501–1531.[PubMed][Google Scholar]
26. Randall A, Baldi PSELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs. BMC Structural Biology. 2008;8:52.[Google Scholar]