BioMed Research International. Dec/31/2013; 2014

Published online May/20/2014

PMID: 24967386

PMC: 4055483

doi: 10.1155/2014/623149

iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition

Abstract

1. Introduction

In eukaryotic genomes, exons that code for proteins are typically interrupted by introns termed as protein noncoding regions. The borders between exons and introns are called splice sites (Figure 1). A splice site can be located at either the upstream or the downstream part of an intron. For the former, it is called the 5′ splice site or donor site; for the latter, it is called the 3′ splice site or acceptor site. The vast majority of the donor and acceptor sites are canonical or regular splice sites that are characterized by the presence of the GT and AG, respectively. During RNA splicing, both the donor and acceptor sites will be recognized by a large macromolecule called spliceosome that is comprised of more than 300 proteins and five small nuclear RNAs (snRNAs U1, U2, U4, U5, and U6) [1]. Once the splice sites are recognized, the spliceosome will remove introns through two sequential transesterification reactions (Figure 1). Removing introns from precursor messenger RNA (pre-mRNA) so that exons can be joined together to form mature mRNA is an essential step of gene expression. Therefore, to better understand the splicing process and mechanism, it is important to accurately detect the splice sites in the genome.

Although biochemical experimental approaches can provide some details about the splice sites, it is both time-consuming and expensive to rely on the biochemical experimental techniques alone. Hence, it is a big challenge and also highly desirable to develop computational methods for timely and effectively identifying the splice sites. In view of this, the present study was initiated in an attempt to develop a computational method for predicting splice sites.

According to a comprehensive review [2] and demonstrated by a series of recent publications [3–9], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, let us describe how to deal with these procedures one by one.

2. Materials and Methods

2.1. Benchmark Dataset

The human splice site-containing sequences were obtained from the database HS³D (http://www.sci.unisannio.it/docenti/rampone/), which contained the sequences of exons, introns, and splice regions extracted from GenBank Rel.123. All the splice site-containing sequences in HS³D obey the GT-AG rule; that is, begin with the dinucleotides GT (GU in RNA) and end with the dinucleotides AG, and their lengths are of 140 nucleotides with the splice donor site GT (or acceptor site AG) in the middle positions.

At present, there are 2,796 (2,880) true splice donor (acceptor) site-containing sequences and 271,937 (329,374) false splice donor (acceptor) site-containing sequences in HS³D. To balance the number of the true and false splice site-containing sequences and to avoid the overfitting problem in the model-training processes, we randomly selected out 2,800 false splice donor (acceptor) site-containing sequences from the 271,937 (329,374) false splice donor (acceptor) site-containing sequences.

As pointed out in a comprehensive review [10], there is no need to separate a benchmark dataset into a training dataset and a testing dataset for examining the performance of a prediction method if it is tested by the jackknife test or subsampling cross-validation test.

Finally, we obtained two benchmark datasets, one for the splice donor site-containing sequence, while the other for the splice acceptor, as can be formulated by(1)S1=S1+∪S1−forsplicedonor,S2=S2+∪S2−forspliceacceptor,where the positive dataset S₁⁺ contains 2,796 true splice donor site-containing sequences while the negative dataset S₁⁻ contains 2,800 false splice donor site-containing sequences; S₂⁺ contains 2,880 true splice acceptor site-containing sequences, while S₂⁻ contains 2,800 false splice acceptor site-containing sequences, and the symbol ∪ means the union in the set theory. The detailed sequences in the two benchmark datasets S₁ and S₂ are given in Supplementary Information S1 and Supplementary Information S2, respectively; see Supplementary Material available online at http://dx.doi.org/10.1155/2014/623149.

2.2. DNA Sample Formulation

Given a DNA sample D with L nucleic acid residues, the most straightforward way to express the sample is to use the following sequential model:(2)D=R1R2R3R4R5R6R7⋯RL,where R₁ represents the first nucleic acid residue at position 1, R₂ represents the second nucleic acid residue at position 2, and so forth. Although the sequential formulation of (2) contains the complete information of the DNA sample, it is difficult to be handled for statistical prediction. This is because all the existing operation engines, such as optimization approach [11], covariance discriminant (CD) [12], neural network [13], support vector machine (SVM) [14–16], random forest [17, 18], conditional random field [8], nearest neighbor (NN) [19], K-nearest neighbor (KNN) [20], OET-KNN [21], fuzzy K-nearest neighbor [22–24], ML-KNN algorithm [25], and SLLE algorithm [26], can only handle vector but not sequence samples. Although some sequence-similarity-search-based tools, such as BLAST [27], can be used to directly search for those sequences with high similarity to the query sample, unfortunately, this kind of straightforward and intuitive approach failed to work when the query sample did not have significant similarity to any of the character-known sequences. Therefore, various nonsequential or discrete models to represent the DNA samples were proposed in hopes of establishing some sort of correlation or cluster manner through which the prediction could be more effectively carried out.

The simplest discrete model used to represent a DNA sample is its nucleic acid composition or NAC, as given below:(3)D=[f(A)f(C)f(G)f(T)]T,where f(A), f(C), f(G), and f(T) are the normalized occurrence frequencies of adenine (A), cytosine (C), guanine (G), and thymine (T) in the DNA sequence, respectively; the symbol T is the transpose operator. However, as we can see from (3), all its sequence-order information is completely lost if using NAC to represent a DNA sample. Actually, one of the most important but also most difficult problems in computational biology is how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information.

One way to cope with such a problem is to represent the DNA segment with the k-tuple nucleotide composition, a vector with 4^k components; that is,(4)D=[f1K-tuplef2K-tuple⋯fiK-tuple⋯f4kK-tuple]T,where f_i^K-tuple is the normalized occurrence frequency of the ith k-tuple nucleotide in the DNA segment. As we can see from (4), the dimension of the vector is(5)4k={64k=3,256k=4,1024k=5,4096k=6,16384k=7,⋮⋮indicating that by increasing the value of k, although the coverage scope of sequence order will be gradually increased, the dimension of the vector D will be rapidly increased as well. This will cause the high-dimension disaster [28] as reflected by the following disadvantages: (i) the overfitting problem that will make the predictor with a serious bias and extremely low capacity for generalization; (ii) the information redundancy or noise that will bring about the error of misrepresentation resulting in very poor prediction accuracy; and (iii) unnecessarily increasing the computational time.

To avoid the high-dimension disaster, here, the dinucleotide composition (DNC) was used to formulate the DNA sample, as given by(6)D=[f12-tuplef22-tuple⋯fi2-tuple⋯f162-tuple]T=[f(AA)f(AC⁡)f(AG)f(AT)⋯f(TT)]T,where f₁^2-tuple = f(AA) is the normalized occurrence frequency of AA in the DNA sequence, f₂^2-tuple = f(AC⁡) is that of AC, f₃^2-tuple = f(AG) is that of AG, and so forth. By doing so, we can only incorporate the local sequence-order information between the most contiguous nucleotides, but none of the global or long-range sequence-order information can be reflected.

Actually, similar problem also occurred in computational proteomics, where, in order to incorporate the global or long-range sequence-order information for proteins, the pseudo amino acid composition [29] or Chou's PseAAC [30] was proposed. Since the concept of PseAAC was proposed in 2001 [29], it has been penetrating into almost all the fields of protein attribute predictions (see, e.g., [31–73]). Because it has been widely used, recently two types of open access software, called “PseAAC-Builder” [51] and “propy” [74], were established for generating various modes of PseAAC.

Encouraged by the successes of introducing the PseAAC approach into computational proteomics, Chen et al. [4] proposed the “pseudo dinucleotide composition” or PseDNC to identify recombination spots of DNA. The formulation of PseDNC is given by(7)DPseDNC=[d1d2⋯d16d16+1⋯d16+λ]T,where(8)du={fu2-tuple∑i=116fi2-tuple+w∑j=1λθj,1≤u≤16,wθu∑i=116fi2-tuple+w∑j=1λθj,(16+1)≤u≤(16+λ),where f_i^2-tuple (i = 1,2,…, 16) have the same meaning as those in (6), while θ_j is the jth tire correlation factor that reflects the sequence-order correlation between all the jth most contiguous dinucleotides along a DNA sequence (see Figure 2), as formulated by(9)θj=1L−j−1∑i=1L−j−1Θ(RiRi+1;Ri+jRi+1+j)(j=1,2,…,λ<L).In the above two equations, λ is the number of the total counted ranks or tiers of the correlations along a DNA sequence, and w is the weight factor. Their concrete values as well as the final value for k will be further discussed later. The correlation function Θ(R_iR_i+1; R_i+jR_i+1+j) in (9) is defined by(10)Θ(RiRi+1;Ri+jRi+1+j)=1μ∑ν=1μ[Pν(RiRi+1)−Pν(Ri+jRi+1+j)]2,where μ is the number of local DNA structural properties considered that is equal to 6 in the current study as will be explained below, P_ν(R_iR_i+1) is the numerical value of the νth (ν = 1,2,…, μ) DNA local structural property for the dinucleotide R_iR_i+1 at position i, and P_ν(R_i+jR_i+1+j) is the corresponding value for the dinucleotide R_i+jR_i+1+j at position i + j, as will be given below.

2.3. DNA Local Structural Property Parameters

A lot of evidences have shown that DNA local structural properties play important roles in many biological processes, such as protein-DNA interactions [75], formation of chromosomes [76], and meiotic recombination [4]. Generally speaking, the spatial arrangements of two successive base pairs can be characterized by six parameters, of which three are the local translational ones and the other three are the local angular ones (Figure 3), as formulated by(11)translational={slide,shift,rise,angular={roll,tilt,twist.

The six structural parameters of dinucleotides have been calculated by Goñi et al. [75] based on the long atomistic molecular dynamics (MD) simulations in water, and their concrete values are given in Table 1, which will be used to calculate the global or long-range sequence-order effects for the DNA sequences via (9) and (10).

Note that before substituting the values of physicochemical property into (10), they were all subjected to a standard conversion as described by the following equation:(12)Pν(RiRi+1)=Pν0(RiRi+1)−〈Pν0(RiRi+1)〉SD〈Pν0(RiRi+1)〉,where the symbols 〈〉 mean taking the average of the quantity therein over the 16 different combinations of A, C, G, T for R_iR_i+1 and SD means the corresponding standard deviation [10]. The converted values obtained by (12) will have a zero mean value over the 16 different dinucleotides and will remain unchanged if going through the same conversion procedure again. Listed in Table 2 are the values of P_ν(R_iR_i+1) (v = 1,2,…, 6) obtained via the standard conversion of (12) from those of Table 1.

2.4. Support Vector Machine (SVM)

Support vector machine (SVM) is an effective method for supervised pattern recognition and has been widely used in the realm of bioinformatics [4, 14, 77, 78]. The basic idea of SVM is to transform the data into a high dimensional feature space and then determine the optimal separating hyperplane. A brief introduction about the formulation of SVM has been given in [14]. In this study, the SVM implementation was based on the freely available package LIBSVM 2.84 written by Chang and Lin [79], which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Because of its effectiveness and speed in training process, the radial basis kernel function (RBF) was used to obtain the best classification hyperplane. The regularization parameter C and the kernel width parameter γ were tuned via the grid search method in the 10-fold cross-validation.

The predictor obtained via the above procedures is called iSS-PseDNC, where “i” stands for “identifying,” “SS” for “splice site,” “Pse” for “pseudo,” “D” for “di,” “N” for “nucleotide,” and “C” for “composition.”

2.5. Criteria for Performance Evaluation

To provide a more intuitive and easier-to-understand method to measure the prediction quality, the following set of four metrics based on the formulation used by Chou [80] in studying signal peptide prediction was adopted. According to Chou's formulation, the sensitivity (Sn), specificity (Sp), overall accuracy (Acc), and Matthew's correlation coefficient (MCC) can be expressed as follows [4, 7–9]:(13)Sn=1−N−+N+,Sp=1−N+−N−,Acc=1−N−++N+−N++N−,MCC=1−((N−+/N+)+(N+−/N−))(1+(N+−−N−+)/N+)(1+(N−+−N+−)/N−),where N⁺ is the total number of the true splice site-containing sequences investigated, while N₋⁺ is the number of true splice site-containing sequences incorrectly predicted as the false splice site-containing sequences; N⁻ is the total number of the false splice site-containing sequences investigated, while N₊⁻ is the number of the false splice site-containing sequences incorrectly predicted as true splice site-containing sequences. From (13), we can easily see the following. When N₋⁺ = 0 meaning that none of the true splice site-containing sequences was incorrectly predicted to be a false splice site-containing sequence, we have the sensitivity Sn = 1. When N₋⁺ = N⁺ meaning that all the true splice site-containing sequences were incorrectly predicted to be the false splice site-containing sequences, we have the sensitivity Sn = 0. Likewise, when N₊⁻ = 0 meaning that none of the false splice site-containing sequences was incorrectly predicted to be a true splice site-containing sequence, we have the specificity Sp = 1, whereas when N₊⁻ = N⁻ meaning that all the false splice site-containing sequences were incorrectly predicted to be the true splice site-containing sequences, we have the specificity Sp = 0. When N₋⁺ = N₊⁻ = 0 meaning that none of the true splice site-containing sequences and none of the false splice site-containing sequences were incorrectly predicted, we have the overall accuracy Acc = 1 and Mathew's correlation coefficient MCC = 1; when N₋⁺ = N⁺ and N₊⁻ = N⁻ meaning that all the false splice site-containing sequences and all the true splice site-containing sequences were incorrectly predicted, we have Acc = 0 and MCC = −1, whereas when N₋⁺ = N⁺/2 and N₊⁻ = N⁻/2, we have Acc = 0.5 and MCC = 0 meaning no better than random prediction. As we can see from the above discussion based on (13), the meanings of the four metrics have become much more intuitive and easier to understand than the conventional formulation often used in the literature, particularly for Mathew's correlation coefficient, which is usually used for measuring the quality of binary (two-class) classifications as in the case of the current study. However, it is instructive to point out that the set of the metrics in (13) is valid only for the single-label systems. For the multilabel systems whose existence has become more frequent in system biology [81–83] and system medicine [24, 84], a completely different set of metrics as defined in [25] is needed.

3. Results and Discussions

3.1. Graphic Profiles of True and False Splice Site-Containing Sequences

It has been reported that the DNA local structural properties, that is, angular parameters (twist, tilt, and roll) and translational parameters (shift, slide, and rise), play important roles in prokaryotic transcription initiation, protein-DNA interactions, and meiotic recombination [4, 75, 76, 85]. Accordingly, it is quite natural to ask whether these DNA structural properties may also play some role in regulating RNA splicing. Here, let us use the graphic approach to address this question. This is because using graphical approaches to study biological problems can provide an intuitive picture or useful insights for helping in analyzing complicated relations in these systems [30], as demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions [86–89], inhibition of HIV-1 reverse transcriptase [90–93], inhibition kinetics of processive nucleic acid polymerases and nucleases [94], protein folding kinetics [95], drug metabolism systems [96], protein sequence evolutionary analysis [97], protein remote homology detection [5], and using Wenxiang diagram or graph [98] to study protein-protein interactions [99–102]. Shown in Figure 4 is a comparison of the graphic profiles between the true and false splice site-containing sequences. As we can see there, the divergence between the true and false splice site-containing sequence profiles is remarkable, clearly indicating that the six structural property parameters can indeed play important roles in RNA splicing. That was why we used them to calculate the global sequence-order effects as elaborated in Section 2.3.

3.2. Cross-Validation

How to properly evaluate the anticipated accuracy is an important step in developing a new predictor. Generally speaking, to avoid the “memory effect” [10] of the resubstitution test in which a same dataset was used to train and test a predictor, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling or K-fold (such as 5-fold, 7-fold, or 10-fold) test, and jackknife test. However, as elaborated by a penetrating analysis in [2], considerable arbitrariness exists in the independent dataset test. Also, as demonstrated by (28)–(30) in [2], the subsampling test (or K-fold cross-validation) cannot avoid arbitrariness either. Only the jackknife test is the least arbitrary that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., [42, 58, 59, 62, 64, 66, 67, 70, 103–107]). Therefore, in this study, the jackknife test was also used to examine the performance of the predictor. During the jackknife test, each sequence in the benchmark dataset S₁ (or S₂) was in turn singled out as an independent test sample and all the rule-parameters were derived based on the remaining data without including the one under the prediction.

3.3. Parameter Optimization

As we can see from (8), the predictive accuracy of the present model depends on the two parameters w and λ, where w is the weight factor which was usually within the range from 0 to 1 and λ is the number of the correlation tiers to be counted for the global sequence-order information. Generally speaking, the greater the λ is, the more global sequence-order information the model will contain. However, if λ is too large, it would reduce the cluster-tolerant capacity [108] so as to lower down the cross-validation accuracy due to overfitting or “high dimension disaster” [28] problem. Therefore, our searching for the optimal values of the two parameters was confined in the range(14)0≤w≤1,1≤λ≤10.Furthermore, to reduce the computational time during the search process, the 10-fold cross-validation approach was adopted. Once the optimal values thus obtained for the two parameters were determined, the rigorous jackknife test was utilized to evaluate the anticipated accuracy of the predictor.

Listed in Table 3 are the jackknife test results of the iSS-PseDNC predictor in identifying the splice donor site-containing sequences and the splice acceptor site-containing sequences on the benchmark datasets S₁ and S₂, respectively, where the optimal values for w and λ are also explicitly given.

To further show the power of the iSS-PseDNC predictor, we also did some comparison calculations as described below.

First, based on the sequence similarity principle, we used BLAST [109] to conduct the jackknife test on the same benchmark dataset as used by the iSS-PseDNC predictor. The results thus obtained are given in Table 4, from which we can see that the percentage rates for Sn, Sp, and Acc by BLAST are about 40% lower than those by iSS-PseDNC and that the rates of MCC by BLAST are about 0.5 lower than those by iSS-PseDNA, for the cases of both donor and acceptor.

Second, rather than pseudo dinucleotide composition (7), we used the dinucleotide compositions (6) to represent the DNA samples for prediction. The corresponding results thus obtained are given in Table 5, from which we can see that the rates for Sn, Sp, Acc, and MCC are all lower than those reported in Table 3, clearly implying that the additional components in the pseudo nucleotide composition did play a role in enhancing the prediction quality.

All these results indicate that the iSS-PseDNC model as proposed in this paper is quite promising and may become a useful tool in identifying splice sites.

4. Conclusions

RNA splicing is a complicated biological process that involves interactions among DNA, RNA, and proteins. Hence, it is reasonable to analyze the structural properties that can be used to describe these interactions. In view of this, we firstly plotted the profiles of the six DNA structural properties (twist, tilt, roll, shift, slide, and rise) for splice site-containing sequences and found the differences between true and false splice site-containing sequences. The structural divergences surrounding splice sites may facilitate the removal of the introns by spliceosome.

By defining PseDNC using the above six DNA structural properties, we proposed a model, namely, iSS-PseDNC, for identifying splice sites. The predictive performance demonstrated that our model is helpful for splice site recognitions. Since user-friendly and publicly accessible web-servers represent the direction of developing practically more useful models [110], simulated methods, or predictors, we will make efforts in our future work to provide a web-server for the approach presented in this paper.

It has not escaped our notice that the web-server PseKNC (pseudo K-tuple nucleotide composition) developed very recently [111] will be very useful for further improving the prediction quality in identifying the splicing sites.

Supplementary Material

Supporting Information S1. The benchmark dataset for splice donor sites.

Supporting Information S2. The benchmark dataset for splice acceptor sites.

623149.f1.zip

Figure 1

A schematic drawing to show the pathways of RNA splicing. (a) The 2′OH of the branchpoint nucleotide within the intron (solid line) carries out a nucleophilic attack at the first nucleotide of the intron at the 5′ splice site (GU) forming the lariat intermediate. (b) The 3′OH of the released 5′ exon then performs a nucleophilic attack at the last nucleotide of the intron at the 3′ splice site (AG). (c) Joining the exons and releasing the intron lariat.

Figure 2

A schematic illustration to show the correlations of dinucleotides along a DNA sequence. (a) The first-tier correlation reflects the sequence-order mode between all the most contiguous dinucleotides. (b) The second-tier correlation reflects the sequence-order mode between all the second-most contiguous dinucleotides. (c) The third-tier correlation reflects the sequence-order mode between all the third-most contiguous dinucleotides.

Figure 3

A schematic drawing to illustrate the six spatial arrangements between two neighboring base pairs in DNA. Of the six panels, three are for the local translational arrangements and the other three are for the local angular ones [6].

Figure 4

Graphic profiles to show the difference between the true and false splice site-containing sequences. The profiles of six DNA structural properties (i.e., rise (black), slide (red), shift (blue), twist (orange), roll (green) and tilt (purple)) for (a) true splice donor site-containing sequences, (b) false splice donor site- containing sequences, (c) true splice acceptor site-containing sequences, and (d) false acceptor donor site-containing sequences. The profiles are plotted with a window size of 10 bp and a step size of 5 bp.

Table 1

The original values for the six DNA dinucleotide physical structures.

Dinucleotide	Physical structures^a
Dinucleotide	P₁(R_iR_i+1)	P₂(R_iR_i+1)	P₃(R_iR_i+1)	P₄(R_iR_i+1)	P₅(R_iR_i+1)	P₆(R_iR_i+1)
AA	0.026	0.038	0.020	1.69	2.26	7.65
AC	0.036	0.038	0.023	1.32	3.03	8.93
AG	0.031	0.037	0.019	1.46	2.03	7.08
AT	0.033	0.036	0.022	1.03	3.83	9.07
CA	0.016	0.025	0.017	1.07	1.78	6.38
CC	0.026	0.042	0.019	1.43	1.65	8.04
CG	0.014	0.026	0.016	1.08	2.00	6.23
CT	0.031	0.037	0.019	1.46	2.03	7.08
GA	0.025	0.038	0.020	1.32	1.93	8.56
GC	0.025	0.036	0.026	1.20	2.61	9.53
GG	0.026	0.042	0.019	1.43	1.65	8.04
GT	0.036	0.038	0.023	1.32	3.03	8.93
TA	0.017	0.018	0.016	0.72	1.20	6.23
TC	0.025	0.038	0.020	1.32	1.93	8.56
TG	0.016	0.025	0.017	1.07	1.78	6.38
TT	0.026	0.038	0.020	1.69	2.26	7.65

^aIn this table, the following symbols were used to represent the six physical structures of dinucleotide: P₁ for “twist”, P₂ for “tilt”, P₃ for “roll”, P₄ for “shift”, P₅ for “slide”, and P₆ for “rise”. The data was obtained from [75].

Table 2

The normalized values for the six DNA dinucleotide physical structures.

Dinucleotide	Physical structures^a
Dinucleotide	P₁(R_iR_i+1)	P₂(R_iR_i+1)	P₃(R_iR_i+1)	P₄(R_iR_i+1)	P₅(R_iR_i+1)	P₆(R_iR_i+1)
AA	0.06	0.5	0.27	1.59	0.11	−0.11
AC	1.50	0.50	0.80	0.13	1.29	1.04
AG	0.78	0.36	0.09	0.68	−0.24	−0.62
AT	1.07	0.22	0.62	−1.02	2.51	1.17
CA	−1.38	−1.36	−0.27	−0.86	−0.62	−1.25
CC	0.06	1.08	0.09	0.56	−0.82	0.24
CG	−1.66	−1.22	−0.44	−0.82	−0.29	−1.39
CT	0.78	0.36	0.09	0.68	−0.24	−0.62
GA	−0.08	0.5	0.27	0.13	−0.39	0.71
GC	−0.08	0.22	1.33	−0.35	0.65	1.59
GG	0.06	1.08	0.09	0.56	−0.82	0.24
GT	1.50	0.50	0.80	0.13	1.29	1.04
TA	−1.23	−2.37	−0.44	−2.24	−1.51	−1.39
TC	−0.08	0.5	0.27	0.13	−0.39	0.71
TG	−1.38	−1.36	−0.27	−0.86	−0.62	−1.25
TT	0.06	0.5	0.27	1.59	0.11	−0.11

^aSee footnote a of Table 1 for further explanation.

Table 3

The prediction quality as measured by metrics of (13) by iSS-PseDNC in identifying the splice donor and acceptor sites, respectively.

Splice sites	Optimal parameters		Metrics
Splice sites	λ	w	Sn (%)	Sp (%)	Acc (%)	MCC
Donor^a	4	0.3	86.66	84.25	85.45	0.71
Acceptor^b	2	0.3	88.78	86.64	87.73	0.75

^aSee Supplementary Information S1 for benchmark dataset of donor.

^bSee Supplementary Information S2 for benchmark dataset of acceptor.

Table 4

The prediction quality as measured by metrics of (13) by using BLAST [109] and sequence similarity principle in identifying splice acceptor and donor sites, respectively.

Splice sites	Metrics
Splice sites	Sn (%)	Sp (%)	Acc (%)	MCC
Acceptor^a	39.09	40.20	39.62	−0.21
Donor^b	42.75	37.63	40.23	0.20

^aSee footnote a of Table 3 for further explanation.

^bSee footnote b of Table 3 for further explanation.

Table 5

The prediction quality as measured by metrics of (13) by using the dinucleotide composition (6) to formulate the DNA samples in identifying the splice donor and acceptor sites, respectively.

Splice sites	Metrics
Splice sites	Sn (%)	Sp (%)	Acc (%)	MCC
Donor^a	81.23	84.42	82.58	0.67
Acceptor^b	83.39	85.60	83.78	0.68

^aSee footnote a of Table 3 for further explanation.

^bSee footnote b of Table 3 for further explanation.

Acknowledgments

The authors wish to thank the editor for taking time to edit this paper and the anonymous reviewers for the constructive comments, which were very helpful for strengthening the presentation of this paper. This work was supported by the National Nature Scientific Foundation of China (nos. 61100092 and 61202256) and the Nature Scientific Foundation of Hebei Province (no. C2013209105).

Conflict of Interests

The authors declare no conflict of interests.

References

1. HoskinsAAMooreMJThe spliceosome: a flexible, reversible macromolecular machineTrends in Biochemical Sciences2012375179188[PubMed][Google Scholar]
2. ChouK-CSome remarks on protein attribute prediction and pseudo amino acid compositionJournal of Theoretical Biology20112731236247[PubMed][Google Scholar]
3. XiaoXWangPLinWZJiaJHiAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional typesAnalytical Biochemistry20134362168177[PubMed][Google Scholar]
4. ChenWFengPMLinHChouKCiRSpot-PseDNC: identify recombination spots with pseudo dinucleotide compositionNucleic Acids Research201341p. e69[Google Scholar]
5. LiuBZhangDXuRCombining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detectionBioinformatics2014304472479[PubMed][Google Scholar]
6. GuoSHDengEZXuLQDingHLinHChenWiNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide compositionBioinformatics2014[Google Scholar]
7. QiuWRXiaoXiRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid componentsInternational Journal of Molecular Sciences201415217461766[PubMed][Google Scholar]
8. XuYDingJWuLYiSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid compositionPLoS ONE201382[PubMed][Google Scholar]
9. XuYShaoXJWuLYDengNYiSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteinsPeerJ20131p. e171[Google Scholar]
10. ChouK-CShenH-BRecent progress in protein subcellular location predictionAnalytical Biochemistry20073701116[PubMed][Google Scholar]
11. ZhangC-TChouK-CAn optimization approach to predicting protein structural class from amino acid compositionProtein Science199213401408[PubMed][Google Scholar]
12. ChenWLinHFengPMDingCZuoYCiNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical propertiesPLoS ONE2012710[PubMed][Google Scholar]
13. ThompsonTBChouK-CZhengCNeural network prediction of the HIV-1 protease cleavage sitesJournal of Theoretical Biology19951774369379[PubMed][Google Scholar]
14. CaiY-DZhouG-PChouK-CSupport vector machines for predicting membrane protein types by using functional domain compositionBiophysical Journal200384532573263[PubMed][Google Scholar]
15. FengPMChenWLinHChouK-CiHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet compositionAnalytical Biochemistry20134421118125[PubMed][Google Scholar]
16. XiaoXWangPChouK-CiNR-physchem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrixPLoS ONE201272[PubMed][Google Scholar]
17. KandaswamyKKChouK-CMartinetzTAFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived propertiesJournal of Theoretical Biology201127015662[PubMed][Google Scholar]
18. LinW-ZFangJ-AXiaoXChouK-CiDNA-prot: identification of DNA binding proteins using random forest with grey modelPLoS ONE201169[PubMed][Google Scholar]
19. CaiY-DChouK-CPredicting subcellular localization of proteins in a hybridization spaceBioinformatics200420711511156[PubMed][Google Scholar]
20. DenoeuxTκ-nearest neighbor classification rule based on Dempster-Shafer theoryIEEE Transactions on Systems, Man and Cybernetics1995255804813[PubMed][Google Scholar]
21. ChouK-CShenH-BEuk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sitesJournal of Proteome Research20076517281734[PubMed][Google Scholar]
22. HayatMKhanADiscriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou’s PseAACProtein & Peptide Letters2012194411421[PubMed][Google Scholar]
23. XiaoXMinJLWangPiCDI-PseFpt: Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprintsJournal of Theoretical Biology20133377179[PubMed][Google Scholar]
24. XiaoXWangPLinWZJiaJHiAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional typesAnalytical Biochemistry20134362168177[PubMed][Google Scholar]
25. ChouKCSome remarks on predicting multi-label attributes in molecular biosystemsMolecular Biosystems20139610921100[PubMed][Google Scholar]
26. WangMYangJXuZ-JChouK-CSLLE for predicting membrane protein typesJournal of Theoretical Biology20052321715[PubMed][Google Scholar]
27. WoottonJCFederhenSStatistics of local complexity in amino acid sequences and sequence databasesComputers and Chemistry1993172149163[PubMed][Google Scholar]
28. WangTYangJShenH-BChouK-CPredicting membrane protein types by the LLDA algorithmProtein & Peptide Letters2008159915921[PubMed][Google Scholar]
29. ChouKCPrediction of protein cellular attributes using pseudo amino acid compositionPROTEINS: Structure, Function, and Genetics200143246255(Erratum: ibid., 2001, Vol. 44, 60)[Google Scholar]
30. LinSXLapointeJTheoretical and experimental biology in oneJournal of Biomedical Science and Engineering20136435442[Google Scholar]
31. ZhouX-BChenCLiZ-CZouX-YUsing Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classesJournal of Theoretical Biology20072483546551[PubMed][Google Scholar]
32. ZhangT-LDingY-SChouK-CPrediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity patternJournal of Theoretical Biology20082501186193[PubMed][Google Scholar]
33. JiangXWeiRZhaoYZhangTUsing Chou’s pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear locationAmino Acids2008344669675[PubMed][Google Scholar]
34. LinHThe modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid compositionJournal of Theoretical Biology20082522350356[PubMed][Google Scholar]
35. NanniLLuminiAGenetic programming for creating Chou’s pseudo amino acid based features for submitochondria localizationAmino Acids2008344653660[PubMed][Google Scholar]
36. ZhangG-YFangB-SPredicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo-amino acid compositionJournal of Theoretical Biology20082532310315[PubMed][Google Scholar]
37. ZhangS-WChenWYangFPanQUsing Chou’s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approachAmino Acids2008353591598[PubMed][Google Scholar]
38. GeorgiouDNKarakasidisTENietoJJTorresAUse of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid compositionJournal of Theoretical Biology200925711726[PubMed][Google Scholar]
39. LiZ-CZhouX-BDaiZZouX-YPrediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysisAmino Acids2009372415425[PubMed][Google Scholar]
40. LinHWangHDingHChenY-LLiQ-ZPrediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid compositionActa Biotheoretica2009573321330[PubMed][Google Scholar]
41. QiuJ-DHuangJ-HLiangR-PLuX-QPrediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transformAnalytical Biochemistry200939016873[PubMed][Google Scholar]
42. ZengY-HGuoY-ZXiaoR-QYangLYuL-ZLiM-LUsing the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approachJournal of Theoretical Biology20092592366372[PubMed][Google Scholar]
43. EsmaeiliMMohabatkarHMohsenzadehSUsing the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomavirusesJournal of Theoretical Biology20102632203209[PubMed][Google Scholar]
44. MohabatkarHPrediction of cyclin proteins using Chou’s pseudo amino acid compositionProtein & Peptide Letters2010171012071214[PubMed][Google Scholar]
45. SahuSSPandaGA novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class predictionComputational Biology and Chemistry2010345-6320327[PubMed][Google Scholar]
46. YuLGuoYLiYSecretP: identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid compositionJournal of Theoretical Biology2010267116[PubMed][Google Scholar]
47. MohabatkarHMohammad BeigiMEsmaeiliAPrediction of GABAA receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machineJournal of Theoretical Biology201128111823[PubMed][Google Scholar]
48. BeigiMMBehjatiMMohabatkarHPrediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approachJournal of Structural and Functional Genomics2011124191197[PubMed][Google Scholar]
49. QiuJ-DSuoS-BSunX-YShiS-PLiangR-POligoPred: a web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou’s pseudo amino acid compositionJournal of Molecular Graphics and Modelling201130129134[PubMed][Google Scholar]
50. ZouDHeZHeJXiaYSupersecondary structure prediction using Chou’s pseudo amino acid compositionJournal of Computational Chemistry2011322271278[PubMed][Google Scholar]
51. DuPWangXXuCGaoYPseAAC-Builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositionsAnalytical Biochemistry20124252117119[PubMed][Google Scholar]
52. FanG-LLiQ-ZPredict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid compositionJournal of Theoretical Biology20123048895[PubMed][Google Scholar]
53. MeiSMulti-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localizationJournal of Theoretical Biology2012293121130[PubMed][Google Scholar]
54. MeiSPredicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learningJournal of Theoretical Biology20123108087[PubMed][Google Scholar]
55. NanniLBrahnamSLuminiAWavelet images and Chou's pseudo amino acid composition for protein classificationAmino Acids2012432657665[PubMed][Google Scholar]
56. NanniLLuminiAGuptaDGargAIdentifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s Pseudo amino acid composition and on evolutionary informationIEEE/ACM Transactions on Computational Biology and Bioinformatics201292467475[PubMed][Google Scholar]
57. SunXYShiSPQiuJDSuoSBHuangSYLiangRPIdentifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transformMolecular BioSystems201281231783184[PubMed][Google Scholar]
58. ChangTHWuLCLeeTYChenSPHuangHDHorngJTEuLoc: a web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAACJournal of Computer-Aided Molecular Design201327191103[PubMed][Google Scholar]
59. ChenYKLiKBPredicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid compositionJournal of Theoretical Biology2013318112[PubMed][Google Scholar]
60. FanGLLiQZDiscriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou's pseudo amino acid compositionJournal of Theoretical Biology20133344551[PubMed][Google Scholar]
61. GuptaMKNiyogiRMisraMAn alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid compositionSAR and QSAR in Environmental Research2013247597609[PubMed][Google Scholar]
62. HajisharifiZPiryaieeMMohammad BeigiMBehbahaniMMohabatkarHPredicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames testJournal of Theoretical Biology20143413440[PubMed][Google Scholar]
63. HuangCYuanJUsing radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sitesBiosystems201311315057[PubMed][Google Scholar]
64. HuangCYuanJQA multilabel model based on chou's pseudo-amino acid composition for identifying membrane proteins with both single and multiple functional typesThe Journal of Membrane Biology20132464327334[PubMed][Google Scholar]
65. HuangCYuanJQPredicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou's pseudo amino acid compositionsJournal of Theoretical Biology2013335205212[PubMed][Google Scholar]
66. KhosravianMFaramarziFKBeigiMMBehbahaniMMohabatkarHPredicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methodsProtein & Peptide Letters2013202180186[PubMed][Google Scholar]
67. MohabatkarHBeigiMMAbdolahiKMohsenzadehSPrediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approachMedicinal Chemistry201391133137[PubMed][Google Scholar]
68. QinYFZhengLHuangJLocating apoptosis proteins by incorporating the signal peptide cleavage sites into the general form of Chou's Pseudo amino acid compositionInternational Journal of Quantum Chemistry20131131116601667[Google Scholar]
69. SarangiANLohaniMAggarwalRPrediction of essential proteins in prokaryotes by incorporating various physico-chemical features into the general form of Chou's pseudo amino acid compositionProtein & Peptide Letters2013207781795[PubMed][Google Scholar]
70. WanSMakMWKungSYGOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid compositionJournal of Theoretical Biology20133234048[PubMed][Google Scholar]
71. WangXLiGZLuWCVirus-ECC-mPLoc: a multi-label predictor for predicting the subcellular localization of virus proteins with both single and multiple sites based on a general form of Chou's pseudo amino acid compositionProtein & Peptide Letters2013203309317[PubMed][Google Scholar]
72. XiaohuiNNanaLJingboXUsing the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theoryJournal of Theoretical Biology2013332211217[PubMed][Google Scholar]
73. XieHLFuLNieXDUsing ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAACProtein Engineering, Design and Selection20132611735742[Google Scholar]
74. CaoDSXuQSLiangYZPropy: a tool to generate various modes of Chou's PseAACBioinformatics2013297960962[PubMed][Google Scholar]
75. GoñiJRPérezATorrentsDOrozcoMDetermining promoter location based on DNA structure first-principles calculationsGenome Biology2007812, article R263[PubMed][Google Scholar]
76. GoñiJRFenollosaCPérezATorrentsDOrozcoMDNAlive: a tool for the physical analysis of DNA at the genomic scaleBioinformatics2008241517311732[PubMed][Google Scholar]
77. FengPMDingHChenWLinHNaive Bayes classifier with feature selection to identify phage virion proteinsComputational and Mathematical Methods in Medicine201320136 pages[PubMed][Google Scholar]
78. ChenWFengPLinHPrediction of replication origins by calculating DNA structural propertiesFEBS Letters20125866934938[PubMed][Google Scholar]
79. ChangCCLinCJLIBSVM: a library for support vector machines. pp.Software2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm
80. ChouK-CUsing subsite coupling to predict signal peptidesProtein Engineering20011427579[PubMed][Google Scholar]
81. XiaoXWuZ-CChouK-CA multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sitesPLoS ONE201166[PubMed][Google Scholar]
82. XiaoXWuZ-CChouK-CiLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sitesJournal of Theoretical Biology201128414251[PubMed][Google Scholar]
83. WuZ-CXiaoXChouK-CILoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sitesMolecular BioSystems201171232873297[PubMed][Google Scholar]
84. ChenLZengW-MCaiY-DFengK-YChouK-CPredicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similaritiesPLoS ONE201274[PubMed][Google Scholar]
85. AbeelTSaeysYBonnetERouzéPvan de PeerYGeneric eukaryotic core promoter prediction using structural features of DNAGenome Research2008182310323[PubMed][Google Scholar]
86. ChouKCForsénSGraphical rules for enzyme-catalysed rate lawsBiochemical Journal19801873829835[PubMed][Google Scholar]
87. ZhouGPDengMHAn extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathwaysBiochemical Journal19842221169176[PubMed][Google Scholar]
88. ChouKCGraphic rules in steady and non-steady state enzyme kineticsJournal of Biological Chemistry1989264201207412079[PubMed][Google Scholar]
89. AndraosJKinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws—new methods based on directed graphsCanadian Journal of Chemistry2008864342357[PubMed][Google Scholar]
90. AlthausIWFranksKMDiebelMRThe benzylthio-pyrididine U-31, 355 is a potent inhibitor of HIV-1 reverse transcriptaseBiochemical Pharmacology199651743750[PubMed][Google Scholar]
91. AlthausIWChouJJGonzalesAJKinetic studies with the non-nucleoside human immunodeficiency virus type- 1 reverse transcriptase inhibitor U-90152EBiochemical Pharmacology1994471120172028[PubMed][Google Scholar]
92. AlthausIWChouJJGonzalesAJSteady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201EJournal of Biological Chemistry1993268961196124[PubMed][Google Scholar]
93. AlthausIWChouJJGonzalesAJKinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204EBiochemistry1993322665486554[PubMed][Google Scholar]
94. ChouK-CKezdyFJReusserFReview: Steady-state inhibition kinetics of processive nucleic acid polymerases and nucleasesAnalytical Biochemistry19942212217230[PubMed][Google Scholar]
95. ChouK-CApplications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady-state systemsBiophysical Chemistry1990351124[PubMed][Google Scholar]
96. ChouK-CGraphic rule for drug metabolism systemsCurrent Drug Metabolism2010114369378[PubMed][Google Scholar]
97. WuZ-CXiaoXChouK-C2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acidsJournal of Theoretical Biology201026712934[PubMed][Google Scholar]
98. ChouKCLinWZXiaoXWenxiang: a web-server for drawing wenxiang diagramsNatural Science2011310862865[Google Scholar]
99. ZhouG-PThe disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanismJournal of Theoretical Biology20112841142148[PubMed][Google Scholar]
100. KurochkinaNChoekyiTHelix-helix interfaces and ligand bindingJournal of Theoretical Biology2011283192102[PubMed][Google Scholar]
101. ZhouG-PThe structural determinations of the leucine zipper coiled-coil domains of the cGMP-dependent protein kinase Iα and its interaction with the myosin binding subunit of the myosin light chains phosphaseProtein & Peptide Letters20111810966978[PubMed][Google Scholar]
102. ZhouGPHuangRBThe pH-triggered conversion of the PrP(c) to PrP(sc)Current Topics in Medicinal Chemistry2013131011521163[PubMed][Google Scholar]
103. ZhangS-WZhangY-LYangH-FZhaoC-HPanQUsing the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropiesAmino Acids2008344565572[PubMed][Google Scholar]
104. ZhouG-PAn intriguing controversy over protein structural class predictionProtein Journal1998178729738[PubMed][Google Scholar]
105. DingSLiYYangXWangTA simple k-word interval method for phylogenetic analysis of DNA sequencesJournal of Theoretical Biology2013317192199[PubMed][Google Scholar]
106. JingboXSilanZFengSUsing the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: an approach from chaos games representationJournal of Theoretical Biology201128411623[PubMed][Google Scholar]
107. HayatMKhanAPredicting membrane protein types by fusing composite protein sequence features into pseudo amino acid compositionJournal of Theoretical Biology201127111017[PubMed][Google Scholar]
108. ChouK-CA key driving force in determination of protein structural classesBiochemical and Biophysical Research Communications19992641216224[PubMed][Google Scholar]
109. SchäfferAAAravindLMaddenTLImproving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinementsNucleic Acids Research2001291429943005[PubMed][Google Scholar]
110. ChouKCShenHBReview: recent advances in developing web-servers for predicting protein attributesNatural Science2009126392[Google Scholar]
111. ChenWLeiTYJinDCLinHPseKNC: a flexible web-server for generating pseudo K-tuple nucleotide compositionAnalytical Biochemistry20144565360[PubMed][Google Scholar]