International Journal of Molecular Sciences. Apr/30/2009; 10(5): 1978-1998

Published online Apr/28/2009

PMID: 19564933

PMC: 2695261

doi: 10.3390/ijms10051978

Current Mathematical Methods Used in QSAR/QSPR Studies

Peixun Liu

Wei Long

Abstract

1.Introduction

As a common and successful research approach, quantitative structure activity/property relationship (QASR/QSPR) studies are applied extensively to chemometrics, pharmacodynamics, pharmacokinetics, toxicology and so on. Recently, the mathematical methods used as regression tools in QSAR/QSPR analysis have been developing quickly. Thus, not only are the previous methods, such as Multiple Linear Regression (MLR), Partial Least Squares (PLS), Neural Networks (NN), Support Vector Machine (SVM), being upgraded by improving the kernel algorithms or by combining them with other methods, but also some new methods, including Gene Expression Programming (GEP), Project Pursuit Regression (PPR) and Local Lazy Regression (LLR), are being mentioned in the current reported QSAR/QSPR studies. In light of this, this paper is divided as follows: first, the improved methods based on the earlier ones are described, then the new methods are presented. Finally, all these methods are jointly commented and their advantages and disadvantages discussed to show their application potential in future QASR/QSPR studies.

2.Multiple Linear Regression (MLR)

MLR is one of the earliest methods used for constructing QSAR/QSPR models, but it is still one of the most commonly used ones to date. The advantage of MLR is its simple form and easily interpretable mathematical expression. Although utilized to great effect, MLR is vulnerable to descriptors which are correlated to one another, making it incapable of deciding which correlated sets may be more significant to the model. Some new methodologies based on MLR have been developed and reported in recent papers aimed at improving this technique. These methods include Best Multiple Linear Regression (BMLR), Heuristic Method (HM), Genetic Algorithm based Multiple Linear Regression (GA-MLR), Stepwise MLR, Factor Analysis MLR and so on. The three most important and commonly used of these methods are described in detail below.

2.1.Best Multiple Linear Regression (BMLR)

BMLR implements the following strategy to search for the multi-parameter regression with the maximum predicting ability. All orthogonal pairs of descriptors i and j (with R2ij < R2min, default value R2ij < 0.1) are found in a given data set. The property analyzed is treated by using the two-parameter regression with the pairs of descriptors, obtained in the first step. The Nc (default value Nc = 400) pairs with highest regression correlation coefficients are chosen for performing the higher-order regression treatments. For each descriptor pair, obtained in the previous step, a non-collinear descriptor scale, k (with R2ik < R2nc and R2kj < R2nc, default value R2 < 0.6) is added, and the respective three-parameter regression treatment is performed. If the Fisher criterion at a given probability level, F, is smaller than that for the best two-parameter correlation, the latter is chosen as the final result. Otherwise, the Nc (default value Nc = 400) descriptor triples with highest regression correlation coefficients are chosen for the next step. For each descriptor set, chosen in the previous step, an additional non-collinear descriptor scale is added, and the respective (n + 1)-parameter regression treatment is performed. If the Fisher criterion at the given probability level, F, is smaller than for the best two-parameter correlation, the latter is chosen as the final result. Otherwise, the Nc (default value Nc = 400) sets descriptor sets with highest regression correlation coefficients are chosen, and this step repeated with n = n + 1 [1].

As an improved method based on MLR, BMLR is instrumental for variable selection and QSAR/QSPR modeling [2–8]. Like MLR, BMLR is noted for its simple and interpretable mathematical expression. Moreover, overcoming the shortcomings of MLR, BMLR works well when the number of compounds in the training set doesn’t exceed the number of molecular descriptors by at least a factor of five. However, BMLR will derive an unsatisfactory result when the structure-activity relationship is non-linear in nature. When too many descriptors are involved in a calculation, the modeling process will be time consuming. To speed up the calculations, it is advisable reject descriptors with insignificant variance within the dataset. This will significantly decrease the probability of including unrelated descriptors by chance. In addition, BMLR is unable to build a one-parameter model. BMLR is commercially available in the software packages CODESSA [9] or CODESSA PRO [10].

2.2.Heuristic Method (HM)

HM, an advanced algorithm based on MLR, is popular for building linear QSAR/QSPR equations because of its convenience and high calculation speed. The advantage of HM is totally based on its unique strategy of selecting variables. The details of selecting descriptors are as follows: first of all, all descriptors are checked to ensure that values of each descriptor are available for each structure. Descriptors for which values are not available for every structure in the data are discarded. Descriptors having a constant value for all structures in the data set are also discarded. Thereafter all possible one-parameter regression models are tested and the insignificant descriptors are removed. As a next step, the program calculates the pair correlation matrix of descriptors and further reduces the descriptor pool by eliminating highly correlated descriptors. The details of validating intercorrelation are: (a) all quasi-orthogonal pairs of structural descriptors are selected from the initial set. Two descriptors are considered orthogonal if their intercorrelation coefficient rij is lower than 0.1; (b) the pairs of orthogonal descriptors are used to compute the biparametric regression equations; (c) to a multi-linear regression (MLR) model containing n descriptors, a new descriptor is added to generate a model with n + 1 descriptors if the new descriptor is not significantly correlated with the previous n descriptors; step (c) is repeated until MLR models with a prescribed number of descriptors are obtained. The goodness of the correlation is tested by the square of coefficient regression (R2), square of cross-validate coefficient regression (q2), the F-test (F), and the standard deviation (S) [1].

HM is commonly used in linear QSAR and QSPR studies, and also as an excellent tool for descriptor selection before a linear or nonlinear model is built [11–34]. The advantages of HM are the high speed and the absence of software restrictions on the size of the data set. HM can either quickly give a good estimation about what quality of correlation to expect from the data, or derive several best regression models. HM usually produces correlations 2 – 5 times faster than other methods with comparable quality. Additionally, the maximum number of parameters in the resulting model can be fixed in accordance with the situation so as to save time. As a method inherited from MLR, HM is also limited in linear models.

2.3.Genetic Algorithm based Multiple Linear Regression (GA-MLR)

Combining Genetic Algorithm (GA) with MLR, a new method called GA-MLR is becoming popular in currently reported QSAR and QSPR studies [35–55]. In this method, GA is performed to search the feature space and select the major descriptors relevant to the activities or properties of the compounds. This method can deal with z large search space efficiently and has less chance to become a local optimal solution than the other algorithms. We give a brief summary of the main procedure of GA herein. The first step of GA is to generate a set of solutions (chromosomes) randomly, which is called an initial population. Then, a fitness function is deduced from the gene composition of a chromosome. The Friedman LOF function is commonly used as the fitness function, which was defined as follows:(1)LOF={SSE/(1−(c+dp/n))}2where SSE is the sum of squares of errors, c is the number of the basis function (other than the constant term), d is the smoothness factor, p is the number of features in the model, and n is the number of data points from which the model is built. Unlike the R2 error, the LOF measure cannot always be reduced by adding more terms to the regression model. By limiting the tendency to simply add more terms, the LOF measure resists over-fitting of a model. Then, crossover and mutation operations are performed to generate new individuals. In the subsequent selection stage, the fittest individuals evolve to the next generation. These steps of evolution continue until the stopping conditions are satisfied. After that, the MLR is employed to correlate the descriptors selected by GA and the values of activities or properties.

GA, a well-estimated method for parameter selection, is embedded in GA-MLR method so as to overcome the shortage of MLR in variable selection. Like the MLR method, the regression tool in GA-MLR, is a simple and classical regression method, which can provide explicit equations. The two parts have a complementation for each other to make GA-MLR a promising method in QSAR/QSPR research.

3.Partial Least Squares (PLS)

The basic concept of PLS regression was originally developed by Wold [56,57]. As a popular and pragmatic methodology, PLS is used extensively in various fields. In the field of QSAR/QSPR, PLS is famous for its application to CoMFA and CoMSIA. Recently, PLS has evolved by combination with other mathematical methods to give better performance in QSAR/QSPR analyses. These evolved PLS’, such as Genetic Partial Least Squares (G/PLS), Factor Analysis Partial Least Squares (FA-PLS) and Orthogonal Signal Correction Partial Least Squares (OSC-PLS), are briefly introduced in the following sections.

3.1.Genetic Partial Least Squares (G/PLS)

G/PLS is derived from two QSAR calculation methods Genetic Function Approximation (GFA) [58,59] and PLS. The G/PLS algorithm uses GFA to select appropriate basis functions to be used in a model of the data and PLS regression is used as the fitting technique to weigh the basis functions’ relative contributions in the final model. Application of G/PLS thus allows the construction of larger QSAR equations while still avoiding over-fitting and eliminating most variables. As the regression method used in Molecular Field Analysis (MFA), a well-known 3D-QSAR analysis tool, G/PLS is commonly used. The recent literatures related to G/PLS are mainly listed as [60–70].

3.2.Factor Analysis Partial Least Squares (FA-PLS)

This is the combination of Factor Analysis (FA) and PLS, where FA is used for initial selection of descriptors, after which PLS is performed. FA is a tool to find out the relationships among variables. It reduces variables into few latent factors from which important variables are selected for PLS regression. Most of the time, a leave-one-out method is used as a tool for selection of optimum number of components for PLS. We can find examples of FA-PLS used in QSAR analysis in [68,71–74].

3.3.Orthogonal Signal Correction Partial Least Squares (OSC-PLS)

Orthogonal signal correction (OSC) was introduced by Wold et al. [75] to remove systematic variation from the response matrix X that is unrelated, or orthogonal, to the property matrix Y. Therefore, one can be certain that important information regarding the analyte is retained. Since then, various OSC algorithms have been published in an attempt to reduce model complexity by removing orthogonal components from the signal. In abstracto, a preprocessing with OSC will help traditional PLS to obtain a more precise model, as proven in many studies of spectral analysis [76–88]. To date, unfortunately, there are only a few reports in which OSC-PLS is applied to QSAR/QSPR studies [89–91], but more QSAR or QSPR research involving application of the OSC-PLS method are expected in the future.

4.Neural Networks (NN)

As an alternative to the fitting of data to an equation and reporting the coefficients derived therefrom, neural networks are designed to process input information and generate hidden models of the relationships. One advantage of neural networks is that they are naturally capable of modeling nonlinear systems. Disadvantages include a tendency to overfit the data, and a significant level of difficulty in ascertaining which descriptors are most significant in the resulting model. In the recent QSAR/QSPR studies, RBFNN and GRNN are the most frequently used ones among NN.

4.1.Radial Basis Function Neural Network (RBFNN)

The RBFNN consists of three layers: an input layer, a hidden layer and an output layer. The input layer does not process the information; it only distributes the input vectors to the hidden layer. Each neuron on the hidden layer employs a radial basis function as a nonlinear transfer function to operate on the input data. In general, there are several radial basis functions (RBF): linear, cubic, thin plate spline, Gaussian, multi-quadratic and inverse multi-quadratic. The most often used RBF is a Gaussian function that is characterized by a center (cj) and width (rj). Due to the limited length of writing, only Gaussian RBF is introduced in this paper. The nonlinear transformation with RBF in the hidden layer is given as follows:(2)hj(x)=exp(−‖x−cj‖2rj2)where, hj is the notation for the output of the jth RBF unit, cj and rj are the center and width of the jth RBF, respectively. The operation of the output layer is linear, which is given as below:(3)yk(x)=∑wkjhj(x)+bkwhere, yk is the kth output unit for the input vector x, wkj is the weight connection between the kth output unit and the jth hidden layer unit, and bk is the bias. The training procedure when using RBF involves selecting centers, width and weights. In this paper, the forward subset selection routine was used to select the centers from training set samples. The adjustment of the connection weight between the hidden layer and the output layer was performed using a least squares solution after the selection of centers and widths of radial basis functions. Compared with the Back Propagation Neural Network (BPNN), RBFNN has the advantage of short training time and is guaranteed to reach the global minimum of error surface during training process. The optimization of its topology and learning parameters are easy to implement. Applications of RBFNN in QSAR/QSPR studies can be found in [22,24,27,32,51,92–102].

4.2.General Regression Neural Network (GRNN)

GRNN, one of the so-called Bayesian networks, is a type of neural network using kernel-based approximation to perform regression. It was introduced by Specht in 1991 [103]. GRNN is a nonparametric estimator that calculates a weighted average of the target values of training patterns by the probability density function using Parzen’s nonparametric estimator. For GRNN, the predicted value is the most probable value E(y|x):(4)E(y|x)=□y^(x)=∫−∞+∞yf(x,y)dy∫−∞+∞f(x,y)dywhere f(x, y) is the probability density function. This can be estimated from the training set by using the Parzen’s nonparametric estimator [47]:(5)f(x,y)=1nσ∑i=1nw(x−xi)σwhere n is the sample size, s is a scaling parameter that defines the width of the bell curve that surrounds each sample point, W(d) is a weighting function that has its largest value at d=0, and (x-xi) is the distance between the unknown sample and a data point. The Gaussian function is frequently used as the weighting function because it is well behaved, easily calculated, and satisfies the conditions required by Parzen’s estimator. Substituting Parzen’s nonparametric estimator for f(x, y) and performing the integrations leads to the fundamental equation of GRNN:(6)y^(x)=∑i=1nyiexp(−D(x,xi))∑i=1nexp(−D(x,xi))where:(7)D(x,xi)=∑j=1p(xj−xijσj)2

GRNN consists of four layers: input, hidden, summation, and output layers. The greatest advantage of GRNN is the training speed. Meanwhile, it is relatively insensitive to outliers (wild points). Training a GRNN actually consists mostly of copying training cases into the network, and so is as close to instantaneous as can be expected. The greatest disadvantage is network size: a GRNN network actually contains the entire set of training cases, and is therefore space-consuming, requiring more memory space to store the model. Relative literatures are [46,104–109].

5.Support Vector Machine (SVM)

SVM, developed by Vapnik [110,111] as a novel type of machine learning method, is gaining popularity due to its many attractive features and promising empirical performance. Originally, SVM was developed for pattern recognition problems. After that, SVM it was applied to regression by the introduction of an alternative loss function and results appear to be very encouraging [112]. As a developing method, new types of SVM are coming in on the stage of QSAR/QSPR, such as Least Square Support Vector Machine (LS-SVM), Grid Search Support Vector Machine (GS-SVM), Potential Support Vector Machine (P-SVM) and Genetic Algorithms Support Vector Machine (GA-SVM). Here, we only choose LS-SVM, the most commonly used one, as an example to provide a description.

5.1.Least Square Support Vector Machine (LS-SVM)

The LS-SVM, which was a modified SVM algorithm, was proposed by Suykens et al. [113] and used to build the nonlinear model. Here, we only briefly describe the main idea of LS-SVM for function estimation. In principle, LS-SVM always fits a linear relation (y = wx + b) between the regressors (x) and the dependent variable (y). The best relation can be obtained by minimizing the cost function (Q) containing a penalized regression error term:(8)QLS−SVM=12wTw+γ∑k=1Nek2subject to:(9)yk=wTϕ(xk)+b+ek,k=1,…,Nwhere ϕ : Rn → Rm is the feature map mapping the input space to a usually high-dimensional feature space, γ is the relative weight of the error term, and ek is error variables taking noisy data into accurate and avoiding poor generalization. LS-SVM considers this optimization problem to be a constrained optimization problem and uses a language function to solve it. By solving the Lagrange style of equation (8), the weight coefficients (w) can be written as:(10)w=∑k=1NαkxkTx+bwithαk=2γek

By substituting it into the original regression line (y □ wx + b), the following result can be obtained:(11)y=∑k=1NαkxkTx+b

It can be seen that the Lagrange multipliers can be defined as:αk=(xkTx+(2γ)−1(yk−b))

Finding these Lagrange multipliers is very simple, as opposed to with the SVM approach, in which a more difficult relation has to be solved to obtain these values. In addition, it easily allows for a nonlinear regression as an extension of the linear approach by introducing the kernel function. This leads to the following nonlinear regression function:(12)f(x)=∑k=1NαkK(x,xk)+bwhere K(x, xk) is the kernel function. The value is equal to the inner product of two vectors x and xk in the feature space Φ(x) and Φ(xk); that is, K(x, xk)) = Φ(x)^TΦ(xk). The choices of a kernel and its specific parameters together with γ have to be tuned by the user. The radial basis function (RBF) kernel K(x, xk) = exp(-‖xk– x‖²/σ²) is commonly used, and then leave-one-out (LOO) cross-validation is employed to tune the optimized values of the two parameters γ and σ. LS-SVM is a simplification of traditional support vector machine (SVM). It encompasses similar advantages with SVM and its own additional advantages. It only requires solving a set of linear equations (linear programming), which is much easier and computationally simpler than nonlinear equations (quadratic programming) employed by traditional SVM. Therefore, LS-SVM is faster than traditional SVM in treating the same work. The related literature is presented in [37,114–118].

6.Gene Expression Programming (GEP)

Gene expression programming was invented by Ferreira in 1999 and was developed from genetic algorithms and genetic programming (GP). GEP uses the same kind of diagram representation of GP, but the entities evolved by GEP (expression trees) are the expression of a genome. GEP is more simple than cellular gene progression. It mainly includes two sides: the chromosomes and the expression trees (ETs). The process of information of gene code and translation is very simple, such as a one-to-one relationship between the symbols of the chromosome and the functions or terminals they represent. The rules of GEP determine the spatial organization of the functions and terminals in the ETs and the type of interaction between sub-ETs. Therefore, the language of the genes and the ETs represents the language of GEP.

6.1.The GEP chromosomes, expression trees (ETs), and the mapping mechanism

Each chromosome in GEP is a character string of fixed length, which can be composed of gene from the function set or the terminal set. Using the elements {+, −,*, /, Q} as the function set and {a, b, c, d} as the terminal set, the following is an example of GEP chromosome of length eight:{01234567Q*+−abcd}where Q denotes the square-root function; and a, b, c, d are variable (or attribute) names. The above is referred to as Karva notation or K-expression. A K-expression can be mapped into an ET following a width-first procedure. A branch of the ET stops growing when the last node in this branch is a terminal. The conversion of an ET into a K-expression is also very straightforward and can be accomplished by recording the nodes from left to right in each layer of the ET in a top-down fashion to form the string:(13)y=(a−b)*(c+d)

6.2.Description of the GEP algorithm

The purpose of symbolic regression or function finding is to find an expression that can give a good explanation for the dependent variable. The first step is to choose the fitness function. Mathematically, the fitness fi of an individual program i is expressed by the equation:(14)fi=∑j=1n(R−|P(ij)−TjTj.100|)where R is the selection range, P(ij) is the value predicted by the individual program i for fitness case j (out of n fitness cases), and Tj is the target value for fitness case j. Note that the absolute value term corresponds to the relative error. This term is what is called the precision and if the error is smaller than or is equal to the precision then the error becomes zero. Thus, for a good match the absolute value term is zero and fi = fmax = nR. For some function finding problems it is important to evolve a model that performs well for all fitness cases within a certain relative error (the precision) of the correct value:(15)f(i,j)={1,E(i,j)≤p0,where p is the precision and E(ij) is the relative error of an individual program i for the fitting case j.

The E(ij) is given by:(16)E(ij)=|P(ij)−TjTj•100|

The second step consists of choosing the set of terminals T and the set of functions F to create the chromosomes. In this problem, the terminal set consists obviously of the independent variable, i.e., T = {a}. The third step is to choose the chromosomal architecture, i.e., the length of the head and the number of genes. The fourth major step is to choose the linking function. The last major step is to choose the set of genetic operators that cause variation and their rates. These processes are repeated for a pre-specified number of generations until a solution is obtained. In the GEP, the individuals are often selected and copied into the next generation based on their fitness, as determined by roulette-wheel sampling with elitism, which guarantees the survival and cloning of the best individual to the next generation. The variation in the population is introduced by applying one or more genetic operators to select chromosomes, including crossover, mutation, and rotation. The process begins with the random generation of the chromosomes of the initial population. The chromosomes are expressed and the fitness of each individual is evaluated. The individuals are selected according to the fitness to reproduce with modification, leaving progeny with new traits. The individuals of this new generation are, in their turn, subjected to the same developmental process: expression of the genomes, confrontation of the selection environment, and reproduction with modification. To evaluate the ability of the GEP, the correlation coefficient (R) was introduced as:(17)Ci=Cov(T,P)σtσpwhere Cov(T, P) is the covariance of the target and model outputs. σt and σp are the corresponding standard deviations. GEP is the newest chemometrics method, and Si et al. [23,25,19–121] have applied this method to QSAR studies for the first time. The results from their studies are satisfactory and show a promising use in the nonlinear structure-activity/property relationship correlation area, but GEP is congenitally defective as far as reproducibility of the predicted results is concerned, and always deduces too complex equations. This means a higher requirement for a user who is involved with GEP. GEP is now commercialized in the software Automatic Problem Solver 3.0 or GeneXproTools 4.0 [122].

7.Project Pursuit Regression (PPR)

PPR, which was developed by Friedman and Stuetzle [123], is a powerful tool for seeking the interesting projections from high-dimensional data into lower dimensional space by means of linear projections. Therefore, it can overcome the curse of dimensionality because it relies on estimation in at most trivariate settings. Friedman and Stuetzle’s concept of PPR avoided many difficulties experienced with other existing nonparametric regression procedures. It does not split the predictor space into two regions, thereby allowing, when necessary, more complex models. In addition, interactions of predictor variables are directly considered because linear combinations of the predictors are modeled with general smooth functions. Another significant property of PPR is that the results of each interaction can be depicted graphically. The graphical output can be used to modify the major parameters of the procedure: the average smoother bandwidth and the terminal threshold. A brief description about PPR is presented here. Given the (k × n) data matrix X, where k is the number of observed variables and n is the number of units, and an m-dimensional orthonormal matrix A (m × k), the (m × n) matrix Y = AX represents the coordinates of the projected data onto the m-dimensional (m < k) space spanned by the rows of A. Because such projections are infinite, it is important to have a technique to pursue a finite sequence of projections that can reveal the most interesting structures of the data. Projection pursuit (PP) is such a powerful tool that combines both ideas of projection and pursuit. In a typical regression problem, PPR aims to approximate the regression pursuit function f(x) by a finite sum of ridge functions with suitable choices of αi and gi:(18)g(p)(x)=∑i=1pgi(αiTx)where αi values are m × n orthonormal matrices and p is the number of ridge functions. In recent QSAR/QSPR studies [21,37,124–129] PPR was employed as a regression method and always resulted in the best final models. This indicates that PPR is a promising regression method in QSAR/QSPR studies, especially when the correlation between descriptors and activities or properties is nonlinear.

8.Local Lazy Regression (LLR)

Most QSAR/QSPR models often capture the global structure-activity/property trends which are present in a whole dataset. In many cases, there may be groups of molecules which exhibit a specific set of features which relate to their activity or property. Such a major feature can be said to represent a local structure activity/property relationship. Traditional models may not recognize such local relationships. LLR is an excellent approach which extracts a prediction by locally interpolating the neighboring examples of the query which are considered relevant according to a distance measure, rather than considering the whole dataset. That will cause the basic core of this approach which is a simple assumption that similar compounds have similar activities or properties; that is, the activities or properties of molecules will change concurrently with the changes in the chemical structure. For one or more query points, “lazy” estimates the value of an unknown multivariate function on the basis of a set of possibly noisy samples of the function itself. Each sample is an input/output pair where the input is a vector and the output is a number. For each query point, the estimation of the input is obtained by combining different local models. Local models considered for combination by lazy are polynomials of zeroth, first, and second degree that fit a set of samples in the neighborhood of the query point. The neighbors are selected according to either the “Manhattan” or the “Euclidean” distance. It is possible to assign weights to the different directions of the input domain for modifying their importance in computation of the distance. The number of neighbors used for identifying local models is automatically adjusted on a query-by-query basis through a leave-one-out validation of models, each fitting a different number of neighbors. The local models are identified by using the recursive least-squares algorithm, and the leave-one-out cross-validation is obtained through the PRESS statistic. We assumed that there existed a small region around x which is linear between the dependent variable and the predictor variables. Then we determined the points around XNN(x) and build a regression model with only the points in NN(x) using the least-squares method and minimized the squared residuals for the region using this model. So we do not need to build multiple models for each point in the training set beforehand. In essence, when faced with a query point, the approach builds a representative predictive model. Hence, this approach is termed local LLR, which is described below:(19)βx=(XNN(X)TXNN(x))−1XNN(X)TYNN(x)where XNN(x) was a matrix of independent variables, YNN(x) the column vectors representing the dependent variables, for the molecules in the neighborhood of the query point. βx was the column vectors of regression coefficients. One of the main advantages of LLR is the fact that no a priori model needs to be built. This makes it suitable for large data sets, where using all of the observations can normally be time-consuming and even lead to over-fitting. At the same time, because it builds a regression model for each query point, one cannot extract meaningful structure-activity trends for the data set as a whole. That is, the focus of LLR is on predictive ability, rather than interpretability. Like every method, the lazy approach has a number of shortcomings. First, as all of the computations are done at query time, the determination of the local neighborhood must be efficient. Second, uncorrelated features might result in errors in the identification of near neighbors. Finally, it is nontrivial to integrate feature selection in this framework. LLR is generally used to develop linear models for data sets in which the global structure-activity/property relationship is nonlinear in nature. However, as a new arising method in the field of QSAR/QSPR, LLR is not used extensively, with only a few relevant studies shown in references [125,130–132]. It is expected that more application studies involving LLR will appear in future QSAR/QSPR analyses.

9.Conclusions

In this paper, we focus on the current mathematical methods used as regression tools in recent QSAR/QSPR studies. Mathematical regression methods are so important for the QSAR/QSPR modeling that the choice of the regression method, most of the time, will determine if the resulted model will be successful or not. Fortunately, more and more new methods and algorithms have been applied to the studies of QSAR/QSPR, including linear and nonlinear, statistics and machine learning. At the same time, the existing methods have been improved. However, it is still a challenge for the researchers to choose suitable methods for modeling their systems. This paper may give some help on the knowledge of these methods, but more practical applications are needed so as to get a thorough understanding and then perform a better application.

References

1. KatritzkyARLobanovVSKarelsonM Comprehensive Descriptors for Structural and Statistical Analysis (CODESSA) Ref. Man. Version 2.7.10, 2007
2. DuHWangJHuZYaoXQuantitative Structure-Retention relationship study of the constituents of saffron aroma in SPME-GC-MS based on the projection pursuit regression methodTalanta200877360365[PubMed][Google Scholar]
3. DuHWatzlJWangJZhangXYaoXHuZPrediction of retention indices of drugs based on immobilized artificial membrane chromatography using Projection Pursuit Regression and Local Lazy RegressionJ. Sep. Sci20083123252333[PubMed][Google Scholar]
4. DuHZhangXWangJYaoXHuZNovel approaches to predict the retention of histidine-containing peptides in immobilized metal-affinity chromatographyProteomics2008821852195[PubMed][Google Scholar]
5. KatritzkyARPacureanuLDobchevDKarelsonMQSPR modeling of hyperpolarizabilitiesJ. Mol. Model200713951963[PubMed][Google Scholar]
6. RenYLiuHYaoXLiuMAn accurate QSRR model for the prediction of the GCxGCTOFMS retention time of polychlorinated biphenyl (PCB) congenersAnal. Bioanal. Chem2007388165172[PubMed][Google Scholar]
7. SrivaniPSrinivasERaghuRSastryGNMolecular modeling studies of pyridopurinone derivatives--potential phosphodiesterase 5 inhibitorsJ. Mol. Graph. Model200726378390[PubMed][Google Scholar]
8. KahnISildSMaranUModeling the toxicity of chemicals to Tetrahymena pyriformis using heuristic multilinear regression and heuristic back-propagation neural networksJ. Chem. Inf. Model20074722712279[PubMed][Google Scholar]
9. Semichem Home PageAvailable online: http://www.semichem.com/codessa (accessed on 10 March 2009)
10. Codessa Pro Home PageAvailable online: http://www.codessa-pro.com/ (accessed on 10 March 2009)
11. XiaBLiuKGongZZhengBZhangXFanBRapid toxicity prediction of organic chemicals to Chlorella vulgaris using quantitative structure-activity relationships methodsEcotoxicol. Environ. Saf200972787794[PubMed][Google Scholar]
12. YuanYZhangRHuRRuanXPrediction of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas based on the heuristic method, support vector machine and projection pursuit regressionEur. J. Med. Chem2009442534[PubMed][Google Scholar]
13. LuWJChenYLMaWPZhangXYLuanFLiuMCChenXGHuZDQSAR study of neuraminidase inhibitors based on heuristic method and radial basis function networkEur. J. Med. Chem200843569576[PubMed][Google Scholar]
14. XiaBMaWZhengBZhangXFanBQuantitative structure-activity relationship studies of a series of non-benzodiazepine structural ligands binding to benzodiazepine receptorEur. J. Med. Chem20084314891498[PubMed][Google Scholar]
15. ZhaoCZhangHLuanFZhangRLiuMHuZFanBQSAR method for prediction of protein-peptide binding affinity: application to MHC class I molecule HLA-A*0201J. Mol. Graph. Model200726246254[PubMed][Google Scholar]
16. RebehmedJBarbaultFTeixeiraCMaurelF2D and 3D QSAR studies of diarylpyrimidine HIV-1 reverse transcriptase inhibitorsJ. Comput. Aided Mol. Des200822831841[PubMed][Google Scholar]
17. AgrafiotisDKGibbsACZhuFIzrailevSMartinEConformational sampling of bioactive molecules: a comparative studyJ. Chem. Inf. Model20074710671086[PubMed][Google Scholar]
18. SiHZWangTZhangKJHuZDFanBTQSAR study of 1,4-dihydropyridine calcium channel antagonists based on gene expression programmingBioorg. Med. Chem20061448344841[PubMed][Google Scholar]
19. LiXLuanFSiHHuZLiuMPrediction of retention times for a large set of pesticides or toxicants based on support vector machine and the heuristic methodToxicol. Lett2007175136144[PubMed][Google Scholar]
20. GongZGZhangRSXiaBBHuRJFanBTStudy of nematic transition temperatures in themotropic liquid crystal using heuristic method and radial basis function neural networks and support vector machineQSAR Comb.Sci20082712821290[Google Scholar]
21. YuanYNZhangRSHuRJRuanXFPrediction of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas based on the heuristic method, support vector machine and projection pursuit regressionEur. J. Med. Chem2009442534[PubMed][Google Scholar]
22. XiaBBLiuKPGongZGZhengBZhangXYFanBTRapid toxicity prediction of organic chemicals to Chlorella vulgaris using quantitative structure-activity relationships methodsEcotoxicol. Environ. Saf200972787794[PubMed][Google Scholar]
23. LuanFSiHZLiuHTWenYYZhangXYPrediction of atmospheric degradation data for POPs by gene expression programmingSAR QSAR Environ. Res200819465479[PubMed][Google Scholar]
24. XiaBBMaWPZhengBZhangXYFanBTQuantitative structure-activity relationship studies of a series of non-benzodiazepine structural ligands binding to benzodiazepine receptorEur. J. Med. Chem20084314891498[PubMed][Google Scholar]
25. WangTSiHZChenPPZhangKJYaoXJQSAR models for the dermal penetration of polycyclic aromatic hydrocarbons based on Gene Expression ProgrammingQSAR Comb. Sci200827913921[Google Scholar]
26. LiuKPXiaBBMaWPZhengBZhangXYFanBTQuantitative structure-activity relationship modeling of triaminotriazine drugs based on Heuristic MethodQSAR Comb. Sci200827425431[Google Scholar]
27. LuWJChenYLMaWPZhangXYLuanFLiuMCChenXGHuZDQSAR study of neuraminidase inhibitors based on heuristic method and radial basis function networkEur. J. Med. Chem200843569576[PubMed][Google Scholar]
28. ZhaoCYZhangHXLuanFZhangRSLiuMCHuZDFanBTQSAR method for prediction of protein-peptide binding affinity: Application to MHC class I molecule HLA-A*0201J. Mol. Graph. Model200726246254[PubMed][Google Scholar]
29. LiHZLiuHXYaoXJLiuMCHuZDFanBTQuantitative structure-activity relationship study of acyl ureas as inhibitors of human liver glycogen phosphorylase using least squares support vector machinesChemometr. Intel. Lab. Syst200787139146[Google Scholar]
30. QinSLiuHXWangJYaoXJLiuMCHuZDFanBTQuantitative Structure-Activity Relationship study on a series of novel ligands binding to central benzodiazepine receptor by using the combination of Heuristic Method and Support Vector MachinesQSAR Comb. Sci200726443451[Google Scholar]
31. MaWPLuanFZhaoCYZhangXYLiuMCHuZDFanBTQSAR prediction of the penetration of drugs across a polydimethylsiloxane membraneQSAR Comb. Sci200625895904[Google Scholar]
32. LuanFMaWPZhangXYZhangHXLiuMCHuZDFanBTQuantitative structure-activity relationship models for prediction of sensory irritants (logRD(50)) of volatile organic chemicalsChemosphere20066311421153[PubMed][Google Scholar]
33. SiHZYaoXJLiuHXWangJLiJZHuZDLiuMCPrediction of binding rate of drug to human plasma protein based on heuristic method and support vector machineActa Chim. Sinica200664415422[Google Scholar]
34. LuanFMaWPZhangXYZhangHXLiuMCHuZDFanBTQSAR study of polychlorinated dibenzodioxins, dibenzofurans, and Biphenyls using the heuristic method and support vector machineQSAR Comb. Sci2006254655[Google Scholar]
35. GharagheiziFTirandaziBBarzinREstimation of aniline point temperature of pure hydrocarbons: A quantitative structure-property relationship approachInd. Eng. Chem. Res20094816781682[Google Scholar]
36. RiahiSMousaviMFGanjaliMRNorouziPApplication of correlation ranking procedure and artificial neural networks in the modeling of liquid chromatographic retention times (tR) of various pesticidesAnal. Lett20084133643385[Google Scholar]
37. DuHYWangJHuZDYaoXJZhangXYPrediction of fungicidal activities of rice blast disease based on least-squares support vector machines and project pursuit regressionJ. Agric. Food Chem2008561078510792[PubMed][Google Scholar]
38. GharagheiziFMehrpooyaMPrediction of some important physical properties of sulfur compounds using quantitative structure-properties relationshipsMol. Div200812143155[Google Scholar]
39. SattariMGharagheiziFPrediction of molecular diffusivity of pure components into air: A QSPR approachChemosphere20087212981302[PubMed][Google Scholar]
40. GharagheiziFAlamdariRFPrediction of flash point temperature of pure components using a Quantitative Structure-Property Relationship modelQSAR Comb. Sci200827679683[Google Scholar]
41. GharagheiziFFazeliAPrediction of the Watson characterization factor of hydrocarbon components from molecular propertiesQSAR Comb. Sci200827758767[Google Scholar]
42. OmASRyuJCKimJHQuantitative structure-activity relationships for radical scavenging activities of flavonoid compounds by GA-MLR techniqueMol. Cell. Toxicol20084170176[Google Scholar]
43. RiahiSGanjaliMRPourbasheerENorouziPQSRR study of GC retention indices of essential-oil compounds by multiple linear regression with a genetic algorithmChromatographia200867917922[Google Scholar]
44. HashemianzadehMSafarpourMAGholamjani-MoghaddamKMehdipourARDFT-based QSAR study of valproic acid and its derivativesQSAR Comb. Sci200827469474[Google Scholar]
45. GharagheiziFA new molecular-based model for prediction of enthalpy of sublimation of pure componentsThermochim. Acta2008469811[Google Scholar]
46. GharagheiziFQSPR studies for solubility parameter by means of Genetic Algorithm-Based Multivariate Linear Regression and generalized regression neural networkQSAR Comb. Sci200827165170[Google Scholar]
47. GharagheiziFAlamdariRFA molecular-based model for prediction of solubility of C-60 fullerene in various solventsFuller. Nanotub. Carbon Nanostr2008164057[Google Scholar]
48. CarlucciGD'ArchivioAAMaggiMAMazzeoPRuggieriFInvestigation of retention behaviour of non-steroidal anti-inflammatory drugs in high-performance liquid chromatography by using quantitative structure-retention relationshipsAnal. Chim. Acta20076016876[PubMed][Google Scholar]
49. GharagheiziFA new accurate neural network quantitative structure-property relationship for prediction of theta (lower critical solution temperature) of polymer solutionsE-Polymers2007, No. 114[Google Scholar]
50. ElliottGNWorganHBroadhurstDDraperJScullionJSoil differentiation using fingerprint Fourier transform infrared spectroscopy, chemometrics and genetic algorithm-based feature selectionSoil Biol. Biochem20073928882896[Google Scholar]
51. GharagheiziFQSPR analysis for intrinsic viscosity of polymer solutions by means of GA-MLR and RBFNNComput. Mater. Sci200740159167[Google Scholar]
52. DeebOHemmateenejadBJaberAGarduno-JuarezRMiriREffect of the electronic and physicochemical parameters on the carcinogenesis activity of some sulfa drugs using QSAR analysis based on genetic-MLR and genetic-PLSChemosphere20076721222130[PubMed][Google Scholar]
53. VataniAMehrpooyaMGharagheiziFPrediction of standard enthalpy of formation by a QSPR modelInt. J. Mol. Sci20078407432[Google Scholar]
54. JungMTakJLeeYJungYQuantitative structure-activity relationship (QSAR) of tacrine derivatives against acetylcholinesterase (ACNE) activity using variable selectionsBioorg. Med. Chem. Lett20071710821090[PubMed][Google Scholar]
55. FiszJJCombined genetic algorithm and multiple linear regression (GA-MLR) optimizer: Application to multi-exponential fluorescence decay surfaceJ. Phys. Chem. A20061101297712985[PubMed][Google Scholar]
56. WordHResearch Papers in StatisticsWileyNew York, NY, USA1966
57. Jores-KongHWordHSystems under Indirect Observation: Causality, structure, predictionNorth-HollandAmsterdam, The Netherlands1982
58. RogersDHopfingerAJApplication of genetic function approximation to quantitative structure-activity-relationships and quantitative structure-property relationshipsJ. Chem. Inf. Comput. Sci199434854866[Google Scholar]
59. FanYShiLMKohnKWPommierYWeinsteinJNQuantitative structure-antitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studiesJ. Med. Chem20014432543263[PubMed][Google Scholar]
60. SammiTSilakariORavikumarMThree-dimensional quantitative structure-activity relationship (3D-QSAR) studies of various benzodiazepine analogues of gamma-secretase inhibitorsJ. Mol. Model200915343348[PubMed][Google Scholar]
61. LiZGChenKXXieHYGaoJRQuantitative structure - activity relationship analysis of some thiourea derivatives with activities against HIV-1 (IIIB)QSAR Comb. Sci2009288997[Google Scholar]
62. SameeWNunthanavanitPUngwitayatornJ3D-QSAR investigation of synthetic antioxidant chromone derivatives by molecular field analysisInt. J. Mol. Sci20089235246[PubMed][Google Scholar]
63. NunthanavanitPAnthonyNGJohnstonBFMackaySPUngwitayatornJ3D-QSAR studies on chromone derivatives as HIV-1 protease inhibitors: Application of molecular field analysisArch. Der. Pharm2008341357364[Google Scholar]
64. KansalNSilakariORavikumarM3D-QSAR studies of various diaryl urea derivatives of multi-targeted receptor tyrosine kinase inhibitors: Molecular field analysis approachLett. Drug Des. Dis20085437448[Google Scholar]
65. JosephTBKumarBSanthoshBKritiSPramodABRavikumarMKishoreMQuantitative structure activity relationship and pharmacophore studies of adenosine receptor A(2B) inhibitorsChem. Biol. Drug Des200872395408[PubMed][Google Scholar]
66. EqubalTSilakariORavikumarMExploring three-dimensional quantitative structural activity relationship (3D-QSAR) analysis of SCH 66336 (Sarasar) analogues of farnesyltransferase inhibitorsEur. J. Med. Chem200843204209[PubMed][Google Scholar]
67. BhonsleJBBhattacharjeeAKGuptaRKNovel semi-automated methodology for developing highly predictive QSAR models: application for development of QSAR models for insect repellent amidesJ. Mol. Model200713179208[PubMed][Google Scholar]
68. Thomas LeonardJRoyKComparative QSAR modeling of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureasBioorg. Med. Chem. Lett20061644674474[PubMed][Google Scholar]
69. RoyKLeonardJTTopological QSAR modeling of cytotoxicity data of anti-HIV 5-phenyl-1-phenylamino-imidazole derivatives using GFA, G/PLS, FA and PCRA techniquesIndian J. Chem. Sect. A-Inorg. Bio-Inorg. Phys. Theor. Anal. Chem200645126137[Google Scholar]
70. DaviesMNHattotuwagamaCKMossDSDrewMGBFlowerDRStatistical deconvolution of enthalpic energetic contributions to MHC-peptide binding affinityBMC Struct Biol2006651513[PubMed][Google Scholar]
71. MandalASRoyKPredictive QSAR modeling of HIV reverse transcriptase inhibitor TIBO derivativesEur. J. Med. Chem20094415091524[PubMed][Google Scholar]
72. LeonardJTRoyKExploring molecular shape analysis of styrylquinoline derivatives as HIV-1 integrase inhibitorsEur. J. Med. Chem2008438192[PubMed][Google Scholar]
73. RoyKGhoshGQSTR with extended topochemical atom (ETA) indices 8.(a) QSAR for the inhibition of substituted phenols on germination rate of Cucumis sativus using chemometric toolsQSAR Comb. Sci200625846859[Google Scholar]
74. LeonardJTRoyKComparative QSAR modeling of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureasBioorg. Med. Chem. Lett20061644674474[PubMed][Google Scholar]
75. WoldSAnttiHLindgrenFOhmanJOrthogonal signal correction of near-infrared spectraChemometr. Intel. Lab. Syst199844175185[Google Scholar]
76. YinPYMohemaitiPChenJZhaoXJLuXYimitiAUpurHXuGWSerum metabolic profiling of abnormal savda by liquid chromatography/mass spectrometryJ. Chromatogr. B-Anal. Technol. Biomed. Life Sci2008871322327[Google Scholar]
77. Samadi-MaybodiADarziSSimultaneous determination of vitamin B12 and its derivatives using some of multivariate calibration 1 (MVC1) techniquesSpectrochim. Acta A-Mol. Biomol. Spectrosc20087011671172[PubMed][Google Scholar]
78. NiaziAJafarianBGhasemiJKinetic spectrophotometric determination of trace amounts of palladium by whole kinetic curve and a fixed time method using resazurine sulfide reactionSpectrochim. Acta A-Mol. Biomol. Spectrosc200871841846[PubMed][Google Scholar]
79. NiaziAGoodarziMOrthogonal signal correction-partial least squares method for simultaneous spectrophotometric determination of cypermethrin and tetramethrinSpectrochim. Acta A-Mol. Biomol. Spectrosc20086911651169[PubMed][Google Scholar]
80. NiaziAAmjadiENori-SharghDBozorghiSJSimultaneous voltammetric determination of lead and tin by adsorptive differential pulse stripping method and orthogonal signal correction-partial least squares in water samplesJ. Chinese Chem. Soc200855276285[Google Scholar]
81. KarimiMAArdakaniMMBehjatmanesh-ArdakaniRNezhadMRHAmiryanHIndividual and simultaneous determinations of phenothiazine drugs using PCR, PLS and (OSC)-PLS multivariate calibration methodsJ. Serb. Chem. Soc200873233247[Google Scholar]
82. ChoHWKimSBJeongMKParkYMillerNGZieglerTRJonesDPDiscovery of metabolite features for the modelling and analysis of high-resolution NMR spectraInt. J. Data Min. Bioinf20082176192[Google Scholar]
83. ChengZZhuASZhangLQQuantitative analysis of electronic absorption spectroscopy by piecewise orthogonal signal correction and partial least squareGuang Pu Xue Yu Guang Pu Fen Xi200828860864[PubMed][Google Scholar]
84. ChengZZhuASZhangLQQuantitative analysis of electronic absorption spectroscopy by piecewise orthogonal signal correction and partial least squareSpectrosc. Spectr. Anal200828860864[Google Scholar]
85. ChengZZhuASPiecewise orthogonal signal correction approach and its application in the analysis of wheat near-infrared spectroscopic dataChinese J. Anal. Chem200836788792[Google Scholar]
86. RouhollahiARajabzadehRGhasemiJSimultaneous determination of dopamine and ascorbic acid by linear sweep voltammetry along with chemometrics using a glassy carbon electrodeMicrochim. Acta2007157139147[Google Scholar]
87. PsihogiosNGKalaitzidisRGDimouSSeferiadisKISiamopoulosKCBairaktariETEvaluation of tubulointerstitial lesions' severity in patients with glomerulonephritides: An NMR-Based metabonomic studyJ. Proteome Res2007637603770[PubMed][Google Scholar]
88. NiaziAAziziALeardiRA comparative study between PLS and OSC-PLS in the simultaneous determination of lead and mercury in water samples: effect of wavelength selectionCan. J. Anal. Sci. Spectrosc200752365374[Google Scholar]
89. PrioloNArribereCMCaffiniNBarberisSVazquezRNLucoJMIsolation and purification of cysteine peptidases from the latex of Araujia hortorum fruits - Study of their esterase activities using partial least-squares (PLS) modelingJ. Mol. Catal. B-Enzym200115177189[Google Scholar]
90. YangSBXiaZNShuMMeiHLueFLZhangMWuYQLiZLVHSEH Descriptors for the Development of QSAMs of PeptidesChem. J. Chinese Univ20082922132217[Google Scholar]
91. LiangGZMeiHZhouYYangSBWuSRLiZLUsing SZOTT descriptors for the development of QSAMs of peptidesChem. J. Chinese Univ20062719001902[Google Scholar]
92. ZhaoCYBorianiEChanaARoncaglioniABenfenatiEA new hybrid system of QSAR models for predicting bioconcentration factors (BCF)Chemosphere20087317011707[PubMed][Google Scholar]
93. QiJNiuJFWangLLResearch on QSPR for n-octanol-water partition coefficients of organic compounds based on genetic algorithms-support vector machine and genetic algorithms-radial basis function neural networksHuanjing Kexue200829212218[PubMed][Google Scholar]
94. LuanFLiuHTWenYYZhangXYPrediction of quantitative calibration factors of some organic compounds in gas chromatographyAnalyst2008133881887[PubMed][Google Scholar]
95. LuanFLiuHTWenYYZhangXYQuantitative structure-property relationship study for estimation of quantitative calibration factors of some organic compounds in gas chromatographyAnal. Chim. Acta2008612126135[PubMed][Google Scholar]
96. LuanFLiuHTMaWPFanBTQSPR analysis of air-to-blood distribution of volatile organic compoundsEcotoxicol. Environ. Saf200871731739[Google Scholar]
97. ChenHFQuantitative predictions of gas chromatography retention indexes with support vector machines, radial basis neural networks and multiple linear regressionAnal. Chim. Acta20086092436[PubMed][Google Scholar]
98. ZhaoCYZhangHXZhangXYLiuMCHuZDFanBTApplication of support vector machine (SVM) for prediction toxic activity of different data setsToxicology2006217105119[PubMed][Google Scholar]
99. TetkoIVSolov’evVPAntonovAVYaoXJDoucetJPFanBTHoonakkerFFourchesDJostPLachicheNVarnekABenchmarking of linear and nonlinear approaches for quantitative structure-property relationship studies of metal complexation with ionophoresJ. Chem. Inf. Model200646808819[PubMed][Google Scholar]
100. ShiJLuanFZhangHXLiuMCGuoQXHuZDFanBTQSPR study of fluorescence wavelengths (lambda(ex)/lambda(em)) based on the heuristic method and radial basis function neural networksQSAR Comb. Sci200625147155[Google Scholar]
101. MaWPLuanFZhangHXZhangXYLiuMCHuZDFanBTAccurate quantitative structure-property relationship model of mobilities of peptides in capillary zone electrophoresisAnalyst200613112541260[PubMed][Google Scholar]
102. LuanFZhangXZhangHZhangRLiuMHuZFanBPrediction of standard Gibbs energies of the transfer of peptide anions from aqueous solution to nitrobenzene based on support vector machine and the heuristic methodJ. Comput. Aided Mol. Des200620111[PubMed][Google Scholar]
103. SpechtDFA general regression neural networkIEEE Trans. Neur. Netw19912568576[Google Scholar]
104. SzaleniecMTadeusiewiczRWitkoMHow to select an optimal neural model of chemical reactivity?Neurocomputing200872241256[Google Scholar]
105. JiLWangXDLuoSQinLAYangXSLiuSSWangLSQSAR study on estrogenic activity of structurally diverse compounds using generalized regression neural networkSci. China Ser. B-Chem200851677683[Google Scholar]
106. MagerPPSubset selection and docking of human P2X7 inhibitorsCurr. Comput. Aided Drug Des20073248253[Google Scholar]
107. IbricSJovanovicMDjuricZParojcicJSolomunLLucicBGeneralized regression neural networks in prediction of drug stabilityJ. Pharm. Pharmacol200759745750[PubMed][Google Scholar]
108. YapCWLiZRChenYZQuantitative structure-pharmacokinetic relationships for drug clearance by using statistical learning methodsJ. Mol. Graph. Model200624383395[PubMed][Google Scholar]
109. Agatonovic-KustrinSTurnerJVArtificial neural network modeling of phytoestrogen binding to estrogen receptorsLett. Drug Des. Disc20063436442[Google Scholar]
110. CortesCVapnikVSupport-Vector NetworksMach. Learn199520273297[Google Scholar]
111. VapnikVThe Support Vector method of function estimationUS Patent 5,950,1461999[Google Scholar]
112. WangWJXuZBLuWZZhangXYDetermination of the spread parameter in the Gaussian kernel for classification and regressionNeurocomputing200355, PII S0925-2312(02)00632-X.[Google Scholar]
113. SuykensJAKVandewalleJLeast squares support vector machine classifiersNeural Process. Lett19999293300[Google Scholar]
114. YuanYNZhangRSHuRJRuanXFPrediction of volatile components retention time in blackstrap molasses by least-squares support vector machineQSAR Comb. Sci200827535542[Google Scholar]
115. NiaziAJameh-BozorghiSNori-SharghDPrediction of toxicity of nitrobenzenes using ab initio and least squares support vector machinesJ. Hazard. Mater2008151603609[PubMed][Google Scholar]
116. GoudarziNGoodarziMPrediction of the logarithmic of partition coefficients (log P) of some organic compounds by least square-support vector machine (LS-SVM)Mol Phys2008106, DOI 10.1080/00268970802577834|PII 905993443.[Google Scholar]
117. LiuHPapaEWalkerJDGramaticaPIn silico screening of estrogen-like chemicals based on different nonlinear classification modelsJ. Mol. Graph. Model200726135144[PubMed][Google Scholar]
118. LiuHXYaoXJZhangRSLiuMCHuZDFanBTPrediction of the tissue/blood partition coefficients of organic compounds based on the molecular structure using least-squares support vector machinesJ. Comput. Aided Mol. Des200519499508[PubMed][Google Scholar]
119. SiHZWangTZhangKJDuanYBYuanSPFuAPHuZDQuantitative structure activity relationship model for predicting the depletion percentage of skin allergic chemical substances of glutathioneAnal. Chim. Acta2007591255264[PubMed][Google Scholar]
120. SiHZZhangKJHuZDFanBTQSAR model for prediction capacity factor of molecular imprinting polymer based on gene expression programmingQSAR Comb. Sci2007264150[Google Scholar]
121. SiHZYuanSPZhangKJFuAPDuanYBHueZDQuantitative structure activity relationship study on EC5.0 of anti-HIV drugsChemometr. Intel. Lab. Syst2008901524[Google Scholar]
122. Gepsoft Home PageAvailable online: http://www.gepsoft.com/ (accessed on 10 March 2009)
123. FriedmanJHStuetzleWProjection Pursuit RegressionJ. Am. Stat. Assoc198176817823[Google Scholar]
124. YuanYNZhangRSHuRJPrediction of Photolysis of PCDD/Fs Adsorbed to Spruce [Picea abies (L.) Karst.] Needle Surfaces Under Sunlight Irradiation Based on Projection Pursuit RegressionQSAR Comb. Sci200928155162[Google Scholar]
125. DuHYZhangXYWangXYaoXJHuZDNovel approaches to predict the retention of histidine-containing peptides in immobilized metal-affinity chromatographyProteomics2008821852195[PubMed][Google Scholar]
126. DuHYWangJZhangXYHuZDA novel quantitative structure-activity relationship method to predict the affinities of MT3 melatonin binding siteEur. J. Med. Chem20084328612869[PubMed][Google Scholar]
127. DuHYWangJWatzlJZhangXYHuZDPrediction of inhibition of matrix metalloproteinase inhibitors based on the combination of Projection Pursuit Regression and Grid Search methodChemometr. Intel. Lab. Syst200893160166[Google Scholar]
128. RenYYLiuHXYaoXJLiuMCPrediction of ozone tropospheric degradation rate constants by projection pursuit regressionAnal. Chim. Acta2007589150158[PubMed][Google Scholar]
129. RenYYLiuHXLiSYYaoXJLiuMCPrediction of binding affinities to beta(1) isoform of human thyroid hormone receptor by genetic algorithm and projection pursuit regressionBioorg. Med. Chem. Lett20071724742482[PubMed][Google Scholar]
130. GunturiSBArchanaKKhandelwalANarayananRPrediction of hERG Potassium Channel Blockade Using kNN-QSAR and Local Lazy Regression MethodsQSAR Comb. Sci20082713051317[Google Scholar]
131. DuHYWatzlJWangJZhangXYYaolXJHuZDPrediction of retention indices of drugs based on immobilized artificial membrane chromatography using Projection Pursuit Regression and Local Lazy RegressionJ. Sep. Sci20083123252333[PubMed][Google Scholar]
132. GuhaRDuttaDJursPCChenTLocal lazy regression: Making use of the neighborhood to improve QSAR predictionsJ. Chem. Inf. Model20064618361847[PubMed][Google Scholar]