Corrected Confidence Bands for Functional Data Using Principal Components
1. Introduction
Functional principal components (FPC) analysis is a key tool in functional data analysis that has applications in dimension reduction, estimation of curves in the presence of measurement error, and the expression of traditional longitudinal data as sparsely observed functional data. Despite the ubiquity of FPC analysis (hereafter FPCA) in the functional data literature, principal components expansions are poorly understood in terms of their sources of variability. In this article, we develop a method for the estimation of curves that accounts for uncertainty in the model-based estimation of curve-specific scores for a particular FPC decomposition as well as uncertainty in the FPC decomposition itself. Further, we construct confidence intervals for the curves that include both model-and decomposition-based sources of variability.
Using FPCA, individual curves are expressed as a linear combination of population-level basis functions and curve-specific scores. These scores can be predicted as the best linear unbiased predictors (BLUPs) in a mixed model framework for a particular FPC decomposition, which includes the basis functions and score variances. Thus FPCA allows a parsimonious, nonparametric representation of functional data. Additionally, mixed model results allow the construction of confidence intervals for estimated curves, again for a particular decomposition. However, in practice the decomposition is based on the estimated covariance structure of the population of curves and is subject to variability. This added variability can be substantial in many settings, including when a small number of curves is observed and when the curves are observed sparsely at the subject level, yet is often overlooked. Accurate estimation and inference for functional expansions is important is many scientific settings, such as understanding CD4 cell count trajectories in HIV positive individuals based on infrequent measurements or quantifying intracranial white-matter structure using diffusion tensor imaging (DTI).
Our proposed method explicitly conditions on a particular FPC decomposition when predicting curve-specific scores and estimating the associated variability. Next, we use iterated expectation and variance formulas to combine uncertainty in the conditional estimates across the distribution of FPC decomposition objects (the basis functions, mean function, measurement error and score variances, and the truncation lag). We use the bootstrap to derive an empirical distribution of decomposition objects by resampling curves with replacement and, for each sample, constructing an FPC decomposition. Thus, our method accounts for two major sources of variability: (1) the model-based uncertainty conditional on a particular decomposition; and (2) the uncertainty in the estimated FPC decomposition objects. Moreover, it is applicable when curves are sparsely or densely sampled, for small and large sample sizes and when the curves are measured with error.
FPCA has a long history in the statistical literature, dating at least to the treatment of growth curves in Rao (1958). For a modern introduction to the topic, see Ramsay and Silverman (2005, Chapter 8) or Jolliffe (2002, Section 12.3). Recently, Yao, Müller, and Wang (2005) proposed the use of BLUPs to predict curve-specific scores for FPC expansions in a mixed model framework; this approach is related to that of Shi, Weiss, and Taylor (1996) and Rice and Wu (2001), in which curves are expressed using penalized spline bases in a mixed model, but can be used when curves are sparsely observed. BLUPs have also been used in multilevel and longitudinal FPC analysis by Di et al. (2009) and Greven et al. (2010), respectively, to predict scores in more complex decompositions. The mixed model framework has allowed the use of Bayesian techniques for FPCA (Crainiceanu and Goldsmith 2010) and the joint modeling of curves and scalar outcomes in functional regression models (Goldsmith et al. 2012). Asymptotic consistency and distributions of scores predicted in the mixed model framework are developed by Yao et al. (2005), and asymptotic properties of FPC decompositions are considered in Hall and Hosseini-Nasab (2006). James, Hastie, and Sugar (2000) use a penalized spline approach to estimate FPC basis functions directly, rather than through decomposing an estimated covariance matrix. This article also uses the bootstrap to generate pointwise confidence intervals by taking the empirical quantiles of estimated functions; importantly, this approach accounts for uncertainty in the FPC decompositions, but not for model-based uncertainty. The bootstrap is similarly used with functional data by Hall and Hosseini-Nasab (2006) to construct confidence intervals for FPC basis functions, and by Crainiceanu et al. (2012) for inference on the mean difference curve between two groups.
As curves can be viewed as realizations from stochastic processes and thus as random objects, it may be more appropriate to refer to “prediction intervals” than to “confidence intervals” when providing bands for predicted curves. However, in keeping with existing nomenclature (e.g., Yao et al. 2005), we will use the term confidence interval when referring to bands for the curves without additional white noise error, in contrast to prediction intervals including this additional variability source. This is also in line with the use of these terms in the penalized spline literature (e.g., Ruppert, Wand, and Carroll 2003, compare page 139), where random effects are used as a device in the modeling of curves, and the term confidence rather than prediction interval is used when predicting curves without additional noise.
The remainder of the article is organized as follows. In Section 2 we review FPCA and the estimation of curves in a mixed model framework. Section 3 introduces our method for including uncertainty in principal component expansions in the estimation of curves and understanding the associated variability. Simulation results comparing several methods are presented and discussed in Section 4. Two statistically and scientifically distinct applications are considered in Section 5. We close with a discussion in Section 6.
2. Mixed Model Framework for FPCA
Given curves Yi(d), d ∈ [0, 1], for 1 ≤ i ≤ I, define the covariance operator ΣY (d, d′) = Cov[Yi(d), Yi (d′)] so that ΣY (d, d′) is a bivariate function giving the covariance between two locations on the same curve. Let be the spectral decomposition of ΣY (d, d′), where ϕ(d) = {ϕk(d):k ∈ ℤ} are orthonormal eigenfunctions and λ1 ≥ λ2 . . . are the corresponding nonincreasing eigenvalues. Based on this decomposition, a Karhunen–Loève expansion for Yi(d) is , where the are uncorrelated random variables with mean 0 and variance λk , and μ(d) = E[Y (d)].
In practice, we observe the functions Yi(d) with error—that is, we observe Ỹi(d) = Yi(d) + εi(d) where we assume that εi(d) ~ N[0, σ]. Moreover, functions Yi(d) are observed on finite grids that are often sparse at the subject level. These practical concerns impact the estimation of the mean function, the FPC decomposition, and the Karhunen–Loève expansion. First, the mean function is estimated using penalized splines fit to the pooled observations under working independence; the smoothing parameter is selected via restricted maximum likelihood (REML). Next, we use a method of moments approach to construct a raw estimate of the covariance matrix and smooth the off-diagonal elements to obtain (Staniswalis and Lee 1998; Yao et al. 2003). Smoothing of the raw covariance matrix is performed using bivariate penalized splines, again with smoothing parameters chosen via REML. For both the mean function and the covariance matrix, other smoothing approaches could be used without modification of the underlying technique. The retained estimated principal component basis functions and score variances are based on the decomposition of ; the estimate of the truncation K lag is the minimum number of components needed to explain 99% of the variability in the observed curves, although other methods for estimating K are possible. The estimate of the covariance matrix ΣY (d, d′) is given by
where is a diagonal matrix with elements based on the retained basis functions and score variances. The measurement error variance σ is estimated as the average difference between the middle 60% of diagonal elements of the raw covariance matrix and ; omitting the extreme diagonal elements provides more stable estimates of the measurement error variance.
Numeric integration for the estimation of the scores may be inaccurate when the curves are sparsely observed. With this in mind, Yao et al. (2005) predict scores ξi = {ξik : k ∈ } using BLUPs, effectively basing inference on the mixed model
where ξi and εi = {εi(di1), . . . , εi(diJi ) are independent. Notationally, let θ = {φ, μ, Λ, σ, K} be the collection of unobserved FPC decomposition objects: the basis functions ϕ evaluated over the domain of d; the mean function μ evaluated over the domain of d; the matrix of score variances Λ = diag{λ1, . . . λK}; the measurement error variance σ; and the truncation lag K. Similarly, let be the estimated FPC decomposition objects. Also, let and dg be a dense grid on the domain of d, often taken as the union of the di. In the following, Ỹi(di) will represent the vector of observations for curve i, the vector ϕk (di) will be the kth basis function evaluated at locations di, and the Ji × K matrix ϕ(di) will be the collection of basis functions evaluated at locations di. Analogous notation will represent functions evaluated over the dense grid dg. We emphasize the conditioning on the FPC decomposition objects notationally by conditioning on .
The estimated BLUPs for the loadings are
A truncated Karhunen–Loève expansion for the ith curve over the dense grid dg, based on the objects estimated in the FPC decomposition and the predicted scores, is
Model-based covariances for the predicted scores are used to obtain the covariance operator for the estimated curve
the curve-specific covariance in (5) is similar to the estimate of ΣY in (1), but is decreased according to the term . This covariance focuses on the variability in the curve-specific expansion, and does not include uncertainty in the estimated overall mean function . Next, approximate (1 – α)% level pointwise confidence intervals for the predicted curves are given by
where Φ(·) is the standard Gaussian cumulative distribution function. Similarly, approximate (1 – α)% level simultaneous intervals are given by
where is the (1 – α) quantile of a chi-squared distribution with degrees of freedom. The intervals in (6) and (7) are justified by asymptotic arguments in Yao et al. (2005).
2. Mixed Model Framework for FPCA
Given curves Yi(d), d ∈ [0, 1], for 1 ≤ i ≤ I, define the covariance operator ΣY (d, d′) = Cov[Yi(d), Yi (d′)] so that ΣY (d, d′) is a bivariate function giving the covariance between two locations on the same curve. Let be the spectral decomposition of ΣY (d, d′), where ϕ(d) = {ϕk(d):k ∈ ℤ} are orthonormal eigenfunctions and λ1 ≥ λ2 . . . are the corresponding nonincreasing eigenvalues. Based on this decomposition, a Karhunen–Loève expansion for Yi(d) is , where the are uncorrelated random variables with mean 0 and variance λk , and μ(d) = E[Y (d)].
In practice, we observe the functions Yi(d) with error—that is, we observe Ỹi(d) = Yi(d) + εi(d) where we assume that εi(d) ~ N[0, σ]. Moreover, functions Yi(d) are observed on finite grids that are often sparse at the subject level. These practical concerns impact the estimation of the mean function, the FPC decomposition, and the Karhunen–Loève expansion. First, the mean function is estimated using penalized splines fit to the pooled observations under working independence; the smoothing parameter is selected via restricted maximum likelihood (REML). Next, we use a method of moments approach to construct a raw estimate of the covariance matrix and smooth the off-diagonal elements to obtain (Staniswalis and Lee 1998; Yao et al. 2003). Smoothing of the raw covariance matrix is performed using bivariate penalized splines, again with smoothing parameters chosen via REML. For both the mean function and the covariance matrix, other smoothing approaches could be used without modification of the underlying technique. The retained estimated principal component basis functions and score variances are based on the decomposition of ; the estimate of the truncation K lag is the minimum number of components needed to explain 99% of the variability in the observed curves, although other methods for estimating K are possible. The estimate of the covariance matrix ΣY (d, d′) is given by
where is a diagonal matrix with elements based on the retained basis functions and score variances. The measurement error variance σ is estimated as the average difference between the middle 60% of diagonal elements of the raw covariance matrix and ; omitting the extreme diagonal elements provides more stable estimates of the measurement error variance.
Numeric integration for the estimation of the scores may be inaccurate when the curves are sparsely observed. With this in mind, Yao et al. (2005) predict scores ξi = {ξik : k ∈ } using BLUPs, effectively basing inference on the mixed model
where ξi and εi = {εi(di1), . . . , εi(diJi ) are independent. Notationally, let θ = {φ, μ, Λ, σ, K} be the collection of unobserved FPC decomposition objects: the basis functions ϕ evaluated over the domain of d; the mean function μ evaluated over the domain of d; the matrix of score variances Λ = diag{λ1, . . . λK}; the measurement error variance σ; and the truncation lag K. Similarly, let be the estimated FPC decomposition objects. Also, let and dg be a dense grid on the domain of d, often taken as the union of the di. In the following, Ỹi(di) will represent the vector of observations for curve i, the vector ϕk (di) will be the kth basis function evaluated at locations di, and the Ji × K matrix ϕ(di) will be the collection of basis functions evaluated at locations di. Analogous notation will represent functions evaluated over the dense grid dg. We emphasize the conditioning on the FPC decomposition objects notationally by conditioning on .
The estimated BLUPs for the loadings are
A truncated Karhunen–Loève expansion for the ith curve over the dense grid dg, based on the objects estimated in the FPC decomposition and the predicted scores, is
Model-based covariances for the predicted scores are used to obtain the covariance operator for the estimated curve
the curve-specific covariance in (5) is similar to the estimate of ΣY in (1), but is decreased according to the term . This covariance focuses on the variability in the curve-specific expansion, and does not include uncertainty in the estimated overall mean function . Next, approximate (1 – α)% level pointwise confidence intervals for the predicted curves are given by
where Φ(·) is the standard Gaussian cumulative distribution function. Similarly, approximate (1 – α)% level simultaneous intervals are given by
where is the (1 – α) quantile of a chi-squared distribution with degrees of freedom. The intervals in (6) and (7) are justified by asymptotic arguments in Yao et al. (2005).
3. Correct Expansions and Inference
The mixed model framework developed in Section 2 explicitly conditions on the FPC decomposition objects θ and, given the basis functions, mean function, score and measurement error variances, and truncation lag, derives best linear predictors for curve-specific loadings and variance estimates for functional expansions. When these elements are fixed at their true values, standard mixed model results support the construction described (Ruppert et al. 2003). In practice, the elements of the FPC decomposition are unknown and must be estimated from the data. As noted, asymptotic work in Yao et al. (2005) can be used to justify the above-mentioned construction for large samples, but this approach can lead to a substantial understatement of the total variability in the estimated curves when the uncertainty in the decomposition is high.
Our proposed method accounts for both model- and decomposition-based uncertainty in estimating curves and in assessing the variability of these estimates. Briefly, the framework described in Section 2 is used to construct estimates of curves and their associated variability given a particular decomposition. Iterated expectation and variance formulas combine model-based curve estimates and variances across the distribution of decompositions to account for uncertainty in FPC decompositions. We implement a bootstrap procedure to examine uncertainty in the distribution of decompositions; that is, we resample curves with replacement and construct FPC decompositions based on the bootstrap sample. Because we estimate all objects of the FPC decomposition θ = {ϕ, μ, Λ, σ, K} within each bootstrap sample, uncertainty in all of these objects is accounted for by our method. Note that other approaches for assessing uncertainty in θ could be used without altering the fundamental concept of combining two major sources of uncertainty in FPC expansions.
More specifically, we draw a sample Ỹb of I functions from the full data Ỹ with replacement. Based on this sample, we derive an estimate of the covariance operator by smoothing the off-diagonal elements of the raw covariance matrix and obtain an estimate of the FPC decomposition objects . Conditioning on , we proceed as in Section 2 and estimate the scores , curves , and variances for each element of the full data set Ỹ . Thus, resampling is used only to assess uncertainty in the decomposition—given each bootstrap decomposition, estimates and variances are constructed for each curve Yi. Model-based expectations and variances condition on a particular decomposition, but can be estimated for curves Yi that are not contained in the bootstrap sample Ỹb.
Next, we combine information across bootstrap samples to estimate curves and construct variability estimates that account for uncertainty in the FPC decompositions. First, we use iterated expectations to estimate curves over the dense grid dg:
Thus, estimated curves are the mean of model-based estimates taken over the distribution of decomposition objects in θ. This is estimated taking the average of the model-based estimates across our bootstrap samples. The total covariance operator of the estimated curves is based on the iterated variance formula:
We make several remarks regarding (9). We first emphasize that and Yi(dg) are expanded using different mean and basis functions: the former is expanded in terms of estimated objects in while the latter is expanded using true objects in θ. In practice we approximate the conditional variance using the model-based variance in (5) for each bootstrap sample, understanding that this form incorrectly assumes the same expansion for both and Yi(dg). We then average the model-based variances across bootstrap samples to reflect the outer expectation with respect to . Similarly the conditional expectation includes the bias that is induced by expanding in terms of the estimated objects . We estimate this bias by taking the difference of and , where and are the decompositions based on the bootstrap and full data, respectively, and consider the variance of this bias across bootstrap samples. Taken together, (9) combines the expected variation in the estimation of curves given a particular decomposition and the variability of the bias that stems from the use of estimated decomposition objects.
Pointwise confidence intervals for the predicted curves are given by
Simultaneous confidence intervals for FPC expansions are constructed similarly to intervals for penalized splines (Ruppert et al. 2003). Specifically, a 100(1 – α)% simultaneous confidence interval for Yi(dg ) is given by
where m(1–α) is the (1 – α) quantile of the random variable
To estimate m(1–α), we note that Ŷi(dg) – Yi(dg) is approximately multivariate normal centered at 0 and with the total covariance:
We repeatedly draw functions with the estimated covariance from this distribution, and for each draw calculate the resulting value of (12); the (1 – α) quantile of these values is used as an estimate of m(1–α). Because this sampling is done only once after all bootstrap samples have been used, estimating simultaneous intervals does not add considerably to the total computation time of this approach.
The construction of pointwise intervals in (10) and simultaneous intervals in (11) is based on the variance Var[Ŷi(dg) – Yi(dg)] of the curve estimate constructed using iterated expectations in (8). In practice, we approximate this by the total variability of the model-based estimate . Construction of confidence intervals also assumes that the estimated curve averaged over the distribution of FPC decompositions is approximately normally distributed around the true curve. However the distribution of curves may exhibit skewness, overdispersion, or multimodality; in this case, the expected value and variance defined in (8) and (9) may not completely specify the distribution. Alternatively, one might pose more flexible parametric distributions or use empirical distributions to describe the behavior of curves accounting for variability in FPC decompositions. However, asymptotic results in Yao et al. (2005) justify the normal distribution for large sample sizes, and simulation studies in Section 4 indicate that the normality assumption is reasonable and provides confidence intervals that achieve the nominal coverage in a variety of situations, including small sample sizes and sparse observations.
4. Simulations
In this section, we conduct a simulation study to examine the accuracy of the estimated curves and the coverage probabilities of confidence intervals based on our proposed method. We construct data sets using the following model
for d observed on the dense equally spaced grid dg = {(t – 0.5)/50, t = 1, . . . , 50}, μ(d) = ¼d and εi(d) ~ N[0, σ]; two scenarios for the distribution of scores ξi are given below. The eigenfunctions are taken to be and the variances are . Two hundred such data sets are constructed for all possible combinations of the scenarios below:
Sample sizes (a) I = 20; (b) I = 50; (c) I = 100;
Measurement error variances (a) σ = 0.0005; (b) σ = 0.0025; (c) σ = 0.01;
Number of points observed per curve (a) Ji = 50; (b) Ji = 20; (c) an unbalanced design. For the unbalanced design, the number of points observed per curve follow a Pois[λ = 15] distribution; for sparse designs (b) and (c), the locations of observed points are uniformly distributed within dg.
4. a) Normally distributed scores: ξik ~ N[0, λk]; b) Non-normally distributed scores (mixture of normals): ξik drawn with equal probability from either or .
This gives a total of 54 possible designs. Additional simulations presented in the supplementary material use K = 2 and K = 6 to examine the effect of changing the number of components on the relative performance of several inferential approaches; the results of these analyses are similar to those given below.
For each data set, we estimate curves and construct 95% pointwise confidence intervals using the method described in Section 3 with 100 bootstrap samples per data set; additional unreported simulations indicate that this number of bootstrap samples is sufficient to understand the uncertainty in the estimation of FPC decomposition objects in the scenarios we consider. We estimate curves and confidence intervals using three methods: (i) the iterated expectation and variance method proposed in Section 3, labeled “IV” in Figures, Tables, and the text below; (ii) the mixed model approach estimating the FPC decomposition objects θ from the data, labeled “”; and (iii) the mixed model approach fixing the FPC decomposition objects θ at their true values, labeled “MM – θ”. In addition, when examining pointwise intervals we consider using the 2.5% and 97.5% quantiles of the curves estimated within each bootstrap sample, labeled “Naive Bootstrap.”
Table 1 provides the average mean squared error of the curves estimated from IV, , and MM – θ approaches using non-normally distributed scores (results for normally distributed scores are similar and reported in the online Appendix). For reference on the values in this table, Figure 1 displays two examples of simulated curves, the observed data points, and curve estimates with the MSE for each estimate.
Table 1
IV | MM – θ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | |
I = 20; σ = 0.0005 | 0.006 | 0.200 | 0.535 | 0.006 | 0.222 | 0.805 | 0.004 | 0.013 | 0.035 |
σ = 0.0025 | 0.026 | 0.260 | 0.971 | 0.026 | 0.284 | 0.886 | 0.020 | 0.065 | 0.136 |
σ = 0.01 | 0.100 | 0.478 | 0.801 | 0.099 | 0.485 | 0.867 | 0.079 | 0.242 | 0.407 |
I = 50; σ = 0.0005 | 0.005 | 0.089 | 1.246 | 0.005 | 0.105 | 3.729 | 0.004 | 0.013 | 0.039 |
σ = 0.0025 | 0.023 | 0.144 | 0.396 | 0.023 | 0.162 | 1.576 | 0.020 | 0.063 | 0.142 |
σ = 0.01 | 0.087 | 0.335 | 1.009 | 0.088 | 0.349 | 0.643 | 0.079 | 0.237 | 0.419 |
I = 100; σ = 0.0005 | 0.005 | 0.055 | 0.280 | 0.005 | 0.066 | 1.161 | 0.004 | 0.013 | 0.040 |
σ = 0.0025 | 0.021 | 0.109 | 0.775 | 0.021 | 0.120 | 1.752 | 0.020 | 0.064 | 0.138 |
σ = 0.01 | 0.083 | 0.292 | 0.515 | 0.083 | 0.301 | 0.530 | 0.078 | 0.237 | 0.415 |
The example in the left panel comes from a simulation with I = 20, Ji = 20, and σ = 0.01, while the example in the right comes from a simulation with I = 50, Ji = 50, and σ = 0.0025.
For densely observed curves, Table 1 indicates that the IV and approaches have similar performances, but that the IV method modestly outperforms the model-based approach when curves are sparsely observed. For the unbalanced design, some MSE's are quite large; this stems from estimating the measurement error variance to be zero and dramatically overfitting the observed data. In practice, these situations would be clear aberrations and other solutions (such as fixing σ to be positive) could be implemented. As expected, performance improves as sample size increases and when the measurement error variance is smaller.
Next we consider inference on estimated curves. Table 2 displays the average coverage, taken across all grid points and all subjects, of pointwise confidence intervals for simulation scenarios with non-normally distributed scores (again, results for normally distributed scores are similar and reported in the online Appendix).
Table 2
IV | Naive Bootstrap | MM – θ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | |
I = 20; σ = 0.0005 | 95.4 | 98.5 | 98.4 | 68.0 | 94.8 | 94.2 | 89.2 | 29.0 | 33.2 | 95.1 | 94.8 | 94.9 |
σ = 0.0025 | 95.3 | 96.4 | 96.3 | 63.5 | 88.6 | 88.0 | 90.5 | 50.7 | 46.6 | 95.1 | 94.8 | 94.9 |
σ = 0.01 | 95.3 | 94.5 | 93.4 | 61.1 | 76.0 | 74.9 | 91.0 | 80.7 | 75.6 | 95.1 | 94.7 | 94.8 |
I = 50; σ = 0.0005 | 94.9 | 96.4 | 97.5 | 51.5 | 89.9 | 89.7 | 92.6 | 36.2 | 35.3 | 95.0 | 94.9 | 95.0 |
σ = 0.0025 | 94.8 | 93.5 | 94.2 | 46.7 | 78.7 | 79.1 | 93.0 | 66.4 | 57.2 | 95.0 | 95.0 | 95.1 |
σ = 0.01 | 94.9 | 94.4 | 93.7 | 44.3 | 61.8 | 62.3 | 93.2 | 88.7 | 85.4 | 95.0 | 95.0 | 95.2 |
I = 100; σ = 0.0005 | 95.0 | 94.4 | 96.0 | 39.2 | 86.0 | 85.4 | 94.0 | 35.2 | 42.1 | 95.1 | 95.1 | 95.0 |
σ = 0.0025 | 95.0 | 92.8 | 92.9 | 35.2 | 71.5 | 71.2 | 94.1 | 76.0 | 69.2 | 95.1 | 95.1 | 95.0 |
σ = 0.01 | 94.9 | 94.9 | 94.3 | 33.5 | 52.4 | 51.9 | 94.1 | 91.7 | 90.7 | 95.1 | 95.1 | 95.1 |
The results in Table 2 demonstrate the reliability of the IV method, with coverages between 92.8 and 98.5 for all simulation scenarios. Lower coverages are observed for both the naive bootstrap method and the approach, particularly for smaller sample sizes and sparser observations.
The naive bootstrap and approaches each consider only one important source of variability: the naive bootstrap approach captures uncertainty in the FPC decomposition objects θ but does not consider the model-based uncertainty in the estimation of the scores, while the approach includes the model-based variability in the subject-specific scores but does not consider uncertainty in the decomposition objects. As the sample size increases, θ is better estimated, so the coverage of the approach that conditions on this estimate improves. For the naive bootstrap interval, coverage decreases as sample size increases because θ is better estimated within each bootstrap sample, which provides more stable estimates of individual curves across samples. Thus, the bootstrap estimates themselves show lower variability, but do not reflect the uncertainty in the predicted curve-specific scores. Interestingly, coverage for the method decreases as the measurement error variance decreases. For scenarios in which the true measurement error variance σ is small, this parameter is often estimated to be zero. When this happens, confidence intervals estimated using the approach are degenerate and coverage probabilities are zero. Finally, the MM – θ method attains the nominal coverage in all situations, which illustrates that the low coverages in the asymptotic approach are not caused by the mixed model framework, but by uncertainty in the FPC decomposition objects.
Table 3 provides the average coverage of simultaneous confidence intervals taken over all grid points and subjects. For the IV approach, these intervals are constructed using the quantile estimate procedure described in equations (11)–(13) in Section 3. For the and MM θ methods, simultaneous intervals are given by equation (7) in Section 2.
Table 3
IV | MM – θ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | Ji = 50 | = 20 | Unbal | |
I = 20; σ = 0.0005 | 93.4 | 98.2 | 97.3 | 81.7 | 13.2 | 17.4 | 98.0 | 98.1 | 98.3 |
σ = 0.0025 | 94.2 | 94.4 | 93.6 | 87.2 | 26.1 | 25.7 | 98.0 | 98.1 | 98.3 |
σ = 0.01 | 94.7 | 92.0 | 88.3 | 89.1 | 60.0 | 53.8 | 98.0 | 98.2 | 98.1 |
I = 50; σ = 0.0005 | 92.4 | 93.1 | 95.6 | 91.7 | 19.1 | 25.6 | 98.2 | 98.2 | 98.3 |
σ = 0.0025 | 94.0 | 86.6 | 87.9 | 94.3 | 42.7 | 40.9 | 98.3 | 98.2 | 98.2 |
σ = 0.01 | 94.4 | 92.7 | 90.3 | 94.9 | 83.1 | 77.3 | 98.3 | 98.2 | 98.2 |
I = 100; σ = 0.0005 | 92.4 | 85.7 | 90.0 | 94.8 | 20.2 | 31.5 | 98.3 | 98.3 | 98.3 |
σ = 0.0025 | 94.1 | 85.7 | 84.9 | 96.4 | 57.0 | 52.5 | 98.3 | 98.3 | 98.3 |
σ = 0.01 | 94.4 | 94.0 | 92.4 | 96.6 | 91.8 | 88.9 | 98.4 | 98.4 | 98.3 |
The coverage of the IV method is somewhat lower for simultaneous intervals than for pointwise intervals, with values ranging between 84.9% and 97.3%; in general, however, the intervals approach nominal coverage. On the other hand, the approach has substantially worse performance in many scenarios with coverages as low as 13.2%. The effect of low measurement error variance on decreasing coverage probabilities using the mixed model approach is more evident for simultaneous intervals than for pointwise intervals, particularly for sparse data scenarios. Results for the MM – θ method indicate that intervals constructed using equation (7) are in fact conservative, with coverages of approximately 98.2% for all simulation scenarios. In the Web Appendix, we discuss the adaptation of the quantile estimation procedure used to construct simultaneous intervals for the IV method to the and MM – θ methods. Additional simulations indicate that using the quantile estimation procedure, rather than the χ quantile in equation (7), produces confidence intervals with nominal coverage for the MM – θ approach, although coverages for the method drop to as low as 8.8%.
Figure 2 illustrates the results in Tables 2 and and33 by examining a single curve in a data set with I = 20, Ji = 20, and σ = 0.01. Shown in the left panel of this figure is the true curve Yi(d), the observed points and the estimated curves within each bootstrap sample. This panel illustrates the amount of variability in the estimated curve based on uncertainty in the FPC decomposition objects θ. Displayed in the middle panel for the same curve are the 95% pointwise confidence intervals based on the IV approach, the method, and the quantiles of the bootstrapped curve estimates. The naive bootstrap approach does not consider model-based variance in the estimated scores, while the approach ignores variability in . In the right panel are the simultaneous confidence intervals given by the IV and approaches. Pointwise and simultaneous confidence intervals based on the model-only approach are narrower than those based on the iterated variance approach, and do not achieve nominal coverage level.
The improvements in average mean-squared errors and coverage probabilities for the IV approach over the method incur computational costs. Estimation of decompo sition objects requires bivariate smoothing of a covariance matrix, which can be computationally expensive and must be repeated for each bootstrap sample. The time needed to complete this step will increase with the number of points in dg and with the number of knots used in the bivariate smoothing. However, in the scenarios we consider estimating the decomposition objects takes between 4 and 6 seconds for each iteration, making the proposed IV approach computationally feasible.
5. Applications
In this section we consider two statistically and scientifically distinct applications. First, we treat longitudinal observations of CD4 cell counts as sparsely observed functional data; second, we consider densely observed white-matter tract profiles. Both data sets and the code implementing our analyses are available online.
5.1 CD4 Data
Human immune deficiency virus (HIV) attacks an immune cell called the CD4 cell and thereby decreases an affected individual's response to infectious agents. Because of this, CD4 cell count per milliliter of blood is a useful surrogate measure for the progression of HIV. In this data set from the Multicenter AIDS Cohort Study (MACS), we observe CD4 cell counts for 366 infected individuals in a traditional longitudinal study (Kaslow et al. 1987). That is, subjects were observed at roughly semi-annual visits with varying numbers of visits per subjects: there are a total of 1888 data points, with between 1 and 11 observations per subject and a median of 5. CD4 counts are plotted against months since serocon-version (the time at which HIV becomes detectable) in the top-left panel of Figure 3, with a randomly selected subset of patients highlighted in red. Longitudinal data analysis approaches to this study are presented in Diggle et al. (1994), and the treatment of a related data set as sparsely observed functional data is given in Yao et al. (2005).
In this application, we wish to estimate the continuous trajectory of CD4 cell counts over time based on a limited number of observations for each subject. FPCA offers a powerful tool for estimating population-level trends by using the dense collection of observations across subjects and predicting subject-specific functions using a mixed model framework. The predicted functions are subject to model-based variability as well as uncertainty in the FPC decomposition. This second source of variability can be particularly important when functions are sparsely observed at the subject level. Here we analyze the CD4 cell count data using both the IV method developed in Section 3 and the method. Both approaches are implemented on the full data set as well as the randomly selected subset of 25 individuals to demonstrate the effect of sample size in the analysis of sparse functional data.
Results of the CD4 cell count application are shown in Figure 3. The top-left panel plots the CD4 cell counts against months since seroconversion, highlighting the sample used in the subset analysis. Curve estimates and intervals for a single subject in the subset analysis are provided in the top-right panel of Figure 3. Simultaneous intervals are constructed using resampling procedure for the IV approach and using the asymptotic interval given in (7) for the method. For this subject, the estimated CD4 cell count trajectory using the IV method is lower for months –18 to –10 and higher for months 0 to 12 than the trajectory estimated using the mixed model approach. In these times, the pointwise and simultaneous confidence intervals constructed using the IV approach are much wider than those constructed using the framework. For example, using the method proposed in Section 3, at month –18 this subject's CD4 cell count is estimated to be approximately 2600 CD4 cells per milliliter of blood with upper and lower bounds of 3300 and 1900; using the mixed model approach, the CD4 cell count is estimated to be 2800 with bounds 3100 and 2500. Further, using the upper bound of the simultaneous interval to estimate the month in which CD4 cell count drops below 1500 per milliliter would yield month 10 under the IV method and month 4 under the approach. Thus, inference for a subject's CD4 trajectory can change dramatically depending on the procedure used to construct mean and variance estimates.
In the bottom-left panel, we compare the widths of point-wise confidence intervals yielded by the IV and approaches. For the full data analysis, intervals resulting from the proposed method have a similar width as those from the mixed model approach, but are up to 1.18 times as wide for months (36, 42) where there are fewer observations across subjects. In the analysis of the randomly selected 25 subjects, pointwise confidence intervals resulting from the IV approach are up to 1.52 times wider than those from the method, and are on average 1.21 times wider. Similarly, a comparison of the average widths of simultaneous confidence intervals given by the two methods is shown in the bottom-right panel of Figure 3. The relationship between the widths of simultaneous intervals is similar to that for pointwise intervals for the sampled data. For the full data analysis, the intervals given by the IV approach are in fact slightly narrower than those for the method; recall that simulations indicated that the simultaneous intervals in (7) may in fact be conservative asymptotically, while the IV simultaneous intervals could attain coverages slightly under the nominal level. Generally, these results indicate that ignoring the variability in the FPC decomposition can substantially understate the total variability in estimated curves for small samples and sparse data.
5.2 White-Matter Tract Data
Next, we consider a neuroimaging study of white matter tracts in multiple sclerosis (MS) patients and healthy controls. Intracranial white-matter consists of tracts or bundles of myelinated axons that carry electrical signals between areas of the brain. Major examples of white matter tracts are the corticospinal tracts, which connect the motor cortex to the spinal cord, and the corpus callosum, which bridges the left and right hemispheres. MS is an autoimmune disorder that results in white matter lesions, axonal damage and demyelination, and can lead to severe patient disability. DTI is an MRI modality able to resolve individual tracts by tracing the diffusion of water, which tends to be anisotropic within white matter tracts and isotropic elsewhere in the brain (Basser, Mattiello, and LeBihan 1994; Mori and Barker 1999; Basser et al. 2000; LeBihan et al. 2001). Fractional anisotropy (FA) is one of several quantities derived from DTI scans that measures overall anisotropy, and may be decreased in MS patients.
Our data set consists of 42 healthy controls and 100 MS patients, for whom we observe FA values sampled densely at 55 locations along the right corticospinal tract. The top-left panel of Figure 4 displays these FA profiles, separating MS patients and healthy controls. There is significant variability in tract profiles comparing subjects, as well as apparent measurement error in the observed FA values. In addition, 35% of subjects are missing observations for at least one tract location, and 17% of subjects are missing observations for five or more locations. For all subjects, any missing values are consecutive and begin at tract location 0, so that missingness is clustered at the beginning of the tract.
Because tract profiles are affected by disease progression, they are useful biomarkers and have been related to patient disability (Goldsmith et al. 2011). The goal of this analysis is to understand the total variability in tract profile estimates using FPCA. We compare the proposed IV method to the standard approach to assess the contribution of decomposition-based uncertainty on overall variability in estimated curves. Additionally, to examine the effect of sample size when curves are densely sampled, we conduct our analyses on both the full data set and on the subset of healthy controls.
Figure 4 shows the results from the analysis of the DTI tract data.
A comparison of the relative widths of pointwise confidence intervals constructed using the IV and methods is given in the top-right panel of Figure 4. In the full data analysis, the IV method results in confidence intervals that are on average 1.12 times wider than those given by the approach. When considering the 42 healthy controls only, the IV approach produces intervals that are on average 1.40 times wider than those for the approach. The comparison of the intervals yielded by these methods illustrates that, even for densely sampled observations, the mixed model approach that conditions on a single decomposition can significantly understate the total variability by neglecting the uncertainty in the FPC decomposition objects.
A demonstration of the difference between the IV and approaches is given in the bottom panels of Figure 4 using the tract profile of a single healthy control. On the left are the estimated curves and pointwise confidence intervals from the analysis based on the 42 healthy controls only; the interval based on the IV method is substantially wider at several locations on the tract, particularly near 0.0, 0.6, and 0.8. In addition, the curve estimated using iterated expectations falls outside the pointwise confidence interval given by the approach at several locations between 0.5 and 0.8. On the right are the estimates and pointwise intervals from the full data analysis, showing that the two approaches provide more similar intervals when a larger sample size is used. Thus for small sample sizes, accounting for the uncertainty in the FPC decomposition by using the IV method results in noticeably wider confidence intervals; as the sample size increases, the decomposition uncertainty decreases and the IV and methods yield more similar results.
5.1 CD4 Data
Human immune deficiency virus (HIV) attacks an immune cell called the CD4 cell and thereby decreases an affected individual's response to infectious agents. Because of this, CD4 cell count per milliliter of blood is a useful surrogate measure for the progression of HIV. In this data set from the Multicenter AIDS Cohort Study (MACS), we observe CD4 cell counts for 366 infected individuals in a traditional longitudinal study (Kaslow et al. 1987). That is, subjects were observed at roughly semi-annual visits with varying numbers of visits per subjects: there are a total of 1888 data points, with between 1 and 11 observations per subject and a median of 5. CD4 counts are plotted against months since serocon-version (the time at which HIV becomes detectable) in the top-left panel of Figure 3, with a randomly selected subset of patients highlighted in red. Longitudinal data analysis approaches to this study are presented in Diggle et al. (1994), and the treatment of a related data set as sparsely observed functional data is given in Yao et al. (2005).
In this application, we wish to estimate the continuous trajectory of CD4 cell counts over time based on a limited number of observations for each subject. FPCA offers a powerful tool for estimating population-level trends by using the dense collection of observations across subjects and predicting subject-specific functions using a mixed model framework. The predicted functions are subject to model-based variability as well as uncertainty in the FPC decomposition. This second source of variability can be particularly important when functions are sparsely observed at the subject level. Here we analyze the CD4 cell count data using both the IV method developed in Section 3 and the method. Both approaches are implemented on the full data set as well as the randomly selected subset of 25 individuals to demonstrate the effect of sample size in the analysis of sparse functional data.
Results of the CD4 cell count application are shown in Figure 3. The top-left panel plots the CD4 cell counts against months since seroconversion, highlighting the sample used in the subset analysis. Curve estimates and intervals for a single subject in the subset analysis are provided in the top-right panel of Figure 3. Simultaneous intervals are constructed using resampling procedure for the IV approach and using the asymptotic interval given in (7) for the method. For this subject, the estimated CD4 cell count trajectory using the IV method is lower for months –18 to –10 and higher for months 0 to 12 than the trajectory estimated using the mixed model approach. In these times, the pointwise and simultaneous confidence intervals constructed using the IV approach are much wider than those constructed using the framework. For example, using the method proposed in Section 3, at month –18 this subject's CD4 cell count is estimated to be approximately 2600 CD4 cells per milliliter of blood with upper and lower bounds of 3300 and 1900; using the mixed model approach, the CD4 cell count is estimated to be 2800 with bounds 3100 and 2500. Further, using the upper bound of the simultaneous interval to estimate the month in which CD4 cell count drops below 1500 per milliliter would yield month 10 under the IV method and month 4 under the approach. Thus, inference for a subject's CD4 trajectory can change dramatically depending on the procedure used to construct mean and variance estimates.
In the bottom-left panel, we compare the widths of point-wise confidence intervals yielded by the IV and approaches. For the full data analysis, intervals resulting from the proposed method have a similar width as those from the mixed model approach, but are up to 1.18 times as wide for months (36, 42) where there are fewer observations across subjects. In the analysis of the randomly selected 25 subjects, pointwise confidence intervals resulting from the IV approach are up to 1.52 times wider than those from the method, and are on average 1.21 times wider. Similarly, a comparison of the average widths of simultaneous confidence intervals given by the two methods is shown in the bottom-right panel of Figure 3. The relationship between the widths of simultaneous intervals is similar to that for pointwise intervals for the sampled data. For the full data analysis, the intervals given by the IV approach are in fact slightly narrower than those for the method; recall that simulations indicated that the simultaneous intervals in (7) may in fact be conservative asymptotically, while the IV simultaneous intervals could attain coverages slightly under the nominal level. Generally, these results indicate that ignoring the variability in the FPC decomposition can substantially understate the total variability in estimated curves for small samples and sparse data.
5.2 White-Matter Tract Data
Next, we consider a neuroimaging study of white matter tracts in multiple sclerosis (MS) patients and healthy controls. Intracranial white-matter consists of tracts or bundles of myelinated axons that carry electrical signals between areas of the brain. Major examples of white matter tracts are the corticospinal tracts, which connect the motor cortex to the spinal cord, and the corpus callosum, which bridges the left and right hemispheres. MS is an autoimmune disorder that results in white matter lesions, axonal damage and demyelination, and can lead to severe patient disability. DTI is an MRI modality able to resolve individual tracts by tracing the diffusion of water, which tends to be anisotropic within white matter tracts and isotropic elsewhere in the brain (Basser, Mattiello, and LeBihan 1994; Mori and Barker 1999; Basser et al. 2000; LeBihan et al. 2001). Fractional anisotropy (FA) is one of several quantities derived from DTI scans that measures overall anisotropy, and may be decreased in MS patients.
Our data set consists of 42 healthy controls and 100 MS patients, for whom we observe FA values sampled densely at 55 locations along the right corticospinal tract. The top-left panel of Figure 4 displays these FA profiles, separating MS patients and healthy controls. There is significant variability in tract profiles comparing subjects, as well as apparent measurement error in the observed FA values. In addition, 35% of subjects are missing observations for at least one tract location, and 17% of subjects are missing observations for five or more locations. For all subjects, any missing values are consecutive and begin at tract location 0, so that missingness is clustered at the beginning of the tract.
Because tract profiles are affected by disease progression, they are useful biomarkers and have been related to patient disability (Goldsmith et al. 2011). The goal of this analysis is to understand the total variability in tract profile estimates using FPCA. We compare the proposed IV method to the standard approach to assess the contribution of decomposition-based uncertainty on overall variability in estimated curves. Additionally, to examine the effect of sample size when curves are densely sampled, we conduct our analyses on both the full data set and on the subset of healthy controls.
Figure 4 shows the results from the analysis of the DTI tract data.
A comparison of the relative widths of pointwise confidence intervals constructed using the IV and methods is given in the top-right panel of Figure 4. In the full data analysis, the IV method results in confidence intervals that are on average 1.12 times wider than those given by the approach. When considering the 42 healthy controls only, the IV approach produces intervals that are on average 1.40 times wider than those for the approach. The comparison of the intervals yielded by these methods illustrates that, even for densely sampled observations, the mixed model approach that conditions on a single decomposition can significantly understate the total variability by neglecting the uncertainty in the FPC decomposition objects.
A demonstration of the difference between the IV and approaches is given in the bottom panels of Figure 4 using the tract profile of a single healthy control. On the left are the estimated curves and pointwise confidence intervals from the analysis based on the 42 healthy controls only; the interval based on the IV method is substantially wider at several locations on the tract, particularly near 0.0, 0.6, and 0.8. In addition, the curve estimated using iterated expectations falls outside the pointwise confidence interval given by the approach at several locations between 0.5 and 0.8. On the right are the estimates and pointwise intervals from the full data analysis, showing that the two approaches provide more similar intervals when a larger sample size is used. Thus for small sample sizes, accounting for the uncertainty in the FPC decomposition by using the IV method results in noticeably wider confidence intervals; as the sample size increases, the decomposition uncertainty decreases and the IV and methods yield more similar results.
6. Discussion
FPC analysis is ubiquitous in functional data analysis. FPC decompositions are estimated from the data, and curves are expressed using a truncated Karhunen–Loève expansion with curve-specific scores predicted from a mixed model that conditions (often implicitly) on the estimated decomposition. Previous work on estimation and inference for FPC expansions has focused on the model-based prediction of scores and the associated variability without including the uncertainty in the FPC decompositions. In many cases, including small samples or sparsely observed curves, this second source of uncertainty is non-negligible.
In this article, we propose a method for estimating curves and understanding the variability of those estimates that accounts for both model- and decomposition-based uncertainty. Specifically, we use iterated expectations to average model-based estimates that condition on specific decompositions across the distribution of those decompositions; similarly, the iterated variance formula provides correct inference by combining both sources of uncertainty. A bootstrap procedure provides an empirical distribution of the FPC decompositions. Simulations demonstrate the accuracy of our method, both in terms of the MSEs of estimated curves and in terms of the coverage of associated confidence intervals. We apply our method to two statistically and scientifically distinct data sets, one using longitudinally observed CD4 cell counts as sparse functional data and the other using partially unobserved tract profiles from a neuroimaging study. An R implementation of our method is given in the refund package on CRAN (Crainiceanu et al. 2011).
Future work may proceed in several directions. The proposed method assesses uncertainty in the FPC decomposition through the bootstrap; alternatively, one could use a probabilistic model for the unknown objects in the decomposition. Such an approach could be integrated seamlessly with the methods developed here, but the specification of an appropriate model for the unknown objects is a difficult challenge. Inference for more complex FPC-based methods may be improved through the extension of iterated expectation and variance formulas. For instance, longitudinal functional principal components analysis (LFPCA) decomposes repeated observations of curves into baseline subject-specific components, time-varying subject-specific components, and subject-visit-specific components (Greven et al. 2010); including model-and decomposition-based variability will improve the understanding of uncertainty in this scientifically useful setting. Additionally, the proposed method has focused on uncertainty in the Karhunen–Loève expansion arising from both model-and decomposition-based variability, but it does not address the model-based uncertainty in the estimation of the mean function. Correctly including this source of variability may increase the accuracy of the estimated curves and confidence intervals, particularly in cases where the estimate of the mean function may be poor.
Supplementary Material
Supplementary information
Supplementary information
Acknowledgements
The work of Goldsmith and Crainiceanu was supported by Award R01NS060910 from the National Institute Of Neurological Disorders And Stroke. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Neurological Disorders and Stroke or the National Institutes of Health. The work of Greven was funded by Emmy Noether grant GR 3793/1-1 from the German Research Foundation.
Abstract
Functional principal components (FPC) analysis is widely used to decompose and express functional observations. Curve estimates implicitly condition on basis functions and other quantities derived from FPC decompositions; however these objects are unknown in practice. In this article, we propose a method for obtaining correct curve estimates by accounting for uncertainty in FPC decompositions. Additionally, pointwise and simultaneous confidence intervals that account for both model- and decomposition-based variability are constructed. Standard mixed model representations of functional expansions are used to construct curve estimates and variances conditional on a specific decomposition. Iterated expectation and variance formulas combine model-based conditional estimates across the distribution of decompositions. A bootstrap procedure is implemented to understand the uncertainty in principal component decomposition quantities. Our method compares favorably to competing approaches in simulation studies that include both densely and sparsely observed functions. We apply our method to sparse observations of CD4 cell counts and to dense white-matter tract profiles. Code for the analyses and simulations is publicly available, and our method is implemented in the R package refund on CRAN.
Footnotes
7. Supplementary Materials
Web Appendices, Tables, and Figures referenced in Section 4 are available under the Paper Information link at the Biometrics website on Wiley Online Library.