Bayesian integrative model for multi-omics data with missingness.
Journal: 2018/December - Bioinformatics
ISSN: 1367-4811
Abstract:
UNASSIGNED
Integrative analysis of multi-omics data from different high-throughput experimental platforms provides valuable insight into regulatory mechanisms associated with complex diseases, and gains statistical power to detect markers that are otherwise overlooked by single-platform omics analysis. In practice, a significant portion of samples may not be measured completely due to insufficient tissues or restricted budget (e.g. gene expression profile are measured but not methylation). Current multi-omics integrative methods require complete data. A common practice is to ignore samples with any missing platform and perform complete case analysis, which leads to substantial loss of statistical power.
UNASSIGNED
In this article, inspired by the popular Integrative Bayesian Analysis of Genomics data (iBAG), we propose a full Bayesian model that allows incorporation of samples with missing omics data.
UNASSIGNED
Simulation results show improvement of the new full Bayesian approach in terms of outcome prediction accuracy and feature selection performance when sample size is limited and proportion of missingness is large. When sample size is large or the proportion of missingness is low, incorporating samples with missingness may introduce extra inference uncertainty and generate worse prediction and feature selection performance. To determine whether and how to incorporate samples with missingness, we propose a self-learning cross-validation (CV) decision scheme. Simulations and a real application on child asthma dataset demonstrate superior performance of the CV decision scheme when various types of missing mechanisms are evaluated.
UNASSIGNED
Freely available on the GitHub at https://github.com/CHPGenetics/FBM.
UNASSIGNED
Supplementary data are available at Bioinformatics online.
Relations:
Content
Citations
(1)
Similar articles
Articles by the same authors
Discussion board
Bioinformatics 34(22): 3801-3808

Bayesian integrative model for multi-omics data with missingness

1 Introduction

Multi-level omics data refer to the combined molecular data in various types, for example, genome, transcriptome, methylome and proteome data, measured on a common cohort of patients. The availability of multi-level omics data in both magnitude and varieties poses challenges as well as opportunities to understand fundamental mechanisms of diseases and pathologies. Compared with separately discovering the association patterns between each omics data and phenotypes, an integrative framework that simultaneously integrates multiple omics data types will uncover more insightful regulatory machineries between different omics data and will hence deepen our understanding to hereditary and environmental causes in pathology (Richardson et al., 2016; Tseng et al., 2012, 2015).

In the literature, many strategies have emerged for integration of multi-omics data. To cluster samples for identifying unknown disease subtypes, integrative clustering (iCluster; Shen et al., 2009), Bayesian consensus clustering (Lock and Dunson, 2013), group structured integrative clustering (GS-iCluster; Kim et al., 2017) and integrative sparse K-means (IS-Kmeans; Huo et al., 2017) have been developed for integrative clustering of multi-omics data. For association and prediction modelling, Integrative Bayesian Analysis of Genomics (iBAG) (Wang et al., 2013) investigates association patterns of mRNA expressions and methylation with clinical outcome via a mechanistic model for associating methylation with gene expression and then a clinical model for direct association between expression and clinical outcome or via methylation. A Bayesian hierarchical model is then established for the inference. Such Bayesian hierarchical modelling has gained popularity in multi-omics integrative analysis due to its flexibility in model construction for complex regulatory structure, convenience to incorporate prior biological knowledge and advances in modern computing. In the association and prediction modelling, we usually focus on two key goals in the inference: firstly, to select predictive biomarkers for the phenotype and secondly, to predict clinical outcome from the selected biomarkers. As tens of thousands of features are available in omics data, the first goal of feature selection is essential in interrogating the biological and pathological mechanism of a targeted disease. The second goal could be of immediate clinical use, for example, assisting diagnosis of a disease, directing the best treatment decision and predicting drug response.

One obstacle for applying multi-omics integrative methods in real applications is missing data. Due to various reasons (e.g. limited budget, bad tissue quality or insufficient tissue amount), it is common that only partial samples have all omics data types (Voillet et al., 2016). For example, TCGA breast cancer study had 922 samples measured in methylation array, yet only 781 were measured in miRNA expression and 587 were measured for gene expression. Almost 40% of the samples are missing at least one type of omics data. To circumvent this pitfall, a naïve and convenient solution is by complete case (CC) analysis, where samples with any missing measurement are ignored. This approach results in dramatic decrease of sample size and thus decreases of statistical power, especially when more omics data types are combined. The shortcoming of CC in omics studies is recently noticed by statisticians, for example, Voillet et al. (2016) who developed a multiple imputation approach focusing on multiple factor analysis for multiple omics data. However, a unified framework serving the aforementioned purpose of feature selection, prediction as well as missingness handling is still lacking. In Bayesian methods for analysis of data with some predictors that are missing at random, marginal distribution of those predictors are usually modelled (Ibrahim et al., 2002; Little and Rubin, 2002). The data augmentation method is used to obtain the posterior distribution of parameters of interest (Tanner and Wong, 1987). In each iteration of the data augmentation procedure, missing values are imputed from the conditional distribution of those covariates given observed data under current parameters and model parameters are subsequently drawn from their posterior distribution calculated from the imputed dataset. In this article, we are motivated by the iBAG model combining mRNA expression methylation, clinical variables to predict a targeted continuous outcome. We propose a full Bayesian model with missingness (FBM) that allows iBAG to handle situations when partial samples are missing with mRNA expression or methylation. Extensive simulations and real applications demonstrated superior performance in feature selection and prediction accuracy of the new approach compared with naïve CC approach. The model can be extended to other omics data types or other targeted outcomes (e.g. binary or survival).

The article is structured as the following. In Section 2.1, we introduce the motivation, the original iBAG model and the proposed FBM. Section 2.2 discusses the inference of prediction and feature selection of FBM (Section 2.2.1) and evaluation benchmarks (Section 2.2.2). Section 3 contains extensive simulations to evaluate performance of FBM and Section 4 proposes a cross-validation (CV) decision scheme to determine whether and how to incorporate samples with missingness in FBM. Section 5 includes an application to a childhood asthma dataset with 460 individuals. Final conclusion and discussion are presented in Section 6.

2 Materials and methods

2.1 Motivation and the full Bayesian model with missingness

2.1.1 Motivation

IBAG is a two-layer Bayesian hierarchical model for vertical integrative analysis of multi-level omics data, assuming data are complete. However, in reality, a large proportion of missing data is commonly seen due to budget or limitation in tissue collection. Figure 1a gives an example of data structure with missingness. Suppose there are a total of N samples, Yindicates the clinical outcome of interest; C indicates clinical factors; GN×K=(Gobs,NobsG×K,Gmis,NmisG×K)indicates the gene expression with missingness, where UG=(U1G,UNG)is the missing indicator; likewise, MN×J=(Mobs,NobsM×J,Mmis,NmisM×J)indicates methylation data with missingness, where UM=(U1M,UNM)is the missing indicator. U =1 indicates missing and U =0 indicates observed. For example, number of samples with methylation level missing is NobsG=Nk=1KUkG. We assume UiM×UiG1for 1iN. The original iBag model takes only the complete data and is subject to loss of statistical power. To handle missingness, we propose a full Bayesian model with missingness inspired by iBAG model, to achieve three goals simultaneously: feature selection, prediction and missing data imputation. In Section 2.1.2, we briefly introduce the iBAG model. In Section 2.1.3, we propose our full Bayesian model for multi-omics integration with missingness imputation.

An external file that holds a picture, illustration, etc.
Object name is bty775f1.jpg

(a) Illustration of missing pattern adopted in this article. Slash-shaded area represents missing data, cross-shaded area represents value 1 for missing indicator vectors UG and UM. (b) DAG of the model and parameters. The square in the DAG denotes observed data, solid circle denotes variable to be updated, circle in grey colour is the variable of interests, dashed circle denotes prior. Solid arrows indicate stochastic dependencies, and dashed arrows indicate deterministic dependencies

2.1.2 iBAG

In iBAG, the mechanistic model in the first layer assesses gene-methylation effect and divides gene expressions into two parts, the part regulated by methylation (GM) and the part regulated by other mechanisms (GM¯).

G=GM+GM¯,GM=MΩ,

where GM=(gnkM)N×K=(g1M,,gKM),GM¯=(gnkM¯)N×K=(g1M¯,,gKM¯),Ω=(ωjk)J×K:ωjkdenotes the ‘gene-methylation’ effect that estimates the (conditional) effect of jth methylation feature on the kth gene.

The second layer called clinical model assesses the association between gene expression (GMand GM¯) and the phenotype:

Y=CγC+GMγM+GM¯γM¯+ϵ,

where γC=(γ1C,γLC):γlCdenotes the effect of the lth clinical factor on the clinical outcome Y. γM=(γ1M,γKM):γkMdenotes the gene expression effect of gkMon clinical outcome Y, called a type M effect. γM¯=(γ1M¯,γKM¯):γkM¯denotes the gene expression effect of gkM¯on clinical outcome Y, called a type M¯effect.

2.1.3 Full Bayesian model with missingness

Figure 1b gives a full representation of our model. Our model contains three parts: mechanistic model, clinical model and an imputation model derived from the previous two models. The mechanistic model and clinical model stem from the iBAG model introduced in Section 2.1.2. Here, we will focus on two novel parts added to the original iBAG model: the sparsity-induced spike-and-slab priors (Ishwaran and Rao, 2005) for feature selection in both mechanistic and clinical models, and an imputation model to deal with missing omics data.

There are a total of eight groups of parameters that need to be estimated: γM, γM¯, γC, Ω, σk2’s, Gmis,Mmis,σ2, γM and γM¯are our parameters of interests. Investigators may also find Ωs important if they are interested in further inference on methylation-gene regulation. The corresponding Monte Carlo Markov Chain (MCMC) Gibbs sampler will be further discussed in Supplementary Appendix Section 2.

The original iBAG model placed a Laplace prior on γM and γM¯for shrinkage purpose, however, it does not set the effects to exact zeros. To conduct natural variable selection, we instead use a spike-and-slab prior to induce sparsity. In addition, we also perform feature selection in ω using the same spike-and-slab prior, considering that not all the methylation sites are regulating the gene expression. So we will have:

[γkM|IkM,(τM)2](1IkM)N(0,106)+IkMN(0,(τM)2);
[IkM|πM]Bern(πM),[πM]Unif(0,1);
[γkM¯|IkM¯,(τM¯)2](1IkM¯)N(0,106)+IkM¯N(0,(τM¯)2);
[IkM¯|πM¯]Bern(πM¯),[πM¯]Unif(0,1);
[ωj|Ijω,(τω)2](1Ijω)N(0,106)+IjωN(0,(τω)2);
[Ijω|πω]Bern(πω),[πω]Unif(0,1);

where IkM,IkM¯and Ijωare binary indicators and N(0,106)represents a narrow spike and τ2represents the wide slab. A Jeffery’s prior is put on (τM)2,(τM¯)2,(τω)2, i.e.:

p(τM)2(τM)2,p(τM¯)2(τM¯)2,p(τω)2(τω)2.

For Gmisand Mmis(Fig. 1b), we impose the following imputation model:

gmis,kMVNNmisG×NmisG(MJkωk,σk2INmisG×NmisG);
mmis,jMVNNmisM×NmisM(0,(σm)2INmisM×NmisM),

where MVN denotes the multivariate normal distribution and (σm)2=1if the methylation value is already standardized with mean 0 and standard deviation 1. Note that here we assume the methylation data are in Mvalueaccording to Du et al. (2010). If βvalueis to be used, we may need to replace the above prior with a truncated normal distribution bounded between 0 and 1. All other variables are given non-informative priors. Details can be found in the Supplementary Appendix Section 2.

Remarks:

  • Following Wang et al. (2013), we assume gene-gene independence and methylation-methylation independence. However, we are aware that this assumption deviates from real data, so we also propose a model with gene-gene and methylation-methylation correlations considered (see Supplementary Appendix Section 3). Albeit having the options, we will mainly discuss our model with independence assumptions in this article, for two reasons: first, the computational burden for estimating covariance matrices are high. Second, with moderate correlations in simulated data, considering dependency has limited gain in performance compared with model with independence assumption.

  • We assume that the methylations are many-to-one mapped to genes, i.e. only the methylation within the promoter region of the gene is mapped to the gene, and each methylation will only be mapped to one gene. The mapping is also assumed to be known from biology background knowledge in our model. The methylation level is centred around 0 if we are using m-value.

  • It is reasonable and necessary to assume that if for all the ωj where j are within the promoter region of gene k, we let Ijω=0, then we automatically have IkM=0. That is, IkM0only when at least one methylation is selected.

  • Two strategy could be adopted to deal with collinearity issue between methylation sites. First, we proposed a random selection strategy based on pairwise Pearson correlation (see Supplementary Appendix Section 4). Second, one may conduct principal component analysis (PCA) on methylation sites, and use the first few PCs explaining large proportion of variation.

2.2 Inference and evaluation

The full Bayesian hierarchical model in the last section allows fast Gibbs sampler. Full conditional formula for iterative sampling are shown in Supplementary Appendix Section 2. The final parameter estimates are calculated by averaging stabilized MCMC iterations (i.e. removing the first Br burn-in period in MCMC iterations). The burn-in period Br is determined using Geweke’s convergence diagnostics (Geweke, 1992). Geweke diagnostics aims to test whether the first a%and last b%of the MCMC iterations have equal mean, and thus decide whether the samples are drawn from a stationary distribution. As suggested by Geweke, the first 10% of MCMC total iterations are taken as burn-ins once the MCMC chain pass the diagnosis.

2.2.1 Prediction and feature selection

The Bayesian integrative model generates two major inference outcomes: prediction and feature selection. For prediction, denote γ^(b)C,γ^(b)M,γ^(b)M¯, and Ω^(b)as the simulated parameter estimates from the b-th iteration. For a new sample with omics data (Cnew,Gnew,Mnew), we average prediction of y from the (BBr) stable MCMCs by y^new=(b=Br+1By^new(b))/(BBr), where y^new(b)=Cnew·γ^(b)C+Mnew·Ω^(b)·γ^(b)M+(GnewMnew·Ω^(b))·γ^(b)M¯. And RMSE will be calculated from those estimates:

RMSE=i=1N(yiy^new,i)2/N
(1)

where the choice of N is discussed in Section 2.2.2.

Next, we summarize feature selection indicators Ik,(b)Mand Ik,(b)M¯for gene k and the b-th MCMC to determine the set of genes predictive to outcome y. Given the b-th iteration, we define the selection indicator for gene k as Ik,(b)so that gene k is selected if either the impact on outcome is through methylation (Ik,(b)M) or not (Ik,(b)M¯). In other words, Ik,(b)=(Ik,(b)M)OR(Ik,(b)M¯)=1(1Ik,(b)M)(1Ik,(b)M¯)for Br+1bBand 1kK. To control FDR at gene level, we apply Bayesian false discovery rate (BFDR) proposed by Newton (2004):

BFDR(t)=k=1KP^k(H0|D)dk(t)k=1Kdk(t)

where P^k(H0|D)=1(b=Br+1BIk,(b))/(BBr)is the posterior probability of gene k belonging to null hypothesis H0 (i.e. gene k is not selected given observed data D), dk(t)=I(P^k(H0|D)<t), and t is a tuning threshold. Given the definition of BFDR, the q-value of gene k can be defined as qk=mintP^k(H0|D)BFDR(t). This q-value will later be used for feature selection decision, which is comparable to frequentist approaches. Among selected genes, one may perform post hoc analysis to further investigate ckM=(b=Br+1BIk,(b)M)/(BBr)and ckM¯=(b=Br+1BIk,(b)M¯)/(BBr)and determine whether the impact of gene k to outcome is through methylation, non-methylation or both.

2.2.2 Benchmarks for evaluation

To evaluate the performance, the basic approach we compare with is the CC analysis. For full Bayesian model with missingness, we will also choose to impute gene expression only (FBMG), methylation only (FBMM) or both (FBMGM) when applicable. In simulation studies, we also perform analysis of the complete data (full) to examine the reduction of accuracy caused by missingness.

To benchmark performance of outcome prediction, we consider RMSE (Equation 1). In simulation studies, after parameter estimates are obtained we generate a new testing dataset with large sample size (N =2000) and compute RMSE. In real data analysis, we perform 50-fold cross-validation (CV) on complete cases for RMSE evaluation. In each iteration, NCC=2%of complete-case samples are set aside as test dataset, all remaining data are used to perform CC and FBMs analyses. The parameter estimates are then applied to the test data for outcome prediction. After ten iterations, cross-validated RMSE can be evaluated on all complete-case samples with N=NCCin Equation 1.

To evaluate feature selection performance, we order genes by q-value, plot receiver operating characteristic (ROC) curves and calculate area under curve (AUC) in simulations since the underlying true predictive genes are known. For real data, since we do not know the true features, we treat the gene selection result from full data analysis (full) as a surrogate of gold standard and compare gene selection from CC (or FBMs) to the full data analysis by tracing the top number of selected genes on the x-axis (e.g x = 100 top selected genes by CC and full) and the overlapped number from CC and full on the y-axis (Fig. 4b). Comparing the curves of CC and FBMs, a higher curve closer to the diagonal line shows more similar gene selection to ‘full’ and thus an indication of better performance.

An external file that holds a picture, illustration, etc.
Object name is bty775f4.jpg

(a and b) In missing Scenario I (α=50%,β=0) and Scenario II (α=0,β=50%), scatter plot of RMSEs between two methods are shown, where circle means cross-validation scheme generates correct decision, and cross means mistakes. Sample size N varies from 50,100,200,500. (c) Box plot of RMSE of different methods and CV selection scheme in Scenario III (α=β=25%)

2.1 Motivation and the full Bayesian model with missingness

2.1.1 Motivation

IBAG is a two-layer Bayesian hierarchical model for vertical integrative analysis of multi-level omics data, assuming data are complete. However, in reality, a large proportion of missing data is commonly seen due to budget or limitation in tissue collection. Figure 1a gives an example of data structure with missingness. Suppose there are a total of N samples, Yindicates the clinical outcome of interest; C indicates clinical factors; GN×K=(Gobs,NobsG×K,Gmis,NmisG×K)indicates the gene expression with missingness, where UG=(U1G,UNG)is the missing indicator; likewise, MN×J=(Mobs,NobsM×J,Mmis,NmisM×J)indicates methylation data with missingness, where UM=(U1M,UNM)is the missing indicator. U =1 indicates missing and U =0 indicates observed. For example, number of samples with methylation level missing is NobsG=Nk=1KUkG. We assume UiM×UiG1for 1iN. The original iBag model takes only the complete data and is subject to loss of statistical power. To handle missingness, we propose a full Bayesian model with missingness inspired by iBAG model, to achieve three goals simultaneously: feature selection, prediction and missing data imputation. In Section 2.1.2, we briefly introduce the iBAG model. In Section 2.1.3, we propose our full Bayesian model for multi-omics integration with missingness imputation.

An external file that holds a picture, illustration, etc.
Object name is bty775f1.jpg

(a) Illustration of missing pattern adopted in this article. Slash-shaded area represents missing data, cross-shaded area represents value 1 for missing indicator vectors UG and UM. (b) DAG of the model and parameters. The square in the DAG denotes observed data, solid circle denotes variable to be updated, circle in grey colour is the variable of interests, dashed circle denotes prior. Solid arrows indicate stochastic dependencies, and dashed arrows indicate deterministic dependencies

2.1.2 iBAG

In iBAG, the mechanistic model in the first layer assesses gene-methylation effect and divides gene expressions into two parts, the part regulated by methylation (GM) and the part regulated by other mechanisms (GM¯).

G=GM+GM¯,GM=MΩ,

where GM=(gnkM)N×K=(g1M,,gKM),GM¯=(gnkM¯)N×K=(g1M¯,,gKM¯),Ω=(ωjk)J×K:ωjkdenotes the ‘gene-methylation’ effect that estimates the (conditional) effect of jth methylation feature on the kth gene.

The second layer called clinical model assesses the association between gene expression (GMand GM¯) and the phenotype:

Y=CγC+GMγM+GM¯γM¯+ϵ,

where γC=(γ1C,γLC):γlCdenotes the effect of the lth clinical factor on the clinical outcome Y. γM=(γ1M,γKM):γkMdenotes the gene expression effect of gkMon clinical outcome Y, called a type M effect. γM¯=(γ1M¯,γKM¯):γkM¯denotes the gene expression effect of gkM¯on clinical outcome Y, called a type M¯effect.

2.1.3 Full Bayesian model with missingness

Figure 1b gives a full representation of our model. Our model contains three parts: mechanistic model, clinical model and an imputation model derived from the previous two models. The mechanistic model and clinical model stem from the iBAG model introduced in Section 2.1.2. Here, we will focus on two novel parts added to the original iBAG model: the sparsity-induced spike-and-slab priors (Ishwaran and Rao, 2005) for feature selection in both mechanistic and clinical models, and an imputation model to deal with missing omics data.

There are a total of eight groups of parameters that need to be estimated: γM, γM¯, γC, Ω, σk2’s, Gmis,Mmis,σ2, γM and γM¯are our parameters of interests. Investigators may also find Ωs important if they are interested in further inference on methylation-gene regulation. The corresponding Monte Carlo Markov Chain (MCMC) Gibbs sampler will be further discussed in Supplementary Appendix Section 2.

The original iBAG model placed a Laplace prior on γM and γM¯for shrinkage purpose, however, it does not set the effects to exact zeros. To conduct natural variable selection, we instead use a spike-and-slab prior to induce sparsity. In addition, we also perform feature selection in ω using the same spike-and-slab prior, considering that not all the methylation sites are regulating the gene expression. So we will have:

[γkM|IkM,(τM)2](1IkM)N(0,106)+IkMN(0,(τM)2);
[IkM|πM]Bern(πM),[πM]Unif(0,1);
[γkM¯|IkM¯,(τM¯)2](1IkM¯)N(0,106)+IkM¯N(0,(τM¯)2);
[IkM¯|πM¯]Bern(πM¯),[πM¯]Unif(0,1);
[ωj|Ijω,(τω)2](1Ijω)N(0,106)+IjωN(0,(τω)2);
[Ijω|πω]Bern(πω),[πω]Unif(0,1);

where IkM,IkM¯and Ijωare binary indicators and N(0,106)represents a narrow spike and τ2represents the wide slab. A Jeffery’s prior is put on (τM)2,(τM¯)2,(τω)2, i.e.:

p(τM)2(τM)2,p(τM¯)2(τM¯)2,p(τω)2(τω)2.

For Gmisand Mmis(Fig. 1b), we impose the following imputation model:

gmis,kMVNNmisG×NmisG(MJkωk,σk2INmisG×NmisG);
mmis,jMVNNmisM×NmisM(0,(σm)2INmisM×NmisM),

where MVN denotes the multivariate normal distribution and (σm)2=1if the methylation value is already standardized with mean 0 and standard deviation 1. Note that here we assume the methylation data are in Mvalueaccording to Du et al. (2010). If βvalueis to be used, we may need to replace the above prior with a truncated normal distribution bounded between 0 and 1. All other variables are given non-informative priors. Details can be found in the Supplementary Appendix Section 2.

Remarks:

  • Following Wang et al. (2013), we assume gene-gene independence and methylation-methylation independence. However, we are aware that this assumption deviates from real data, so we also propose a model with gene-gene and methylation-methylation correlations considered (see Supplementary Appendix Section 3). Albeit having the options, we will mainly discuss our model with independence assumptions in this article, for two reasons: first, the computational burden for estimating covariance matrices are high. Second, with moderate correlations in simulated data, considering dependency has limited gain in performance compared with model with independence assumption.

  • We assume that the methylations are many-to-one mapped to genes, i.e. only the methylation within the promoter region of the gene is mapped to the gene, and each methylation will only be mapped to one gene. The mapping is also assumed to be known from biology background knowledge in our model. The methylation level is centred around 0 if we are using m-value.

  • It is reasonable and necessary to assume that if for all the ωj where j are within the promoter region of gene k, we let Ijω=0, then we automatically have IkM=0. That is, IkM0only when at least one methylation is selected.

  • Two strategy could be adopted to deal with collinearity issue between methylation sites. First, we proposed a random selection strategy based on pairwise Pearson correlation (see Supplementary Appendix Section 4). Second, one may conduct principal component analysis (PCA) on methylation sites, and use the first few PCs explaining large proportion of variation.

2.1.1 Motivation

IBAG is a two-layer Bayesian hierarchical model for vertical integrative analysis of multi-level omics data, assuming data are complete. However, in reality, a large proportion of missing data is commonly seen due to budget or limitation in tissue collection. Figure 1a gives an example of data structure with missingness. Suppose there are a total of N samples, Yindicates the clinical outcome of interest; C indicates clinical factors; GN×K=(Gobs,NobsG×K,Gmis,NmisG×K)indicates the gene expression with missingness, where UG=(U1G,UNG)is the missing indicator; likewise, MN×J=(Mobs,NobsM×J,Mmis,NmisM×J)indicates methylation data with missingness, where UM=(U1M,UNM)is the missing indicator. U =1 indicates missing and U =0 indicates observed. For example, number of samples with methylation level missing is NobsG=Nk=1KUkG. We assume UiM×UiG1for 1iN. The original iBag model takes only the complete data and is subject to loss of statistical power. To handle missingness, we propose a full Bayesian model with missingness inspired by iBAG model, to achieve three goals simultaneously: feature selection, prediction and missing data imputation. In Section 2.1.2, we briefly introduce the iBAG model. In Section 2.1.3, we propose our full Bayesian model for multi-omics integration with missingness imputation.

An external file that holds a picture, illustration, etc.
Object name is bty775f1.jpg

(a) Illustration of missing pattern adopted in this article. Slash-shaded area represents missing data, cross-shaded area represents value 1 for missing indicator vectors UG and UM. (b) DAG of the model and parameters. The square in the DAG denotes observed data, solid circle denotes variable to be updated, circle in grey colour is the variable of interests, dashed circle denotes prior. Solid arrows indicate stochastic dependencies, and dashed arrows indicate deterministic dependencies

2.1.2 iBAG

In iBAG, the mechanistic model in the first layer assesses gene-methylation effect and divides gene expressions into two parts, the part regulated by methylation (GM) and the part regulated by other mechanisms (GM¯).

G=GM+GM¯,GM=MΩ,

where GM=(gnkM)N×K=(g1M,,gKM),GM¯=(gnkM¯)N×K=(g1M¯,,gKM¯),Ω=(ωjk)J×K:ωjkdenotes the ‘gene-methylation’ effect that estimates the (conditional) effect of jth methylation feature on the kth gene.

The second layer called clinical model assesses the association between gene expression (GMand GM¯) and the phenotype:

Y=CγC+GMγM+GM¯γM¯+ϵ,

where γC=(γ1C,γLC):γlCdenotes the effect of the lth clinical factor on the clinical outcome Y. γM=(γ1M,γKM):γkMdenotes the gene expression effect of gkMon clinical outcome Y, called a type M effect. γM¯=(γ1M¯,γKM¯):γkM¯denotes the gene expression effect of gkM¯on clinical outcome Y, called a type M¯effect.

2.1.3 Full Bayesian model with missingness

Figure 1b gives a full representation of our model. Our model contains three parts: mechanistic model, clinical model and an imputation model derived from the previous two models. The mechanistic model and clinical model stem from the iBAG model introduced in Section 2.1.2. Here, we will focus on two novel parts added to the original iBAG model: the sparsity-induced spike-and-slab priors (Ishwaran and Rao, 2005) for feature selection in both mechanistic and clinical models, and an imputation model to deal with missing omics data.

There are a total of eight groups of parameters that need to be estimated: γM, γM¯, γC, Ω, σk2’s, Gmis,Mmis,σ2, γM and γM¯are our parameters of interests. Investigators may also find Ωs important if they are interested in further inference on methylation-gene regulation. The corresponding Monte Carlo Markov Chain (MCMC) Gibbs sampler will be further discussed in Supplementary Appendix Section 2.

The original iBAG model placed a Laplace prior on γM and γM¯for shrinkage purpose, however, it does not set the effects to exact zeros. To conduct natural variable selection, we instead use a spike-and-slab prior to induce sparsity. In addition, we also perform feature selection in ω using the same spike-and-slab prior, considering that not all the methylation sites are regulating the gene expression. So we will have:

[γkM|IkM,(τM)2](1IkM)N(0,106)+IkMN(0,(τM)2);
[IkM|πM]Bern(πM),[πM]Unif(0,1);
[γkM¯|IkM¯,(τM¯)2](1IkM¯)N(0,106)+IkM¯N(0,(τM¯)2);
[IkM¯|πM¯]Bern(πM¯),[πM¯]Unif(0,1);
[ωj|Ijω,(τω)2](1Ijω)N(0,106)+IjωN(0,(τω)2);
[Ijω|πω]Bern(πω),[πω]Unif(0,1);

where IkM,IkM¯and Ijωare binary indicators and N(0,106)represents a narrow spike and τ2represents the wide slab. A Jeffery’s prior is put on (τM)2,(τM¯)2,(τω)2, i.e.:

p(τM)2(τM)2,p(τM¯)2(τM¯)2,p(τω)2(τω)2.

For Gmisand Mmis(Fig. 1b), we impose the following imputation model:

gmis,kMVNNmisG×NmisG(MJkωk,σk2INmisG×NmisG);
mmis,jMVNNmisM×NmisM(0,(σm)2INmisM×NmisM),

where MVN denotes the multivariate normal distribution and (σm)2=1if the methylation value is already standardized with mean 0 and standard deviation 1. Note that here we assume the methylation data are in Mvalueaccording to Du et al. (2010). If βvalueis to be used, we may need to replace the above prior with a truncated normal distribution bounded between 0 and 1. All other variables are given non-informative priors. Details can be found in the Supplementary Appendix Section 2.

Remarks:

  • Following Wang et al. (2013), we assume gene-gene independence and methylation-methylation independence. However, we are aware that this assumption deviates from real data, so we also propose a model with gene-gene and methylation-methylation correlations considered (see Supplementary Appendix Section 3). Albeit having the options, we will mainly discuss our model with independence assumptions in this article, for two reasons: first, the computational burden for estimating covariance matrices are high. Second, with moderate correlations in simulated data, considering dependency has limited gain in performance compared with model with independence assumption.

  • We assume that the methylations are many-to-one mapped to genes, i.e. only the methylation within the promoter region of the gene is mapped to the gene, and each methylation will only be mapped to one gene. The mapping is also assumed to be known from biology background knowledge in our model. The methylation level is centred around 0 if we are using m-value.

  • It is reasonable and necessary to assume that if for all the ωj where j are within the promoter region of gene k, we let Ijω=0, then we automatically have IkM=0. That is, IkM0only when at least one methylation is selected.

  • Two strategy could be adopted to deal with collinearity issue between methylation sites. First, we proposed a random selection strategy based on pairwise Pearson correlation (see Supplementary Appendix Section 4). Second, one may conduct principal component analysis (PCA) on methylation sites, and use the first few PCs explaining large proportion of variation.

2.2 Inference and evaluation

The full Bayesian hierarchical model in the last section allows fast Gibbs sampler. Full conditional formula for iterative sampling are shown in Supplementary Appendix Section 2. The final parameter estimates are calculated by averaging stabilized MCMC iterations (i.e. removing the first Br burn-in period in MCMC iterations). The burn-in period Br is determined using Geweke’s convergence diagnostics (Geweke, 1992). Geweke diagnostics aims to test whether the first a%and last b%of the MCMC iterations have equal mean, and thus decide whether the samples are drawn from a stationary distribution. As suggested by Geweke, the first 10% of MCMC total iterations are taken as burn-ins once the MCMC chain pass the diagnosis.

2.2.1 Prediction and feature selection

The Bayesian integrative model generates two major inference outcomes: prediction and feature selection. For prediction, denote γ^(b)C,γ^(b)M,γ^(b)M¯, and Ω^(b)as the simulated parameter estimates from the b-th iteration. For a new sample with omics data (Cnew,Gnew,Mnew), we average prediction of y from the (BBr) stable MCMCs by y^new=(b=Br+1By^new(b))/(BBr), where y^new(b)=Cnew·γ^(b)C+Mnew·Ω^(b)·γ^(b)M+(GnewMnew·Ω^(b))·γ^(b)M¯. And RMSE will be calculated from those estimates:

RMSE=i=1N(yiy^new,i)2/N
(1)

where the choice of N is discussed in Section 2.2.2.

Next, we summarize feature selection indicators Ik,(b)Mand Ik,(b)M¯for gene k and the b-th MCMC to determine the set of genes predictive to outcome y. Given the b-th iteration, we define the selection indicator for gene k as Ik,(b)so that gene k is selected if either the impact on outcome is through methylation (Ik,(b)M) or not (Ik,(b)M¯). In other words, Ik,(b)=(Ik,(b)M)OR(Ik,(b)M¯)=1(1Ik,(b)M)(1Ik,(b)M¯)for Br+1bBand 1kK. To control FDR at gene level, we apply Bayesian false discovery rate (BFDR) proposed by Newton (2004):

BFDR(t)=k=1KP^k(H0|D)dk(t)k=1Kdk(t)

where P^k(H0|D)=1(b=Br+1BIk,(b))/(BBr)is the posterior probability of gene k belonging to null hypothesis H0 (i.e. gene k is not selected given observed data D), dk(t)=I(P^k(H0|D)<t), and t is a tuning threshold. Given the definition of BFDR, the q-value of gene k can be defined as qk=mintP^k(H0|D)BFDR(t). This q-value will later be used for feature selection decision, which is comparable to frequentist approaches. Among selected genes, one may perform post hoc analysis to further investigate ckM=(b=Br+1BIk,(b)M)/(BBr)and ckM¯=(b=Br+1BIk,(b)M¯)/(BBr)and determine whether the impact of gene k to outcome is through methylation, non-methylation or both.

2.2.2 Benchmarks for evaluation

To evaluate the performance, the basic approach we compare with is the CC analysis. For full Bayesian model with missingness, we will also choose to impute gene expression only (FBMG), methylation only (FBMM) or both (FBMGM) when applicable. In simulation studies, we also perform analysis of the complete data (full) to examine the reduction of accuracy caused by missingness.

To benchmark performance of outcome prediction, we consider RMSE (Equation 1). In simulation studies, after parameter estimates are obtained we generate a new testing dataset with large sample size (N =2000) and compute RMSE. In real data analysis, we perform 50-fold cross-validation (CV) on complete cases for RMSE evaluation. In each iteration, NCC=2%of complete-case samples are set aside as test dataset, all remaining data are used to perform CC and FBMs analyses. The parameter estimates are then applied to the test data for outcome prediction. After ten iterations, cross-validated RMSE can be evaluated on all complete-case samples with N=NCCin Equation 1.

To evaluate feature selection performance, we order genes by q-value, plot receiver operating characteristic (ROC) curves and calculate area under curve (AUC) in simulations since the underlying true predictive genes are known. For real data, since we do not know the true features, we treat the gene selection result from full data analysis (full) as a surrogate of gold standard and compare gene selection from CC (or FBMs) to the full data analysis by tracing the top number of selected genes on the x-axis (e.g x = 100 top selected genes by CC and full) and the overlapped number from CC and full on the y-axis (Fig. 4b). Comparing the curves of CC and FBMs, a higher curve closer to the diagonal line shows more similar gene selection to ‘full’ and thus an indication of better performance.

An external file that holds a picture, illustration, etc.
Object name is bty775f4.jpg

(a and b) In missing Scenario I (α=50%,β=0) and Scenario II (α=0,β=50%), scatter plot of RMSEs between two methods are shown, where circle means cross-validation scheme generates correct decision, and cross means mistakes. Sample size N varies from 50,100,200,500. (c) Box plot of RMSE of different methods and CV selection scheme in Scenario III (α=β=25%)

2.2.1 Prediction and feature selection

The Bayesian integrative model generates two major inference outcomes: prediction and feature selection. For prediction, denote γ^(b)C,γ^(b)M,γ^(b)M¯, and Ω^(b)as the simulated parameter estimates from the b-th iteration. For a new sample with omics data (Cnew,Gnew,Mnew), we average prediction of y from the (BBr) stable MCMCs by y^new=(b=Br+1By^new(b))/(BBr), where y^new(b)=Cnew·γ^(b)C+Mnew·Ω^(b)·γ^(b)M+(GnewMnew·Ω^(b))·γ^(b)M¯. And RMSE will be calculated from those estimates:

RMSE=i=1N(yiy^new,i)2/N
(1)

where the choice of N is discussed in Section 2.2.2.

Next, we summarize feature selection indicators Ik,(b)Mand Ik,(b)M¯for gene k and the b-th MCMC to determine the set of genes predictive to outcome y. Given the b-th iteration, we define the selection indicator for gene k as Ik,(b)so that gene k is selected if either the impact on outcome is through methylation (Ik,(b)M) or not (Ik,(b)M¯). In other words, Ik,(b)=(Ik,(b)M)OR(Ik,(b)M¯)=1(1Ik,(b)M)(1Ik,(b)M¯)for Br+1bBand 1kK. To control FDR at gene level, we apply Bayesian false discovery rate (BFDR) proposed by Newton (2004):

BFDR(t)=k=1KP^k(H0|D)dk(t)k=1Kdk(t)

where P^k(H0|D)=1(b=Br+1BIk,(b))/(BBr)is the posterior probability of gene k belonging to null hypothesis H0 (i.e. gene k is not selected given observed data D), dk(t)=I(P^k(H0|D)<t), and t is a tuning threshold. Given the definition of BFDR, the q-value of gene k can be defined as qk=mintP^k(H0|D)BFDR(t). This q-value will later be used for feature selection decision, which is comparable to frequentist approaches. Among selected genes, one may perform post hoc analysis to further investigate ckM=(b=Br+1BIk,(b)M)/(BBr)and ckM¯=(b=Br+1BIk,(b)M¯)/(BBr)and determine whether the impact of gene k to outcome is through methylation, non-methylation or both.

2.2.2 Benchmarks for evaluation

To evaluate the performance, the basic approach we compare with is the CC analysis. For full Bayesian model with missingness, we will also choose to impute gene expression only (FBMG), methylation only (FBMM) or both (FBMGM) when applicable. In simulation studies, we also perform analysis of the complete data (full) to examine the reduction of accuracy caused by missingness.

To benchmark performance of outcome prediction, we consider RMSE (Equation 1). In simulation studies, after parameter estimates are obtained we generate a new testing dataset with large sample size (N =2000) and compute RMSE. In real data analysis, we perform 50-fold cross-validation (CV) on complete cases for RMSE evaluation. In each iteration, NCC=2%of complete-case samples are set aside as test dataset, all remaining data are used to perform CC and FBMs analyses. The parameter estimates are then applied to the test data for outcome prediction. After ten iterations, cross-validated RMSE can be evaluated on all complete-case samples with N=NCCin Equation 1.

To evaluate feature selection performance, we order genes by q-value, plot receiver operating characteristic (ROC) curves and calculate area under curve (AUC) in simulations since the underlying true predictive genes are known. For real data, since we do not know the true features, we treat the gene selection result from full data analysis (full) as a surrogate of gold standard and compare gene selection from CC (or FBMs) to the full data analysis by tracing the top number of selected genes on the x-axis (e.g x = 100 top selected genes by CC and full) and the overlapped number from CC and full on the y-axis (Fig. 4b). Comparing the curves of CC and FBMs, a higher curve closer to the diagonal line shows more similar gene selection to ‘full’ and thus an indication of better performance.

An external file that holds a picture, illustration, etc.
Object name is bty775f4.jpg

(a and b) In missing Scenario I (α=50%,β=0) and Scenario II (α=0,β=50%), scatter plot of RMSEs between two methods are shown, where circle means cross-validation scheme generates correct decision, and cross means mistakes. Sample size N varies from 50,100,200,500. (c) Box plot of RMSE of different methods and CV selection scheme in Scenario III (α=β=25%)

3 Simulation studies

3.1 Simulation schemes

To evaluate performance of our full Bayesian model with missingness, we perform simulation based on data structure described in Section 2. Specifically, the clinical and methylation data matrices are simulated from N(0, 10) and N(0, 1) with N=(50,100,200,500), L =2 and J =2000. Each methylation site is randomly assigned to a gene, with the constraint that each gene contains at least one methylation site. In the mechanistic model, Ijω=1,ωjk=5and σk2=4for 1jJand 1kK, where the total number of genes K =1000. We then simulate gene expression matrices from N(MΩ,diag(σ12,σK2)). In the clinical model, 10 genes are randomly selected to impact clinical outcome through methylation and 10 randomly impact not through methylation (i.e. The IM vector has 10 out of K genes equal one and the remaining are zero, and similarly for IM¯. The selected genes in IM and IM¯can possibly overlap). For selected genes in IM and IM¯, the corresponding γkMand γkM¯are set to 10. The coefficients for clinical data γlC(1lL) are also set at 10 and σ2=9to simulate clinical outcome Y.

After full multi-omics datasets are simulated, data with missingness are generated with α%of samples with missing gene expression data and another non-overlapping β%of samples with missing methylation data. We simulate three scenarios of missingness: (I) Missing only gene expression data with (α, β)=(0.1, 0), (0.2, 0) and (0.5, 0); (II) Missing only methylation data with (α, β)=(0, 0.1), (0, 0.2), (0, 0.5); (III) Non-overlapping samples missing either gene expression or methylation data with (α, β)=(0.1, 0.1), (0.2, 0.2), (0.3, 0.3). For Scenario I, we evaluate CC and FBMG approaches and compare with full. Similarly for Scenario II, we compare CC, FBMM and full. Finally for Scenario III, we compare CC, FBMG, FBMM, FBMGM and full. In this case, FBMG imputes gene expression but ignores samples with missing methylation and similarly, FBMM imputes methylation but ignores samples with missing gene expression. FBMGM utilizes all samples and imputes both gene expression and methylation.

3.2 Results

Figure 2 shows the outcome prediction performance by RMSE for all three scenarios. We first focus on small sample size situations N=50200. In Scenario I, CC, FBMG and full have similar performance when α=10%missing but FBMG clearly outperforms CC when missingness increases to 20% and 50%, showing the benefit of imputation as expected. In contrast, FBMM performs much worse than CC in Scenario II with β=10%methylation missingness and FBMM only slightly outperform CC when missingness increases to 50%. Results of Scenario III are consistent with results of Scenarios I and II. FBMM and FBMGM perform worse than CC at α=β=10%. FBMM, FBMG and FBMGM outperform CC at α=β=50%. It is worth noting that when sample size increases to N =500, the data information is strong enough such that CC has performance similar to full. Imputation almost always create more data uncertainty and have worse performance than CC, especially since the majority of the imputed data are irrelevant to the clinical outcome.

An external file that holds a picture, illustration, etc.
Object name is bty775f2.jpg

Model prediction by RMSE(y^)comparing different methods and full data

We next examine feature selection performance by AUC values in Table 1 (ROC curves shown in Figure 3 and Supplementary Figure S1a–c). In Scenario I, FBMG always has higher AUC values than CC especially when missingness α increases to 25% or 30%. On the contrary, FBMM has lower AUC than CC except for N =50, 100 in β=50%. The message becomes mixed in Scenario III as expected.

Table 1.

AUC of different methods in simulation studies

Scenario I
Scenario II
Scenario III
MissingNFullCCFBMGCCFBMMCCFBMGFBMGMFBMM
Lowa500.6760.6490.6720.6310.6130.6160.6280.6520.583
1000.8990.8670.8780.8830.7770.8260.7530.8260.703
2000.9670.9760.9830.9610.8630.9460.8590.9130.789
Meda500.6760.6190.6650.6260.5970.5740.610.6460.528
1000.8990.8130.8620.8070.7740.7620.7450.8260.699
2000.9670.9410.9610.9390.8610.9030.8580.9170.795
Higha500.6760.530.6370.5570.5880.5340.5830.6360.512
1000.8990.6990.8350.6810.7420.6380.750.8330.7
2000.9670.8920.9640.880.8140.8280.7820.8830.765
Low, medium and high missing proportion for Scenario I and II are 10%, 25% and 50% for either α or β, respectively; and for Scenario III is 10%, 20% and 30% for both α and β.
An external file that holds a picture, illustration, etc.
Object name is bty775f3.jpg

The ROC curves for feature selection comparison on Scenario I: α0,β=0. Asterisk on each ROC curve is the point with maximum Youden index

We noticed that since methylation is the up-regulator of gene expression and indirectly impact the clinical outcome, imputing methylation is generally less effective than imputing gene expression. With increasing missing proportion, the benefit of imputation escalates. But when the missing proportion is small or sample size large, the uncertainty brought by imputation will overshadow the contribution from partially observed data (see Supplementary Appendix Section 5).

3.1 Simulation schemes

To evaluate performance of our full Bayesian model with missingness, we perform simulation based on data structure described in Section 2. Specifically, the clinical and methylation data matrices are simulated from N(0, 10) and N(0, 1) with N=(50,100,200,500), L =2 and J =2000. Each methylation site is randomly assigned to a gene, with the constraint that each gene contains at least one methylation site. In the mechanistic model, Ijω=1,ωjk=5and σk2=4for 1jJand 1kK, where the total number of genes K =1000. We then simulate gene expression matrices from N(MΩ,diag(σ12,σK2)). In the clinical model, 10 genes are randomly selected to impact clinical outcome through methylation and 10 randomly impact not through methylation (i.e. The IM vector has 10 out of K genes equal one and the remaining are zero, and similarly for IM¯. The selected genes in IM and IM¯can possibly overlap). For selected genes in IM and IM¯, the corresponding γkMand γkM¯are set to 10. The coefficients for clinical data γlC(1lL) are also set at 10 and σ2=9to simulate clinical outcome Y.

After full multi-omics datasets are simulated, data with missingness are generated with α%of samples with missing gene expression data and another non-overlapping β%of samples with missing methylation data. We simulate three scenarios of missingness: (I) Missing only gene expression data with (α, β)=(0.1, 0), (0.2, 0) and (0.5, 0); (II) Missing only methylation data with (α, β)=(0, 0.1), (0, 0.2), (0, 0.5); (III) Non-overlapping samples missing either gene expression or methylation data with (α, β)=(0.1, 0.1), (0.2, 0.2), (0.3, 0.3). For Scenario I, we evaluate CC and FBMG approaches and compare with full. Similarly for Scenario II, we compare CC, FBMM and full. Finally for Scenario III, we compare CC, FBMG, FBMM, FBMGM and full. In this case, FBMG imputes gene expression but ignores samples with missing methylation and similarly, FBMM imputes methylation but ignores samples with missing gene expression. FBMGM utilizes all samples and imputes both gene expression and methylation.

3.2 Results

Figure 2 shows the outcome prediction performance by RMSE for all three scenarios. We first focus on small sample size situations N=50200. In Scenario I, CC, FBMG and full have similar performance when α=10%missing but FBMG clearly outperforms CC when missingness increases to 20% and 50%, showing the benefit of imputation as expected. In contrast, FBMM performs much worse than CC in Scenario II with β=10%methylation missingness and FBMM only slightly outperform CC when missingness increases to 50%. Results of Scenario III are consistent with results of Scenarios I and II. FBMM and FBMGM perform worse than CC at α=β=10%. FBMM, FBMG and FBMGM outperform CC at α=β=50%. It is worth noting that when sample size increases to N =500, the data information is strong enough such that CC has performance similar to full. Imputation almost always create more data uncertainty and have worse performance than CC, especially since the majority of the imputed data are irrelevant to the clinical outcome.

An external file that holds a picture, illustration, etc.
Object name is bty775f2.jpg

Model prediction by RMSE(y^)comparing different methods and full data

We next examine feature selection performance by AUC values in Table 1 (ROC curves shown in Figure 3 and Supplementary Figure S1a–c). In Scenario I, FBMG always has higher AUC values than CC especially when missingness α increases to 25% or 30%. On the contrary, FBMM has lower AUC than CC except for N =50, 100 in β=50%. The message becomes mixed in Scenario III as expected.

Table 1.

AUC of different methods in simulation studies

Scenario I
Scenario II
Scenario III
MissingNFullCCFBMGCCFBMMCCFBMGFBMGMFBMM
Lowa500.6760.6490.6720.6310.6130.6160.6280.6520.583
1000.8990.8670.8780.8830.7770.8260.7530.8260.703
2000.9670.9760.9830.9610.8630.9460.8590.9130.789
Meda500.6760.6190.6650.6260.5970.5740.610.6460.528
1000.8990.8130.8620.8070.7740.7620.7450.8260.699
2000.9670.9410.9610.9390.8610.9030.8580.9170.795
Higha500.6760.530.6370.5570.5880.5340.5830.6360.512
1000.8990.6990.8350.6810.7420.6380.750.8330.7
2000.9670.8920.9640.880.8140.8280.7820.8830.765
Low, medium and high missing proportion for Scenario I and II are 10%, 25% and 50% for either α or β, respectively; and for Scenario III is 10%, 20% and 30% for both α and β.
An external file that holds a picture, illustration, etc.
Object name is bty775f3.jpg

The ROC curves for feature selection comparison on Scenario I: α0,β=0. Asterisk on each ROC curve is the point with maximum Youden index

We noticed that since methylation is the up-regulator of gene expression and indirectly impact the clinical outcome, imputing methylation is generally less effective than imputing gene expression. With increasing missing proportion, the benefit of imputation escalates. But when the missing proportion is small or sample size large, the uncertainty brought by imputation will overshadow the contribution from partially observed data (see Supplementary Appendix Section 5).

4 Impute or not: a decision scheme by cross-validation

From the mechanistic model in Figure 1b, gene expression and methylation data are not symmetric. Methylation data can be predictive to gene expression data and further predictive to outcomes. On the other hand, gene expression data are less predictive to methylation to help outcome prediction. As we have shown in simulations, imputations do not always improve prediction performance compared with CC analysis. Whether imputation would improve outcome prediction depends on the types of missingness (missing the upstream methylation data or missing the downstream gene expression data or both), the proportion of missingness and sample size. To guide the decision, we propose a self-learning CV scheme. Specifically, we apply 10-fold CV by leaving 10% of CC samples as the test set (for Scenario III N = 500, we apply 50-fold CV). We apply CC, FBMG, FBMM and FBMGM to the remaining training data, calculate parameter estimates and apply to test set. The procedure is repeated across all 10-folds and RMSE can be calculated on all test sets. The method with the smallest RMSE is selected to determine whether and how to impute. We conducted sensitivity analysis for this CV scheme (see Supplementary Appendix Section 1.2). We note that to evaluate performance of CV scheme in RMSE evaluation of outcome prediction in the Section 5, nested CV will be used (i.e. An outer loop of CV is used to evaluate RMSE and an inner loop of CV scheme for method selection is performed in each training set of the outer loop).

Figure 4 shows scatter plot of RMSE performance comparing FBMG and CC in Scenario I (Fig. 4a;α=50%missing gene expression) and FBMM and CC in Scenario II (Fig. 4b;β=50%missing methylation) in 50 independent simulations. In Scenario I, FBMG generally has smaller RMSE than CC in small sample sizes (N =50, 100, 200). But for N =500, RMSE of FBMG becomes slightly larger than CC on average. For missing methylation in Scenario II, FBMM performs better than CC at N =50 but gradually becomes worse than CC at N =100, 200 and 500. We applied the CV scheme in each simulation to determine whether imputation should be performed or not. Simulations shown by circles represent correct decisions (i.e. CV scheme decides to impute and the RMSE of imputation is indeed smaller than RMSE of CC or vice versa) and cross represents incorrect decision. The result shows universally high accuracy of CV scheme decision. When the decision is wrong, imputation and CC RMSEs are close to each other (near the diagonal line) and the incorrect decision only minimally impacts the outcome prediction. Figure 4c shows box plots of RMSE generated from different approaches (CC, FBMG, FBMM, FBMGM, CV) for Scenario III (α=β=25%). At N =50, FBMG and FBMGM both perform well and CV generates similar small RMSE. When N increases to 500, FBMGM becomes much worse and CC performs slightly better than FBMG. Again, the CV scheme makes mostly correct decision and thus generates small RMSE close to the lowest. In conclusion, the simulation results indicate effectiveness of the CV scheme in determining the best strategy of whether and how to impute when encountering missingness in multi-omics data.

5 Real application

5.1 Data and approach

We apply the proposed full Bayesian model with missingness to 460 children asthma nasal epithelium samples obtained from asthma study at Children’s Hospital of Pittsburgh, with complete DNA methylation data from Illumina 450k chips and RNA-Seq gene expression data. All data were preprocessed with standard procedures and bioinformatics tools. We used M-value for methylation level for better model fitting. The RNA-seq gene expression counts were transformed to TPM (transcripts per million), a continuous value also more valid for the model assumptions. We filter out genes with small mean expression level (<130.41) or small standard deviation (<23.83) to obtain K =1000 genes for the analysis. We then select methylation sites matched to these 1000 genes. Since some genes have many corresponding methylation sites, we perform PCA to identify the top eigen-methylation sites as input to the model. We take no more than three PCs per gene or less than three PCs that explain at least 75% variation. This generates J =2619 eigen-methylation features for the analysis. The PCA analysis reduces redundant (highly correlated) information in methylation sites and independence of eigen-methylation features fits the model assumptions well. Similar to simulation, we generate three scenarios of missingness: (I) α=50%, (II) β=50%, (III) α=β=20%. For each scenario, we repeat 50 times.

Serum Immunoglobulin E (IgE) level is a primary clinical outcome in children asthma studies. We take log-transformed IgE level as our clinical outcome, and age and gender as clinical variables. Together with the gene expression and methylation PCs, we run our model with full data and three scenarios of missingness to compare CC approach and full Bayesian model with missingness (FBMG, FBMM and FBMGM).

5.2 Outcome prediction and feature selection

Table 2 shows the RMSE of outcome prediction from complete case analysis and FBMs. When 50% samples have missing gene expression (Scenario I), FBMG had reduced RMSE compared with CC (dropped from 43.34.19 to 36.04). In contrast, FBMM had inflated RMSE compared with CC when 50% of missing methylation (51.59 compared with 39.54 in Scenario II). When gene expression and methylation both missed 20% of samples (i.e. 40% of samples had missing values in Scenario III), FBMG had the smallest RMSE (38.71) compared with FBMGM (46.15), FBMM (54.23) and CC (39.09). Our CV scheme performed the automatic selection of whether and how to impute, and the RMSE was always close to the best (I: 36.61, II: 41.09 and III: 40.02). Figure 5a shows scatter plot or box plot of RMSEs of different methods in all three Scenarios. Similar to the simulation result, CV scheme mostly selected the best method and the mistakes were near the diagonal line with little predictive impact.

Table 2.

RMSE (S.E.) of different methods

MethodsScenario IScenario IIScenario III
CC43.34 (5.33)39.54 (6.85)39.09 (5.43)
FBMG36.04 (4.73)38.71 (5.73)
FBMGM46.15 (4.74)
FBMM51.59 (6.71)54.23 (4.50)
CV36.61 (3.78)41.09 (6.17)40.62 (7.36)
An external file that holds a picture, illustration, etc.
Object name is bty775f5.jpg

(a) Scatter plot of RMSEs between two methods are shown, where circle means cross-validation scheme generates correct decision, and cross means mistakes. (b) The number of overlapped genes (features) selected by each methods compared with full asthma data. On x axis is the top number of genes selected by each method

Unlike in simulation, no underlying truth is available for real data and thus calculation of AUC is not possible. Figure 5b treats the predicted outcome from full data as the surrogate of underlying truth and compare feature selection from each method with the surrogate (i.e. x-axis shows the same number of features selected by the designated method and full and y-axis demonstrates the overlap between the two). A curve with higher overlap shows better similarity of feature selection with the surrogate, an indication of better performance. Similar to the RMSE result, FBMG performed better than CC in Scenario I, CC better than FBMM in Scenario II, and FBMG and CC performed better than FBMM and FBMGM in Scenario III. CV scheme performs close to the best. All results from this real application are largely aligned with our observations in simulation studies.

We next investigate whether feature selection from the imputation methods or method selected by CV scheme represent better functional annotation in a biological sense. We obtained the top 200 genes from feature selection of each method by posterior probability order and then performed pathway enrichment analysis by one-sided Fisher’s exact test. We collected 2467 pathways from four pathway databases (KEGG, Reactome, Biocarta and GO) with the restriction of pathway size between 10 and 500. The enrichment p-values were then adjusted for multiple comparisons by Benjamini–Hochberg procedure. We found 27 enriched pathways from the full data analysis under FDR = 1%. We used these 27 pathways as a surrogate of gold standard to benchmark functional annotation performance of each method. Figure 6 shows box-plots of the minus log-transformed p-values from pathway enrichment of the 27 pathways in each method. As expected, FBMG had better p-value significance distribution than CC in Scenario I. In Scenario II, CC performed better than FBMM. FBMGM and FBMG outperformed CC in Scenario III. The CV scheme automatically determined whether and how to impute, and it always perform close to the best in each scenario.

An external file that holds a picture, illustration, etc.
Object name is bty775f6.jpg

Compare different methods for the log10(p)value of the top 27 pathways taken from full data with q-value <0.01

5.1 Data and approach

We apply the proposed full Bayesian model with missingness to 460 children asthma nasal epithelium samples obtained from asthma study at Children’s Hospital of Pittsburgh, with complete DNA methylation data from Illumina 450k chips and RNA-Seq gene expression data. All data were preprocessed with standard procedures and bioinformatics tools. We used M-value for methylation level for better model fitting. The RNA-seq gene expression counts were transformed to TPM (transcripts per million), a continuous value also more valid for the model assumptions. We filter out genes with small mean expression level (<130.41) or small standard deviation (<23.83) to obtain K =1000 genes for the analysis. We then select methylation sites matched to these 1000 genes. Since some genes have many corresponding methylation sites, we perform PCA to identify the top eigen-methylation sites as input to the model. We take no more than three PCs per gene or less than three PCs that explain at least 75% variation. This generates J =2619 eigen-methylation features for the analysis. The PCA analysis reduces redundant (highly correlated) information in methylation sites and independence of eigen-methylation features fits the model assumptions well. Similar to simulation, we generate three scenarios of missingness: (I) α=50%, (II) β=50%, (III) α=β=20%. For each scenario, we repeat 50 times.

Serum Immunoglobulin E (IgE) level is a primary clinical outcome in children asthma studies. We take log-transformed IgE level as our clinical outcome, and age and gender as clinical variables. Together with the gene expression and methylation PCs, we run our model with full data and three scenarios of missingness to compare CC approach and full Bayesian model with missingness (FBMG, FBMM and FBMGM).

5.2 Outcome prediction and feature selection

Table 2 shows the RMSE of outcome prediction from complete case analysis and FBMs. When 50% samples have missing gene expression (Scenario I), FBMG had reduced RMSE compared with CC (dropped from 43.34.19 to 36.04). In contrast, FBMM had inflated RMSE compared with CC when 50% of missing methylation (51.59 compared with 39.54 in Scenario II). When gene expression and methylation both missed 20% of samples (i.e. 40% of samples had missing values in Scenario III), FBMG had the smallest RMSE (38.71) compared with FBMGM (46.15), FBMM (54.23) and CC (39.09). Our CV scheme performed the automatic selection of whether and how to impute, and the RMSE was always close to the best (I: 36.61, II: 41.09 and III: 40.02). Figure 5a shows scatter plot or box plot of RMSEs of different methods in all three Scenarios. Similar to the simulation result, CV scheme mostly selected the best method and the mistakes were near the diagonal line with little predictive impact.

Table 2.

RMSE (S.E.) of different methods

MethodsScenario IScenario IIScenario III
CC43.34 (5.33)39.54 (6.85)39.09 (5.43)
FBMG36.04 (4.73)38.71 (5.73)
FBMGM46.15 (4.74)
FBMM51.59 (6.71)54.23 (4.50)
CV36.61 (3.78)41.09 (6.17)40.62 (7.36)
An external file that holds a picture, illustration, etc.
Object name is bty775f5.jpg

(a) Scatter plot of RMSEs between two methods are shown, where circle means cross-validation scheme generates correct decision, and cross means mistakes. (b) The number of overlapped genes (features) selected by each methods compared with full asthma data. On x axis is the top number of genes selected by each method

Unlike in simulation, no underlying truth is available for real data and thus calculation of AUC is not possible. Figure 5b treats the predicted outcome from full data as the surrogate of underlying truth and compare feature selection from each method with the surrogate (i.e. x-axis shows the same number of features selected by the designated method and full and y-axis demonstrates the overlap between the two). A curve with higher overlap shows better similarity of feature selection with the surrogate, an indication of better performance. Similar to the RMSE result, FBMG performed better than CC in Scenario I, CC better than FBMM in Scenario II, and FBMG and CC performed better than FBMM and FBMGM in Scenario III. CV scheme performs close to the best. All results from this real application are largely aligned with our observations in simulation studies.

We next investigate whether feature selection from the imputation methods or method selected by CV scheme represent better functional annotation in a biological sense. We obtained the top 200 genes from feature selection of each method by posterior probability order and then performed pathway enrichment analysis by one-sided Fisher’s exact test. We collected 2467 pathways from four pathway databases (KEGG, Reactome, Biocarta and GO) with the restriction of pathway size between 10 and 500. The enrichment p-values were then adjusted for multiple comparisons by Benjamini–Hochberg procedure. We found 27 enriched pathways from the full data analysis under FDR = 1%. We used these 27 pathways as a surrogate of gold standard to benchmark functional annotation performance of each method. Figure 6 shows box-plots of the minus log-transformed p-values from pathway enrichment of the 27 pathways in each method. As expected, FBMG had better p-value significance distribution than CC in Scenario I. In Scenario II, CC performed better than FBMM. FBMGM and FBMG outperformed CC in Scenario III. The CV scheme automatically determined whether and how to impute, and it always perform close to the best in each scenario.

An external file that holds a picture, illustration, etc.
Object name is bty775f6.jpg

Compare different methods for the log10(p)value of the top 27 pathways taken from full data with q-value <0.01

6 Discussion

Integrative analysis of multi-level omics data brings unique insights to the modulating relationship between different types of omics data. Feature selection and model prediction are two important goals in multi-omics integration, which empowers discovery of disease-associated biomarkers, survival prediction and risk assessment. Several methods have been developed to fulfil these goals, including iBAG using a two-layer Bayesian hierarchical model to discover both association between genes and clinical outcome, and that between gene and upstream regulators. However, none of these methods are able to handle the potential large proportion of missing data in the data integration. In this article, we propose a full Bayesian model with missingness (FBM) inspired by iBAG model, to jointly perform feature selection, model prediction and missing data incorporation. In addition to the mechanistic model and clinical model originally proposed by iBAG, FBM includes a third layer of missingness model to incorporate samples with missingness. The flexibility of Bayesian hierarchical modelling and Gibbs sampler technique enables us to jointly model association among data and infer parameters in all three layers of models together. The Laplace (double exponential) prior initially used in iBAG could not realize exact feature selection. FBM applied spike-and-slab prior for more effective feature selection and allows Bayesian FDR control. We demonstrated outcome prediction and feature selection performance of FBM using extensive simulations.

From the extensive simulations, we then further realized that imputation is not always favoured over complete case analysis. For example, in high dimensional genomic data analysis, when missingness exists in upstream regulators (i.e. methylation), uncertainty will be increased to impair final inference. When the missing proportion is relatively small and the sample size is large (i.e. signal is strong), CC analysis also often outperforms FBM. To decide the best handling of data with missingness, we proposed a self-learning cross-validation decision scheme. Previously, we have developed a similar self-training selection scheme to select the best microarray missing value imputation method and its downstream biological impact (Brock et al., 2008; Oh et al., 2011). In both simulation and the childhood asthma application, we showed superior performance of the CV scheme in prediction outcome and feature selection.

While Bayesian hierarchical model allows complex parameter structures, it also comes with a computational cost when using conventional inference approaches such as Metropolis–Hastings or its special case, Gibbs sampling. In FBM, fast Gibbs sampling was applicable using conjugate priors and the convergence was generally fast (B = 2000). We have optimized the R code using C++ with Rcpp package. The computing takes 90 min for a reasonably large dataset with N =500 samples, K =1000 genes, J =2000 methylations and B =2000 MCMC iterations using a regular computer with 1 Intel Xeon CPU (2.40 GHz). Since the computation burden grows linearly with sample size, feature size and number of iterations, applying FBM with parallel computing could be a solution to substantially speed up computing and allow routine omics applications. An R package, data and source code to replicate all results in this article are available on GitHub (https://github.com/CHPGenetics/FBM).

Funding

This work was supported by the National Institutes of Health [RO1 CA190766 to Z.F., T.M., L.Z., and G.C.T.’s, RO1 MD011764 and RO1 HL117191 to W.C. and J.C.C.’s].

Conflict of Interest: none declared.

Supplementary Material

Supplementary Material

Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD, USA
Division of Pediatric Pulmonology, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, USA
To whom correspondence should be addressed. E-mail: ude.ttip@9fhz
Received 2018 Jan 15; Revised 2018 Aug 8; Accepted 2018 Aug 31.
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Abstract

Motivation

Integrative analysis of multi-omics data from different high-throughput experimental platforms provides valuable insight into regulatory mechanisms associated with complex diseases, and gains statistical power to detect markers that are otherwise overlooked by single-platform omics analysis. In practice, a significant portion of samples may not be measured completely due to insufficient tissues or restricted budget (e.g. gene expression profile are measured but not methylation). Current multi-omics integrative methods require complete data. A common practice is to ignore samples with any missing platform and perform complete case analysis, which leads to substantial loss of statistical power.

Methods

In this article, inspired by the popular Integrative Bayesian Analysis of Genomics data (iBAG), we propose a full Bayesian model that allows incorporation of samples with missing omics data.

Results

Simulation results show improvement of the new full Bayesian approach in terms of outcome prediction accuracy and feature selection performance when sample size is limited and proportion of missingness is large. When sample size is large or the proportion of missingness is low, incorporating samples with missingness may introduce extra inference uncertainty and generate worse prediction and feature selection performance. To determine whether and how to incorporate samples with missingness, we propose a self-learning cross-validation (CV) decision scheme. Simulations and a real application on child asthma dataset demonstrate superior performance of the CV decision scheme when various types of missing mechanisms are evaluated.

Availability and implementation

Freely available on the GitHub at https://github.com/CHPGenetics/FBM

Supplementary information

Supplementary data are available at Bioinformatics online.

Abstract
Click here for additional data file.(1.0M, pdf)
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.