Abstract

We propose a Bayesian procedure to cluster temporal gene expression microarray profiles, based on a mixed-effect smoothing-spline model, and design a Gibbs sampler to sample from the desired posterior distribution. Our method can determine the cluster number automatically based on the Bayesian information criterion, and handle missing data easily. When applied to a microarray dataset on the budding yeast, our clustering algorithm provides biologically meaningful gene clusters according to a functional enrichment analysis.

1. Introduction

Microarray technology enables the scientist to measure the mRNA expression levels of thousands of genes simultaneously. For a particular species of interest, one can make microarray measurements under many different conditions and for different types of cells (if it is a multicellular organism). Genes' expression profiles under these conditions often give the scientist some clues on biological roles of these genes. A group of genes with similar profiles are often “coregulated” or participants of the same biological functions.

When a series of microarray experiments are conducted sequentially during a biological process, we call the resulting dataset a “temporal” microarray dataset, which can provide insights on the underlying biology and help decipher the dynamic gene regulatory network. Clustering genes with similar temporal profiles is a crucial first step to reveal potential relationships among the genes.

Conventional clustering methods, such as the K-means and hierarchical clustering, do not take into consideration the correlation in the gene expression levels over time. Although it is possible to use a general multivariate Gaussian model to account for the correlation structure, such a model ignores the time order of the gene expressions. As evidenced in our example, the time factor is important in interpreting the results of gene expression clustering in temporal data. It is also possible to use an autoregression model to describe the gene expression time series, but such a model often requires stationarity, which is unlikely to hold in most temporal microarray data.

Recently, nonparametric analysis of data in the form of curves, that is, functional data, is subject to active research, see [1, 2] for a comprehensive treatment of functional data analysis; and curve-based functional clustering methods have emerged [37], but a rigorous assessment of the estimation precision is still lacking.

In this paper, we propose a Bayesian clustering method, which optimally combines the available information and provides a proper uncertainty measure for all estimated quantities. Our method is based on a mixture of mixed-effect smoothing splines models. For each cluster, we model its mean profile as a smoothing spline function and describe its individual gene's variation by a parametric random effect. Based on the theory of reproducing-kernel Hilbert spaces [8], we represent the mean expression curve as a linear combination of certain basis functions, which enables us to derive the full posterior distribution up to a normalizing constant. All the conditional distributions needed by a Gibbs sampler are also easy to compute and to sample from. Our method automatically takes care of the missing data and infers the number of clusters in the data. Using the method, we analyzed a microarray dataset of budding yeast, we found that the majority of the clusters we had obtained are enriched for known and expected biological functions.

Our method is not restricted to temporal microarray data, and can be applied to all curve clustering problems, especially for sparsely and irregularly sampled temporal data.

2. Material and Methods

2.1. Mixed-Effect Representation of Gene Expression Profile

Let the expression value of the 𝑖th gene at time 𝑡 be 𝑦𝑖𝑡. To accommodate missing data that occasionally occurs in microarray experiment, we denote 𝐭𝑖=(𝑡1,,𝑡𝑛𝑖) and 𝐲𝑖=(𝑦𝑖1,,𝑦𝑖𝑛𝑖)𝑇, where 𝑛𝑖 is the number of measurements of 𝑖th gene. Our mixed-effect smoothing spline model [9] for genes in one cluster is𝐲𝑖𝐭=𝝁𝑖+𝑍𝑖𝐛𝑖+𝝐𝑖,(1)where 𝝁(𝐭𝑖)=(𝜇(𝑡1),𝜇(𝑡𝑛𝑖))𝑇 is the cluster's mean profile, 𝐛𝑖𝑁(0,𝐵) is the random effect to capture the intragene correlation, 𝑍𝑖 is the known design matrix for the random effect, and 𝝐𝑖𝑁(0,𝜎2𝐼) is the random error independent of 𝐛 and of each other.

By taking different 𝐛 vectors, we can accommodate different nonrandom effects. For example, when 𝐛𝑖=𝑏𝑖 and 𝑍𝑖=𝟏, the expression profile of the 𝑖th gene is parallel to the mean profile 𝝁 (Figure 1). If 𝐛𝑖=(𝑏𝑖1,𝑏𝑖2)𝑇 and 𝑍𝑖=(𝟏,𝐭𝑖), the difference between the 𝑖th gene profile and the mean profile is a linear function in time. More complicated structures such as periodicity can be modeled by letting the 𝑍𝑖 be basis of a certain functional space.

By considering 𝜇 in a reproducing kernel Hilbert space {𝜇𝑀(𝜇)<} in which 𝑀(𝜇) is a square seminorm, we can represent 𝜇 as𝜇(𝑡)=𝑚𝜈=1𝑑𝜈𝜙𝜈(𝑡)+𝑞𝑖=1𝑐𝑗𝑅𝑀𝑠𝑗,𝑡,𝑡[0,𝑎],(2)where {𝑠𝑗} is a set consisting of all distinct {𝑡𝑖}, 𝑞 is the number of {𝑠𝑗}, and 𝑅𝑀 is the kernel of . The choice of 𝑀(𝜇)=𝑎0(𝑑2𝜇/𝑑𝑡2)2𝑑𝑡 yields the cubic smoothing spline with𝜙1(𝑡)=1,𝜙2(𝑡)=𝑡,(3)𝑅𝑀𝑡1,𝑡2=𝑎0𝑡1𝑢+𝑡2𝑢+𝑑𝑢,(4)where ()+=max(,0) [10].

Writing (2) in a vector-matrix form, we have𝝁(𝐭𝑖)=𝑆𝑖𝐝𝑖+𝑅𝑖𝐜𝑖,(5)where 𝑆𝑖 is 𝑛𝑖×𝑚 with the (𝑖,𝜈)th entry 𝜙𝜈(𝑡𝑖) and 𝑅 is 𝑛𝑖×𝑞 with the (𝑖,𝑗)th entry 𝑅𝑀(𝑡𝑖,𝑠𝑗). Substituting (5) into (1), we have𝐲𝑖=𝑆𝑖𝐝𝑖+𝑅𝑖𝐜𝑖+𝑍𝑖𝐛𝑖+𝝐𝑖.(6)Denoting 𝐲=(𝐲𝑇1,,𝐲𝑇𝑛)𝑇 and 𝑆, 𝑅, 𝑍, 𝝐 similarly, we have the matrix representation𝐲=𝑆𝐝+𝑅𝐜+𝑍𝐛+𝝐,(7)where 𝐛=(𝐛𝑇1,,𝐛𝑇𝑛)𝑇𝑁(0,diag(𝐵,,𝐵)).

The prior distributions are specified as follows:𝛿𝐝𝑁0,diag1,,𝛿𝑚,𝐜𝑁0,𝜏2𝐼,𝜎2𝛼IG𝜎2,𝛽𝜎2,𝜏2𝛼IG𝜏2,𝛽𝜏2,𝜈𝐵IW0,𝑆01,(8)where IG and IW are inverse-Gamma and inverse-Wishart distributions, respectively.

These priors lead to the following full conditional posteriors, which are used in our Gibbs sampler:𝐝𝐛,𝐜,𝜎2𝑉,𝜹,𝐲𝑁𝑑𝑆𝑇(𝐲𝑅𝐜𝑍𝐛)/𝜎2,𝑉𝑑,𝐜𝐝,𝐛,𝜎2,𝜏2𝑉,𝐲𝑁𝑐𝑅𝑇(𝐲𝑆𝐝𝑍𝐛)/𝜎2,𝑉𝑐,𝐛𝐝,𝐜,𝜎2𝑉,𝐵,𝐲𝑁𝑏𝑍𝑇(𝐲𝑆𝐝𝑅𝐜)/𝜎2,𝑉𝑏,𝜈[𝐵𝐛]IW0𝑆+𝑛,0+𝑛𝑖=1𝐛𝑖𝐛𝑇𝑖1,[𝜏2𝛼𝐜]IG𝜏2+(𝑞𝑚)/2,𝛽𝜏2+𝐜𝑇,𝐜/2[𝜎2𝛼𝐝,𝐛,𝐜,𝐲]IG𝜎2+𝑛/2,𝛽𝜎2,+SSR(9)where 𝑉𝑑=(𝑆𝑇𝑆/𝜎2+diag(𝛿11,,𝛿𝑚1))1, 𝑉𝑏=(𝑍𝑇𝑍/𝜎2+diag(𝐵1,,𝐵1))1, 𝑉𝑐=(𝑅𝑇𝑅/𝜎2+1/𝜏2𝐼)1, and SSR=(𝐲𝑆𝐝𝑅𝐜𝑍𝐛)𝑇(𝐲𝑆𝐝𝑅𝐜𝑍𝐛).

2.2. The Mixture Model with Unknown Number of Components

When more than one cluster is considered, we assume that the expression of the 𝑖th gene has a Gaussian mixture distribution:𝐲𝑖𝑝1𝑁(𝝁1,Σ1)++𝑝𝐾𝑁(𝝁𝐾,Σ𝐾),(10)where 𝝁𝑘 and Σ𝑘=𝑍𝐵𝑘𝑍𝑇+𝜎2𝐼 are the mean and covariance matrix for the 𝑘th component, as given by (7); 𝑝𝑘 is the fraction of 𝑘th component, and 𝐾 is the number of Gaussian components.

2.3. Class Labels and Cluster Numbers

To ease the computation, we introduce a “latent” membership labeling variable 𝐽𝑖 for the 𝑖th gene so that𝐲𝑖𝐽𝑖𝝁=𝑗𝑁𝑗,Σ𝑗.(11)When the number of Gaussian components 𝐾 is known, we can get the joint posterior probability as𝑃(𝐉,𝝁,Σ𝐲)=𝜋(𝝁,Σ)𝑛𝑖=1𝑝𝑗𝑖𝑁(𝐲𝑖𝝁𝑗𝑖,Σ𝑗𝑖),(12)where 𝐉=(𝑗1,,𝑗𝑛), 𝝁=(𝜇1,,𝜇𝐾), Σ=(Σ1,,Σ𝐾), and 𝜋(𝝁,Σ) is the joint prior distribution.

Since 𝐾 is unknown, we used the following Bayesian information criterion (BIC):BIC=2log𝑝𝐲𝑀𝐾,𝜽𝐾+𝑙𝐾log𝑛,(13)where 𝑀𝐾 is the current model with parameters 𝜽𝐾, 𝜽𝐾 is the estimate, and 𝑙𝐾 is the total number of parameters in our model. A small BIC score indicates the adequacy of the corresponding model. An alternative to our current approach (i.e., each clustering configuration is equally likely given the number of clusters 𝐾, and 𝐾 is determined by BIC) is to use a Polya Urn prior (also called the “Chinese restaurant” process), which postulates that when a new member comes in, its a priori probability for joining an existing cluster of size 𝑚𝑖 is (𝑚𝑖+𝑐)/(𝑚+𝑐), and for forming a new cluster of its own is 𝑐/(𝑚+𝑐), where 𝑚 is the total number of existing members. This prior, however, favors unbalanced cluster configurations (e.g., very large and very small clusters) and may not be appropriate in our applications.

2.3.1. Gibbs Sampling from the Posterior

To complete our Bayesian analysis, we employ the Dirichlet prior Di (𝛼1,,𝛼𝐾) for (𝑝1,,𝑝𝐾), the cluster proportions. Thus, given the cluster indicator 𝐉, the posterior distribution of the 𝑝's is again a Dirichlet distribution.

Given 𝝁1,,𝝁𝐾,𝐵1,,𝐵𝐾,𝜎2, we have the conditional distribution of 𝐽𝑖:𝑝𝐽𝑖=𝑗𝝁1,,𝝁𝐾,𝐵1,,𝐵𝐾,𝜎2=𝑝,𝐲𝑗𝐍𝐲𝑖𝝁𝑗,𝑍𝐵𝑗𝑍𝑇+𝜎2𝐼𝐾𝑘=1𝑝𝑘𝐍𝐲𝑖𝝁𝑘,𝑍𝐵𝑘𝑍𝑇+𝜎2𝐼.(14)

With an initial value of 𝐉, which gives rise to a partition of 𝐲(𝐲𝐽1,,𝐲𝐽𝐾), and the initial values of 𝐝𝑘, 𝐛𝑘, 𝐜𝑘, 𝐵𝑘, where 𝑘=1,,𝐾, as well as 𝜎2, we iterate the following iterative conditional sampling steps:(i) for 𝑖=1,,𝑛, draw a new 𝑗𝑖 from the conditional distribution from (14) to replace the old one;(ii) conditional on 𝐉, sequentially (a)update 𝐝𝑘 by a draw from [𝐝𝑘𝐛𝑘,𝐜𝑘,𝜎2,𝜹,𝐲𝐽𝑘], where 𝑘=1,,𝐾,(b)update 𝐛𝑘 from [𝐛𝑘𝐝𝑘,𝐜𝑘,𝜎2,𝐵𝑘,𝐲𝐽𝑘], where 𝑘=1,,𝐾,(c) update 𝐜𝑘 from [𝐜𝑘𝐝𝑘,𝐛𝑘,𝜎2,𝜏2𝑘,𝐲𝐽𝑘], where 𝑘=1,,𝐾,(d) update 𝐵𝑘[𝐵𝑘𝐛𝑘], and 𝜏2𝑘[𝜏2𝑘𝐜𝑘], where 𝑘=1,,𝐾,(e) update 𝜎2[𝜎2𝐝,𝐛,𝐜,𝐲],(f)update (𝑝1,,𝑝𝐾)Di(𝑛1+𝛼1,,𝑛𝐾+𝛼𝐾), where 𝑛𝑗 is the number of genes in the 𝑗th cluster.

3. Results and Discussion

To study oxygen-responsive gene network, Lai et al. [11] used cDNA microarray to monitor the gene expression changes of wild-type budding yeast (Saccharomyces cerevisiae) under aerobic condition in galactose medium. Under aerobic condition, the oxygen concentration was lowered gradually until oxygen was exhausted during a period of ten minutes. Microarray experiments were conducted at 14 time points under aerobic condition. A reference sample was obtained from a pooled RNA collected from all time points for hybridization.

For the analysis, Lai et al. [11] normalized gene expression after time 0 to gene expression of time 0 to set the common starting point. They identified 2388 genes whose expression is differentially expressed at one or more time points. Using our method, 2388 genes was clustered to 31 clusters, which yields the smallest BIC. FunSpec [12] was used for gene annotation and biological function enrichment analysis, where the Bonferroni-corrected functional enrichment 𝑃-values based on hypergeometric distributions are reported. We found 23 clusters out of 31 clusters discovered have biological functions over-represented. Among them, estimated mean gene expression profiles of three clusters are given in Figure 2.

In cluster A, which consists of 40 genes, the estimated mean expression goes up progressively as oxygen level goes down, which suggests that the genes in this cluster were transiently upregulated in response to aerobisis. Accordingly, genes involved in stress response (function enrichment 𝑃-value =104) as well as cell rescue and defense are over-represented in this cluster (function enrichment 𝑃-value =104). Furthermore, genes involved in molecular functions of oxidoreductase and coproporphyrinogen oxidase are also presented, which explains the upregulation of the gene expression levels.

We have 92 genes in cluster B, where the estimated mean gene expression drops down at the beginning rapidly and then goes up gradually. In this cluster, 34 genes are involved in protein synthesis (function enrichment 𝑃-value 1014). Moreover, ribosome biogenesis are also over-represented (function enrichment 𝑃-value 1014). These processes were affected by oxygen level initially, but were quickly adjusted to high expression levels to maintain living of yeast.

Contrast to cluster B, cluster C (68 genes) consists of genes involved in galactose fermentation (function enrichment 𝑃-value =104), carbon utilization (functional enrichment 𝑃-value =102), and carbohydrate metabolism (function enrichment 𝑃-value 1010). The initial upregulation of gene expression under aerobic condition can be partly explained by the fact that the cell increases the energy uptaking through the carbon utilization as oxygen level goes down; but as the oxygen level continues to drop down, these processes are replaced by the more energy-efficient processes, which drives the expression levels of genes to be downregulated.

4. Conclusions

Conventional clustering methods do not take into consideration the correlation in the gene expression levels over time. Multivariate Gaussian models and time series analysis cannot model the time factor and correlation properly. These limitations can be readily overcome by the full Bayesian approach developed here. Although certain prior distributions and the related hyperparameters need to be input by the user, we found the clustering results rather robust to variations in such inputs. Moreover, our Bayesian clustering algorithm serves as a platform to incorporate more biological knowledge. Open source R code is available at www.stat.uiuc.edu/~pingma/BayesianFDAClust.htm.

Acknowledgments

The authors thank C. I. Castillo-Davis for his help in designing Figure 1 and for the many constructive suggestions and discussion during the early stages of this article. The authors thank Ji Young Kim for designing the software website. The authors also thank Kurt Kwast for providing the yeast microarray data.