A Granger causality approach to gene regulatory network reconstructionbased on data from multiple experiments

The discovery of gene regulatory network (GRN) using gene expression data is one of the promising directions for deciphering biological mechanisms, which underlie many basic aspects of scientific and medical advances. In this thesis, we focus on the reconstruction of GRN from time-series data using a Granger causality (GC) approach. As there is little existing research on combining data from multiple time-series experiments, we identify the need for developing a methodology with underlying theory to combine multiple experiments for statistical significant discovery. We derive a statistical theory for intersection of two discovered networks. Such a statistical framework is novel and intended for our GRN discovery problem. However, this theory is not limited to GRN or GC, and may be applied to other problems as long as one can take the intersection of discoveries obtained from multiple experiments (or datasets). We propose a number of novel methods for combining data from multiple experiments. Our single underlying model (SUM) method regresses data of multiple experiments in one go, enabling GC to fully utilize the information in the original data. Based on our statistical theory and SUM, we develop new meta-analysis methods, including union of pairwise common edges (UPCE) and leave-one-out hybrid of SUM and UPCE (LOOHSU). Applications on synthetic data and real data show that our new methods give discoveries of substantially higher precision than traditional meta-analysis. We also propose methods for estimating the precision of GC-discovered networks and thus fill in an important gap not considered in the literature. This allows us to assess how good a discovered network is in the case of unknown ground truth, which is typical in most biological applications. Our precision estimation by half-half splitting with combinations (HHSC) gives an estimate much closer to the true value compared with that computed from the Benjamini-Hochberg false discovery rate controlling procedure. Furthermore, using a network covering notion, we design a method that can identify a small number of links with high precision of around 0.8-0.9, which may relieve the burden of testing many hypothetical interactions of low precision in biological experiments. For the situation where the number of genes is much larger than the data length, in which case full-model GC cannot be applied, GC is often applied to the genes pairwisely. We analyze how spurious causalities (false discoveries) may arise. Consequently, we demonstrate that model validation can effectively remove spurious discoveries. With our proposed implementation that model orders are fixed by the Akaike information criterion and every model is subject to validation, we report a new observation that network hubs tend to act as sources rather than receivers of interactions.

並列關鍵字

Gene regulatory networks - Statistical methods.

國際替代計量

主題瀏覽