Statistical and computational method development for system-wide quantification of RNA and protein molecules

The advancement of next-generation high-throughput sequencing technologies has been revolutionizing genomic studies in the last decade. In particular, the RNA sequencing (RNA-Seq) technology, which has deep coverage and base level resolution, enables investigating human and other eukaryotic species’ transcriptomes (i.e., sets of RNA molecules in cells) with unprecedented detail and clarity. Unlike previous technologies such as microarrays, RNA-Seq provides information on alternative splicing (i.e., how is the DNA of one gene possibly transcribed into multiple RNA sequences) as well as gene expression on a genome-wide scale without requiring prior knowledge on RNA sequences. This superior advantage of RNA-Seq enables the discovery of novel RNA isoforms (i.e., RNA molecules transcribed from the same gene but having different sequences) and the estimation of isoform expression, two crucial steps to understanding molecular mechanisms of diseases and other biological phenomena, in a high-throughput manner. Before the invention of RNA-Seq, RNA isoforms have been discovered on a gene-to-gene basis, and the discovered isoforms are recorded inannotations (i.e., databases of genes and their RNA isoforms). Previous methods that aim to discover and quantify RNA isoforms from RNA-Seq data belong to two categories: “annotation-based” and “annotation-free”. The former utilizes existing annotations to aid isoform discovery, but is hindered by the incompleteness of annotations. The latter uses no annotation information but directly assembles RNA isoforms from RNA-Seq data; however, it is affected by widespread non-systematic noise and biases in the data.

In [1], we developed a method “SLIDE” (Sparse Linear modeling of RNA-Seq data forIsoform Discovery and abundance Estimation), which defines a new category—“annotation-aided” methods for discovering and quantifying RNA isoforms from RNA-Seq data. SLIDE, which uses a stochastic approach to account for RNA-seq noise and borrows information from known isoform structures, is an annotation-aided method. Combining information from both annotations and RNA-Seq data, SLIDE has the advantage of being more capable to find novel isoforms (compared to annotation-based methods) as well as being more robust to data noise and biases (compared to annotation-free methods). After its publication, SLIDE has received considerable attention from the RNA-Seq field (cited for 91 times) and has been adapted for use by the modENCODE consortium (a nationwide project that aimed to identify all of the sequence-based functional elements in the genomes of two model organisms, the nematode worm C. elegans and the fruit fly D. melanogaster) (Brown et al, Nature, 2014Boley et al, Nature Biotechnology, 2014). In an independent assessment of multiple isoform discovery methods conducted by the RGASP consortium, SLIDE achieved top performance as compared to other methods (Steijger et al, Nature Methods, 2013).

However, the isoform discovery task remains a great challenge in studying human transcriptomes, mostly because the complex alternative splicing mechanisms of human genes lead to an ultra large search space of possible RNA isoforms. Therefore, in [2] we developed a method “NMFP” (Nonnegative Matrix Factorization Based Preselection) as an upstream preselection step that significantly improves the precision and recall rates of SLIDE and other methods for isoform discovery. Moreover, to further improve the accuracy of estimating isoform expression, in [3] we developed an umbrella statistical method “MSIQ” (Joint Modeling of Multiple RNA-Seq Samples for Accurate Isoform Quantification) to jointly use the information in multiple RNA-seq replicates while accounting for possible replicate heterogeneity. Both NMFP and MSIQ are useful tools to boost the performance of existing methods and facilitate the reuse of large-scale RNA-Seq data sets in public repositories for new studies. Our contribution to the method development for RNA-Seq data has been well recognized by the field, and we are currently funded by an NIH R01 grant to pursue important research questions in this direction.

In addition, in [4 and 5] we have used statistics to quantify the relative contributions oftranscription (i.e., information flow from DNA to RNA) and translation (i.e., information flow from RNA to protein) in determining protein abundances in animal cells. By re-analyzing existing large-scale transcriptomic and proteomic data, we showed that transcription is the dominant step, in contrast to the claims of many papers over the last decade that had suggested that translational control plays the larger role. Furthermore, our ongoing work in [6] shows that decomposing translational rate into RNA dependent and independent components further reveals mechanisms behind translational control.


[1] Li, J.J., Jiang, C.-R., Brown, B.J., Huang, H., and Bickel, P.J., “Sparse linear modeling of RNA-seq data for isoform discovery and abundance estimation,” Proc Natl Acad Sci. USA 108(50):19867-19872 (2011)
[2] Ye, Y., and Li, J.J., “NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data,” BMC Genomics 17(Supp 1):11 (2016)
[3] Li, W.V., Zhao, A., Zhang, S., and Li, J.J. , “MSIQ: Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification,” arXiv:1603.05915
[4] Li, J.J., Bickel, P.B., and Biggin, M.D., “System wide analyses have underestimated protein abundances and transcriptional importance in animals,” PeerJ 2:e270 (2014)
[5] Li, J.J. and Biggin. M.D., “Statistics requantitates the central dogma,” Science 347(6226):1066-1067 (2015)
[6] Li, J.J., Chew, G.L., Biggin, M.D., “Quantitating translational control: mRNA abundance-dependent and independent contributions,” Under review at Proc Natl Acad Sci. USA (2016)