Software
Single-cell RNA-seq
- scImpute (imputation method for single-cell RNA-Seq data)
- Citation: Li, W.V. and Li, J.J. (2018). An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications 9:997.
- scDesign (a statistical simulator for rational scRNA-seq experimental design)
- Citation: Li, W.V. and Li, J.J. (2019). A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics 35(14):i41–i50.
- scDesign2 (a simulator that generates single-cell gene expression counts with gene correlations captured)
- Citation 1: Sun, T., Song, D., Li, W.V., and Li, J.J. (2021). scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biology 22:163.
- Citation 2: Sun, T., Song, D., Li, W.V., and Li, J.J. (2022). Simulating single-cell gene expression count data with preserved gene correlations by scDesign2. Journal of Computational Biology 29(1):23–26 (software article).
- scDesign3 (a simulator that generates realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data)
- Citation: Song, D., Wang, Q., Yan, G., Liu, T., and Li, J.J. (2023). scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nature Biotechnology.
- PseudotimeDE (inference of differential gene expression along cell pseudotime)
- Citation: Song, D. and Li, J.J. (2021). PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data. Genome Biology 22:124.
- scGTM (a flexible and interpretable model of gene expression trend along cell pseudotime)
- Citation: Cui, E.H., Song, D., Wong, W.K., and Li, J.J. (2022). Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime. Bioinformatics 38(16):3927–3934.
- scPNMF (single-cell projective non-negative matrix factorization for selecting informative features that distinguish cell types)
- Citation: Song, D., Li, K.A., Hemminger, Z., Wollman, R., and Li, J.J. (2021). scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling. Bioinformatics 37(Supplement_1):i358-i366.
- scSampler (fast diversity-preserving subsampling of large-scale single-cell transcriptomic data)
- Citation: Song, D., Xi, N.M., Li, J.J., and Wang, L. (2022). scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38(11):3126–3127.
- DoubletCollection (R package that integrates the execution and benchmark of doublet-detection methods)
- Citation: Xi, N.M. and Li, J.J. (2021). Protocol for executing and benchmarking eight computational doublet-detection methods in single-cell RNA sequencing data analysis. STAR Protocols 2(3):100699.
Bulk RNA-seq isoform discovery and quantification
- SLIDE (Sparse Linear modeling of RNA-Seq data for Isoform Discovery and abundance Estimation)
- Citation: Li, J.J., Jiang, C.-R., Brown, B.J., Huang, H., and Bickel, P.J. (2011). Sparse linear modeling of RNA-seq data for isoform discovery and abundance estimation. Proc Natl Acad Sci. USA 108(50):19867-19872.
- Important note: SLIDE is compatible with RNA-seq .bam files mapped by TopHat and TopHat2.
- (Updates on Jan 30th, 2018 -- Several bugs have been fixed for RNA-seq .bam files with more than one read lengths.)
- (Updates on May 7th, 2012 -- A feature was added for estimating the annotated isoform abundance without doing isoform discovery.)
- (Updates on Apr 18th, 2012 -- A feature was added for removing erroneously mapped reads; a bug of multiprocessing was fixed.)
- (Updates on Apr 5th, 2012 -- A feature was added for handling single-end RNA-Seq reads or a mixture of single-end and paired-end reads.)
- NMFP (Non-negative Matrix Factorization based Preselection)
- Please cite the following paper in any research that uses this software package
- Ye, Y. and Li, J.J. (2016). NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data. BMC Genomics 17(Supp 1):11.
- Please cite the following paper in any research that uses this software package
- MSIQ (joint modeling of Multiple RNA-seq Samples for accurate Isoform Quantification)
- Citation: Li, W.V., Zhao, A., Zhang, S., and Li, J.J. (2018). MSIQ: joint modeling of multiple RNA-seq samples for accurate isoform quantification. Annals of Applied Statistics 12(1):510-539.
- AIDE (annotation-assisted isoform discovery)
- Citation: Li, W.V., Li, S., Tong, X., Deng, L., Shi, H., and Li, J.J. (2019). AIDE: annotation-assisted isoform discovery with high precision. Genome Research 29:2056-2072.
Comparative genomics
- TROM (TRanscriptome Overlap Measure)
- Citation: Li, W.V., Chen, Y., and Li, J.J. (2017). TROM: a testing-based method for finding transcriptomic similarity of biological samples. Statistics in Biosciences 9(1):105-136.
- EPOM (EPigenome Overlap Measure)
- Citation: Li, W.V., Razaee, Z.S., and Li, J.J. (2016). Epigenome overlap measure (EPOM) for comparing tissue/cell types based on chromatin states. BMC Genomics 17(Supp 1):10.
- EpiAlign (an alignment-based bioinformatic tool for comparing chromatin state sequences)
- Citation: Ge, X., Zhang, H., Xie, L., Li, W.V., Kwon, S.B., and Li, J.J. (2019). EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences. Nucleic Acids Research gkz287.
- BiTSC (bipartite tight spectral clustering)
- Citation: Sun, Y.E., Zhou, H.J., and Li, J.J. (2020). Bipartite tight spectral clustering (BiTSC) algorithm for identifying conserved gene co-clusters in two species. Bioinformatics btaa741.
Microbiomics
- mbImpute (imputation method for microbiome data)
- Citation: Jiang, R., Li, W.V., and Li, J.J. (2021). mbImpute: an accurate and robust imputation method for microbiome data. Genome Biology accepted after minor revision.
Classification
- NPROC (Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristics (NP-ROC))
- Citation: Tong, X., Feng, Y., and Li, J.J. (2018). Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristics (NP-ROC). Science Advances 4(2):eaao1659.
- frc (feature ranking for classification)
- Citation: Li, J.J., Chen, Y., and Tong, X. (2021). A flexible model-free prediction-based framework for feature ranking. Journal of Machine Learning Research 22:1-54.
- ITCA (combination of ambiguous outcome labels in multi-class classification)
- Citation: Zhang, C., Chen, Y.E., Zhang, S., and Li, J.J. (2022). Information-theoretic classification accuracy: a criterion that guides data-driven combination of ambiguous outcome labels in multi-class classification. Journal of Machine Learning Research 23(341):1−65.
Association measure
- gR2 (generalized R squares measures)
- Citation: Li, J.J., Tong, X., and Bickel, P.J. (2019). Generalized R2 measures for a mixture of bivariate linear dependences. arXiv:1811.09965.
High-dimensional model inference
- HDCI (high-dimensional linear model coefficient confidence interval)
- Citation: Liu, H., Xu, X., and Li, J.J. (2020). A bootstrap lasso + partial ridge method to construct confidence intervals for parameters in high-dimensional sparse linear models. Statistica Sinica 30:1333-1355.
False discovery rate control
- Clipper (p-value-free FDR control in high-throughput genomics data analysis)
- Citation: Ge, X., Chen, Y.E., Song, D., McDermott, M., Woyshner, K., Manousopoulou, A., Wang, L.D., Li, W., and Li, J.J. (2021). Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biology 22:288.
Data
- Estimates of D. melanogaster and C. elegans gene expression in different developmental stages, tissues and cells (in FPKM units)
- D. melanogaster gene expression estimates in 30 fly developmental stages (download)
- D. melanogaster gene expression estimates in 29 fly tissues and 19 fly cell lines (download)
- C. elegans gene expression estimates in 35 worm developmental stages (download)
- C. elegans gene expression estimates in 4 worm tissues and 14 worm dissected cells (download)
- Please cite the following paper in any research that uses the above data
- Li, J.J., Huang, H., Bickel, P.B., and Brenner, S.E. (2014). Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome Research 24(7):1086-1101.
- For more details about the data, please refer to the section "Estimating gene expression in developmental stages and tissues/cells" in the Methods of the above paper ([html] [pdf]).
- Associated promoter and enhancer regions identified based on signals of three histone modification marks (H3K4me1, H3K4me3 and H3K27ac) in 16 human tissue and cell types (download)
- Estimates of gene expression (FPKM) in various cell and tissue types from human, chimpanzee, bonobo and mouse
- Expression estimates of protein-coding genes in human (download)
- Expression estimates of protein-coding genes in chimpanzee (download)
- Expression estimates of protein-coding genes in bonobo (download)
- Expression estimates of protein-coding genes in mouse (download)
- Expression estimates of protein-coding genes in pig (download)
- Expression estimates of long non-coding RNAs in human (download)
- Expression estimates of long non-coding RNAs in chimpanzee (download)
- Expression estimates of long non-coding RNAs in bonobo (download)
- Expression estimates of long non-coding RNAs in mouse (download)
- Please cite the following paper in any research that uses the above data
- Yang et al. Large-scale mapping of mammalian transcriptomes identifies conserved genes associated with different cell states. Nucleic Acids Research 45(4):1657–1672.
- For more details about the data, please refer to the section "RNA-seq data collection and processing" in the Methods of the above paper ([html] [pdf]).
- Data for the R package Clipper (download)