Software and Data

Software

Single-cell RNA-seq

scImpute (imputation method for single-cell RNA-Seq data)
- Citation: Li, W.V. and Li, J.J. (2018). An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications 9:997.
scDesign (a statistical simulator for rational scRNA-seq experimental design)
- Citation: Li, W.V. and Li, J.J. (2019). A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics 35(14):i41–i50.
scDesign2 (a simulator that generates single-cell gene expression counts with gene correlations captured)
- Citation 1: Sun, T., Song, D., Li, W.V., and Li, J.J. (2021). scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biology 22:163.
- Citation 2: Sun, T., Song, D., Li, W.V., and Li, J.J. (2022). Simulating single-cell gene expression count data with preserved gene correlations by scDesign2. Journal of Computational Biology 29(1):23–26 (software article).
scDesign3 (a simulator that generates realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data)
- Citation: Song, D., Wang, Q., Yan, G., Liu, T., and Li, J.J. (2023). scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nature Biotechnology.
PseudotimeDE (inference of differential gene expression along cell pseudotime)
- Citation: Song, D. and Li, J.J. (2021). PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data. Genome Biology 22:124.
scGTM (a flexible and interpretable model of gene expression trend along cell pseudotime)
- Citation: Cui, E.H., Song, D., Wong, W.K., and Li, J.J. (2022). Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime. Bioinformatics 38(16):3927–3934.
scPNMF (single-cell projective non-negative matrix factorization for selecting informative features that distinguish cell types)
- Citation: Song, D., Li, K.A., Hemminger, Z., Wollman, R., and Li, J.J. (2021). scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling. Bioinformatics 37(Supplement_1):i358-i366.
scSampler (fast diversity-preserving subsampling of large-scale single-cell transcriptomic data)
- Citation: Song, D., Xi, N.M., Li, J.J., and Wang, L. (2022). scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38(11):3126–3127.

DoubletCollection (R package that integrates the execution and benchmark of doublet-detection methods)
- Citation: Xi, N.M. and Li, J.J. (2021). Protocol for executing and benchmarking eight computational doublet-detection methods in single-cell RNA sequencing data analysis. STAR Protocols 2(3):100699.

Bulk RNA-seq isoform discovery and quantification

SLIDE (Sparse Linear modeling of RNA-Seq data for Isoform Discovery and abundance Estimation)
- Citation: Li, J.J., Jiang, C.-R., Brown, B.J., Huang, H., and Bickel, P.J. (2011). Sparse linear modeling of RNA-seq data for isoform discovery and abundance estimation. Proc Natl Acad Sci. USA 108(50):19867-19872.
- Important note: SLIDE is compatible with RNA-seq .bam files mapped by TopHat and TopHat2.
- (Updates on Jan 30th, 2018 -- Several bugs have been fixed for RNA-seq .bam files with more than one read lengths.)
- (Updates on May 7th, 2012 -- A feature was added for estimating the annotated isoform abundance without doing isoform discovery.)
- (Updates on Apr 18th, 2012 -- A feature was added for removing erroneously mapped reads; a bug of multiprocessing was fixed.)
- (Updates on Apr 5th, 2012 -- A feature was added for handling single-end RNA-Seq reads or a mixture of single-end and paired-end reads.)

NMFP (Non-negative Matrix Factorization based Preselection)
- Please cite the following paper in any research that uses this software package
  - Ye, Y. and Li, J.J. (2016). NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data. BMC Genomics 17(Supp 1):11.
MSIQ (joint modeling of Multiple RNA-seq Samples for accurate Isoform Quantification)
- Citation: Li, W.V., Zhao, A., Zhang, S., and Li, J.J. (2018). MSIQ: joint modeling of multiple RNA-seq samples for accurate isoform quantification. Annals of Applied Statistics 12(1):510-539.
AIDE (annotation-assisted isoform discovery)
- Citation: Li, W.V., Li, S., Tong, X., Deng, L., Shi, H., and Li, J.J. (2019). AIDE: annotation-assisted isoform discovery with high precision. Genome Research 29:2056-2072.

Comparative genomics

TROM (TRanscriptome Overlap Measure)
- Citation: Li, W.V., Chen, Y., and Li, J.J. (2017). TROM: a testing-based method for finding transcriptomic similarity of biological samples. Statistics in Biosciences 9(1):105-136.
EPOM (EPigenome Overlap Measure)
- Citation: Li, W.V., Razaee, Z.S., and Li, J.J. (2016). Epigenome overlap measure (EPOM) for comparing tissue/cell types based on chromatin states. BMC Genomics 17(Supp 1):10.
EpiAlign (an alignment-based bioinformatic tool for comparing chromatin state sequences)
- Citation: Ge, X., Zhang, H., Xie, L., Li, W.V., Kwon, S.B., and Li, J.J. (2019). EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences. Nucleic Acids Research gkz287.

BiTSC (bipartite tight spectral clustering)
- Citation: Sun, Y.E., Zhou, H.J., and Li, J.J. (2020). Bipartite tight spectral clustering (BiTSC) algorithm for identifying conserved gene co-clusters in two species. Bioinformatics btaa741.

Microbiomics

mbImpute (imputation method for microbiome data)
- Citation: Jiang, R., Li, W.V., and Li, J.J. (2021). mbImpute: an accurate and robust imputation method for microbiome data. Genome Biology accepted after minor revision.

Classification

NPROC (Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristics (NP-ROC))
- Citation: Tong, X., Feng, Y., and Li, J.J. (2018). Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristics (NP-ROC). Science Advances 4(2):eaao1659.
frc (feature ranking for classification)
- Citation: Li, J.J., Chen, Y., and Tong, X. (2021). A flexible model-free prediction-based framework for feature ranking. Journal of Machine Learning Research 22:1-54.
ITCA (combination of ambiguous outcome labels in multi-class classification)
- Citation: Zhang, C., Chen, Y.E., Zhang, S., and Li, J.J. (2022). Information-theoretic classification accuracy: a criterion that guides data-driven combination of ambiguous outcome labels in multi-class classification. Journal of Machine Learning Research 23(341):1−65.

Association measure

gR2 (generalized R squares measures)
- Citation: Li, J.J., Tong, X., and Bickel, P.J. (2019). Generalized R2 measures for a mixture of bivariate linear dependences. arXiv:1811.09965.

High-dimensional model inference

HDCI (high-dimensional linear model coefficient confidence interval)
- Citation: Liu, H., Xu, X., and Li, J.J. (2020). A bootstrap lasso + partial ridge method to construct confidence intervals for parameters in high-dimensional sparse linear models. Statistica Sinica 30:1333-1355.

False discovery rate control

Clipper (p-value-free FDR control in high-throughput genomics data analysis)
- Citation: Ge, X., Chen, Y.E., Song, D., McDermott, M., Woyshner, K., Manousopoulou, A., Wang, L.D., Li, W., and Li, J.J. (2021). Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biology 22:288.

Data

Estimates of D. melanogaster and C. elegans gene expression in different developmental stages, tissues and cells (in FPKM units)
- D. melanogaster gene expression estimates in 30 fly developmental stages (download)
- D. melanogaster gene expression estimates in 29 fly tissues and 19 fly cell lines (download)
- C. elegans gene expression estimates in 35 worm developmental stages (download)
- C. elegans gene expression estimates in 4 worm tissues and 14 worm dissected cells (download)
- Please cite the following paper in any research that uses the above data
  - Li, J.J., Huang, H., Bickel, P.B., and Brenner, S.E. (2014). Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome Research 24(7):1086-1101.
- For more details about the data, please refer to the section "Estimating gene expression in developmental stages and tissues/cells" in the Methods of the above paper ([html] [pdf]).

Associated promoter and enhancer regions identified based on signals of three histone modification marks (H3K4me1, H3K4me3 and H3K27ac) in 16 human tissue and cell types (download)

Estimates of gene expression (FPKM) in various cell and tissue types from human, chimpanzee, bonobo and mouse
- Expression estimates of protein-coding genes in human (download)
- Expression estimates of protein-coding genes in chimpanzee (download)
- Expression estimates of protein-coding genes in bonobo (download)
- Expression estimates of protein-coding genes in mouse (download)
- Expression estimates of protein-coding genes in pig (download)
- Expression estimates of long non-coding RNAs in human (download)
- Expression estimates of long non-coding RNAs in chimpanzee (download)
- Expression estimates of long non-coding RNAs in bonobo (download)
- Expression estimates of long non-coding RNAs in mouse (download)
- Please cite the following paper in any research that uses the above data
  - Yang et al. Large-scale mapping of mammalian transcriptomes identifies conserved genes associated with different cell states. Nucleic Acids Research 45(4):1657–1672.
- For more details about the data, please refer to the section "RNA-seq data collection and processing" in the Methods of the above paper ([html] [pdf]).

Data for the R package Clipper (download)

Search form

Software

Single-cell RNA-seq

Bulk RNA-seq isoform discovery and quantification

Comparative genomics

Microbiomics

Classification

Association measure

High-dimensional model inference

False discovery rate control

Data