Overview
Human-scATAC-Corpus (v1.0.0) is a comprehensive and large-scale database designed to advance research in single-cell epigenomics by providing an unprecedented resource of human scATAC-seq data. Currently comprising over 5.4 million cells—more than three times the size of any existing counterpart—the database aggregates and harmonizes data curated from 35 datasets and 37 tissues or cell lines, drawn from a thorough manual review of over 200 published studies. Stringent quality control and standardization protocols were applied to ensure high data integrity and usability for algorithm development and benchmarking.
Human-scATAC-Corpus features versatile data formats to accommodate diverse analytic needs. All datasets are uniformly represented in a cell-by-candidate cis-regulatory element (cCRE) matrix, facilitating cross-dataset analyses and matrix-based computational methods. The database also provides processed, standardized fragment files for fragment-level analyses and includes the original cell-by-peak matrices to support comparisons across feature definitions. This harmonization involved manual processing of more than 700 files to ensure consistency and ease of access.
A hallmark of Human-scATAC-Corpus is its rich metadata, enabling exploration of various biological scenarios, such as cell type annotation, batch effect correction, out-of-sample stimulation analysis, and CRISPR perturbation studies. Furthermore, the database is tightly integrated with EpiAgent, the first foundation model for single-cell epigenomics, offering users robust tools and tutorials for mapping new datasets onto the reference, supporting applications such as cancer cell tracing and developmental trajectory analysis.
Once online, Human-scATAC-Corpus will provide data browsing, search, download, and online analysis functionalities. We anticipate that Human-scATAC-Corpus will become a foundational resource for the single-cell epigenomics community, accelerating the development and benchmarking of novel analytic methods.
Number of cells for each file format
Number of cells for each tissue
Number of cells for each task