The example analysis is implemented by SnapATAC.
The example data consist in ~4000 cells of adult mouse brain available from 10X genomics. All the data used in this analysis can be downloaded from here.
First import SnapATAC, ggplot2, GenomicRanges, and some other packages we will be using for analyzing PBMCs.
options(warn=-1)
library(SnapATAC)
library(viridisLite)
library(ggplot2)
library(GenomicRanges)
library(leiden)
set.seed(1234)
The first step of analysis is to load the snap file including cell-by-feature matrix as a SnapATAC object.
The matrix is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a peak/bin), that is predicted to represent a region of open chromatin. ScEpiTools will provide a task-specific ID after uploading the file.
For the data generation, we can see the tutorial for how to generate a snap file.
Internal parameters:
task_id - The only identifier of each task.
User-defined parameters:
Snap_file - The name of the snap file you want to load. Default: ‘atac_v1_adult_brain_fresh_5k.snap’.
Barcodes_file - The name of the barcodes file you want to load. Default: ‘atac_v1_adult_brain_fresh_5k_singlecell.csv’.
task_id <- 'Example'
inputpath <- paste0("./", task_id, "/")
outputpath <- paste0("./", task_id, "/")
path_blacklist <- paste0("./", task_id, "/SnapATAC_data/")
Snap_file <- "atac_v1_adult_brain_fresh_5k.snap"
x.sp <- createSnap(
file=paste0(inputpath, Snap_file),
sample="SnapATAC",
num.cores=1
)
## Epoch: reading the barcode session ...
Barcodes_file <- 'atac_v1_adult_brain_fresh_5k_singlecell.csv'
barcodes <- read.csv(
file = paste0(inputpath, Barcodes_file),
header = TRUE
)
Next we perform some preprocessing including quality control and binarization.
We select high-quality barcodes based on two criteria: 1) number of unique fragments; 2) fragments in promoter ratio.
User-defined parameters:
Minimum_UMI - Minimum number of UMIs required for a cell to pass filtering. Default: 3.
Maximum_UMI - Maximum number of UMIs required for a cell to pass filtering. Default: 5.
Minimum_promoter_ratio - Minimum ratio of promoters required for a cell to pass filtering. Default: 0.15.
Maximum_promoter_ratio - Maximum ratio of promoters required for a cell to pass filtering. Default: 0.6.
barcodes <- barcodes[2:nrow(barcodes),]
promoter_ratio <- (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1)
UMI <- log(barcodes$passed_filters+1, 10)
data <- data.frame(UMI=UMI, promoter_ratio=promoter_ratio)
barcodes$promoter_ratio <- promoter_ratio
options(repr.plot.width = 6, repr.plot.height = 6)
p <- ggplot(
data,
aes(x=UMI, y=promoter_ratio)) +
geom_point(size=0.1, col="grey") +
theme_classic() +
ggtitle("10X Fresh Adult Brain") +
ylim(0, 1) + xlim(0, 6) +
labs(x = "log10(UMI)", y="promoter ratio")
print(p)
ggsave(p, filename = paste0(outputpath, "results/figures/", task_id, ".png"), width = 6, height = 6)
Minimum_UMI <- 3
Maximum_UMI <- 5
Minimum_promoter_ratio <- 0.15
Maximum_promoter_ratio <- 0.6
barcodes.sel <- barcodes[which(UMI >= Minimum_UMI & UMI <= Maximum_UMI &
promoter_ratio >= Minimum_promoter_ratio & promoter_ratio <= Maximum_promoter_ratio),]
rownames(barcodes.sel) <- barcodes.sel$barcode
x.sp <- x.sp[which(x.sp@barcode %in% barcodes.sel$barcode),]
x.sp@metaData <- barcodes.sel[x.sp@barcode,]
x.sp
## number of barcodes: 4100
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
Next, we add the cell-by-bin matrix of 5kb(default) resolution to the snap object. This function will automatically read the cell-by-bin matrix and add it to bmat slot of snap object.
User-defined parameters:
Bin_size - The size of bin you want to set. Default: 5000.
showBinSizes(paste0(inputpath, Snap_file))
## [1] 1000 5000 10000
Bin_size <- 5000
x.sp <- addBmatToSnap(x.sp, bin.size=Bin_size, num.cores=1)
## Epoch: reading cell-bin count matrix session ...
We convert the cell-by-bin count matrix to a binary matrix.
x.sp <- makeBinary(x.sp, mat="bmat")
First, we filter out any bins overlapping with the ENCODE blacklist to prevent from potential artifacts.
Internal parameters:
path_blacklist - The directory of blacklist files.
User-defined parameters:
Reference_genome - ‘hg19’, ‘hg38’, ‘mm10’ or ‘mm9’ are the 4 currently recognised genomic version in this pipeline. Default: ‘mm10’.
Reference_genome <- 'mm10'
path_blacklist1 <- paste0(path_blacklist, Reference_genome, ".blacklist.bed.gz")
black_list <- read.table(path_blacklist1)
black_list.gr <- GRanges(
black_list[,1],
IRanges(black_list[,2], black_list[,3])
)
idy <- queryHits(findOverlaps(x.sp@feature, black_list.gr))
if(length(idy) > 0){x.sp <- x.sp[,-idy, mat="bmat"]}
x.sp
## number of barcodes: 4100
## number of bins: 546103
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
Second, we remove unwanted chromosomes.
chr.exclude <- seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))]
idy <- grep(paste(chr.exclude, collapse="|"), x.sp@feature)
if(length(idy) > 0){x.sp <- x.sp[,-idy, mat="bmat"]}
x.sp
## number of barcodes: 4100
## number of bins: 545183
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
Third, the bin coverage roughly obeys a log normal distribution. We remove the top 5% bins that overlap with invariant features such as promoters of the house keeping genes.
bin.cov <- log10(Matrix::colSums(x.sp@bmat)+1)
options(repr.plot.width = 6, repr.plot.height = 6)
hist(
bin.cov[bin.cov > 0],
xlab="log10(bin cov)",
main="Bin coverage distribution",
col="lightblue",
xlim=c(0, 5)
)
bin.cutoff <- quantile(bin.cov[bin.cov > 0], 0.95)
idy <- which(bin.cov <= bin.cutoff & bin.cov > 0)
x.sp <- x.sp[, idy, mat="bmat"]
x.sp
## number of barcodes: 4100
## number of bins: 474624
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
We compute diffusion maps for dimentionality reduction.
User-defined parameters:
Number_of_eigenvectors - The number of eigenvectors to be computed. Default: 50.
Number_of_eigenvectors <- 50
x.sp <- runDiffusionMaps(
obj=x.sp,
input.mat="bmat",
num.eigs=Number_of_eigenvectors
)
## Epoch: checking the inputs ...
## Epoch: computing jaccard similarity matrix ...
## Epoch: fitting regression model ...
## Epoch: performing normalization ...
## Epoch: computing eigen decomposition ...
## Epoch: Done
We next determine the number of reduced dimensions to include for downstream analysis. We use an ad hoc method by simply looking at a pairwise plot and select the number of dimensions in which the scatter plot starts looking like a blob.
options(repr.plot.width = 10, repr.plot.height = 10)
plotDimReductPW(
obj=x.sp,
eigs.dims=1:Number_of_eigenvectors,
point.size=0.3,
point.color="grey",
point.shape=19,
point.alpha=0.6,
down.sample=5000,
pdf.file.name=NULL,
pdf.height=7,
pdf.width=7
)
Using the selected significant dimensions, we next construct a K Nearest Neighbor (KNN) Graph (default k=15). Each cell is a node and the k-nearest neighbors of each cell are identified according to the Euclidian distance and edges are draw between neighbors in the graph.
User-defined parameters:
Select_dimensions - The number of dimensions we select. Default: 20.
Number_of_neighbors - The number of neighbors we choose for KNN. Default: 15.
Clustering_algorithm - The algorithm for clustering. scEpiTools provides two choices, “R-igraph” and “leiden”. “R-igraph” uses “cluster_louvain” implemented by igraph package in R. “leiden” uses “leiden” algorithm for finding clusters. Default: ‘R-igraph’.
Select_dimensions <- 20
Number_of_neighbors <- 15
x.sp <- runKNN(
obj=x.sp,
eigs.dims=1:Select_dimensions,
k=Number_of_neighbors
)
Clustering_algorithm <- c("R-igraph")
for (Clustering_algorithm_ in Clustering_algorithm) {
x.sp <- runCluster(
obj=x.sp,
tmp.folder=tempdir(),
louvain.lib=Clustering_algorithm_,
seed.use=10
)
if (Clustering_algorithm_ == 'leiden') {
x.sp@metaData$cluster_leiden <- x.sp@cluster
} else {
x.sp@metaData$cluster_louvain <- x.sp@cluster
}
}
We visualize and explore the data using tSNE or UMAP.
User-defined parameters:
Visualization_method - The visualization method for displaying clustering results, scEpiTools provides two choices, ‘umap’ and ‘tsne’. Default: ‘tsne’.
Visualization_method <- c('tsne')
for (Visualization_method_ in Visualization_method) {
if (Visualization_method_ == 'tsne') {
method <- paste0('R', Visualization_method_)
} else {method <- Visualization_method_}
x.sp <- runViz(
obj=x.sp,
tmp.folder=tempdir(),
dims=2,
eigs.dims=1:Select_dimensions,
method=method,
seed.use=10
)
}
for (Visualization_method_ in Visualization_method) {
if (length(Clustering_algorithm) > 1) {
options(repr.plot.width = 16, repr.plot.height = 24)
par(mfrow = c(3, 2))
} else {
options(repr.plot.width = 16, repr.plot.height = 16)
par(mfrow = c(2, 2))
}
if ('R-igraph' %in% Clustering_algorithm) {
plotViz(
obj=x.sp,
method=Visualization_method_,
main="Louvain clustering",
point.color=x.sp@metaData$cluster_louvain,
point.size=1,
point.shape=19,
point.alpha=0.8,
text.add=TRUE,
text.size=1.5,
text.color="black",
text.halo.add=TRUE,
text.halo.color="white",
text.halo.width=0.2,
down.sample=10000,
legend.add=FALSE
)
}
if ('leiden' %in% Clustering_algorithm) {
plotViz(
obj=x.sp,
method=Visualization_method_,
main="Leiden clustering",
point.color=x.sp@metaData$cluster_leiden,
point.size=1,
point.shape=19,
point.alpha=0.8,
text.add=TRUE,
text.size=1.5,
text.color="black",
text.halo.add=TRUE,
text.halo.color="white",
text.halo.width=0.2,
down.sample=10000,
legend.add=FALSE
)
}
plotFeatureSingle(
obj=x.sp,
feature.value=log(x.sp@metaData[,"passed_filters"]+1,10),
method=Visualization_method_,
main="Read Depth",
point.size=0.2,
point.shape=19,
down.sample=10000,
quantiles=c(0.01, 0.99)
)
plotFeatureSingle(
obj=x.sp,
feature.value=x.sp@metaData$peak_region_fragments / x.sp@metaData$passed_filters,
method=Visualization_method_,
main="FRiP",
point.size=0.2,
point.shape=19,
down.sample=10000,
quantiles=c(0.01, 0.99) # remove outliers
)
plotFeatureSingle(
obj=x.sp,
feature.value=x.sp@metaData$duplicate / x.sp@metaData$total,
method=Visualization_method_,
main="Duplicate",
point.size=0.2,
point.shape=19,
down.sample=10000,
quantiles=c(0.01, 0.99) # remove outliers
)
}
To help annotate identified cell clusters, we next create the cell-by-gene matrix and visualize the enrichment of marker genes.
At present, we only provide the mouse genes from Gencode.
User-defined parameters:
Marker_genes - Marker genes to help interpret our ATAC-seq clusters. It is a string separated by ‘,’. Default: NULL.
if (Reference_genome == 'mm10') {
genes <- read.table(paste0(path_blacklist, "gencode.vM16.gene.bed"));
genes.gr <- GRanges(genes[,1],
IRanges(genes[,2], genes[,3]), name=genes[,4]
)
} else {print("There is no matched Gencode data for this step!")}
Marker_genes <- "Snap25,Gad2,Apoe,C1qb,Pvalb,Vip,Sst,Lamp5,Slc17a7"
if (Marker_genes!="NULL") {
Marker.genes <- strsplit(Marker_genes, ",")[[1]]
genes.sel.gr <- genes.gr[which(genes.gr$name %in% Marker.genes)]
}
Re-add the cell-by-bin matrix to the snap object
if (Marker_genes!="NULL") {
x.sp <- addBmatToSnap(x.sp)
x.sp <- createGmatFromMat(
obj=x.sp,
input.mat="bmat",
genes=genes.sel.gr,
do.par=TRUE,
num.cores=10
)}
## Epoch: reading cell-bin count matrix session ...
Normalize and smooth the cell-by-gene matrix
if (Marker_genes!="NULL") {
x.sp <- scaleCountMatrix(
obj=x.sp,
cov=x.sp@metaData$passed_filters + 1,
mat="gmat",
method = "RPM"
)
x.sp <- runMagic(
obj=x.sp,
input.mat="gmat",
step.size=3
)}
## Epoch: checking the inputs ...
if (Marker_genes!="NULL") {
for (Visualization_method_ in Visualization_method) {
options(repr.plot.width = 12, repr.plot.height = 12)
par(mfrow = c(3, 3))
for(i in 1:length(Marker.genes)){
plotFeatureSingle(
obj=x.sp,
feature.value=x.sp@gmat[, Marker.genes[i]],
method=Visualization_method_,
main=Marker.genes[i],
point.size=0.1,
point.shape=19,
down.sample=10000,
quantiles=c(0, 1)
)}
}
}
Next, cells belonging to the same cluster are pooled to create the aggregate signal for heretical clustering.
for (Clustering_algorithm_ in Clustering_algorithm) {
if (Clustering_algorithm_ == 'leiden') {
ensemble.ls <- lapply(split(seq(length(x.sp@metaData$cluster_leiden)), x.sp@metaData$cluster_leiden), function(x){
SnapATAC::colMeans(x.sp[x,], mat="bmat");
})
hc <- hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2")
for (Visualization_method_ in Visualization_method) {
par(mfrow = c(1, 2))
options(repr.plot.width = 16, repr.plot.height = 8)
plotViz(
obj=x.sp,
method=Visualization_method_,
main="Leiden clustering",
point.color=x.sp@metaData$cluster_leiden,
point.size=1,
point.shape=19,
point.alpha=0.8,
text.add=TRUE,
text.size=1.5,
text.color="black",
text.halo.add=TRUE,
text.halo.color="white",
text.halo.width=0.2,
down.sample=10000,
legend.add=FALSE
)
plot(hc, hang=-1, xlab="")
}
} else {
ensemble.ls <- lapply(split(seq(length(x.sp@metaData$cluster_louvain)), x.sp@metaData$cluster_louvain), function(x){
SnapATAC::colMeans(x.sp[x,], mat="bmat");
})
hc <- hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2")
for (Visualization_method_ in Visualization_method) {
par(mfrow = c(1, 2))
options(repr.plot.width = 16, repr.plot.height = 8)
plotViz(
obj=x.sp,
method=Visualization_method_,
main="Louvain clustering",
point.color=x.sp@metaData$cluster_louvain,
point.size=1,
point.shape=19,
point.alpha=0.8,
text.add=TRUE,
text.size=1.5,
text.color="black",
text.halo.add=TRUE,
text.halo.color="white",
text.halo.width=0.2,
down.sample=10000,
legend.add=FALSE
)
plot(hc, hang=-1, xlab="")
}
}
}
Next we aggregate cells from the each cluster to create an ensemble track for peak calling and visualization. This step will generate a .narrowPeak file that contains the identified peak and .bedGraph file for visualization.
Internal parameters:
snaptools_source - The installation path of snaptools.
macs2_source - The installation path of macs2.
# system("which snaptools")
snaptools_source <- "/home/scepitools/miniconda3/envs/snapatac/bin/snaptools"
# system("which macs2")
macs2_source <- "/home/scepitools/miniconda3/envs/snapatac/bin/macs2"
dir.create(paste0(outputpath, "MACS_intermediate"))
if ('R-igraph' %in% Clustering_algorithm) {
x.sp@cluster <- x.sp@metaData$cluster_louvain
}
runMACS(
obj=x.sp[which(x.sp@cluster==1),],
output.prefix=paste0(task_id, ".1"),
path.to.snaptools=snaptools_source,
path.to.macs=macs2_source,
gsize="mm",
buffer.size=500,
num.cores=5,
macs.options="--nomodel --shift 37 --ext 73 --qval 1e-2 -B --SPMR --call-summits",
tmp.folder=paste0(outputpath, "MACS_intermediate")
)
Next, we provide a short script that performs this step for all clusters.
clusters.sel <- names(table(x.sp@cluster))[which(table(x.sp@cluster) > 200)]
library(parallel)
peaks.ls <- mclapply(seq(clusters.sel), function(i) {
# print(clusters.sel[i]);
runMACS(
obj=x.sp[which(x.sp@cluster==clusters.sel[i]),],
output.prefix=paste0(task_id, ".", gsub(" ", "_", clusters.sel)[i]),
path.to.snaptools=snaptools_source,
path.to.macs=macs2_source,
gsize="hs", # mm, hs, etc
buffer.size=500,
num.cores=1,
macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
tmp.folder=paste0(outputpath, "MACS_intermediate")
)
}, mc.cores=5)
peaks.names <- system("ls | grep narrowPeak", intern=TRUE)
peak.gr.ls <- lapply(peaks.names, function(x) {
peak.df <- read.table(x)
GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
})
peak.gr <- reduce(Reduce(c, peak.gr.ls))
peak.gr
## GRanges object with 253912 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] b'chr1' 3094765-3095642 *
## [2] b'chr1' 3113499-3114060 *
## [3] b'chr1' 3118103-3118401 *
## [4] b'chr1' 3119673-3120870 *
## [5] b'chr1' 3121499-3121786 *
## ... ... ... ...
## [253908] b'chrY' 90800605-90800805 *
## [253909] b'chrY' 90804746-90805456 *
## [253910] b'chrY' 90808580-90808819 *
## [253911] b'chrY' 90808845-90809131 *
## [253912] b'chrY' 90810827-90811057 *
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Using merged peak list as a reference, we next create a cell-by-peak matrix using the original snap file.
peaks.df <- as.data.frame(peak.gr)[,1:3]
write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".",
row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
# saveRDS(x.sp, file=paste0(task_id, ".snap.rds"))
# !snaptools snap-add-pmat --snap-file atac_v1_adult_brain_fresh_5k.snap --peak-file peaks.combined.bed
Next we add the cell-by-peak matrix to the existing snap object.
# x.sp <- readRDS(paste0(task_id, ".snap.rds"))
x.sp <- addPmatToSnap(x.sp)
## Epoch: reading cell-peak count matrix session ...
x.sp <- makeBinary(x.sp, mat="pmat")
x.sp
## number of barcodes: 4100
## number of bins: 546206
## number of genes: 9
## number of peaks: 242847
## number of motifs: 0
We find differentially accessible regions (DARs) that define clusters via differential analysis.
By default, it identifies positive peaks of a single cluster (specified in cluster.pos), compared to a group of negative control cells.
User-defined parameters:
Cluster_number - The cluster number for comparison in the differential feature analysis. Default: 1.
Cluster_number <- 1
if (grepl('mm',Reference_genome)) {
bcv <- 0.1
} else {bcv <- 0.4}
DARs <- findDAR(
obj=x.sp,
input.mat="pmat",
cluster.pos=Cluster_number,
cluster.neg.method="knn",
test.method="exactTest",
bcv=bcv, #0.4 for human, 0.1 for mouse
seed.use=10
)
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
DARs$FDR <- p.adjust(DARs$PValue, method="BH")
idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0)
for (Visualization_method_ in Visualization_method) {
par(mfrow = c(1, 2))
plot(DARs$logCPM, DARs$logFC,
pch=19, cex=0.1, col="grey",
ylab="logFC", xlab="logCPM",
main=paste0("Cluster ", as.character(Cluster_number))
)
points(DARs$logCPM[idy],
DARs$logFC[idy],
pch=19,
cex=0.5,
col="red"
)
abline(h = 0, lwd=1, lty=2)
covs <- Matrix::rowSums(x.sp@pmat)
vals <- Matrix::rowSums(x.sp@pmat[,idy]) / covs
vals.zscore <- (vals - mean(vals)) / sd(vals)
plotFeatureSingle(
obj=x.sp,
feature.value=vals.zscore,
method=Visualization_method_,
main=paste0("Cluster ", as.character(Cluster_number)),
point.size=0.1,
point.shape=19,
down.sample=5000,
quantiles=c(0.01, 0.99)
)
}
Next, we identify DARs for each of the clusters. For clusters, especially the small ones, that lack the static power to reveal DARs, we rank the peaks based on the enrichment and use the top 2000 peaks as representative peaks for motif discovery.
idy.ls <- lapply(levels(x.sp@cluster), function(cluster_i) {
DARs <- findDAR(
obj=x.sp,
input.mat="pmat",
cluster.pos=cluster_i,
cluster.neg=NULL,
cluster.neg.method="knn",
bcv=bcv,
test.method="exactTest",
seed.use=10
);
DARs$FDR <- p.adjust(DARs$PValue, method="BH");
idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0);
if((x=length(idy)) < 2000L){
PValues = DARs$PValue;
PValues[DARs$logFC < 0] = 1;
idy = order(PValues, decreasing=FALSE)[1:2000];
rm(PValues); # free memory
}
idy
})
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
names(idy.ls) <- levels(x.sp@cluster)
for (Visualization_method_ in Visualization_method) {
options(repr.plot.width = 12, repr.plot.height = 12)
par(mfrow = c(3, 3))
for(cluster_i in levels(x.sp@cluster)[1:9]){
# print(cluster_i)
idy = idy.ls[[cluster_i]];
vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
vals.zscore = (vals - mean(vals)) / sd(vals);
plotFeatureSingle(
obj=x.sp,
feature.value=vals.zscore,
method=Visualization_method_,
main=cluster_i,
point.size=0.1,
point.shape=19,
down.sample=5000,
quantiles=c(0.01, 0.99)
);
}
}
We apply GREAT to identify biological pathways active in each of the cell cluster.
In this example, we will first identify the differential elements in cluster 1 and report the top 6 GO Ontologies enrichment inferred using GREAT analysis.
User-defined parameters:
Cluster_number_for_GREAT - The cluster number used in the GREAT analysis. Default: 1.
library(rGREAT)
Cluster_number_for_GREAT <- 1
DARs <- findDAR(
obj=x.sp,
input.mat="pmat",
cluster.pos=Cluster_number_for_GREAT,
cluster.neg.method="knn",
test.method="exactTest",
bcv=bcv, #0.4 for human, 0.1 for mouse
seed.use=10
)
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
DARs$FDR <- p.adjust(DARs$PValue, method="BH")
idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0)
job <- submitGreatJob(
gr = x.sp@peak[idy],
bg = NULL,
species = Reference_genome,
includeCuratedRegDoms = TRUE,
rule = "basalPlusExt",
adv_upstream = 5.0,
adv_downstream = 1.0,
adv_span = 1000.0,
adv_twoDistance = 1000.0,
adv_oneDistance = 1000.0,
request_interval = 300,
max_tries = 10,
version = "default",
base_url = "http://great.stanford.edu/public/cgi-bin"
)
job
## Submit time: 2023-08-30 14:15:08
## Note the results may only be avaiable on GREAT server for 24 hours.
## Version: 4.0.4
## Species: mm10
## Inputs: 1017 regions
## Mode: Basal plus extension
## Proximal: 5 kb upstream, 1 kb downstream,
## plus Distal: up to 1000 kb
## Include curated regulatory domains
##
## Enrichment tables for following ontologies have been downloaded:
## None
tb <- getEnrichmentTables(job)
names(tb)
## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
GBP <- tb[["GO Biological Process"]]
head(GBP[order(GBP[,"Binom_Adjp_BH"]),1:5])
## ID name Binom_Genome_Fraction
## 1 GO:0009653 anatomical structure morphogenesis 0.23937720
## 2 GO:0044707 single-multicellular organism process 0.47178490
## 3 GO:0016049 cell growth 0.01857512
## 4 GO:0009888 tissue development 0.18362640
## 5 GO:0035265 organ growth 0.01397975
## 6 GO:0051216 cartilage development 0.02396191
## Binom_Expected Binom_Observed_Region_Hits
## 1 243.44660 330
## 2 479.80530 576
## 3 18.89090 49
## 4 186.74800 261
## 5 14.21741 39
## 6 24.36926 54
Write .rds-formatted file. You can get the results (a html file and a rds file) via your taskID.
saveRDS(x.sp, file=paste0(outputpath, "results/", task_id, "_result.rds"))