The example analysis is implemented by SnapATAC.
The example data consist in ~4000 cells of adult mouse brain available from 10X genomics. All the data used in this analysis can be downloaded from here.

1 Initialization

First import SnapATAC, ggplot2, GenomicRanges, and some other packages we will be using for analyzing PBMCs.

options(warn=-1)
library(SnapATAC)
library(viridisLite)
library(ggplot2)
library(GenomicRanges)
library(leiden)
set.seed(1234)

2 Load the data

The first step of analysis is to load the snap file including cell-by-feature matrix as a SnapATAC object.
The matrix is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a peak/bin), that is predicted to represent a region of open chromatin. ScEpiTools will provide a task-specific ID after uploading the file.
For the data generation, we can see the tutorial for how to generate a snap file.

Internal parameters:
task_id - The only identifier of each task.

User-defined parameters:
Snap_file - The name of the snap file you want to load. Default: ‘atac_v1_adult_brain_fresh_5k.snap’.
Barcodes_file - The name of the barcodes file you want to load. Default: ‘atac_v1_adult_brain_fresh_5k_singlecell.csv’.

task_id <- 'Example'
inputpath <- paste0("./", task_id, "/")
outputpath <- paste0("./", task_id, "/")
path_blacklist <- paste0("./", task_id, "/SnapATAC_data/")

Snap_file <- "atac_v1_adult_brain_fresh_5k.snap"
x.sp <- createSnap(
    file=paste0(inputpath, Snap_file),
    sample="SnapATAC",
    num.cores=1
)
## Epoch: reading the barcode session ...

Barcodes_file <- 'atac_v1_adult_brain_fresh_5k_singlecell.csv'
barcodes <- read.csv(
  file = paste0(inputpath, Barcodes_file),
  header = TRUE
)

3 Preprocessing

Next we perform some preprocessing including quality control and binarization.

3.1 Barcode selection

We select high-quality barcodes based on two criteria: 1) number of unique fragments; 2) fragments in promoter ratio.

User-defined parameters:
Minimum_UMI - Minimum number of UMIs required for a cell to pass filtering. Default: 3.
Maximum_UMI - Maximum number of UMIs required for a cell to pass filtering. Default: 5.
Minimum_promoter_ratio - Minimum ratio of promoters required for a cell to pass filtering. Default: 0.15.
Maximum_promoter_ratio - Maximum ratio of promoters required for a cell to pass filtering. Default: 0.6.

barcodes <- barcodes[2:nrow(barcodes),]
promoter_ratio <- (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1)
UMI <- log(barcodes$passed_filters+1, 10)
data <- data.frame(UMI=UMI, promoter_ratio=promoter_ratio)
barcodes$promoter_ratio <- promoter_ratio

options(repr.plot.width = 6, repr.plot.height = 6)
p <- ggplot(
    data, 
    aes(x=UMI, y=promoter_ratio)) + 
    geom_point(size=0.1, col="grey") +
    theme_classic() +
    ggtitle("10X Fresh Adult Brain") +
    ylim(0, 1) + xlim(0, 6) +
    labs(x = "log10(UMI)", y="promoter ratio") 
print(p)

plot of chunk unnamed-chunk-6

ggsave(p, filename = paste0(outputpath, "results/figures/", task_id, ".png"), width = 6, height = 6)

Minimum_UMI <- 3
Maximum_UMI <- 5
Minimum_promoter_ratio <- 0.15
Maximum_promoter_ratio <- 0.6
barcodes.sel <- barcodes[which(UMI >= Minimum_UMI & UMI <= Maximum_UMI & 
                               promoter_ratio >= Minimum_promoter_ratio & promoter_ratio <= Maximum_promoter_ratio),]
rownames(barcodes.sel) <- barcodes.sel$barcode
x.sp <- x.sp[which(x.sp@barcode %in% barcodes.sel$barcode),]
x.sp@metaData <- barcodes.sel[x.sp@barcode,]
x.sp

## number of barcodes: 4100
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

3.2 Add cell-by-bin matrix

Next, we add the cell-by-bin matrix of 5kb(default) resolution to the snap object. This function will automatically read the cell-by-bin matrix and add it to bmat slot of snap object.

User-defined parameters:
Bin_size - The size of bin you want to set. Default: 5000.

showBinSizes(paste0(inputpath, Snap_file))

## [1]  1000  5000 10000

Bin_size <- 5000
x.sp <- addBmatToSnap(x.sp, bin.size=Bin_size, num.cores=1)
## Epoch: reading cell-bin count matrix session ...

3.3 Matrix binarization

We convert the cell-by-bin count matrix to a binary matrix.

x.sp <- makeBinary(x.sp, mat="bmat")

3.4 Bin filtering

First, we filter out any bins overlapping with the ENCODE blacklist to prevent from potential artifacts.

Internal parameters:
path_blacklist - The directory of blacklist files.

User-defined parameters:
Reference_genome - ‘hg19’, ‘hg38’, ‘mm10’ or ‘mm9’ are the 4 currently recognised genomic version in this pipeline. Default: ‘mm10’.

Reference_genome <- 'mm10'
path_blacklist1 <- paste0(path_blacklist, Reference_genome, ".blacklist.bed.gz")

black_list <- read.table(path_blacklist1)
black_list.gr <- GRanges(
    black_list[,1], 
    IRanges(black_list[,2], black_list[,3])
)
idy <- queryHits(findOverlaps(x.sp@feature, black_list.gr))
if(length(idy) > 0){x.sp <- x.sp[,-idy, mat="bmat"]}
x.sp

## number of barcodes: 4100
## number of bins: 546103
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Second, we remove unwanted chromosomes.

chr.exclude <- seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))]
idy <- grep(paste(chr.exclude, collapse="|"), x.sp@feature)
if(length(idy) > 0){x.sp <- x.sp[,-idy, mat="bmat"]}
x.sp

## number of barcodes: 4100
## number of bins: 545183
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Third, the bin coverage roughly obeys a log normal distribution. We remove the top 5% bins that overlap with invariant features such as promoters of the house keeping genes.

bin.cov <- log10(Matrix::colSums(x.sp@bmat)+1)
options(repr.plot.width = 6, repr.plot.height = 6)
hist(
    bin.cov[bin.cov > 0], 
    xlab="log10(bin cov)", 
    main="Bin coverage distribution", 
    col="lightblue", 
    xlim=c(0, 5)
)

plot of chunk unnamed-chunk-13

bin.cutoff <- quantile(bin.cov[bin.cov > 0], 0.95)
idy <- which(bin.cov <= bin.cutoff & bin.cov > 0)
x.sp <- x.sp[, idy, mat="bmat"]
x.sp

## number of barcodes: 4100
## number of bins: 474624
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

4 Dimensionality reduction

We compute diffusion maps for dimentionality reduction.

User-defined parameters:
Number_of_eigenvectors - The number of eigenvectors to be computed. Default: 50.

Number_of_eigenvectors <- 50
x.sp <- runDiffusionMaps(
    obj=x.sp,
    input.mat="bmat", 
    num.eigs=Number_of_eigenvectors
)
## Epoch: checking the inputs ...
## Epoch: computing jaccard similarity matrix ...
## Epoch: fitting regression model ...
## Epoch: performing normalization ...
## Epoch: computing eigen decomposition ...
## Epoch: Done

5 Determine significant components

We next determine the number of reduced dimensions to include for downstream analysis. We use an ad hoc method by simply looking at a pairwise plot and select the number of dimensions in which the scatter plot starts looking like a blob.

options(repr.plot.width = 10, repr.plot.height = 10)
plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:Number_of_eigenvectors,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
)

plot of chunk unnamed-chunk-16

6 Graph-based clustering

Using the selected significant dimensions, we next construct a K Nearest Neighbor (KNN) Graph (default k=15). Each cell is a node and the k-nearest neighbors of each cell are identified according to the Euclidian distance and edges are draw between neighbors in the graph.

User-defined parameters:
Select_dimensions - The number of dimensions we select. Default: 20.
Number_of_neighbors - The number of neighbors we choose for KNN. Default: 15.
Clustering_algorithm - The algorithm for clustering. scEpiTools provides two choices, “R-igraph” and “leiden”. “R-igraph” uses “cluster_louvain” implemented by igraph package in R. “leiden” uses “leiden” algorithm for finding clusters. Default: ‘R-igraph’.

Select_dimensions <- 20
Number_of_neighbors <- 15
x.sp <- runKNN(
    obj=x.sp,
    eigs.dims=1:Select_dimensions,
    k=Number_of_neighbors
)

Clustering_algorithm <- c("R-igraph")
for (Clustering_algorithm_ in Clustering_algorithm) {
    x.sp <- runCluster(
        obj=x.sp,
        tmp.folder=tempdir(),
        louvain.lib=Clustering_algorithm_,
        seed.use=10
    )
    if (Clustering_algorithm_ == 'leiden') {
        x.sp@metaData$cluster_leiden <- x.sp@cluster
    } else {
        x.sp@metaData$cluster_louvain <- x.sp@cluster
    }
}

7 Visualization

We visualize and explore the data using tSNE or UMAP.

User-defined parameters:
Visualization_method - The visualization method for displaying clustering results, scEpiTools provides two choices, ‘umap’ and ‘tsne’. Default: ‘tsne’.

Visualization_method <- c('tsne')
for (Visualization_method_ in Visualization_method) {
    if (Visualization_method_ == 'tsne') {
        method <- paste0('R', Visualization_method_)
    } else {method <- Visualization_method_}

    x.sp <- runViz(
        obj=x.sp, 
        tmp.folder=tempdir(),
        dims=2,
        eigs.dims=1:Select_dimensions, 
        method=method,
        seed.use=10
    )
}

for (Visualization_method_ in Visualization_method) {
    if (length(Clustering_algorithm) > 1) {
        options(repr.plot.width = 16, repr.plot.height = 24)
        par(mfrow = c(3, 2))
    } else {
        options(repr.plot.width = 16, repr.plot.height = 16)
        par(mfrow = c(2, 2))
    }
    if ('R-igraph' %in% Clustering_algorithm) {
        plotViz(
            obj=x.sp,
            method=Visualization_method_, 
            main="Louvain clustering",
            point.color=x.sp@metaData$cluster_louvain, 
            point.size=1, 
            point.shape=19, 
            point.alpha=0.8, 
            text.add=TRUE,
            text.size=1.5,
            text.color="black",
            text.halo.add=TRUE,
            text.halo.color="white",
            text.halo.width=0.2,
            down.sample=10000,
            legend.add=FALSE
        )
    }
    if ('leiden' %in% Clustering_algorithm) {
        plotViz(
            obj=x.sp,
            method=Visualization_method_, 
            main="Leiden clustering",
            point.color=x.sp@metaData$cluster_leiden, 
            point.size=1, 
            point.shape=19, 
            point.alpha=0.8, 
            text.add=TRUE,
            text.size=1.5,
            text.color="black",
            text.halo.add=TRUE,
            text.halo.color="white",
            text.halo.width=0.2,
            down.sample=10000,
            legend.add=FALSE
        )
    }
    plotFeatureSingle(
        obj=x.sp,
        feature.value=log(x.sp@metaData[,"passed_filters"]+1,10),
        method=Visualization_method_, 
        main="Read Depth",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
    ) 
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@metaData$peak_region_fragments / x.sp@metaData$passed_filters,
        method=Visualization_method_, 
        main="FRiP",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99) # remove outliers
    )
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@metaData$duplicate / x.sp@metaData$total,
        method=Visualization_method_, 
        main="Duplicate",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99) # remove outliers
    )
}

plot of chunk unnamed-chunk-20

8 Gene based annotation

To help annotate identified cell clusters, we next create the cell-by-gene matrix and visualize the enrichment of marker genes.
At present, we only provide the mouse genes from Gencode.

User-defined parameters:
Marker_genes - Marker genes to help interpret our ATAC-seq clusters. It is a string separated by ‘,’. Default: NULL.

if (Reference_genome == 'mm10') {
    genes <- read.table(paste0(path_blacklist, "gencode.vM16.gene.bed"));
    genes.gr <- GRanges(genes[,1], 
        IRanges(genes[,2], genes[,3]), name=genes[,4]
)
} else {print("There is no matched Gencode data for this step!")}

Marker_genes <- "Snap25,Gad2,Apoe,C1qb,Pvalb,Vip,Sst,Lamp5,Slc17a7"
if (Marker_genes!="NULL") {
    Marker.genes <- strsplit(Marker_genes, ",")[[1]]
    genes.sel.gr <- genes.gr[which(genes.gr$name %in% Marker.genes)]
}

Re-add the cell-by-bin matrix to the snap object

if (Marker_genes!="NULL") {
    x.sp <- addBmatToSnap(x.sp)
    x.sp <- createGmatFromMat(
        obj=x.sp, 
        input.mat="bmat",
        genes=genes.sel.gr,
        do.par=TRUE,
        num.cores=10
    )}
## Epoch: reading cell-bin count matrix session ...

Normalize and smooth the cell-by-gene matrix

if (Marker_genes!="NULL") {
    x.sp <- scaleCountMatrix(
        obj=x.sp, 
        cov=x.sp@metaData$passed_filters + 1,
        mat="gmat",
        method = "RPM"
    )
    x.sp <- runMagic(
        obj=x.sp,
        input.mat="gmat",
        step.size=3
    )}
## Epoch: checking the inputs ...

if (Marker_genes!="NULL") {
    for (Visualization_method_ in Visualization_method) {
        options(repr.plot.width = 12, repr.plot.height = 12)
        par(mfrow = c(3, 3))
        for(i in 1:length(Marker.genes)){
            plotFeatureSingle(
                obj=x.sp,
                feature.value=x.sp@gmat[, Marker.genes[i]],
                method=Visualization_method_, 
                main=Marker.genes[i],
                point.size=0.1, 
                point.shape=19, 
                down.sample=10000,
                quantiles=c(0, 1)
        )}
    }
}

plot of chunk unnamed-chunk-25

9 Heretical clustering

Next, cells belonging to the same cluster are pooled to create the aggregate signal for heretical clustering.

for (Clustering_algorithm_ in Clustering_algorithm) {
    if (Clustering_algorithm_ == 'leiden') {
        ensemble.ls <- lapply(split(seq(length(x.sp@metaData$cluster_leiden)), x.sp@metaData$cluster_leiden), function(x){
            SnapATAC::colMeans(x.sp[x,], mat="bmat");
            })
        hc <- hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2")
        for (Visualization_method_ in Visualization_method) {
            par(mfrow = c(1, 2))
            options(repr.plot.width = 16, repr.plot.height = 8)
            plotViz(
                obj=x.sp,
                method=Visualization_method_, 
                main="Leiden clustering",
                point.color=x.sp@metaData$cluster_leiden, 
                point.size=1, 
                point.shape=19, 
                point.alpha=0.8, 
                text.add=TRUE,
                text.size=1.5,
                text.color="black",
                text.halo.add=TRUE,
                text.halo.color="white",
                text.halo.width=0.2,
                down.sample=10000,
                legend.add=FALSE
                )
            plot(hc, hang=-1, xlab="")
        }
    } else {
        ensemble.ls <- lapply(split(seq(length(x.sp@metaData$cluster_louvain)), x.sp@metaData$cluster_louvain), function(x){
            SnapATAC::colMeans(x.sp[x,], mat="bmat");
            })
        hc <- hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2")
        for (Visualization_method_ in Visualization_method) {
            par(mfrow = c(1, 2))
            options(repr.plot.width = 16, repr.plot.height = 8)
            plotViz(
                obj=x.sp,
                method=Visualization_method_, 
                main="Louvain clustering",
                point.color=x.sp@metaData$cluster_louvain, 
                point.size=1, 
                point.shape=19, 
                point.alpha=0.8, 
                text.add=TRUE,
                text.size=1.5,
                text.color="black",
                text.halo.add=TRUE,
                text.halo.color="white",
                text.halo.width=0.2,
                down.sample=10000,
                legend.add=FALSE
                )
            plot(hc, hang=-1, xlab="")
        }
    }
}

plot of chunk unnamed-chunk-26

10 Identify peaks

Next we aggregate cells from the each cluster to create an ensemble track for peak calling and visualization. This step will generate a .narrowPeak file that contains the identified peak and .bedGraph file for visualization.

Internal parameters:
snaptools_source - The installation path of snaptools.
macs2_source - The installation path of macs2.

# system("which snaptools")
snaptools_source <- "/home/scepitools/miniconda3/envs/snapatac/bin/snaptools"
# system("which macs2")
macs2_source <- "/home/scepitools/miniconda3/envs/snapatac/bin/macs2"
dir.create(paste0(outputpath, "MACS_intermediate"))

if ('R-igraph' %in% Clustering_algorithm) {
    x.sp@cluster <- x.sp@metaData$cluster_louvain
}
runMACS(
    obj=x.sp[which(x.sp@cluster==1),], 
    output.prefix=paste0(task_id, ".1"),
    path.to.snaptools=snaptools_source,
    path.to.macs=macs2_source,
    gsize="mm", 
    buffer.size=500, 
    num.cores=5,
    macs.options="--nomodel --shift 37 --ext 73 --qval 1e-2 -B --SPMR --call-summits",
    tmp.folder=paste0(outputpath, "MACS_intermediate")
)

Next, we provide a short script that performs this step for all clusters.

clusters.sel <- names(table(x.sp@cluster))[which(table(x.sp@cluster) > 200)]
library(parallel)
peaks.ls <- mclapply(seq(clusters.sel), function(i) {
#     print(clusters.sel[i]);
    runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0(task_id, ".", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools=snaptools_source,
        path.to.macs=macs2_source,
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=paste0(outputpath, "MACS_intermediate")
)
}, mc.cores=5)

peaks.names <- system("ls | grep narrowPeak", intern=TRUE)
peak.gr.ls <- lapply(peaks.names, function(x) {
    peak.df <- read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
})
peak.gr <- reduce(Reduce(c, peak.gr.ls))
peak.gr

## GRanges object with 253912 ranges and 0 metadata columns:
##            seqnames            ranges strand
##               <Rle>         <IRanges>  <Rle>
##        [1]  b'chr1'   3094765-3095642      *
##        [2]  b'chr1'   3113499-3114060      *
##        [3]  b'chr1'   3118103-3118401      *
##        [4]  b'chr1'   3119673-3120870      *
##        [5]  b'chr1'   3121499-3121786      *
##        ...      ...               ...    ...
##   [253908]  b'chrY' 90800605-90800805      *
##   [253909]  b'chrY' 90804746-90805456      *
##   [253910]  b'chrY' 90808580-90808819      *
##   [253911]  b'chrY' 90808845-90809131      *
##   [253912]  b'chrY' 90810827-90811057      *
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

11 Cell-by-peak matrix

11.1 Create a cell-by-peak matrix

Using merged peak list as a reference, we next create a cell-by-peak matrix using the original snap file.

peaks.df <- as.data.frame(peak.gr)[,1:3]
write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")
# saveRDS(x.sp, file=paste0(task_id, ".snap.rds"))
# !snaptools snap-add-pmat --snap-file atac_v1_adult_brain_fresh_5k.snap --peak-file peaks.combined.bed

11.2 Add cell-by-peak matrix

Next we add the cell-by-peak matrix to the existing snap object.

# x.sp <- readRDS(paste0(task_id, ".snap.rds"))
x.sp <- addPmatToSnap(x.sp)
## Epoch: reading cell-peak count matrix session ...

x.sp <- makeBinary(x.sp, mat="pmat")
x.sp

## number of barcodes: 4100
## number of bins: 546206
## number of genes: 9
## number of peaks: 242847
## number of motifs: 0

12 Differential feature analysis

We find differentially accessible regions (DARs) that define clusters via differential analysis.
By default, it identifies positive peaks of a single cluster (specified in cluster.pos), compared to a group of negative control cells.

User-defined parameters:
Cluster_number - The cluster number for comparison in the differential feature analysis. Default: 1.

Cluster_number <- 1
if (grepl('mm',Reference_genome)) {
    bcv <- 0.1
} else {bcv <- 0.4}

DARs <- findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos=Cluster_number,
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=bcv, #0.4 for human, 0.1 for mouse
    seed.use=10
  )
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
DARs$FDR <- p.adjust(DARs$PValue, method="BH")
idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0)

for (Visualization_method_ in Visualization_method) {
    par(mfrow = c(1, 2))
    plot(DARs$logCPM, DARs$logFC, 
        pch=19, cex=0.1, col="grey", 
        ylab="logFC", xlab="logCPM",
        main=paste0("Cluster ", as.character(Cluster_number))
      )
    points(DARs$logCPM[idy], 
        DARs$logFC[idy], 
        pch=19, 
        cex=0.5, 
        col="red"
    )
    abline(h = 0, lwd=1, lty=2)
    covs <- Matrix::rowSums(x.sp@pmat)
    vals <- Matrix::rowSums(x.sp@pmat[,idy]) / covs
    vals.zscore <- (vals - mean(vals)) / sd(vals)
    plotFeatureSingle(
        obj=x.sp,
        feature.value=vals.zscore,
        method=Visualization_method_, 
        main=paste0("Cluster ", as.character(Cluster_number)),
        point.size=0.1, 
        point.shape=19, 
        down.sample=5000,
        quantiles=c(0.01, 0.99)
    )
}

plot of chunk unnamed-chunk-35

Next, we identify DARs for each of the clusters. For clusters, especially the small ones, that lack the static power to reveal DARs, we rank the peaks based on the enrichment and use the top 2000 peaks as representative peaks for motif discovery.

idy.ls <- lapply(levels(x.sp@cluster), function(cluster_i) {
    DARs <- findDAR(
        obj=x.sp,
        input.mat="pmat",
        cluster.pos=cluster_i,
        cluster.neg=NULL,
        cluster.neg.method="knn",
        bcv=bcv,
        test.method="exactTest",
        seed.use=10
        );
    DARs$FDR <- p.adjust(DARs$PValue, method="BH");
    idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0);
    if((x=length(idy)) < 2000L){
            PValues = DARs$PValue;
            PValues[DARs$logFC < 0] = 1;
            idy = order(PValues, decreasing=FALSE)[1:2000];
            rm(PValues); # free memory
    }
    idy
  })
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...
names(idy.ls) <- levels(x.sp@cluster)

for (Visualization_method_ in Visualization_method) {
    options(repr.plot.width = 12, repr.plot.height = 12)
    par(mfrow = c(3, 3))
    for(cluster_i in levels(x.sp@cluster)[1:9]){
    #     print(cluster_i)
        idy = idy.ls[[cluster_i]];
        vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
        vals.zscore = (vals - mean(vals)) / sd(vals);
        plotFeatureSingle(
            obj=x.sp,
            feature.value=vals.zscore,
            method=Visualization_method_, 
            main=cluster_i,
            point.size=0.1, 
            point.shape=19, 
            down.sample=5000,
            quantiles=c(0.01, 0.99)
            );
      }
}

plot of chunk unnamed-chunk-37

13 GREAT analysis

We apply GREAT to identify biological pathways active in each of the cell cluster.
In this example, we will first identify the differential elements in cluster 1 and report the top 6 GO Ontologies enrichment inferred using GREAT analysis.

User-defined parameters:
Cluster_number_for_GREAT - The cluster number used in the GREAT analysis. Default: 1.

library(rGREAT)

Cluster_number_for_GREAT <- 1

DARs <- findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos=Cluster_number_for_GREAT,
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=bcv, #0.4 for human, 0.1 for mouse
    seed.use=10
  )
## Epoch: checking inputs ...
## Epoch: identifying DARs for positive cluster ...

DARs$FDR <- p.adjust(DARs$PValue, method="BH")
idy <- which(DARs$FDR < 5e-2 & DARs$logFC > 0)
job <- submitGreatJob(
    gr                    = x.sp@peak[idy],
    bg                    = NULL,
    species               = Reference_genome,
    includeCuratedRegDoms = TRUE,
    rule                  = "basalPlusExt",
    adv_upstream          = 5.0,
    adv_downstream        = 1.0,
    adv_span              = 1000.0,
    adv_twoDistance       = 1000.0,
    adv_oneDistance       = 1000.0,
    request_interval = 300,
    max_tries = 10,
    version = "default",
    base_url = "http://great.stanford.edu/public/cgi-bin"
  )

job

## Submit time: 2023-08-30 14:15:08 
##   Note the results may only be avaiable on GREAT server for 24 hours.
## Version: 4.0.4 
## Species: mm10 
## Inputs: 1017 regions
## Mode: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   None

tb <- getEnrichmentTables(job)

names(tb)

## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"

GBP <- tb[["GO Biological Process"]]
head(GBP[order(GBP[,"Binom_Adjp_BH"]),1:5])

##           ID                                  name Binom_Genome_Fraction
## 1 GO:0009653    anatomical structure morphogenesis            0.23937720
## 2 GO:0044707 single-multicellular organism process            0.47178490
## 3 GO:0016049                           cell growth            0.01857512
## 4 GO:0009888                    tissue development            0.18362640
## 5 GO:0035265                          organ growth            0.01397975
## 6 GO:0051216                 cartilage development            0.02396191
##   Binom_Expected Binom_Observed_Region_Hits
## 1      243.44660                        330
## 2      479.80530                        576
## 3       18.89090                         49
## 4      186.74800                        261
## 5       14.21741                         39
## 6       24.36926                         54

14 Save data

Write .rds-formatted file. You can get the results (a html file and a rds file) via your taskID.

saveRDS(x.sp, file=paste0(outputpath, "results/", task_id, "_result.rds"))

ScEpiTools tutorial