OpenAnnotate has been successfully applied to demonstrate cell type-specificity of validated human K562 silencers (Nucleic Acids Research 2020) and to model the dependence of the openness of a regulatory element on its DNA sequence and TF expression for scoring the cell type-specific impacts of noncoding variants in personal genomes (Proceedings of the National Academy of Sciences 2020).
In this notebook, we provide a quick-start illustration of using OpenAnnotate to reveal cell type-specificity of regulatory elements, taking enhancers experimentally validated in human A549 cell line as an example. The main steps are listed as follows:
Download and process enhancers from EnhancerAtlas
Annotate chromatin accessibility of the enhancers
Analyze cell type-specificity of the enhancers
Download the human A549 enhancers from EnhancerAtlas 2.0, and save as ./data/A549.bed
Reformat the data according to the input requirement of OpenAnnotate by the following Bash Command:
# awk '{printf("%s\t%s\t%s\t.\t.\t.\n",$1,$2,$3)}' ./data/A549.bed > ./data/A549.submit.bed
OpenAnnotate extracts the first three columns and the sixth column (the chromosomes, starting sites, terminating sites and strands, respectively) separated by tabs for calculating openness scores.
Submit the BED file to OpenAnnotate on the 'Annotate' page:
Upload the file, choose 'DNase-seq (ENCODE)' and 'Homo sapiens (GRCh37/hg19)', specify the cell type of 'All biosample types', enable the option of 'Region-based annotation', and submit the task.
Download the annotation results by the links sent to mailbox or the online Download button:
In this demo, you can directly use the following Download links for convenience. http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/readopen.txt.gz http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/head.txt.gz
Save these two files as ./output/A549.readopen.txt.gz and ./output/A549.head.txt.gz, respectively.
Then we load the results using data.table.
The first four columns consist of chromosomes, starting sites, terminating sites and strands, respectively. The remaining columns corresponds to different DNase-seq experiments.
readopen <- data.table::fread('./output/A549.readopen.txt.gz', header = F)
print(dim(readopen))
head <- data.table::fread('./output/A549.head.txt.gz', header = F)
print(dim(head))
head(readopen)
Each line in head.txt.gz contains the detailed information of corresponding column in the openness results.
head(head)
target_cell_line <- "A549"
biosample_names <- head$V6
biosample_types <- sort(unique(biosample_names))
sprintf('The %d samples include %d unique types.', length(biosample_names), length(biosample_types))
For each biosample type, calculate average openness score of the enhancers.
readopen_score <- as.matrix(readopen[,5:ncol(readopen)])
readopen_mean <- matrix(0, nrow=nrow(readopen_score), ncol=length(biosample_types))
for (i in 1:length(biosample_types)){
ind <- which(biosample_names==biosample_types[i])
if (length(ind)==1)
readopen_mean[, i] <- readopen_score[, ind]
else
readopen_mean[, i] <- rowMeans(readopen_score[, ind])
}
readopen_mean = t(readopen_mean)
print(dim(readopen_mean))
To check whether these enhancers have cell type-specific openness scores in the A549 cell line compared with in other 198 biosample types, we performed one-sided (greater) Wilcoxon test for openness scores of the enhancers in A549 versus each of the remaining 198 biosample types, and finally obtained 198 FDR P-values (Benjamini and Hochberg correction) respectively.
pvalue_greater <- c()
target_ind <- which(biosample_types==target_cell_line)
for (i in which(biosample_types!=target_cell_line)){
pvalue_greater <- c(pvalue_greater, wilcox.test(readopen_mean[target_ind,], readopen_mean[i,],
alternative="g", paired=T)$p.value)
}
adjusted_pvalue <- p.adjust(pvalue_greater, method='fdr')
thr <- 0.01
sprintf('Openness in %s is significantly higher than %d other biosample types.',
target_cell_line, sum(adjusted_pvalue < thr))
The results demonstrate that these A549 enhancers have higher openness scores in A549 cell line than in all the other biosample types (FDR = 0.01), suggesting the high cell type-specificity of these A549 enhancers.