OpenAnnotate has been successfully applied to demonstrate cell type-specificity of validated human K562 silencers (Nucleic Acids Research 2020) and to model the dependence of the openness of a regulatory element on its DNA sequence and TF expression for scoring the cell type-specific impacts of noncoding variants in personal genomes (Proceedings of the National Academy of Sciences 2020).
In this notebook, we provide a quick-start illustration of using OpenAnnotate to reveal cell type-specificity of regulatory elements, taking enhancers experimentally validated in human A549 cell line as an example. The main steps are listed as follows:
Download and process enhancers from EnhancerAtlas
Annotate chromatin accessibility of the enhancers
Analyze cell type-specificity of the enhancers
Download the human A549 enhancers from EnhancerAtlas 2.0, and reformat the data according to the input requirement of OpenAnnotate.
!mkdir ./data
!wget http://www.enhanceratlas.org/data/download/enhancer/hs/A549.bed -O ./data/A549.bed
!awk '{printf("%s\t%s\t%s\t.\t.\t.\n",$1,$2,$3)}' ./data/A549.bed > ./data/A549.submit.bed
OpenAnnotate extracts the first three columns and the sixth column (the chromosomes, starting sites, terminating sites and strands, respectively) separated by tabs for calculating openness scores.
!head -5 ./data/A549.submit.bed
Submit the BED file to OpenAnnotate on the 'Annotate' page:
Upload the file, choose 'DNase-seq (ENCODE)' and 'Homo sapiens (GRCh37/hg19)', specify the cell type of 'All biosample types', enable the option of 'Region-based annotation', and submit the task.
Download the annotation results by the links sent to mailbox or the online Download button:
# In this demo, you can directly use the following Download links for convenience.
!mkdir ./output
!wget http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/readopen.txt.gz -O ./output/A549.readopen.txt.gz
!wget http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/head.txt.gz -O ./output/A549.head.txt.gz
Then we uncompress the files and load the results using Numpy.
The first four columns consist of chromosomes, starting sites, terminating sites and strands, respectively. The remaining columns corresponds to different DNase-seq experiments.
!gunzip -f ./output/*.txt.gz
import numpy as np
readopen = np.loadtxt('./output/A549.readopen.txt', dtype=str)
print(readopen.shape)
head = np.loadtxt('./output/A549.head.txt', dtype=str, delimiter='\t')
print(head.shape)
readopen[:5,:10]
Each line in head.txt.gz contains the detailed information of corresponding column in the openness results.
head[:5,:]
target_cell_line = "A549"
biosample_names = head[:, 5]
biosample_types = np.unique(biosample_names)
print('The %d samples include %d unique types.'%(len(biosample_names), len(biosample_types)))
For each biosample type, calculate average openness score of the enhancers.
readopen_score = readopen[:, 4:].astype(float)
readopen_mean = []
for b_type in biosample_types:
ind = np.where(biosample_names == b_type)[0]
readopen_mean.append(np.mean(readopen_score[:, ind], axis=1))
readopen_mean = np.array(readopen_mean)
print(readopen_mean.shape)
To check whether these enhancers have cell type-specific openness scores in the A549 cell line compared with in other 198 biosample types, we performed one-sided (greater) Wilcoxon test for openness scores of the enhancers in A549 versus each of the remaining 198 biosample types, and finally obtained 198 FDR P-values (Benjamini and Hochberg correction) respectively.
from scipy import stats
pvalue_greater = []
target_ind = np.where(biosample_types==target_cell_line)[0][0]
for ind in np.where(biosample_types!=target_cell_line)[0]:
_, p_g = stats.wilcoxon(readopen_mean[target_ind, :],
readopen_mean[ind, :], alternative='greater')
pvalue_greater.append(p_g)
# function for Benjamini and Hochberg correction
def FDR_adjust(pvalue):
pvalue.sort(reverse = True)
length = len(pvalue)
qvalue = []
max_q = pvalue[0]
for i in range(len(pvalue)):
q = pvalue[i] * length / (length-i)
if q > max_q:
q = max_q
else:
max_q = q
qvalue.append(q)
return np.array(qvalue)
adjusted_pvalue = FDR_adjust(pvalue_greater)
thr = 0.01
print('Openness in %s is significantly higher than %d other biosample types.'%
(target_cell_line, sum(adjusted_pvalue < thr)))
The results demonstrate that these A549 enhancers have higher openness scores in A549 cell line than in all the other biosample types (FDR = 0.01), suggesting the high cell type-specificity of these A549 enhancers.