OpenAnnotate reveals cell type-specificity of regulatory elements¶

Click here for Python version

OpenAnnotate has been successfully applied to demonstrate cell type-specificity of validated human K562 silencers (Nucleic Acids Research 2020) and to model the dependence of the openness of a regulatory element on its DNA sequence and TF expression for scoring the cell type-specific impacts of noncoding variants in personal genomes (Proceedings of the National Academy of Sciences 2020).

In this notebook, we provide a quick-start illustration of using OpenAnnotate to reveal cell type-specificity of regulatory elements, taking enhancers experimentally validated in human A549 cell line as an example. The main steps are listed as follows:

Download and process enhancers from EnhancerAtlas
Annotate chromatin accessibility of the enhancers
Analyze cell type-specificity of the enhancers

Data collection and processing¶

Download the human A549 enhancers from EnhancerAtlas 2.0, and save as ./data/A549.bed

Reformat the data according to the input requirement of OpenAnnotate by the following Bash Command:

# awk '{printf("%s\t%s\t%s\t.\t.\t.\n",$1,$2,$3)}' ./data/A549.bed > ./data/A549.submit.bed

Submit to OpenAnnotate¶

OpenAnnotate extracts the first three columns and the sixth column (the chromosomes, starting sites, terminating sites and strands, respectively) separated by tabs for calculating openness scores.

Submit the BED file to OpenAnnotate on the 'Annotate' page:

A549.submit.bed

Upload the file, choose 'DNase-seq (ENCODE)' and 'Homo sapiens (GRCh37/hg19)', specify the cell type of 'All biosample types', enable the option of 'Region-based annotation', and submit the task.

Download the annotation results by the links sent to mailbox or the online Download button:

the raw read openness file (readopen.txt.gz)
the header file (head.txt.gz).

In this demo, you can directly use the following Download links for convenience. http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/readopen.txt.gz http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/head.txt.gz

Save these two files as ./output/A549.readopen.txt.gz and ./output/A549.head.txt.gz, respectively.

Then we load the results using data.table.

The first four columns consist of chromosomes, starting sites, terminating sites and strands, respectively. The remaining columns corresponds to different DNase-seq experiments.

readopen <- data.table::fread('./output/A549.readopen.txt.gz', header = F)
print(dim(readopen))
head <- data.table::fread('./output/A549.head.txt.gz', header = F)
print(dim(head))

[1] 49760   875
[1] 871   8

head(readopen)

Each line in head.txt.gz contains the detailed information of corresponding column in the openness results.

head(head)

Analysis of cell type-specificity¶

target_cell_line <- "A549"
biosample_names  <- head$V6
biosample_types  <- sort(unique(biosample_names))
sprintf('The %d samples include %d unique types.', length(biosample_names), length(biosample_types))

For each biosample type, calculate average openness score of the enhancers.

readopen_score <- as.matrix(readopen[,5:ncol(readopen)])
readopen_mean  <- matrix(0, nrow=nrow(readopen_score), ncol=length(biosample_types))
for (i in 1:length(biosample_types)){
  ind <- which(biosample_names==biosample_types[i])
  if (length(ind)==1)
    readopen_mean[, i] <- readopen_score[, ind]
  else
    readopen_mean[, i] <- rowMeans(readopen_score[, ind])
}
readopen_mean = t(readopen_mean)
print(dim(readopen_mean))

[1]   199 49760

To check whether these enhancers have cell type-specific openness scores in the A549 cell line compared with in other 198 biosample types, we performed one-sided (greater) Wilcoxon test for openness scores of the enhancers in A549 versus each of the remaining 198 biosample types, and finally obtained 198 FDR P-values (Benjamini and Hochberg correction) respectively.

pvalue_greater <- c()
target_ind <- which(biosample_types==target_cell_line)
for (i in which(biosample_types!=target_cell_line)){
  pvalue_greater <- c(pvalue_greater, wilcox.test(readopen_mean[target_ind,], readopen_mean[i,],
                                                  alternative="g", paired=T)$p.value)
}

adjusted_pvalue <- p.adjust(pvalue_greater, method='fdr')
thr <- 0.01
sprintf('Openness in %s is significantly higher than %d other biosample types.',
      target_cell_line, sum(adjusted_pvalue < thr))

The results demonstrate that these A549 enhancers have higher openness scores in A549 cell line than in all the other biosample types (FDR = 0.01), suggesting the high cell type-specificity of these A549 enhancers.

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	⋯	V866	V867	V868	V869	V870	V871	V872	V873	V874	V875
<chr>	<int>	<int>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
chr1	778980	779420	.	0.8878	0.7401	0.1347	0.0000	0.10090	0.05353	⋯	0.0000	0.0000	0.7150	0.0000	0.17780	0.1537	0.5763	0.1532	0.3043	0.6346
chr1	826680	827870	.	0.8764	0.3798	0.1748	0.2627	0.03583	0.05675	⋯	0.1621	0.1902	0.5993	0.2947	0.05888	0.2531	0.1157	0.2528	0.4518	0.5307
chr1	839680	840590	.	7.7290	7.1140	0.8666	1.1970	1.54500	2.82100	⋯	9.2390	8.3130	7.6510	8.3050	9.29500	9.9860	12.3500	8.1290	8.5630	3.8540
chr1	841460	842800	.	1.4300	1.6770	0.1474	0.2301	2.44100	2.83400	⋯	1.5380	1.4760	3.9910	1.6090	1.16900	1.7470	3.3340	2.2750	1.4730	3.5070
chr1	872930	874410	.	1.2710	2.2270	0.2273	0.4006	0.65770	1.16600	⋯	1.5630	2.0880	2.1940	1.3990	2.02800	2.0000	2.0920	1.5320	1.0730	1.6510
chr1	876690	877510	.	10.5400	8.9870	0.5975	0.2189	2.85000	4.46300	⋯	18.0000	12.8500	9.2360	13.4100	15.04000	16.1700	12.6700	16.5200	14.3500	9.4760

V1	V2	V3	V4	V5	V6	V7	V8
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
ENCFF283EIF	ENCBS217AEF	ENCSR620QNS	DHS	EFO:0007598	HAP-1	Blood	Circulatory system
ENCFF726CQE	ENCBS912MMS	ENCSR620QNS	DHS	EFO:0007598	HAP-1	Blood	Circulatory system
ENCFF295KVW	ENCBS299YQN	ENCSR458LIB	DHS	EFO:0005724	MM.1S	Blood	Circulatory system
ENCFF450BQU	ENCBS523NFL	ENCSR458LIB	DHS	EFO:0005724	MM.1S	Blood	Circulatory system
ENCFF666DAH	ENCBS457ZNO	ENCSR594NOE	DHS	EFO:0002322	RPMI8226	Blood	Circulatory system
ENCFF982VCZ	ENCBS604KZU	ENCSR594NOE	DHS	EFO:0002322	RPMI8226	Blood	Circulatory system