OpenAnnotate reveals cell type-specificity of regulatory elements

Click here for Python version

OpenAnnotate has been successfully applied to demonstrate cell type-specificity of validated human K562 silencers (Nucleic Acids Research 2020) and to model the dependence of the openness of a regulatory element on its DNA sequence and TF expression for scoring the cell type-specific impacts of noncoding variants in personal genomes (Proceedings of the National Academy of Sciences 2020).

In this notebook, we provide a quick-start illustration of using OpenAnnotate to reveal cell type-specificity of regulatory elements, taking enhancers experimentally validated in human A549 cell line as an example. The main steps are listed as follows:

  • Download and process enhancers from EnhancerAtlas

  • Annotate chromatin accessibility of the enhancers

  • Analyze cell type-specificity of the enhancers

Data collection and processing

Download the human A549 enhancers from EnhancerAtlas 2.0, and save as ./data/A549.bed

Reformat the data according to the input requirement of OpenAnnotate by the following Bash Command:

In [ ]:
# awk '{printf("%s\t%s\t%s\t.\t.\t.\n",$1,$2,$3)}' ./data/A549.bed > ./data/A549.submit.bed

Submit to OpenAnnotate

OpenAnnotate extracts the first three columns and the sixth column (the chromosomes, starting sites, terminating sites and strands, respectively) separated by tabs for calculating openness scores.

Submit the BED file to OpenAnnotate on the 'Annotate' page:

  • A549.submit.bed

Upload the file, choose 'DNase-seq (ENCODE)' and 'Homo sapiens (GRCh37/hg19)', specify the cell type of 'All biosample types', enable the option of 'Region-based annotation', and submit the task.

Download the annotation results by the links sent to mailbox or the online Download button:

  • the raw read openness file (readopen.txt.gz)
  • the header file (head.txt.gz).

In this demo, you can directly use the following Download links for convenience. http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/readopen.txt.gz http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/head.txt.gz

Save these two files as ./output/A549.readopen.txt.gz and ./output/A549.head.txt.gz, respectively.

Then we load the results using data.table.

The first four columns consist of chromosomes, starting sites, terminating sites and strands, respectively. The remaining columns corresponds to different DNase-seq experiments.

In [1]:
readopen <- data.table::fread('./output/A549.readopen.txt.gz', header = F)
print(dim(readopen))
head <- data.table::fread('./output/A549.head.txt.gz', header = F)
print(dim(head))
[1] 49760   875
[1] 871   8
In [2]:
head(readopen)
A data.table: 6 × 875
V1V2V3V4V5V6V7V8V9V10V866V867V868V869V870V871V872V873V874V875
<chr><int><int><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
chr1778980779420. 0.88780.74010.13470.00000.100900.05353 0.0000 0.00000.7150 0.0000 0.17780 0.1537 0.5763 0.1532 0.30430.6346
chr1826680827870. 0.87640.37980.17480.26270.035830.05675 0.1621 0.19020.5993 0.2947 0.05888 0.2531 0.1157 0.2528 0.45180.5307
chr1839680840590. 7.72907.11400.86661.19701.545002.82100 9.2390 8.31307.6510 8.3050 9.29500 9.986012.3500 8.1290 8.56303.8540
chr1841460842800. 1.43001.67700.14740.23012.441002.83400 1.5380 1.47603.9910 1.6090 1.16900 1.7470 3.3340 2.2750 1.47303.5070
chr1872930874410. 1.27102.22700.22730.40060.657701.16600 1.5630 2.08802.1940 1.3990 2.02800 2.0000 2.0920 1.5320 1.07301.6510
chr1876690877510.10.54008.98700.59750.21892.850004.4630018.000012.85009.236013.410015.0400016.170012.670016.520014.35009.4760

Each line in head.txt.gz contains the detailed information of corresponding column in the openness results.

In [3]:
head(head)
A data.table: 6 × 8
V1V2V3V4V5V6V7V8
<chr><chr><chr><chr><chr><chr><chr><chr>
ENCFF283EIFENCBS217AEFENCSR620QNSDHSEFO:0007598HAP-1 BloodCirculatory system
ENCFF726CQEENCBS912MMSENCSR620QNSDHSEFO:0007598HAP-1 BloodCirculatory system
ENCFF295KVWENCBS299YQNENCSR458LIBDHSEFO:0005724MM.1S BloodCirculatory system
ENCFF450BQUENCBS523NFLENCSR458LIBDHSEFO:0005724MM.1S BloodCirculatory system
ENCFF666DAHENCBS457ZNOENCSR594NOEDHSEFO:0002322RPMI8226BloodCirculatory system
ENCFF982VCZENCBS604KZUENCSR594NOEDHSEFO:0002322RPMI8226BloodCirculatory system

Analysis of cell type-specificity

In [4]:
target_cell_line <- "A549"
biosample_names  <- head$V6
biosample_types  <- sort(unique(biosample_names))
sprintf('The %d samples include %d unique types.', length(biosample_names), length(biosample_types))
'The 871 samples include 199 unique types.'

For each biosample type, calculate average openness score of the enhancers.

In [5]:
readopen_score <- as.matrix(readopen[,5:ncol(readopen)])
readopen_mean  <- matrix(0, nrow=nrow(readopen_score), ncol=length(biosample_types))
for (i in 1:length(biosample_types)){
  ind <- which(biosample_names==biosample_types[i])
  if (length(ind)==1)
    readopen_mean[, i] <- readopen_score[, ind]
  else
    readopen_mean[, i] <- rowMeans(readopen_score[, ind])
}
readopen_mean = t(readopen_mean)
print(dim(readopen_mean))
[1]   199 49760

To check whether these enhancers have cell type-specific openness scores in the A549 cell line compared with in other 198 biosample types, we performed one-sided (greater) Wilcoxon test for openness scores of the enhancers in A549 versus each of the remaining 198 biosample types, and finally obtained 198 FDR P-values (Benjamini and Hochberg correction) respectively.

In [6]:
pvalue_greater <- c()
target_ind <- which(biosample_types==target_cell_line)
for (i in which(biosample_types!=target_cell_line)){
  pvalue_greater <- c(pvalue_greater, wilcox.test(readopen_mean[target_ind,], readopen_mean[i,],
                                                  alternative="g", paired=T)$p.value)
}
In [7]:
adjusted_pvalue <- p.adjust(pvalue_greater, method='fdr')
thr <- 0.01
sprintf('Openness in %s is significantly higher than %d other biosample types.',
      target_cell_line, sum(adjusted_pvalue < thr))
'Openness in A549 is significantly higher than 198 other biosample types.'

The results demonstrate that these A549 enhancers have higher openness scores in A549 cell line than in all the other biosample types (FDR = 0.01), suggesting the high cell type-specificity of these A549 enhancers.