OpenAnnotate reveals cell type-specificity of regulatory elements¶

Click here for R version

OpenAnnotate has been successfully applied to demonstrate cell type-specificity of validated human K562 silencers (Nucleic Acids Research 2020) and to model the dependence of the openness of a regulatory element on its DNA sequence and TF expression for scoring the cell type-specific impacts of noncoding variants in personal genomes (Proceedings of the National Academy of Sciences 2020).

In this notebook, we provide a quick-start illustration of using OpenAnnotate to reveal cell type-specificity of regulatory elements, taking enhancers experimentally validated in human A549 cell line as an example. The main steps are listed as follows:

Download and process enhancers from EnhancerAtlas
Annotate chromatin accessibility of the enhancers
Analyze cell type-specificity of the enhancers

Data collection and processing¶

Download the human A549 enhancers from EnhancerAtlas 2.0, and reformat the data according to the input requirement of OpenAnnotate.

!mkdir ./data
!wget http://www.enhanceratlas.org/data/download/enhancer/hs/A549.bed -O ./data/A549.bed
!awk '{printf("%s\t%s\t%s\t.\t.\t.\n",$1,$2,$3)}' ./data/A549.bed > ./data/A549.submit.bed

OpenAnnotate extracts the first three columns and the sixth column (the chromosomes, starting sites, terminating sites and strands, respectively) separated by tabs for calculating openness scores.

!head -5 ./data/A549.submit.bed

chr1	778980	779420	.	.	.
chr1	826680	827870	.	.	.
chr1	839680	840590	.	.	.
chr1	841460	842800	.	.	.
chr1	872930	874410	.	.	.

Submit to OpenAnnotate¶

Submit the BED file to OpenAnnotate on the 'Annotate' page:

A549.submit.bed

Upload the file, choose 'DNase-seq (ENCODE)' and 'Homo sapiens (GRCh37/hg19)', specify the cell type of 'All biosample types', enable the option of 'Region-based annotation', and submit the task.

Download the annotation results by the links sent to mailbox or the online Download button:

the raw read openness file (readopen.txt.gz)
the header file (head.txt.gz).

# In this demo, you can directly use the following Download links for convenience.
!mkdir ./output
!wget http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/readopen.txt.gz -O ./output/A549.readopen.txt.gz
!wget http://health.tsinghua.edu.cn/openness/anno/task/2021/0301/01413832/anno/head.txt.gz -O ./output/A549.head.txt.gz

Then we uncompress the files and load the results using Numpy.

The first four columns consist of chromosomes, starting sites, terminating sites and strands, respectively. The remaining columns corresponds to different DNase-seq experiments.

!gunzip -f ./output/*.txt.gz

import numpy as np
readopen = np.loadtxt('./output/A549.readopen.txt', dtype=str)
print(readopen.shape)
head = np.loadtxt('./output/A549.head.txt', dtype=str, delimiter='\t')
print(head.shape)

(49760, 875)
(871, 8)

readopen[:5,:10]

array([['chr1', '778980', '779420', '.', '0.8878', '0.7401', '0.1347',
        '0', '0.1009', '0.05353'],
       ['chr1', '826680', '827870', '.', '0.8764', '0.3798', '0.1748',
        '0.2627', '0.03583', '0.05675'],
       ['chr1', '839680', '840590', '.', '7.729', '7.114', '0.8666',
        '1.197', '1.545', '2.821'],
       ['chr1', '841460', '842800', '.', '1.43', '1.677', '0.1474',
        '0.2301', '2.441', '2.834'],
       ['chr1', '872930', '874410', '.', '1.271', '2.227', '0.2273',
        '0.4006', '0.6577', '1.166']], dtype='<U9')

Each line in head.txt.gz contains the detailed information of corresponding column in the openness results.

head[:5,:]

array([['ENCFF283EIF', 'ENCBS217AEF', 'ENCSR620QNS', 'DHS',
        'EFO:0007598', 'HAP-1', 'Blood', 'Circulatory system'],
       ['ENCFF726CQE', 'ENCBS912MMS', 'ENCSR620QNS', 'DHS',
        'EFO:0007598', 'HAP-1', 'Blood', 'Circulatory system'],
       ['ENCFF295KVW', 'ENCBS299YQN', 'ENCSR458LIB', 'DHS',
        'EFO:0005724', 'MM.1S', 'Blood', 'Circulatory system'],
       ['ENCFF450BQU', 'ENCBS523NFL', 'ENCSR458LIB', 'DHS',
        'EFO:0005724', 'MM.1S', 'Blood', 'Circulatory system'],
       ['ENCFF666DAH', 'ENCBS457ZNO', 'ENCSR594NOE', 'DHS',
        'EFO:0002322', 'RPMI8226', 'Blood', 'Circulatory system']],
      dtype='<U54')

Analysis of cell type-specificity¶

target_cell_line = "A549"
biosample_names  = head[:, 5]
biosample_types  = np.unique(biosample_names)
print('The %d samples include %d unique types.'%(len(biosample_names), len(biosample_types)))

The 871 samples include 199 unique types.

For each biosample type, calculate average openness score of the enhancers.

readopen_score = readopen[:, 4:].astype(float)
readopen_mean  = []
for b_type in biosample_types:
    ind = np.where(biosample_names == b_type)[0]
    readopen_mean.append(np.mean(readopen_score[:, ind], axis=1))
readopen_mean = np.array(readopen_mean)
print(readopen_mean.shape)

(199, 49760)

To check whether these enhancers have cell type-specific openness scores in the A549 cell line compared with in other 198 biosample types, we performed one-sided (greater) Wilcoxon test for openness scores of the enhancers in A549 versus each of the remaining 198 biosample types, and finally obtained 198 FDR P-values (Benjamini and Hochberg correction) respectively.

from scipy import stats
pvalue_greater = []
target_ind = np.where(biosample_types==target_cell_line)[0][0]
for ind in np.where(biosample_types!=target_cell_line)[0]:
    _, p_g = stats.wilcoxon(readopen_mean[target_ind, :], 
                            readopen_mean[ind, :], alternative='greater')
    pvalue_greater.append(p_g)

# function for Benjamini and Hochberg correction
def FDR_adjust(pvalue):
    pvalue.sort(reverse = True)
    length = len(pvalue)
    qvalue = []
    max_q = pvalue[0]
    for i in range(len(pvalue)):
        q = pvalue[i] * length / (length-i)
        if q > max_q:
            q = max_q
        else:
            max_q = q
        qvalue.append(q)
    return np.array(qvalue)

adjusted_pvalue = FDR_adjust(pvalue_greater)
thr = 0.01
print('Openness in %s is significantly higher than %d other biosample types.'%
      (target_cell_line, sum(adjusted_pvalue < thr)))

Openness in A549 is significantly higher than 198 other biosample types.

The results demonstrate that these A549 enhancers have higher openness scores in A549 cell line than in all the other biosample types (FDR = 0.01), suggesting the high cell type-specificity of these A549 enhancers.