EpiGePT tutorial

This is a step-by-step tutorial on using the pre-trained EpiGePT model to predict epigenomic signals. We have expanded the training data for EpiGePT to cover 104 cell types. All the data mentioned in this tutorial can be downloaded from the Download page. The purpose of this tutorial is to provide an example of how to use the pre-trained EpiGePT model to predict epigenomic signals for any genomic region and cell type. It's worth noting that this model has been updated to the hg38 reference genome.

1 Initialization

Requirements

2 Load pretrained model

Loading parameters of the pre-trained model and the reference genome, the pretrained model can be downloaded from here. The reference genome can be downloaded from here, and the code for this tutorial can be downloaded from here.

3 Predict

3.1 Prepare input: motif score for transcription factor binding and gene expression

Users need to prepare a matrix with dimensions (1000, 711), representing the binding states of these 711 transcription factors on 1000 genomic bins. This can be achieved using the HOMER tool for scanning. Additionally, a 711-dimensional vector is required, representing the TPM values of the 711 transcription factors after quantile normalization. Users can refer to this link for specific instructions on how to perform these operations.

Prepare tf expression

Users can obtain the tf expression input by performing quantile normalization on the TPM values of 711 transcription factors from multiple cell types, following the procedure as outlined. Users can also obtain the normalized expression of 105 cell types that we have processed and the information about the corresponding cell types from the Download page.

The information for the 711 transcription factors used can be obtained as follows, and the motif information used can be downloaded from this link on the Download page.

The reference tf expression data for quantile normalization can be downloaded from here and example tf TPM values can be downloaded from here.

Prepare tf motif score

Users can obtain motif scores using either of the following two methods.

(1) Users can obtain the TF motif score for model input by following the process below using the Homer tool. Note that users should save each 128 kbp region as a format with 1000 genomic bins of 128 bp each in the 'bins.bed' file.

(2) Users can calculate motif scores for specific regions based on the motif scores we implemented, which are scanned and stored by chromosome.

3.2 Prepare input: genomic region with 128kbp length

3.3 Model prediction

4 Save