Publication Date


Document Type

Doctoral Dissertation

Academic Program

Bioinformatics and Computational Biology


Program in Bioinformatics and Integrative Biology

First Thesis Advisor

Zhiping Weng, PhD


ENCODE, enhancer, regulatory element, genome, epigenome, DNase, ChIP-seq, target gene, schizophrenia, bipolar disorder, major depressive disorder


Over the last decade there has been a great effort to annotate noncoding regions of the genome, particularly those that regulate gene expression. These regulatory elements contain binding sites for transcription factors (TF), which interact with one another and transcriptional machinery to initiate, enhance, or repress gene expression. The Encyclopedia of DNA Elements (ENCODE) consortium has generated thousands of epigenomic datasets, such as DNase-seq and ChIP-seq experiments, with the goal of defining such regions. By integrating these assays, we developed the Registry of candidate Regulatory Elements (cREs), a collection of putative regulatory regions across human and mouse. In total, we identified over 1.3M human and 400k mouse cREs each annotated with cell-type specific signatures (e.g. promoter-like, enhancer-like) in over 400 human and 100 mouse biosamples. We then demonstrated the biological utility of these regions by analyzing cell type enrichments for genetic variants reported by genome wide association studies (GWAS). To search and visualize these cREs, we developed the online database SCREEN (search candidate regulatory elements by ENCODE). After defining cREs, we next sought to determine their potential gene targets. To compare target gene prediction methods, we developed a comprehensive benchmark of enhancer-gene links by curating ChIA-PET, Hi-C and eQTL datasets. We then used this benchmark to evaluate unsupervised linking approaches such as the correlation of epigenomic signal. We determined that these methods have low overall performance and do not outperform simply selecting the closest gene. We then developed a supervised Random Forest model which had notably better performance than unsupervised methods. We demonstrated that this model can be applied across cell types and can be used to predict target genes for GWAS associated variants. Finally, we used the registry of cREs to annotate variants associated with psychiatric disorders. We found that these "psych SNPs" are enriched in cREs active in brain tissue and likely target genes involved in neural development pathways. We also demonstrated that psych SNPs overlap binding sites for TFs involved in neural and immune pathways. Finally, by identifying psych SNPs with allele imbalance in chromatin accessibility, we highlighted specific cases of psych SNPs altering TF binding motifs resulting in the disruption of TF binding. Overall, we demonstrated our collection of putative regulatory regions, the Registry of cREs, can be used to understand the potential biological function of noncoding variation and develop hypotheses for future testing.



Rights and Permissions

Licensed under a Creative Commons license

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Supplemental Table 1 - Genetic Context.xlsx (71 kB)
Supplemental Table 1 - Genetic Context

Supplemental Table 2 - PROVEAN _ SIFT_.xlsx (77 kB)
Supplemental Table 2 - PROVEAN & SIFT

Supplemental Table 3 - Overlap with cREs.xlsx (159 kB)
Supplemental Table 3 - Overlap with cREs

Supplemental Table S4 - H3K27ac Enrichment (Human).xlsx (216 kB)
Supplemental Table S4 - H3K27ac Enrichment (Human)

Supplemental Table S6 - DNase Enrichment (Human).xlsx (216 kB)
Supplemental Table S6 - DNase Enrichment (Human)

Supplemental Table S7 - H3K27ac Enrichment (Mouse).xlsx (140 kB)
Supplemental Table S7 - H3K27ac Enrichment (Mouse)

Supplemental Table S9- eQTLs.xlsx (211 kB)
Supplemental Table S9- eQTLs

Supplemental Table S10 - eQTL Gene Expression.xlsx (85 kB)
Supplemental Table S10 - eQTL Gene Expression

Supplemental Table S11 - Closest Gene Expression.xlsx (156 kB)
Supplemental Table S11 - Closet Gene Expression

Supplemental Table S12 -100kb Gene Expression.xlsx (270 kB)
Supplemental Table S12 - 100 kb Gene Expression