ABOUT THIS COLLECTION

The Department of Genomics and Computational Biology (GCB) at UMass Chan Medical School was originally established as the Program in Bioinformatics and Integrative Biology in 2008. The group evolved into a full-fledged department in 2023, reflecting their growth and the expanding scope of their research. The department embodies the convergence of Computational Biology, Evolutionary Biology, and Genomics and is committed to advancing understanding of biological complexity through cutting-edge computational methods, evolutionary theory, and genomic technologies. This collection showcases journal articles and other publications produced by faculty and researchers of the Department of Genomics and Computational Biology.

Recently Published

  • Investigating the etiologies of non-malarial febrile illness in Senegal using metagenomic sequencing

    Levine, Zoë C; Sene, Aita; Mkandawire, Winnie; Deme, Awa B; Ndiaye, Tolla; Sy, Mouhamad; Gaye, Amy; Diedhiou, Younouss; Mbaye, Amadou M; Ndiaye, Ibrahima M; et al. (2024-01-25)
    The worldwide decline in malaria incidence is revealing the extensive burden of non-malarial febrile illness (NMFI), which remains poorly understood and difficult to diagnose. To characterize NMFI in Senegal, we collected venous blood and clinical metadata in a cross-sectional study of febrile patients and healthy controls in a low malaria burden area. Using 16S and untargeted sequencing, we detected viral, bacterial, or eukaryotic pathogens in 23% (38/163) of NMFI cases. Bacteria were the most common, with relapsing fever Borrelia and spotted fever Rickettsia found in 15.5% and 3.8% of cases, respectively. Four viral pathogens were found in a total of 7 febrile cases (3.5%). Sequencing also detected undiagnosed Plasmodium, including one putative P. ovale infection. We developed a logistic regression model that can distinguish Borrelia from NMFIs with similar presentation based on symptoms and vital signs (F1 score: 0.823). These results highlight the challenge and importance of improved diagnostics, especially for Borrelia, to support diagnosis and surveillance.
  • Dog size and patterns of disease history across the canine age spectrum: Results from the Dog Aging Project

    Nam, Yunbi; White, Michelle; Karlsson, Elinor K; Creevy, Kate E; Promislow, Daniel E L; McClelland, Robyn L (2024-01-17)
    Age in dogs is associated with the risk of many diseases, and canine size is a major factor in that risk. However, the size patterns are complex. While small size dogs tend to live longer, some diseases are more prevalent among small dogs. In this study we seek to quantify how the pattern of disease history varies across the spectrum of dog size, dog age, and their interaction. Utilizing owner-reported data on disease history from a substantial number of companion dogs enrolled in the Dog Aging Project, we investigate how body size, as measured by weight, associates with the lifetime prevalence of a reported condition and its pattern across age for various disease categories. We found significant positive associations between dog size and the lifetime prevalence of skin, bone/orthopedic, gastrointestinal, ear/nose/throat, cancer/tumor, brain/neurologic, endocrine, and infectious diseases. Similarly, dog size was negatively associated with lifetime prevalence of ocular, cardiac, liver/pancreas, and respiratory disease categories. Kidney/urinary disease prevalence did not vary by size. We also found that the association between age and lifetime disease prevalence varied by dog size for many conditions including ocular, cardiac, orthopedic, ear/nose/throat, and cancer. Controlling for sex, purebred vs. mixed-breed status, and geographic region made little difference in all disease categories we studied. Our results align with the reduced lifespan in larger dogs for most of the disease categories and suggest potential avenues for further examination.
  • A Burden of Rare Copy Number Variants in Obsessive-Compulsive Disorder [preprint]

    Halvorsen, Matthew; de Schipper, Elles; Boberg, Julia; Strom, Nora; Hagen, Kristen; Lindblad-Toh, Kerstin; Karlsson, Elinor K; Pedersen, Nancy; Bulik, Cynthia; Fundín, Bengt; et al. (2024-01-03)
    Current genetic research on obsessive-compulsive disorder (OCD) supports contributions to risk specifically from common single nucleotide variants (SNVs), along with rare coding SNVs and small insertion-deletions (indels). The contribution to OCD risk from large, rare copy number variants (CNVs), however, has not been formally assessed at a similar scale. Here we describe an analysis of rare CNVs called from genotype array data in 2,248 deeply phenotyped OCD cases and 3,608 unaffected controls from Sweden and Norway. We found that in general cases carry an elevated burden of large (>30kb, at least 15 probes) CNVs (OR=1.12, P=1.77×10-3). The excess rate of these CNVs in cases versus controls was around 0.07 (95% CI 0.02-0.11, P=2.58×10-3). This signal was largely driven by CNVs overlapping protein-coding regions (OR=1.19, P=3.08×10-4), particularly deletions impacting loss-of-function intolerant genes (pLI>0.995, OR=4.12, P=2.54×10-5). We did not identify any specific locus where CNV burden was associated with OCD case status at genome-wide significance, but we noted non-random recurrence of CNV deletions in cases (permutation P = 2.60×10-3). In cases where sufficient clinical data were available (n=1612) we found that carriers of neurodevelopmental duplications were more likely to have comorbid autism (P<0.001), and that carriers of deletions overlapping neurodevelopmental genes had lower treatment response (P=0.02). The results demonstrate a contribution of large, rare CNVs to OCD risk, and suggest that studies of rare coding variation in OCD would have increased power to identify risk genes if this class of variation were incorporated into formal tests.
  • Mutational spectrum and phenotypic variability of Duchenne muscular dystrophy and related disorders in a Bangladeshi population

    Sarker, Shaoli; Eshaque, Tamannyat Binte; Soorajkumar, Anjana; Nassir, Nasna; Zehra, Binte; Kanta, Shayla Imam; Rahaman, Md Atikur; Islam, Amirul; Akter, Shimu; Ali, Mohammad Kawsar; et al. (2023-12-06)
    Duchenne muscular dystrophy (DMD) is a severe rare neuromuscular disorder caused by mutations in the X-linked dystrophin gene. Several mutations have been identified, yet the full mutational spectrum, and their phenotypic consequences, will require genotyping across different populations. To this end, we undertook the first detailed genotype and phenotype characterization of DMD in the Bangladeshi population. We investigated the rare mutational and phenotypic spectrum of the DMD gene in 36 DMD-suspected Bangladeshi participants using an economically affordable diagnostic strategy involving initial screening for exonic deletions in the DMD gene via multiplex PCR, followed by testing PCR-negative patients for mutations using whole exome sequencing. The deletion mapping identified two critical DMD gene hotspot regions (near proximal and distal ends, spanning exons 8-17 and exons 45-53, respectively) that comprised 95% (21/22) of the deletions for this population cohort. From our exome analysis, we detected two novel pathogenic hemizygous mutations in exons 21 and 42 of the DMD gene, and novel pathogenic recessive and loss of function variants in four additional genes: SGCD, DYSF, COL6A3, and DOK7. Our phenotypic analysis showed that DMD suspected participants presented diverse phenotypes according to the location of the mutation and which gene was impacted. Our study provides ethnicity specific new insights into both clinical and genetic aspects of DMD.
  • An encyclopedia of enhancer-gene regulatory interactions in the human genome [preprint]

    Gschwind, Andreas R; Mualim, Kristy S; Karbalayghareh, Alireza; Sheth, Maya U; Dey, Kushal K; Jagoda, Evelyn; Nurtdinov, Ramil N; Xi, Wang; Tan, Anthony S; Jones, Hank; et al. (2023-11-13)
    Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.
  • Single-cell transcriptomic and genomic changes in the aging human brain [preprint]

    Jeffries, Ailsa M; Yu, Tianxiong; Ziegenfuss, Jennifer S; Tolles, Allie K; Kim, Yerin; Weng, Zhiping; Lodato, Michael A (2023-11-07)
    Aging brings dysregulation of various processes across organs and tissues, often stemming from stochastic damage to individual cells over time. Here, we used a combination of single-nucleus RNA-sequencing and single-cell whole-genome sequencing to identify transcriptomic and genomic changes in the prefrontal cortex of the human brain across life span, from infancy to centenarian. We identified infant-specific cell clusters enriched for the expression of neurodevelopmental genes, and a common down-regulation of cell-essential homeostatic genes that function in ribosomes, transport, and metabolism during aging across cell types. Conversely, expression of neuron-specific genes generally remains stable throughout life. We observed a decrease in specific DNA repair genes in aging, including genes implicated in generating brain somatic mutations as indicated by mutation signature analysis. Furthermore, we detected gene-length-specific somatic mutation rates that shape the transcriptomic landscape of the aged human brain. These findings elucidate critical aspects of human brain aging, shedding light on transcriptomic and genomics dynamics.
  • Beyond genome-wide association studies: Investigating the role of noncoding regulatory elements in primary sclerosing cholangitis

    Pratt, Henry E; Wu, Tong; Elhajjajy, Shaimae I; Zhou, Jeffrey Y.; Fitzgerald, Kate; Fazzio, Tom; Weng, Zhiping; Pratt, Daniel S (2023-09-27)
    Background: Genome-wide association studies (GWAS) have identified 30 risk loci for primary sclerosing cholangitis (PSC). Variants within these loci are found predominantly in noncoding regions of DNA making their mechanisms of conferring risk hard to define. Epigenomic studies have shown noncoding variants broadly impact regulatory element activity. The possible association of noncoding PSC variants with regulatory element activity has not been studied. We aimed to (1) determine if the noncoding risk variants in PSC impact regulatory element function and (2) if so, assess the role these regulatory elements have in explaining the genetic risk for PSC. Methods: Available epigenomic datasets were integrated to build a comprehensive atlas of cell type-specific regulatory elements, emphasizing PSC-relevant cell types. RNA-seq and ATAC-seq were performed on peripheral CD4+ T cells from 10 PSC patients and 11 healthy controls. Computational techniques were used to (1) study the enrichment of PSC-risk variants within regulatory elements, (2) correlate risk genotype with differences in regulatory element activity, and (3) identify regulatory elements differentially active and genes differentially expressed between PSC patients and controls. Results: Noncoding PSC-risk variants are strongly enriched within immune-specific enhancers, particularly ones involved in T-cell response to antigenic stimulation. In total, 250 genes and >10,000 regulatory elements were identified that are differentially active between patients and controls. Conclusions: Mechanistic effects are proposed for variants at 6 PSC-risk loci where genotype was linked with differential T-cell regulatory element activity. Regulatory elements are shown to play a key role in PSC pathophysiology.
  • Reliable multiplex generation of pooled induced pluripotent stem cells

    Smullen, Molly; Olson, Meagan N; Reichert, Julia M; Dawes, Pepper; Murray, Liam F; Baer, Christina E; Wang, Qi; Readhead, Benjamin; Church, George M; Lim, Elaine T; et al. (2023-08-31)
    Reprogramming somatic cells into pluripotent stem cells (iPSCs) enables the study of systems in vitro. To increase the throughput of reprogramming, we present induction of pluripotency from pooled cells (iPPC)-an efficient, scalable, and reliable reprogramming procedure. Using our deconvolution algorithm that employs pooled sequencing of single-nucleotide polymorphisms (SNPs), we accurately estimated individual donor proportions of the pooled iPSCs. With iPPC, we concurrently reprogrammed over one hundred donor lymphoblastoid cell lines (LCLs) into iPSCs and found strong correlations of individual donors' reprogramming ability across multiple experiments. Individual donors' reprogramming ability remains consistent across both same-day replicates and multiple experimental runs, and the expression of certain immunoglobulin precursor genes may impact reprogramming ability. The pooled iPSCs were also able to differentiate into cerebral organoids. Our procedure enables a multiplex framework of using pooled libraries of donor iPSCs for downstream research and investigation of in vitro phenotypes.
  • Improving diagnosis of non-malarial fevers in Senegal: Borrelia and the contribution of tick-borne bacteria [preprint]

    Levine, Zoë C; Sene, Aita; Mkandawire, Winnie; Deme, Awa B; Ndiaye, Tolla; Sy, Mouhamad; Gaye, Amy; Diedhiou, Younouss; Mbaye, Amadou M; Ndiaye, Ibrahima; et al. (2023-08-25)
    The worldwide decline in malaria incidence is revealing the extensive burden of non-malarial febrile illness (NMFI), which remains poorly understood and difficult to diagnose. To characterize NMFI in Senegal, we collected venous blood and clinical metadata from febrile patients and healthy controls in a low malaria burden area. Using 16S and unbiased sequencing, we detected viral, bacterial, or eukaryotic pathogens in 29% of NMFI cases. Bacteria were the most common, with relapsing fever Borrelia and spotted fever Rickettsia found in 15% and 3.7% of cases, respectively. Four viral pathogens were found in a total of 7 febrile cases (3.5%). Sequencing also detected undiagnosed Plasmodium, including one putative P. ovale infection. We developed a logistic regression model to distinguish Borrelia from NMFIs with similar presentation based on symptoms and vital signs. These results highlight the challenge and importance of improved diagnostics, especially for Borrelia, to support diagnosis and surveillance.
  • Using evolutionary constraint to define novel candidate driver genes in medulloblastoma

    Roy, Ananya; Sakthikumar, Sharadha; Kozyrev, Sergey V; Nordin, Jessika; Pensch, Raphaela; Mäkeläinen, Suvi; Pettersson, Mats; Karlsson, Elinor K; Lindblad-Toh, Kerstin; Forsberg-Nilsson, Karin (2023-08-07)
    Current knowledge of cancer genomics remains biased against noncoding mutations. To systematically search for regulatory noncoding mutations, we assessed mutations in conserved positions in the genome under the assumption that these are more likely to be functional than mutations in positions with low conservation. To this end, we use whole-genome sequencing data from the International Cancer Genome Consortium and combined it with evolutionary constraint inferred from 240 mammals, to identify genes enriched in noncoding constraint mutations (NCCMs), mutations likely to be regulatory in nature. We compare medulloblastoma (MB), which is malignant, to pilocytic astrocytoma (PA), a primarily benign tumor, and find highly different NCCM frequencies between the two, in agreement with the fact that malignant cancers tend to have more mutations. In PA, a high NCCM frequency only affects the BRAF locus, which is the most commonly mutated gene in PA. In contrast, in MB, >500 genes have high levels of NCCMs. Intriguingly, several loci with NCCMs in MB are associated with different ages of onset, such as the HOXB cluster in young MB patients. In adult patients, NCCMs occurred in, e.g., the WASF-2/AHDC1/FGR locus. One of these NCCMs led to increased expression of the SRC kinase FGR and augmented responsiveness of MB cells to dasatinib, a SRC kinase inhibitor. Our analysis thus points to different molecular pathways in different patient groups. These newly identified putative candidate driver mutations may aid in patient stratification in MB and could be valuable for future selection of personalized treatment options.
  • Aub, Vasa and Armi localization to phase separated nuage is dispensable for piRNA biogenesis and transposon silencing in Drosophila [preprint]

    Ho, Samantha; Rice, Nicholas P; Yu, Tianxiong; Weng, Zhiping; Theurkauf, William E (2023-07-26)
    From nematodes to placental mammals, key components of the germline transposon silencing piRNAs pathway localize to phase separated perinuclear granules. In Drosophila, the PIWI protein Aub, DEAD box protein Vasa and helicase Armi localize to nuage granules and are required for ping-pong piRNA amplification and phased piRNA processing. Drosophila piRNA mutants lead to genome instability and Chk2 kinase DNA damage signaling. By systematically analyzing piRNA pathway organization, small RNA production, and long RNA expression in single piRNA mutants and corresponding chk2/mnk double mutants, we show that Chk2 activation disrupts nuage localization of Aub and Vasa, and that the HP1 homolog Rhino, which drives piRNA precursor transcription, is required for Aub, Vasa, and Armi localization to nuage. However, these studies also show that ping-pong amplification and phased piRNA biogenesis are independent of nuage localization of Vasa, Aub and Armi. Dispersed cytoplasmic proteins thus appear to mediate these essential piRNA pathway functions.
  • Knowledge, attitudes and practices regarding the use of mobile travel health apps

    Machoko, Munashe M P; Dong, Yinan; Grozdani, Andonaq; Hong, Hung; Oliver, Elizabeth; Hyle, Emily P; Ryan, Edward T; Colubri, Andrés; LaRocque, Regina C (2023-07-06)
    We performed a survey of U.S. international travellers to evaluate their knowledge, attitudes and practices regarding mobile technologies related to health. We found that many international travellers carry smartphones and are interested in receiving health information from a mobile app when they travel abroad.
  • Performance of Rapid Antigen Tests to Detect Symptomatic and Asymptomatic SARS-CoV-2 Infection : A Prospective Cohort Study

    Soni, Apurv; Herbert, Carly; Lin, Honghuang; Yan, Yi; Pretz, Caitlin; Stamegna, Pamela; Wang, Biqi; Orwig, Taylor; Wright, Colton; Tarrant, Seanan; et al. (2023-07-04)
    Background: The performance of rapid antigen tests (Ag-RDTs) for screening asymptomatic and symptomatic persons for SARS-CoV-2 is not well established. Objective: To evaluate the performance of Ag-RDTs for detection of SARS-CoV-2 among symptomatic and asymptomatic participants. Design: This prospective cohort study enrolled participants between October 2021 and January 2022. Participants completed Ag-RDTs and reverse transcriptase polymerase chain reaction (RT-PCR) testing for SARS-CoV-2 every 48 hours for 15 days. Setting: Participants were enrolled digitally throughout the mainland United States. They self-collected anterior nasal swabs for Ag-RDTs and RT-PCR testing. Nasal swabs for RT-PCR were shipped to a central laboratory, whereas Ag-RDTs were done at home. Participants: Of 7361 participants in the study, 5353 who were asymptomatic and negative for SARS-CoV-2 on study day 1 were eligible. In total, 154 participants had at least 1 positive RT-PCR result. Measurements: The sensitivity of Ag-RDTs was measured on the basis of testing once (same-day), twice (after 48 hours), and thrice (after a total of 96 hours). The analysis was repeated for different days past index PCR positivity (DPIPPs) to approximate real-world scenarios where testing initiation may not always coincide with DPIPP 0. Results were stratified by symptom status. Results: Among 154 participants who tested positive for SARS-CoV-2, 97 were asymptomatic and 57 had symptoms at infection onset. Serial testing with Ag-RDTs twice 48 hours apart resulted in an aggregated sensitivity of 93.4% (95% CI, 90.4% to 95.9%) among symptomatic participants on DPIPPs 0 to 6. When singleton positive results were excluded, the aggregated sensitivity on DPIPPs 0 to 6 for 2-time serial testing among asymptomatic participants was lower at 62.7% (CI, 57.0% to 70.5%), but it improved to 79.0% (CI, 70.1% to 87.4%) with testing 3 times at 48-hour intervals. Limitation: Participants tested every 48 hours; therefore, these data cannot support conclusions about serial testing intervals shorter than 48 hours. Conclusion: The performance of Ag-RDTs was optimized when asymptomatic participants tested 3 times at 48-hour intervals and when symptomatic participants tested 2 times separated by 48 hours. Primary funding source: National Institutes of Health RADx Tech program.
  • Modeling of mitochondrial genetic polymorphisms reveals induction of heteroplasmy by pleiotropic disease locus 10398A>G

    Smullen, Molly; Olson, Meagan N; Murray, Liam F; Suresh, Madhusoodhanan; Yan, Guang; Dawes, Pepper; Barton, Nathaniel J; Mason, Jivanna N; Zhang, Yucheng; Fernandez-Fontaine, Aria A; et al. (2023-06-27)
    Mitochondrial (MT) dysfunction has been associated with several neurodegenerative diseases including Alzheimer's disease (AD). While MT-copy number differences have been implicated in AD, the effect of MT heteroplasmy on AD has not been well characterized. Here, we analyzed over 1800 whole genome sequencing data from four AD cohorts in seven different tissue types to determine the extent of MT heteroplasmy present. While MT heteroplasmy was present throughout the entire MT genome for blood samples, we detected MT heteroplasmy only within the MT control region for brain samples. We observed that an MT variant 10398A>G (rs2853826) was significantly associated with overall MT heteroplasmy in brain tissue while also being linked with the largest number of distinct disease phenotypes of all annotated MT variants in MitoMap. Using gene-expression data from our brain samples, our modeling discovered several gene networks involved in mitochondrial respiratory chain and Complex I function associated with 10398A>G. The variant was also found to be an expression quantitative trait loci (eQTL) for the gene MT-ND3. We further characterized the effect of 10398A>G by phenotyping a population of lymphoblastoid cell-lines (LCLs) with and without the variant allele. Examination of RNA sequence data from these LCLs reveal that 10398A>G was an eQTL for MT-ND4. We also observed in LCLs that 10398A>G was significantly associated with overall MT heteroplasmy within the MT control region, confirming the initial findings observed in post-mortem brain tissue. These results provide novel evidence linking MT SNPs with MT heteroplasmy and open novel avenues for the investigation of pathomechanisms that are driven by this pleiotropic disease associated loci.
  • FACS-Based Sequencing Approach to Evaluate Cell Type to Genotype Associations Using Cerebral Organoids

    Murray, Liam; Olson, Meagan N; Barton, Nathaniel; Dawes, Pepper; Chan, Yingleong; Lim, Elaine T (2023-06-11)
    Recent technological developments have led to widespread applications of large-scale transcriptomics-based sequencing methods to identify genotype-to-cell type associations. Here we describe a fluorescence-activated cell sorting (FACS)-based sequencing method to utilize CRISPR/Cas9 edited mosaic cerebral organoids to identify or validate genotype-to-cell type associations. Our approach is high-throughput and quantitative and uses internal controls to enable comparisons of the results across different antibody markers and experiments.
  • Expression of ALS-PFN1 impairs vesicular degradation in iPSC-derived microglia [preprint]

    Funes, Salome; Gadd, Del Hayden; Mosqueda, Michelle; Zhong, Jianjun; Jung, Jonathan; Shankaracharya; Unger, Matthew; Cameron, Debra; Dawes, Pepper; Keagle, Pamela J; et al. (2023-06-01)
    Microglia play a pivotal role in neurodegenerative disease pathogenesis, but the mechanisms underlying microglia dysfunction and toxicity remain to be fully elucidated. To investigate the effect of neurodegenerative disease-linked genes on the intrinsic properties of microglia, we studied microglia-like cells derived from human induced pluripotent stem cells (iPSCs), termed iMGs, harboring mutations in profilin-1 (PFN1) that are causative for amyotrophic lateral sclerosis (ALS). ALS-PFN1 iMGs exhibited lipid dysmetabolism and deficits in phagocytosis, a critical microglia function. Our cumulative data implicate an effect of ALS-linked PFN1 on the autophagy pathway, including enhanced binding of mutant PFN1 to the autophagy signaling molecule PI3P, as an underlying cause of defective phagocytosis in ALS-PFN1 iMGs. Indeed, phagocytic processing was restored in ALS-PFN1 iMGs with Rapamycin, an inducer of autophagic flux. These outcomes demonstrate the utility of iMGs for neurodegenerative disease research and highlight microglia vesicular degradation pathways as potential therapeutic targets for these disorders.
  • The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity [preprint]

    Reese, Fairlie; Williams, Brian; Balderrama-Gutierrez, Gabriela; Wyman, Dana; Çelik, Muhammed Hasan; Rebboah, Elisabeth; Rezaie, Narges; Trout, Diane; Razavi-Mohseni, Milad; Jiang, Yunzhe; et al. (2023-05-16)
    The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multitranscript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
  • Up-regulation of cholesterol synthesis pathways and limited neurodegeneration in a knock-in mutant mouse model of ALS [preprint]

    Dominov, Janice A; Madigan, Laura A; Whitt, Joshua P; Rademacher, Katerina L; Webster, Kristin M; Zhang, Hesheng; Banno, Haruhiko; Tang, Siqi; Zhang, Yifan; Wightman, Nicholas; et al. (2023-05-05)
    Amyotrophic lateral sclerosis (ALS) is a severe neurodegenerative disorder affecting brain and spinal cord motor neurons. Mutations in the copper/zinc superoxide dismutase gene ( SOD1 ) are associated with ∼20% of inherited and 1-2% of sporadic ALS cases. Much has been learned from mice expressing transgenic copies of mutant SOD1, which typically involve high-level transgene expression, thereby differing from ALS patients expressing one mutant gene copy. To generate a model that more closely represents patient gene expression, we created a knock-in point mutation (G85R, a human ALS-causing mutation) in the endogenous mouse Sod1 gene, leading to mutant SOD1 G85R protein expression. Heterozygous Sod1 G85R mutant mice resemble wild type, whereas homozygous mutants have reduced body weight and lifespan, a mild neurodegenerative phenotype, and express very low mutant SOD1 protein levels with no detectable SOD1 activity. Homozygous mutants exhibit partial neuromuscular junction denervation at 3-4 months of age. Spinal cord motor neuron transcriptome analyses of homozygous Sod1 G85R mice revealed up-regulation of cholesterol synthesis pathway genes compared to wild type. Transcriptome and phenotypic features of these mice are similar to Sod1 knock-out mice, suggesting the Sod1 G85R phenotype is largely driven by loss of SOD1 function. By contrast, cholesterol synthesis genes are down-regulated in severely affected human TgSOD1 G93A transgenic mice at 4 months. Our analyses implicate dysregulation of cholesterol or related lipid pathway genes in ALS pathogenesis. The Sod1 G85R knock-in mouse is a useful ALS model to examine the importance of SOD1 activity in control of cholesterol homeostasis and motor neuron survival.
  • The functional and evolutionary impacts of human-specific deletions in conserved elements

    Xue, James R; Mackay-Smith, Ava; Mouri, Kousuke; Garcia, Meilin Fernandez; Dong, Michael X; Akers, Jared F; Noble, Mark; Li, Xue; Lindblad-Toh, Kerstin; Karlsson, Elinor K; et al. (2023-04-28)
    Conserved genomic sequences disrupted in humans may underlie uniquely human phenotypic traits. We identified and characterized 10,032 human-specific conserved deletions (hCONDELs). These short (average 2.56 base pairs) deletions are enriched for human brain functions across genetic, epigenomic, and transcriptomic datasets. Using massively parallel reporter assays in six cell types, we discovered 800 hCONDELs conferring significant differences in regulatory activity, half of which enhance rather than disrupt regulatory function. We highlight several hCONDELs with putative human-specific effects on brain development, including HDAC5, CPEB4, and PPP2CA. Reverting an hCONDEL to the ancestral sequence alters the expression of LOXL2 and developmental genes involved in myelination and synaptic function. Our data provide a rich resource to investigate the evolutionary mechanisms driving new traits in humans and other species.
  • Leveraging base-pair mammalian constraint to understand genetic variation and human disease

    Sullivan, Patrick F; Meadows, Jennifer R S; Gazal, Steven; Phan, BaDoi N; Li, Xue; Genereux, Diane P; Dong, Michael X; Bianchi, Matteo; Andrews, Gregory; Sakthikumar, Sharadha; et al. (2023-04-28)
    Thousands of genomic regions have been associated with heritable human diseases, but attempts to elucidate biological mechanisms are impeded by an inability to discern which genomic positions are functionally important. Evolutionary constraint is a powerful predictor of function, agnostic to cell type or disease mechanism. Single-base phyloP scores from 240 mammals identified 3.3% of the human genome as significantly constrained and likely functional. We compared phyloP scores to genome annotation, association studies, copy-number variation, clinical genetics findings, and cancer data. Constrained positions are enriched for variants that explain common disease heritability more than other functional annotations. Our results improve variant annotation but also highlight that the regulatory landscape of the human genome still needs to be further explored and linked to disease.

View more