Literature DB >> 35317627

Modelling the genetic aetiology of complex disease: human-mouse conservation of noncoding features and disease-associated loci.

George Powell^1,2, Helen Long^1,2, Louisa Zolkiewski^2,3, Rebecca Dumbell⁴, Ann-Marie Mallon², Cecilia M Lindgren^1,5,4,6, Michelle M Simon².

Abstract

Understanding the genetic aetiology of loci associated with a disease is crucial for developing preventative measures and effective treatments. Mouse models are used extensively to understand human pathobiology and mechanistic functions of disease-associated loci. However, the utility of mouse models is limited in part by evolutionary divergence in transcription regulation for pathways of interest. Here, we summarize the alignment of genomic (exonic and multi-cell regulatory) annotations alongside Mendelian and complex disease-associated variant sites between humans and mice. Our results highlight the importance of understanding evolutionary divergence in transcription regulation when interpreting functional studies using mice as models for human disease variants.

Entities: Chemical

Keywords: Mendelian disease; alignment; annotation; complex disease; conservation; orthologue

Mesh：

Year: 2022 PMID： 35317627 PMCID： PMC8941414 DOI： 10.1098/rsbl.2021.0630

Source DB: PubMed Journal: Biol Lett ISSN： 1744-9561 Impact factor: 3.703

Background

Understanding the mechanistic function of disease-associated loci is a fundamental challenge for biomedical research, and is critical for the development of effective treatments and drug targets [1]. Genome-Wide Association Studies (GWAS) have identified a myriad of variant sites associated with the risk of complex diseases [2]; however, the causal pathways of these loci remain poorly understood [3]. This is in part due to the relative difficulty of functional follow-up studies, which is compounded by the small and potentially interactive effects of variants, and the complexity of interpreting the function of non-coding regions, where the majority of GWAS variants are found [4]. The mouse is the most commonly used mammalian model for biomedical research [5-8] and has been used to infer the function of human disease variants. Mouse models have been particularly useful for elucidating the function of variants in protein-coding transcripts, which are highly conserved between the species [9], in addition to loci associated with traits that can only be measured in vivo such as body fat distribution or body mass index [10]. The mouse is also the only non-human mammal for which we have data on regulatory feature occupancy from genomic assays catalogued by ENCODE [11,12]. It is therefore uniquely suited to serve as a model for understanding regulatory feature function, with further potential for modelling human disease loci through humanization of the mouse genome using CRISPR/Cas9 technologies [5,13,14]. Studies have mapped human GWAS variants associated with given disease phenotypes to the mouse genome and shown an enrichment in regions linked to transcription regulation [11,15-17]. Studies have also, however, highlighted the substantial divergence in tissue and/or cell-specific transcription regulation between the species [11,12,15,18-20], making it unclear in which instances the mouse can recapitulate mechanisms of human gene expression to sufficiently model the function of human disease-associated genetic variants [21]. The Ensembl Regulatory Build amalgamates datasets from various consortia, including ENCODE, to annotate predicted regulatory sequences across the human and mouse genomes [22]. These annotations are continually updated as more data become available and, importantly, have stable identifiers to provide a reference framework for ongoing research. It is, however, currently unreported how human–mouse alignment compares across the spectrum of annotation categories. Furthermore, it remains unclear how Mendelian and complex disease-associated variant site alignment varies between different regulatory annotations. Addressing these two questions would provide a useful reference point for researchers considering mouse models for human disease-associated loci. Here, we use genomic annotation from Ensembl to provide a genome-wide overview of sequence alignment for twelve categories of annotation (including exonic and regulatory features) between humans and mice. We assess the alignment of Mendelian and complex disease-associated variant sites between the species across these annotation categories and discuss the implications of our results for the use of mouse models to understand the mechanistic function of human disease loci.

Results and discussion

The human and mouse genomes have been annotated genome-wide by the Ensembl Regulatory Build [22] and GENCODE [23], and we used these two sources to annotate all base-pair positions across the autosomes for both species (see §4 and electronic supplementary material). Species genomes that have diverged over evolutionary time can be aligned to identify orthologous loci [24]. Throughout this manuscript, we define human bases as aligned if they have an orthologous base in the mouse genome (i.e. if they have a corresponding genomic position in the pairwise alignment conducted by Ensembl [25]), independent of whether a point mutation has occurred. We summarize the overall fraction of human bases that align to the mouse genome for each annotation category (figure 1). We describe the fraction of human bases with a given annotation that align to bases with the same annotation in the mouse genome as having common annotation.

Figure 1

Alignment of genomic annotations between humans and mice. Bars represent the percentage of human bases that align with the mouse genome. Coloured bars represent the percentage of bases that align with a common annotation in the mouse (i.e. the same annotation in each species). Black bars represent the percentage of bases that align to a different annotation in the mouse (i.e. do not have a common annotation). The dashed blue line represents the genome-wide percentage of human bases that align with the mouse genome. The genomic coverage for each human annotation is labelled in brackets on the Y-axis. The sum of coverage is greater than 100% due to the overlap of annotations (electronic supplementary material, table S2). Human protein-coding sequences show the greatest alignment to the mouse genome (95.5%). The fraction of human annotation that aligns to the same annotation in mice is highest for protein-coding sequences (88.2%), proximal intronic sequences (56.0%), untranslated regions (UTRs; 40.8%) and promoters (38.5%), and lowest for distal enhancers (3.1%), topologically associated domain (TAD) boundaries (4.8%) and miscellaneous (2.6%). CTCF, CCCTC‐binding factor, which is encoded by the CTCF gene; CDS, coding DNA sequence. In total, 29.3% of the human autosome aligns to the mouse; however, alignment varies by genomic annotation (figure 1; electronic supplementary material, table S3). Human translated exons (95.5%), proximal intronic sequences that include splice sites (77.3%), 3′ and 5′ untranslated regions (UTRs) (67.6%) and promoters (59.4%) show a relatively higher degree of alignment to the mouse genome than other exonic and regulatory annotations, including proximal and distal enhancers (41.8% and 45.3%, respectively), miscellaneous sequences (39.2%), CCCTC‐binding factor (CTCF) binding sites (41.4%) and topologically associated domain (TAD) boundaries (32.4%). The fraction of bases that align to the mouse and have common annotation in both species provides a coarse measure of feature conservation. The fraction of common annotation varies by annotation category and is greatest for translated exonic sequences (88.2%), followed by proximal intronic regions (56.0%), UTRs (40.8%), and promoters (38.5%) (figure 1; electronic supplementary material, table S3). By contrast, the fraction of common annotation is lower for proximal and distal enhancers (9.2% and 3.1% respectively) (figure 1; electronic supplementary material, table S3). This is consistent with previous research highlighting the rapid rate of enhancer turnover relative to promoters across mammalian species [7,26-29]. However, because enhancers can act as tissue-specific cis-regulatory elements [30], and the human and mouse regulatory builds are constructed using different amalgamations of tissue sources [22], some enhancer annotation and alignment may not have been captured by the multicell annotation model. We assessed the alignment of tissue-specific enhancers by comparing the alignment of enhancers active in adult heart, liver and spleen samples between the species (electronic supplementary material, table S4). Tissue-specific enhancers have a comparable alignment (ranging from 42.2% to 51.0% for proximal enhancers and 43.5% to 60.2% for distal enhancers) to the multicell model but were less conserved (ranging from 1.6% to 5.1% for proximal enhancers and 0.4% to 2.7% for distal enhancers). It is important to determine the similarities and differences in regulatory architecture between humans and mice when considering using a mouse model to infer the mechanistic function of human disease-associated variants [21]. We assessed the alignment of human variant sites predicted to cause Mendelian disease and human variant sites associated with complex disease with the mouse genome by considering two datasets: single nucleotide variant (SNV) sites predicted to cause Mendelian disease from ClinVar [31] (n = 42 039) and SNV sites associated with complex disease from the GWAS Catalog [32] (n = 27 794). As expected, both Mendelian and complex disease-associated variant sites in translated human sequences (Exoncoding DNA sequence (CDS)) have a high degree of alignment to the mouse genome (99.3% and 95.8%, respectively) (figure 2; electronic supplementary material, table S5). Across non-protein-coding sequences (i.e. loci not classified as ExonCDS, hereafter referred to as non-coding), 98.4% of pathogenic variant sites predicted to follow Mendelian inheritance patterns have an orthologous position in the mouse genome (figure 2; electronic supplementary material, table S5). This is significantly more than the genome-wide average of 28.8% for non-coding loci (z = 139.5, p < 1.0 × 10−300) and indicates that these sites have had a higher probability of being constrained by local purifying selection, potentially as a result of functional importance, since the species' divergence. There is, however, variation in the fraction of SNV sites that align to the same annotation in mouse between regulatory elements. 70.8% of Mendelian pathogenic SNV sites in human promoter sequences align to mouse promoter sequences. In comparison, only 12.2% of Mendelian pathogenic SNV sites in human proximal enhancers and 6.5% in human distal enhancers align to loci with the same annotation in mice (figure 2; electronic supplementary material, table S5). This difference suggests that while these loci may have had a higher probability of preservation due to local purifying selection in both lineages, the active regulatory elements and functional pathways at these variant sites have diverged. It must be noted, however, that some similarities may be missed due to regulatory feature specificity and differences in the tissue amalgamations used to annotate regulatory features.

Figure 2

Alignment of human SNV sites associated with complex disease (GWAS Catalog) and Mendelian disease (ClinVar) between humans and mice. Bars represent the percentage of human variant sites that align with the mouse genome. Coloured bars represent the percentage of variant sites that aign with a common annotation in the mouse (i.e. the same annotation in each species). Black bars represent the percentage of variant sites that align with a different annotation in the mouse (i.e. do not have a common annotation). The dashed blue line represents the total percentage of variant sites that align with the mouse genome. Variant sites associated with human Mendelian disease are more conserved between the species than variant sites associated with human complex disease. However, annotation of non-exonic regulatory features (excluding the promoter) is poorly conserved, suggesting functional divergence between the species. A significantly smaller fraction of non-coding variant sites associated with complex disease aligns with the mouse genome than non-coding variant sites predicted to follow Mendelian inheritance patterns (36.4% compared with 98.4%, z = 97.7, p < 1.0 × 10−300) (figure 2; electronic supplementary material, tables S5 and S6). One explanation for this may be the small effect size of variant sites associated with complex disease having limited fitness effects [33]. Distal introns and unannotated regions contain the majority (62.0%) of variant sites associated with complex disease, making their effect on transcription regulation difficult to infer. However, a significantly greater fraction of variant sites associated with complex disease in these regions aligns with the mouse genome than the total fraction of bases with these annotations: 34.3% compared with 29.6% (z = 10.7, p = 9.10 × 10−27) for distal introns and 25.8% compared with 17.9% (z = 16.5, p < 4.90 × 10−61) for unannotated (electronic supplementary material, table S8). This suggests that the functional role of loci within regions annotated as ‘intron–distal’ and ‘unannotated’ has not been captured by the annotation model and may discourage the production of mouse models for these variants.

Conclusion

By comparing the mouse and human genomes, we found that 95.5% of human protein-coding sequence and 28.5% of human non-coding (untranslated) sequence aligns with the mouse genome. Furthermore, 98.4% of human non-coding variant sites associated with Mendelian disease align to the mouse genome, compared with 36.4% of non-coding variant sites associated with complex disease. The degree of overall divergence in the regulatory landscape between humans and mice highlights the importance of understanding the differences between functional pathways of interest when using mouse models to infer human disease mechanisms.

Methods

Regional genomic annotations for human and mouse autosomes are defined by the Ensembl Multicell Regulatory Build [22] and GENCODE [23] from Ensembl (v.101) [34]. Exonic genomic regions were categorized by their GENCODE annotations as: ‘exon-CDS’ for translated nucleotides in protein-coding exons; ‘exon-UTR’ for 5′ untranslated region (UTR) or 3′ UTR nucleotides in protein-coding exons; ‘exon-other’ for nucleotides in non-protein-coding exons (notably ncRNAs and lncRNAs). Regulatory regions were categorized by their Ensembl Regulatory Build annotations as: ‘promoter’, ‘enhancer-proximal’, ‘enhancer-distal’, ‘CTCF binding site’ or ‘miscellaneous’ for nucleotides categorized as unannotated transcription factor binding site or unannotated open chromatin. Intronic nucleotides in either protein-coding or non-protein-coding genes were inferred from exon coordinates as annotated in GENCODE, and categorized as either: ‘Intron-proximal’ if they are located within 10 bp of a splice-site position, or ‘Intron-distal’ if they are located more than 10 bp from a splice-site position and do not have any other annotation. TADs were called using the Arrowhead algorithm [35] (detail provided in the electronic supplementary information) and TAD boundaries were defined as ±25 kb from the start and end of each called TAD. All remaining nucleotides not annotated in GENCODE, the Ensembl Regulatory Build or as intronic are categorized as ‘Unannotated’. A summary of the genomic coverage for each annotation is provided in electronic supplementary material, table S1. Annotation overlap is summarized in electronic supplementary material, figure S1 and table S2. Human–mouse pairwise alignment was conducted by Ensembl (v. 101) using LastZ [24,25]. Human single nucleotide variant (SNV) sites associated with Mendelian disease were downloaded from ClinVar [31]. We considered all SNV sites with clinical significance labelled as either ‘Pathogenic’ or ‘Likely pathogenic’, and a review status labelled as either ‘criteria provided, multiple submitters, no conflicts', ‘criteria provided, single submitter’, or ‘reviewed by expert panel’ (n = 42 039). Human SNV sites associated with complex disease were obtained from the GWAS Catalog [32] and have a phenotype that is ontologically classified as either disease, disorder or cancer, and a p-value < 10−8 (n = 27 794). We tested differences in proportions using two-proportion z-tests (more information provided in the electronic supplementary material). All analysis and figure plotting were conducted in R v. 3.4.2 [36]. Detailed methodology is provided in the electronic supplementary material.

34 in total

Review 1. Deciphering the Emerging Complexities of Molecular Mechanisms at GWAS Loci.

Authors: Maren E Cannon; Karen L Mohlke
Journal: Am J Hum Genet Date: 2018-11-01 Impact factor: 11.025

Review 2. The Post-GWAS Era: From Association to Function.

Authors: Michael D Gallagher; Alice S Chen-Plotkin
Journal: Am J Hum Genet Date: 2018-05-03 Impact factor: 11.025

3. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments.

Authors: Neva C Durand; Muhammad S Shamim; Ido Machol; Suhas S P Rao; Miriam H Huntley; Eric S Lander; Erez Lieberman Aiden
Journal: Cell Syst Date: 2016-07 Impact factor: 10.304

Review 4. Genomically humanized mice: technologies and promises.

Authors: Anny Devoy; Rosie K A Bunton-Stasyshyn; Victor L J Tybulewicz; Andrew J H Smith; Elizabeth M C Fisher
Journal: Nat Rev Genet Date: 2011-12-16 Impact factor: 53.242

5. Mouse regulatory DNA landscapes reveal global principles of cis-regulatory evolution.

Authors: Jeff Vierstra; Eric Rynes; Richard Sandstrom; Miaohua Zhang; Theresa Canfield; R Scott Hansen; Sandra Stehling-Sun; Peter J Sabo; Rachel Byron; Richard Humbert; Robert E Thurman; Audra K Johnson; Shinny Vong; Kristen Lee; Daniel Bates; Fidencio Neri; Morgan Diegel; Erika Giste; Eric Haugen; Douglas Dunn; Matthew S Wilken; Steven Josefowicz; Robert Samstein; Kai-Hsin Chang; Evan E Eichler; Marella De Bruijn; Thomas A Reh; Arthur Skoultchi; Alexander Rudensky; Stuart H Orkin; Thalia Papayannopoulou; Piper M Treuting; Licia Selleri; Rajinder Kaul; Mark Groudine; M A Bender; John A Stamatoyannopoulos
Journal: Science Date: 2014-11-21 Impact factor: 47.728

6. The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease.

Authors: Janan T Eppig; Judith A Blake; Carol J Bult; James A Kadin; Joel E Richardson
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

7. A comparative encyclopedia of DNA elements in the mouse genome.

Authors: Feng Yue; Yong Cheng; Alessandra Breschi; Jeff Vierstra; Weisheng Wu; Tyrone Ryba; Richard Sandstrom; Zhihai Ma; Carrie Davis; Benjamin D Pope; Yin Shen; Dmitri D Pervouchine; Sarah Djebali; Robert E Thurman; Rajinder Kaul; Eric Rynes; Anthony Kirilusha; Georgi K Marinov; Brian A Williams; Diane Trout; Henry Amrhein; Katherine Fisher-Aylor; Igor Antoshechkin; Gilberto DeSalvo; Lei-Hoon See; Meagan Fastuca; Jorg Drenkow; Chris Zaleski; Alex Dobin; Pablo Prieto; Julien Lagarde; Giovanni Bussotti; Andrea Tanzer; Olgert Denas; Kanwei Li; M A Bender; Miaohua Zhang; Rachel Byron; Mark T Groudine; David McCleary; Long Pham; Zhen Ye; Samantha Kuan; Lee Edsall; Yi-Chieh Wu; Matthew D Rasmussen; Mukul S Bansal; Manolis Kellis; Cheryl A Keller; Christapher S Morrissey; Tejaswini Mishra; Deepti Jain; Nergiz Dogan; Robert S Harris; Philip Cayting; Trupti Kawli; Alan P Boyle; Ghia Euskirchen; Anshul Kundaje; Shin Lin; Yiing Lin; Camden Jansen; Venkat S Malladi; Melissa S Cline; Drew T Erickson; Vanessa M Kirkup; Katrina Learned; Cricket A Sloan; Kate R Rosenbloom; Beatriz Lacerda de Sousa; Kathryn Beal; Miguel Pignatelli; Paul Flicek; Jin Lian; Tamer Kahveci; Dongwon Lee; W James Kent; Miguel Ramalho Santos; Javier Herrero; Cedric Notredame; Audra Johnson; Shinny Vong; Kristen Lee; Daniel Bates; Fidencio Neri; Morgan Diegel; Theresa Canfield; Peter J Sabo; Matthew S Wilken; Thomas A Reh; Erika Giste; Anthony Shafer; Tanya Kutyavin; Eric Haugen; Douglas Dunn; Alex P Reynolds; Shane Neph; Richard Humbert; R Scott Hansen; Marella De Bruijn; Licia Selleri; Alexander Rudensky; Steven Josefowicz; Robert Samstein; Evan E Eichler; Stuart H Orkin; Dana Levasseur; Thalia Papayannopoulou; Kai-Hsin Chang; Arthur Skoultchi; Srikanta Gosh; Christine Disteche; Piper Treuting; Yanli Wang; Mitchell J Weiss; Gerd A Blobel; Xiaoyi Cao; Sheng Zhong; Ting Wang; Peter J Good; Rebecca F Lowdon; Leslie B Adams; Xiao-Qiao Zhou; Michael J Pazin; Elise A Feingold; Barbara Wold; James Taylor; Ali Mortazavi; Sherman M Weissman; John A Stamatoyannopoulos; Michael P Snyder; Roderic Guigo; Thomas R Gingeras; David M Gilbert; Ross C Hardison; Michael A Beer; Bing Ren
Journal: Nature Date: 2014-11-20 Impact factor: 49.962

8. Ensembl comparative genomics resources.

Authors: Javier Herrero; Matthieu Muffato; Kathryn Beal; Stephen Fitzgerald; Leo Gordon; Miguel Pignatelli; Albert J Vilella; Stephen M J Searle; Ridwan Amode; Simon Brent; William Spooner; Eugene Kulesha; Andrew Yates; Paul Flicek
Journal: Database (Oxford) Date: 2016-02-20 Impact factor: 3.451

9. Ensembl 2018.

Authors: Daniel R Zerbino; Premanand Achuthan; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Jyothish Bhai; Konstantinos Billis; Carla Cummins; Astrid Gall; Carlos García Girón; Laurent Gil; Leo Gordon; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Jimmy Kiang To; Matthew R Laird; Ilias Lavidas; Zhicheng Liu; Jane E Loveland; Thomas Maurel; William McLaren; Benjamin Moore; Jonathan Mudge; Daniel N Murphy; Victoria Newman; Michael Nuhn; Denye Ogeh; Chuang Kee Ong; Anne Parker; Mateus Patricio; Harpreet Singh Riat; Helen Schuilenburg; Dan Sheppard; Helen Sparrow; Kieron Taylor; Anja Thormann; Alessandro Vullo; Brandon Walts; Amonida Zadissa; Adam Frankish; Sarah E Hunt; Myrto Kostadima; Nicholas Langridge; Fergal J Martin; Matthieu Muffato; Emily Perry; Magali Ruffier; Dan M Staines; Stephen J Trevanion; Bronwen L Aken; Fiona Cunningham; Andrew Yates; Paul Flicek
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

10. Expanded encyclopaedias of DNA elements in the human and mouse genomes.

Authors: Jill E Moore; Michael J Purcaro; Henry E Pratt; Charles B Epstein; Noam Shoresh; Jessika Adrian; Trupti Kawli; Carrie A Davis; Alexander Dobin; Rajinder Kaul; Jessica Halow; Eric L Van Nostrand; Peter Freese; David U Gorkin; Yin Shen; Yupeng He; Mark Mackiewicz; Florencia Pauli-Behn; Brian A Williams; Ali Mortazavi; Cheryl A Keller; Xiao-Ou Zhang; Shaimae I Elhajjajy; Jack Huey; Diane E Dickel; Valentina Snetkova; Xintao Wei; Xiaofeng Wang; Juan Carlos Rivera-Mulia; Joel Rozowsky; Jing Zhang; Surya B Chhetri; Jialing Zhang; Alec Victorsen; Kevin P White; Axel Visel; Gene W Yeo; Christopher B Burge; Eric Lécuyer; David M Gilbert; Job Dekker; John Rinn; Eric M Mendenhall; Joseph R Ecker; Manolis Kellis; Robert J Klein; William S Noble; Anshul Kundaje; Roderic Guigó; Peggy J Farnham; J Michael Cherry; Richard M Myers; Bing Ren; Brenton R Graveley; Mark B Gerstein; Len A Pennacchio; Michael P Snyder; Bradley E Bernstein; Barbara Wold; Ross C Hardison; Thomas R Gingeras; John A Stamatoyannopoulos; Zhiping Weng
Journal: Nature Date: 2020-07-29 Impact factor: 69.504