Literature DB >> 31126314

One reference genome is not enough.

Xiaofei Yang^1,2,3, Wan-Ping Lee^2,3,4, Kai Ye^2,5, Charles Lee^6,7,8.

Abstract

A recent study on human structural variation indicates insufficiencies and errors in the human reference genome, GRCh38, and argues for the construction of a human pan-genome.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31126314 PMCID： PMC6534916 DOI： 10.1186/s13059-019-1717-0

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Introduction

The human reference genome is a critical foundation for human genetics and biomedical research. The current human reference genome, GRCh38, blends genomic segments from a few individuals, although clones of a single individual predominate [1]. This invites criticisms of the ability of such a reference genome to present the common variants from multiple human populations accurately. In addition, the current human reference genome harbors many genomic segments that actually contain rare variants, and these impact downstream sequence analyses including read alignments and the identification of variants, especially the identification of structural variants (SVs) (that is, insertions, deletions and rearrangements) that encompass more than 50 bp of DNA. Incorporating SVs that are shared among major human populations into the current reference genome can correct for biases and improves both read alignments and the detection of variants in other individuals. Recently, a study based on deep (i.e., > 50×) long-read PacBio whole genome sequencing (WGS) data for 15 individuals from five populations led to the discovery and sequencing of a large fraction of common structural variation. These data can be used to genotype variants from other short-read sequencing datasets and ultimately to reduce biases inherent in the GRCh38 version of the human reference genome [2].

SV discovery based on long-read sequencing data

Audano et al. [2] sequenced 11 genomes (from three African, three Asian, two European and three American samples) using single-molecule, real-time (SMRT) PacBio RSII and Sequel long-read sequencing technology. They further analyzed long-read sequencing data, including data from four additional sources: CHM1 [3], CHM13 [3], AK1 [4] and HX1 [5]. Reads were aligned against the GRCh38 version of the human reference sequence using the BLASR software and SVs were detected using the SMRT-SV algorithm [6]. In total, 99,604 nonredundant SVs were identified from these 15 sequenced genomes. The analysis focused on around 95% of the human genome but excluded the pericentromeric and other regions of the genome that are enriched for repetitive DNAs (Fig. 1a). Among the 99,604 discovered SVs, the existence of 2238 ‘shared type’ SVs (shared across all samples) and 13,053 ‘majority type’ SVs (present in more than half of the genomes studied, but not in all samples) suggested that the current reference genome either carries a minor allele or contains an error at each of these positions. These shared and majority SVs were enriched with repetitive sequences and reflect insertions (61. 6 %), deletions (38.1%) and inversions (0.33%). Excluding analyses of the highly repetitive regions of the human genome (which probably contain many SVs), a logarithmic function conservatively suggested that adding SV data from an additional human genome would probably increase the total SV callset by 2.1%, adding 35 genomes would increase the total SV callset by 39% and, finally, adding 327 genomes would identify twice as many SVs than were identified from these 15 genomes.

Fig. 1

The human genome structural variant (SV) resource. a The detection of 99,604 nonredundant SVs in 15 samples from five populations using a long-read sequencing technology. AK1 [4] and HX1 [5] are Asian individuals whose genomes were previously sequenced. b The subtelomeric regions of human chromosomes are particularly enriched for SVs of the variable number of tandem repeats (VNTR) and short tandem repeat (STR) types. Here, the frequency of black dots along the length of the chromosome indicates the relative density of SVs. c About 15% of the discovered SVs can be found in more than 50% of the samples studied, indicating that these sites actually harbor minor alleles or errors in the current reference genome. d Ultimately, a human pan-reference genome can be developed using genome graphs (or other methods) to represent common SVs accurately. DEL deletion, INS insertion, INV inversion Among the SVs discovered, 40.8% are novel when compared to previously described SVs from several published large-scale projects (Figure S1E in [2]). To assess the allele frequency of the discovered SVs, Audano et al. [2] went on to genotype these SVs across a total of 440 additional genomes, which were all sequenced using short-read technologies, including those of 174 individuals from the 1000 Genomes Project and 266 individuals from the Simons Genome Diversity Project [7]. The results showed that 92. 6% of the released SVs actually appeared in more than half of the samples, further confirming these biases in the GRCh38 version of the human reference genome.

SVs enriched with tandem repeat sequences

Audano et al. [2] found that SVs are not randomly distributed across the genome, and in fact, there was as much as a nine-fold increase in SV density within the subtelomeric regions (the last 5 Mb) of human chromosomes. In addition, SVs in these subtelomeric regions were significantly enriched with tandem repeats, particularly for VNTRs (variable number of tandem repeats) and STRs (short tandem repeats), rather than retrotransposons (Fig. 1b). There was also a positive correlation between the abundance of STRs (R = 0.27) and VNTRs (particularly larger VNTRs; R = 0.48) with known hotspots of meiotic double strand breaks (DSBs), suggesting a potential role for DSBs in the formation of SVs in these genomic regions.

SVs affect gene structures and regulatory elements

How do the discovered SVs interfere with gene expression? To address this question, Audano et al. [2] annotated the shared and majority SVs using RefSeq. The analysis showed that 7550 of these SVs intersect with gene regions (including coding regions, untranslated regions (UTRs), introns, and 2-kb flanking regions), and 1033 of these SVs intersect with known regulatory elements. Some of the SVs disrupted gene structures: 841 intersected RefSeq-annotated coding regions and 667 intersected RefSeq-annotated noncoding RNA regions. For example, a 1.6-kb insertion was located in the 5′ UTR of UBEQ2L1 and extended into its promoter. In another case, a 1.06-kbp GC-rich insertion was located at the 3′ UTR of ADARB1 and incorporated motifs that may promote the formation of a quadruplex structure. Examples of SVs located in gene regulatory elements included a 1.2-kb and a 1.4-kb fragment inserted upstream of KDM6B and FGFR1OP, respectively. These insertions intersected with H3K4Me3 and H3K27Ac sites. Audano et al. [2] further investigated the impact of SVs on gene expression using RNA-seq data from 376 European cell lines and found that the expression of 411 genes was significantly associated with the discovered SVs.

The discovered SVs can be helpful for re-constructing a canonical human reference genome

GRCh38 currently contains 819 gaps, including minor alleles or actual errors. Audano et al. [2] proposed that the SVs discovered in their work could be included to correct the reference genome (Fig. 1c). They found 34 shared insertions that intersect with scaffold switch-points of the GRCh38 version of the reference genome and the new data could be used to correct possible misassemblies in GRCh38. For instance, a 2159-bp shared insertion overlaps with a switch-point in the NUTM1 gene and indicates a misassembly by stitching two contigs together. Additional sequencing clones from BAC libraries confirmed the misassembly. Adding the discovered SV contigs to the reference genome could rescue 2.62% of unmapped Illumina short reads, and 1.24% of the SV-contig-mapped reads show increased mapping quality, thus improving variant detection. This effect is most pronounced for insertions, for which 25.68% of the reads show increased mapping quality when compared to the reference genome. Furthermore, GATK was able to identify a substantial amount of variation within SV insertions (i.e., 68,656 alternative alleles across the 30 whole-genome haplotypes) where no reference sequence previously existed. Taken together, these data proved to be useful in re-constructing a more precise canonical human reference genome.

Concluding remarks

Audano et al. [2] provided a sequence-resolved SV callset from analysis of 15 human genomes. They found the reported SVs to be significantly enriched with VNTRs and STRs and correlated with DSB. In addition, they found that certain SVs impact gene regulatory elements and affect gene expression, opening a door for additional future studies correlating SVs with gene expression. They further patched errors and biases in the current human reference genome assembly using their SV callset, significantly improving the quality of future short-read alignments and variant calling. This study also promotes the concept of a pan-genome (Fig. 1d), which incorporates SVs into the reference genome and can be applied to recently published graph genome tools [8, 9]. The next steps will involve phasing human genomes to reduce false negatives [10] and discovering complex SVs and indels that map to large repetitive regions of the human genome.

10 in total

1. De novo assembly and phasing of a Korean human genome.

Authors: Jeong-Sun Seo; Arang Rhie; Junsoo Kim; Sangjin Lee; Min-Hwan Sohn; Chang-Uk Kim; Alex Hastie; Han Cao; Ji-Young Yun; Jihye Kim; Junho Kuk; Gun Hwa Park; Juhyeok Kim; Hanna Ryu; Jongbum Kim; Mira Roh; Jeonghun Baek; Michael W Hunkapiller; Jonas Korlach; Jong-Yeon Shin; Changhoon Kim
Journal: Nature Date: 2016-10-05 Impact factor: 49.962

2. Fast and accurate genomic analyses using genome graphs.

Authors: Goran Rakocevic; Vladimir Semenyuk; Wan-Ping Lee; James Spencer; John Browning; Ivan J Johnson; Vladan Arsenijevic; Jelena Nadj; Kaushik Ghose; Maria C Suciu; Sun-Gou Ji; Gülfem Demir; Lizao Li; Berke Ç Toptaş; Alexey Dolgoborodov; Björn Pollex; Iosif Spulber; Irina Glotova; Péter Kómár; Andrew L Stachyra; Yilong Li; Milos Popovic; Morten Källberg; Amit Jain; Deniz Kural
Journal: Nat Genet Date: 2019-01-14 Impact factor: 38.330

3. Resolving the complexity of the human genome using single-molecule sequencing.

Authors: Mark J P Chaisson; John Huddleston; Megan Y Dennis; Peter H Sudmant; Maika Malig; Fereydoun Hormozdiari; Francesca Antonacci; Urvashi Surti; Richard Sandstrom; Matthew Boitano; Jane M Landolin; John A Stamatoyannopoulos; Michael W Hunkapiller; Jonas Korlach; Evan E Eichler
Journal: Nature Date: 2014-11-10 Impact factor: 49.962

4. Long-read sequencing and de novo assembly of a Chinese genome.

Authors: Lingling Shi; Yunfei Guo; Chengliang Dong; John Huddleston; Hui Yang; Xiaolu Han; Aisi Fu; Quan Li; Na Li; Siyi Gong; Katherine E Lintner; Qiong Ding; Zou Wang; Jiang Hu; Depeng Wang; Feng Wang; Lin Wang; Gholson J Lyon; Yongtao Guan; Yufeng Shen; Oleg V Evgrafov; James A Knowles; Francoise Thibaud-Nissen; Valerie Schneider; Chack-Yung Yu; Libing Zhou; Evan E Eichler; Kwok-Fai So; Kai Wang
Journal: Nat Commun Date: 2016-06-30 Impact factor: 14.919

5. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

Authors: Valerie A Schneider; Tina Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul A Kitts; Terence D Murphy; Kim D Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Timothy Hubbard; Sarah Pelan; Jared T Simpson; Glen Threadgold; James Torrance; Jonathan M Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso; Heng Li; Chen-Shan Chin; Adam M Phillippy; Richard Durbin; Richard K Wilson; Paul Flicek; Evan E Eichler; Deanna M Church
Journal: Genome Res Date: 2017-04-10 Impact factor: 9.043

6. Discovery and genotyping of structural variation from long-read haploid genome sequence data.

Authors: John Huddleston; Mark J P Chaisson; Karyn Meltz Steinberg; Wes Warren; Kendra Hoekzema; David Gordon; Tina A Graves-Lindsay; Katherine M Munson; Zev N Kronenberg; Laura Vives; Paul Peluso; Matthew Boitano; Chen-Shin Chin; Jonas Korlach; Richard K Wilson; Evan E Eichler
Journal: Genome Res Date: 2016-11-28 Impact factor: 9.043

7. Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Authors: Erik Garrison; Jouni Sirén; Adam M Novak; Glenn Hickey; Jordan M Eizenga; Eric T Dawson; William Jones; Shilpa Garg; Charles Markello; Michael F Lin; Benedict Paten; Richard Durbin
Journal: Nat Biotechnol Date: 2018-08-20 Impact factor: 54.908

8. Multi-platform discovery of haplotype-resolved structural variation in human genomes.

Authors: Mark J P Chaisson; Ashley D Sanders; Xuefang Zhao; Ankit Malhotra; David Porubsky; Tobias Rausch; Eugene J Gardner; Oscar L Rodriguez; Li Guo; Ryan L Collins; Xian Fan; Jia Wen; Robert E Handsaker; Susan Fairley; Zev N Kronenberg; Xiangmeng Kong; Fereydoun Hormozdiari; Dillon Lee; Aaron M Wenger; Alex R Hastie; Danny Antaki; Thomas Anantharaman; Peter A Audano; Harrison Brand; Stuart Cantsilieris; Han Cao; Eliza Cerveira; Chong Chen; Xintong Chen; Chen-Shan Chin; Zechen Chong; Nelson T Chuang; Christine C Lambert; Deanna M Church; Laura Clarke; Andrew Farrell; Joey Flores; Timur Galeev; David U Gorkin; Madhusudan Gujral; Victor Guryev; William Haynes Heaton; Jonas Korlach; Sushant Kumar; Jee Young Kwon; Ernest T Lam; Jong Eun Lee; Joyce Lee; Wan-Ping Lee; Sau Peng Lee; Shantao Li; Patrick Marks; Karine Viaud-Martinez; Sascha Meiers; Katherine M Munson; Fabio C P Navarro; Bradley J Nelson; Conor Nodzak; Amina Noor; Sofia Kyriazopoulou-Panagiotopoulou; Andy W C Pang; Yunjiang Qiu; Gabriel Rosanio; Mallory Ryan; Adrian Stütz; Diana C J Spierings; Alistair Ward; AnneMarie E Welch; Ming Xiao; Wei Xu; Chengsheng Zhang; Qihui Zhu; Xiangqun Zheng-Bradley; Ernesto Lowy; Sergei Yakneen; Steven McCarroll; Goo Jun; Li Ding; Chong Lek Koh; Bing Ren; Paul Flicek; Ken Chen; Mark B Gerstein; Pui-Yan Kwok; Peter M Lansdorp; Gabor T Marth; Jonathan Sebat; Xinghua Shi; Ali Bashir; Kai Ye; Scott E Devine; Michael E Talkowski; Ryan E Mills; Tobias Marschall; Jan O Korbel; Evan E Eichler; Charles Lee
Journal: Nat Commun Date: 2019-04-16 Impact factor: 17.694

9. Characterizing the Major Structural Variant Alleles of the Human Genome.

Authors: Peter A Audano; Arvis Sulovari; Tina A Graves-Lindsay; Stuart Cantsilieris; Melanie Sorensen; AnneMarie E Welch; Max L Dougherty; Bradley J Nelson; Ankeeta Shah; Susan K Dutcher; Wesley C Warren; Vincent Magrini; Sean D McGrath; Yang I Li; Richard K Wilson; Evan E Eichler
Journal: Cell Date: 2019-01-17 Impact factor: 41.582

10. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.

Authors: Swapan Mallick; Heng Li; Mark Lipson; Iain Mathieson; Melissa Gymrek; Fernando Racimo; Mengyao Zhao; Niru Chennagiri; Susanne Nordenfelt; Arti Tandon; Pontus Skoglund; Iosif Lazaridis; Sriram Sankararaman; Qiaomei Fu; Nadin Rohland; Gabriel Renaud; Yaniv Erlich; Thomas Willems; Carla Gallo; Jeffrey P Spence; Yun S Song; Giovanni Poletti; Francois Balloux; George van Driem; Peter de Knijff; Irene Gallego Romero; Aashish R Jha; Doron M Behar; Claudio M Bravi; Cristian Capelli; Tor Hervig; Andres Moreno-Estrada; Olga L Posukh; Elena Balanovska; Oleg Balanovsky; Sena Karachanak-Yankova; Hovhannes Sahakyan; Draga Toncheva; Levon Yepiskoposyan; Chris Tyler-Smith; Yali Xue; M Syafiq Abdullah; Andres Ruiz-Linares; Cynthia M Beall; Anna Di Rienzo; Choongwon Jeong; Elena B Starikovskaya; Ene Metspalu; Jüri Parik; Richard Villems; Brenna M Henn; Ugur Hodoglugil; Robert Mahley; Antti Sajantila; George Stamatoyannopoulos; Joseph T S Wee; Rita Khusainova; Elza Khusnutdinova; Sergey Litvinov; George Ayodo; David Comas; Michael F Hammer; Toomas Kivisild; William Klitz; Cheryl A Winkler; Damian Labuda; Michael Bamshad; Lynn B Jorde; Sarah A Tishkoff; W Scott Watkins; Mait Metspalu; Stanislav Dryomov; Rem Sukernik; Lalji Singh; Kumarasamy Thangaraj; Svante Pääbo; Janet Kelso; Nick Patterson; David Reich
Journal: Nature Date: 2016-09-21 Impact factor: 49.962

10 in total

24 in total

1. Multi-Omic Approaches to Identify Genetic Factors in Metabolic Syndrome.

Authors: Karen C Clark; Anne E Kwitek
Journal: Compr Physiol Date: 2021-12-29 Impact factor: 8.915

2. The presence and impact of reference bias on population genomic studies of prehistoric human populations.

Authors: Torsten Günther; Carl Nettelblad
Journal: PLoS Genet Date: 2019-07-26 Impact factor: 5.917

3. Mapping Genome Variants Sheds Light on Genetic and Phenotypic Differentiation in Chinese.

Authors: Li Guo; Kai Ye
Journal: Genomics Proteomics Bioinformatics Date: 2019-09-09 Impact factor: 7.691

4. MoMI-G: modular multi-scale integrated genome graph browser.

Authors: Toshiyuki T Yokoyama; Yoshitaka Sakamoto; Masahide Seki; Yutaka Suzuki; Masahiro Kasahara
Journal: BMC Bioinformatics Date: 2019-11-05 Impact factor: 3.169

5. Vargas: heuristic-free alignment for assessing linear and graph read aligners.

Authors: Charlotte A Darby; Ravi Gaddipati; Michael C Schatz; Ben Langmead
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

6. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads.

Authors: Carlos Valiente-Mullor; Beatriz Beamud; Iván Ansari; Carlos Francés-Cuesta; Neris García-González; Lorena Mejía; Paula Ruiz-Hueso; Fernando González-Candelas
Journal: PLoS Comput Biol Date: 2021-01-27 Impact factor: 4.475

7. Impacts of allopolyploidization and structural variation on intraspecific diversification in Brassica rapa.

Authors: Xu Cai; Lichun Chang; Tingting Zhang; Haixu Chen; Lei Zhang; Runmao Lin; Jianli Liang; Jian Wu; Michael Freeling; Xiaowu Wang
Journal: Genome Biol Date: 2021-05-31 Impact factor: 13.583

8. A de novo transcriptional atlas in Danaus plexippus reveals variability in dosage compensation across tissues.

Authors: José M Ranz; Pablo M González; Bryan D Clifton; Nestor O Nazario-Yepiz; Pablo L Hernández-Cervantes; María J Palma-Martínez; Dulce I Valdivia; Andrés Jiménez-Kaufman; Megan M Lu; Therese A Markow; Cei Abreu-Goodger
Journal: Commun Biol Date: 2021-06-25

Review 9. Prospective avenues for human population genomics and disease mapping in southern Africa.

Authors: Yolandi Swart; Gerald van Eeden; Anel Sparks; Caitlin Uren; Marlo Möller
Journal: Mol Genet Genomics Date: 2020-05-21 Impact factor: 2.980

10. A powerful HUPAN on a pan-genome study: significance and perspectives.

Authors: Yingyan Yu; Chaochun Wei
Journal: Cancer Biol Med Date: 2020-02-15 Impact factor: 4.248