Literature DB >> 22880081

Mis-assembled "segmental duplications" in two versions of the Bos taurus genome.

Aleksey V Zimin1, David R Kelley, Michael Roberts, Guillaume Marçais, Steven L Salzberg, James A Yorke.   

Abstract

We analyzed the whole genome sequence coverage in two versions of the Bos taurus genome and identified all regions longer than five kilobases (Kbp) that are duplicated within chromosomes with >99% sequence fidelity in both copies. We call these regions High Fidelity Duplications (HFDs). The two assemblies were Btau 4.2, produced by the Human Genome Sequencing Center at Baylor College of Medicine, and UMD Bos taurus 3.1 (UMD 3.1), produced by our group at the University of Maryland. We found that Btau 4.2 has a far greater number of HFDs, 3111 versus only 69 in UMD 3.1. Read coverage analysis shows that 39 million base pairs (Mbp) of sequence in HFDs in Btau 4.2 appear to be a result of a mis-assembly and therefore cannot be qualified as segmental duplications. UMD 3.1 has only 0.41 Mbp of sequence in HFDs that are due to a mis-assembly.

Entities:  

Mesh:

Year:  2012        PMID: 22880081      PMCID: PMC3411808          DOI: 10.1371/journal.pone.0042680

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Duplications: a challenge to genome assembly

Segmental duplications have been the focus of much of the biological analysis of the mammalian genomes [1], [2]. They play an important role in the evolution of many species, often providing a substrate for the development of new gene functions. Such duplications present a challenge to genome assembly, particularly when the duplications are recent and the copied sequences are near-identical. Assembly programs sometimes collapse nearby duplications into a single copy, or erroneously incorporate multiple copies of a unique sequence into the assembly. The creation of erroneous duplications can be caused by divergent regions in a diploid genome, in which the two haplotypes are sufficiently different that the assembler fails to merge them together. Identifying such mis-assemblies is critically important for downstream biological analysis.

Two assemblies of the Bos taurus genome

In April 2009, two assemblies of the Bos taurus genome were published simultaneously: Btau 4.0 by the Baylor College of Medicine [3] and UMD Bos taurus 2.0 (UMD 2.0) by the University of Maryland [4]. These assemblies have since been updated, and the current versions are Btau 4.2, available from the sequencing center's website ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau20080815/, and UMD Bos taurus 3.1 (UMD 3.1), available from Genbank as accession DAAA00000000.2. We note that Btau 4.2 has only minor differences from the published Btau 4.0 assembly; the primary update was the replacement of selected contigs by finished BAC sequences. In this paper we analyze the latest available versions of both assemblies.

Results

High Fidelity Duplications (HFDs) in the two assemblies of Bos taurus genome

One striking difference between the assemblies is the disparity in the number of large regions of sequence that are duplicated within the chromosomes with high fidelity between copies. We defined a High Fidelity Duplication (HFD) as any region >5 Kbp in length occurring in two copies in the assembly, such that the copies are >99% identical to each other and reside on the same chromosome. To find the HFDs we used the Nucmer software [5] to map each assembly to itself and looked for non-overlapping self matches longer than 5 kbp with at least 99% identity. Btau 4.2 has 3,111 HFDs, while UMD 3.1 has 69. More surprisingly, only 2 of these HFDs appear in both assemblies. The Btau 4.2 regions cover 83 Mbp of sequence, while the UMD 3.1 duplications cover 1.3 Mbp. In this paper we present analysis that shows that almost all HFDs in the Btau 4.2 and some in UMD 3.1 are assembly artifacts and therefore should be ignored for biological analysis. Figure 1 shows the histograms of coverage for all HFDs in which the two assemblies disagree about copy number; i.e., at least one of the assemblies is incorrect. We created the set B1U2 containing the regions with exactly one copy in Btau 4.2 and two copies in the UMD 3.1 assembly; conversely, we created the set B2U1 containing the regions with two copies in Btau 4.2 and one copy in UMD 3.1. We show the distributions of read coverage for regions in B1U2 (dashed line) and B2U1 (solid line) as percentages of all regions. (Note that B2U1 is a much larger set, with 3,111 regions versus just 69 regions in B1U2.) Based on this WGS coverage statistic, 47 of the 69 regions (68%) in B1U2 are more likely to be true segmental duplications, suggesting that the UMD3.1 assembly is correct for these regions. In contrast, only 187 out of 3,111 regions (6%) in B2U1 appear to be true duplications, indicating that Btau 4.2 has a large number of erroneously duplicated sequences.
Figure 1

Histogram of the percentage of HFDs that belong to (i) set B2U1, duplicated in Btau 4.2 and single copy in UMD Bos taurus 3.1 (solid line), and (ii) set B1U2, single copy in Btau 4.2 and duplicated in UMD Bos taurus 3.1 (dashed line).

The area under each curve integrates to 100%. The histograms were computed by mapping the WGS reads to both assemblies. The average WGS read coverage of the assemblies is 5.9. The solid vertical line is placed at 5.9/ln(2), the coverage at which it is equally likely that a region occurs in two copies versus one. 47 of the 69 regions (68%) in B1U2 are on the right hand side of the line and thus they are more likely to be true segmental duplications. 94% of the 3,111 HFDs in Btau 4.2 (set B2U1) are more likely to be unique in the genome and thus probably represent assembly errors in Btau 4.2.

Histogram of the percentage of HFDs that belong to (i) set B2U1, duplicated in Btau 4.2 and single copy in UMD Bos taurus 3.1 (solid line), and (ii) set B1U2, single copy in Btau 4.2 and duplicated in UMD Bos taurus 3.1 (dashed line).

The area under each curve integrates to 100%. The histograms were computed by mapping the WGS reads to both assemblies. The average WGS read coverage of the assemblies is 5.9. The solid vertical line is placed at 5.9/ln(2), the coverage at which it is equally likely that a region occurs in two copies versus one. 47 of the 69 regions (68%) in B1U2 are on the right hand side of the line and thus they are more likely to be true segmental duplications. 94% of the 3,111 HFDs in Btau 4.2 (set B2U1) are more likely to be unique in the genome and thus probably represent assembly errors in Btau 4.2.

Independent validation of false duplications in Btau 4.0

The BGSAC authors devote part of their paper to discussing the biological implications of the segmental duplications in their Btau 4.0 assembly. However, in the online supplement, they remark that many of their duplications are likely a product of mis-assembly: “A total of 1,860 pairwise alignments (>20 kbp, >94% identity) corresponding to 92.45 Mbp of apparent duplicated sequence in Btau 4.0 could not be substantiated by WSSD.” Note that these duplicated sequences were omitted from the main analysis, but they are still present in the 4.0 and 4.2 assemblies. Our analysis suggests that the problem is even more extensive since 84% of the regions that we analyzed are shorter than 20 Kb (but longer than 5 kb, see the definition of the HFD above), and therefore they had to be included in the main analysis. These indications of erroneous duplications in the Btau 4.0 assembly are supported by a recent independent study by [6], which examined intra-chromosomal duplication patterns in the Bos taurus genome using fluorescent in situ hybridization (FISH). They compared Btau 4.0 and UMD 2.0 by analyzing 13 segments of the genome that were duplicated in only one of the assemblies. The FISH results were consistent with the UMD 2.0 assembly at 10 of 13 sites, while only 2 of 13 were consistent with Btau 4.0.

Methods

Evaluating the read coverage of duplicated regions in Btau 4.2 and UMD3.1

To determine which HFD regions are likely to be actually duplicated in the genome, we examined their whole-genome shotgun (WGS) read coverage. We aligned the WGS read sequences to each region that was a HFD in either assembly and calculated that region's read coverage, shown in Figure 1. We used Nucmer software to align the reads to the HFD sequences. We used a single copy of each HFD sequence because duplicate copies differed by less than 1% by definition. We accepted all matches with >94% identity over 90% of the read length. Mate pair information was not used. We then computed the coverage for each copy of the HFD as the total number of bases in the reads that match the HFD sequence divided by the length of the HFD. Next, we compared the individual read coverage of each HFD to the mean WGS read coverage over the entire genome, which was approximately 5.9×. In our analysis we assumed that WGS reads, which provided two thirds of the sequence data, were distributed nearly uniformly over the genome. Under this assumption, the coverage of the HFDs should have a Poisson distribution with a mean coverage of 5.9 in unique regions. If a sequence is truly duplicated in the genome and all reads are aligned to a single copy of that sequence, then the expected coverage would be twice the normal coverage, or about 11.8.

Conclusions

Our analysis implies that that the BCM Btau 4.2 assembly contains at least 39 Mbp of intra-chromosomal duplicated sequence that appears to be single-copy in the genome. In contrast, UMD 3.1 has only 0.41 Mb that appear to be erroneously duplicated. A possible explanation for the excess duplications in Btau 4.2 can be found in the BAC-based assembly strategy used to construct it. The authors used a hybrid approach in which they first assembled bacterial artificial chromosomes (BACs) and then merged the BAC assemblies. Because the BACs were sequenced from either haplotype, when two overlapping BACs represented different haplotypes, sequence divergence might have prevented the assembly software from correctly merging them, and instead the BACs were assembled in adjacent, non-overlapping locations. This would create nearly identical duplicated sequences within chromosomes. We could not verify this conjecture because we do not have access to assembly sequences of individual BACs. Scientists analyzing the Btau 4.2 version of Bos taurus genome may need to gather additional, independent evidence before assuming that duplications in the assembly represent the true genome.
  6 in total

1.  Shotgun sequence assembly and recent segmental duplications within the human genome.

Authors:  Xinwei She; Zhaoshi Jiang; Royden A Clark; Ge Liu; Ze Cheng; Eray Tuzun; Deanna M Church; Granger Sutton; Aaron L Halpern; Evan E Eichler
Journal:  Nature       Date:  2004-10-21       Impact factor: 49.962

2.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

3.  Unlocking the bovine genome.

Authors:  Ross L Tellam; Danielle G Lemay; Curtis P Van Tassell; Harris A Lewin; Kim C Worley; Christine G Elsik
Journal:  BMC Genomics       Date:  2009-04-24       Impact factor: 3.969

4.  Analysis of recent segmental duplications in the bovine genome.

Authors:  George E Liu; Mario Ventura; Angelo Cellamare; Lin Chen; Ze Cheng; Bin Zhu; Congjun Li; Jiuzhou Song; Evan E Eichler
Journal:  BMC Genomics       Date:  2009-12-01       Impact factor: 3.969

5.  The genome sequence of taurine cattle: a window to ruminant biology and evolution.

Authors:  Christine G Elsik; Ross L Tellam; Kim C Worley; Richard A Gibbs; Donna M Muzny; George M Weinstock; David L Adelson; Evan E Eichler; Laura Elnitski; Roderic Guigó; Debora L Hamernik; Steve M Kappes; Harris A Lewin; David J Lynn; Frank W Nicholas; Alexandre Reymond; Monique Rijnkels; Loren C Skow; Evgeny M Zdobnov; Lawrence Schook; James Womack; Tyler Alioto; Stylianos E Antonarakis; Alex Astashyn; Charles E Chapple; Hsiu-Chuan Chen; Jacqueline Chrast; Francisco Câmara; Olga Ermolaeva; Charlotte N Henrichsen; Wratko Hlavina; Yuri Kapustin; Boris Kiryutin; Paul Kitts; Felix Kokocinski; Melissa Landrum; Donna Maglott; Kim Pruitt; Victor Sapojnikov; Stephen M Searle; Victor Solovyev; Alexandre Souvorov; Catherine Ucla; Carine Wyss; Juan M Anzola; Daniel Gerlach; Eran Elhaik; Dan Graur; Justin T Reese; Robert C Edgar; John C McEwan; Gemma M Payne; Joy M Raison; Thomas Junier; Evgenia V Kriventseva; Eduardo Eyras; Mireya Plass; Ravikiran Donthu; Denis M Larkin; James Reecy; Mary Q Yang; Lin Chen; Ze Cheng; Carol G Chitko-McKown; George E Liu; Lakshmi K Matukumalli; Jiuzhou Song; Bin Zhu; Daniel G Bradley; Fiona S L Brinkman; Lilian P L Lau; Matthew D Whiteside; Angela Walker; Thomas T Wheeler; Theresa Casey; J Bruce German; Danielle G Lemay; Nauman J Maqbool; Adrian J Molenaar; Seongwon Seo; Paul Stothard; Cynthia L Baldwin; Rebecca Baxter; Candice L Brinkmeyer-Langford; Wendy C Brown; Christopher P Childers; Timothy Connelley; Shirley A Ellis; Krista Fritz; Elizabeth J Glass; Carolyn T A Herzig; Antti Iivanainen; Kevin K Lahmers; Anna K Bennett; C Michael Dickens; James G R Gilbert; Darren E Hagen; Hanni Salih; Jan Aerts; Alexandre R Caetano; Brian Dalrymple; Jose Fernando Garcia; Clare A Gill; Stefan G Hiendleder; Erdogan Memili; Diane Spurlock; John L Williams; Lee Alexander; Michael J Brownstein; Leluo Guan; Robert A Holt; Steven J M Jones; Marco A Marra; Richard Moore; Stephen S Moore; Andy Roberts; Masaaki Taniguchi; Richard C Waterman; Joseph Chacko; Mimi M Chandrabose; Andy Cree; Marvin Diep Dao; Huyen H Dinh; Ramatu Ayiesha Gabisi; Sandra Hines; Jennifer Hume; Shalini N Jhangiani; Vandita Joshi; Christie L Kovar; Lora R Lewis; Yih-Shin Liu; John Lopez; Margaret B Morgan; Ngoc Bich Nguyen; Geoffrey O Okwuonu; San Juana Ruiz; Jireh Santibanez; Rita A Wright; Christian Buhay; Yan Ding; Shannon Dugan-Rocha; Judith Herdandez; Michael Holder; Aniko Sabo; Amy Egan; Jason Goodell; Katarzyna Wilczek-Boney; Gerald R Fowler; Matthew Edward Hitchens; Ryan J Lozado; Charles Moen; David Steffen; James T Warren; Jingkun Zhang; Readman Chiu; Jacqueline E Schein; K James Durbin; Paul Havlak; Huaiyang Jiang; Yue Liu; Xiang Qin; Yanru Ren; Yufeng Shen; Henry Song; Stephanie Nicole Bell; Clay Davis; Angela Jolivet Johnson; Sandra Lee; Lynne V Nazareth; Bella Mayurkumar Patel; Ling-Ling Pu; Selina Vattathil; Rex Lee Williams; Stacey Curry; Cerissa Hamilton; Erica Sodergren; David A Wheeler; Wes Barris; Gary L Bennett; André Eggen; Ronnie D Green; Gregory P Harhay; Matthew Hobbs; Oliver Jann; John W Keele; Matthew P Kent; Sigbjørn Lien; Stephanie D McKay; Sean McWilliam; Abhirami Ratnakumar; Robert D Schnabel; Timothy Smith; Warren M Snelling; Tad S Sonstegard; Roger T Stone; Yoshikazu Sugimoto; Akiko Takasuga; Jeremy F Taylor; Curtis P Van Tassell; Michael D Macneil; Antonio R R Abatepaulo; Colette A Abbey; Virpi Ahola; Iassudara G Almeida; Ariel F Amadio; Elen Anatriello; Suria M Bahadue; Fernando H Biase; Clayton R Boldt; Jeffery A Carroll; Wanessa A Carvalho; Eliane P Cervelatti; Elsa Chacko; Jennifer E Chapin; Ye Cheng; Jungwoo Choi; Adam J Colley; Tatiana A de Campos; Marcos De Donato; Isabel K F de Miranda Santos; Carlo J F de Oliveira; Heather Deobald; Eve Devinoy; Kaitlin E Donohue; Peter Dovc; Annett Eberlein; Carolyn J Fitzsimmons; Alessandra M Franzin; Gustavo R Garcia; Sem Genini; Cody J Gladney; Jason R Grant; Marion L Greaser; Jonathan A Green; Darryl L Hadsell; Hatam A Hakimov; Rob Halgren; Jennifer L Harrow; Elizabeth A Hart; Nicola Hastings; Marta Hernandez; Zhi-Liang Hu; Aaron Ingham; Terhi Iso-Touru; Catherine Jamis; Kirsty Jensen; Dimos Kapetis; Tovah Kerr; Sari S Khalil; Hasan Khatib; Davood Kolbehdari; Charu G Kumar; Dinesh Kumar; Richard Leach; Justin C-M Lee; Changxi Li; Krystin M Logan; Roberto Malinverni; Elisa Marques; William F Martin; Natalia F Martins; Sandra R Maruyama; Raffaele Mazza; Kim L McLean; Juan F Medrano; Barbara T Moreno; Daniela D Moré; Carl T Muntean; Hari P Nandakumar; Marcelo F G Nogueira; Ingrid Olsaker; Sameer D Pant; Francesca Panzitta; Rosemeire C P Pastor; Mario A Poli; Nathan Poslusny; Satyanarayana Rachagani; Shoba Ranganathan; Andrej Razpet; Penny K Riggs; Gonzalo Rincon; Nelida Rodriguez-Osorio; Sandra L Rodriguez-Zas; Natasha E Romero; Anne Rosenwald; Lillian Sando; Sheila M Schmutz; Libing Shen; Laura Sherman; Bruce R Southey; Ylva Strandberg Lutzow; Jonathan V Sweedler; Imke Tammen; Bhanu Prakash V L Telugu; Jennifer M Urbanski; Yuri T Utsunomiya; Chris P Verschoor; Ashley J Waardenberg; Zhiquan Wang; Robert Ward; Rosemarie Weikard; Thomas H Welsh; Stephen N White; Laurens G Wilming; Kris R Wunderlich; Jianqi Yang; Feng-Qi Zhao
Journal:  Science       Date:  2009-04-24       Impact factor: 47.728

6.  A whole-genome assembly of the domestic cow, Bos taurus.

Authors:  Aleksey V Zimin; Arthur L Delcher; Liliana Florea; David R Kelley; Michael C Schatz; Daniela Puiu; Finnian Hanrahan; Geo Pertea; Curtis P Van Tassell; Tad S Sonstegard; Guillaume Marçais; Michael Roberts; Poorani Subramanian; James A Yorke; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-04-24       Impact factor: 13.583

  6 in total
  15 in total

1.  CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations.

Authors:  Xihong Wang; Zhuqing Zheng; Yudong Cai; Ting Chen; Chao Li; Weiwei Fu; Yu Jiang
Journal:  Gigascience       Date:  2017-12-01       Impact factor: 6.524

2.  A clone-free, single molecule map of the domestic cow (Bos taurus) genome.

Authors:  Shiguo Zhou; Steve Goldstein; Michael Place; Michael Bechner; Diego Patino; Konstantinos Potamousis; Prabu Ravindran; Louise Pape; Gonzalo Rincon; Juan Hernandez-Ortiz; Juan F Medrano; David C Schwartz
Journal:  BMC Genomics       Date:  2015-08-28       Impact factor: 3.969

3.  Multiple conformations are a conserved and regulatory feature of the RB1 5' UTR.

Authors:  Katrina M Kutchko; Wes Sanders; Ben Ziehr; Gabriela Phillips; Amanda Solem; Matthew Halvorsen; Kevin M Weeks; Nathaniel Moorman; Alain Laederach
Journal:  RNA       Date:  2015-05-21       Impact factor: 4.942

Review 4.  The challenges and importance of structural variation detection in livestock.

Authors:  Derek M Bickhart; George E Liu
Journal:  Front Genet       Date:  2014-02-18       Impact factor: 4.599

5.  A genome-wide scan for signatures of differential artificial selection in ten cattle breeds.

Authors:  Sophie Rothammer; Doris Seichter; Martin Förster; Ivica Medugorac
Journal:  BMC Genomics       Date:  2013-12-21       Impact factor: 3.969

6.  Genetic variants and signatures of selective sweep of Hanwoo population (Korean native cattle).

Authors:  Taeheon Lee; Seoae Cho; Kang Seok Seo; Jongsoo Chang; Heebal Kim; Duhak Yoon
Journal:  BMB Rep       Date:  2013-07       Impact factor: 4.778

7.  RepARK--de novo creation of repeat libraries from whole-genome NGS reads.

Authors:  Philipp Koch; Matthias Platzer; Bryan R Downie
Journal:  Nucleic Acids Res       Date:  2014-03-14       Impact factor: 16.971

8.  A new rhesus macaque assembly and annotation for next-generation sequencing analyses.

Authors:  Aleksey V Zimin; Adam S Cornish; Mnirnal D Maudhoo; Robert M Gibbs; Xiongfei Zhang; Sanjit Pandey; Daniel T Meehan; Kristin Wipfler; Steven E Bosinger; Zachary P Johnson; Gregory K Tharp; Guillaume Marçais; Michael Roberts; Betsy Ferguson; Howard S Fox; Todd Treangen; Steven L Salzberg; James A Yorke; Robert B Norgren
Journal:  Biol Direct       Date:  2014-10-14       Impact factor: 4.540

Review 9.  Extracting functional trends from whole genome duplication events using comparative genomics.

Authors:  Russell A Hermansen; Torgeir R Hvidsten; Simen Rød Sandve; David A Liberles
Journal:  Biol Proced Online       Date:  2016-05-10       Impact factor: 3.244

10.  RNA-seq Transcriptional Profiling of Peripheral Blood Leukocytes from Cattle Infected with Mycobacterium bovis.

Authors:  Kirsten E McLoughlin; Nicolas C Nalpas; Kévin Rue-Albrecht; John A Browne; David A Magee; Kate E Killick; Stephen D E Park; Karsten Hokamp; Kieran G Meade; Cliona O'Farrelly; Eamonn Gormley; Stephen V Gordon; David E MacHugh
Journal:  Front Immunol       Date:  2014-08-26       Impact factor: 7.561

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.