Literature DB >> 20022973

Copy number variant detection in inbred strains from short read sequence data.

Jared T Simpson¹, Rebecca E McIntyre, David J Adams, Richard Durbin.

Abstract

SUMMARY: We have developed an algorithm to detect copy number variants (CNVs) in homozygous organisms, such as inbred laboratory strains of mice, from short read sequence data. Our novel approach exploits the fact that inbred mice are homozygous at virtually every position in the genome to detect CNVs using a hidden Markov model (HMM). This HMM uses both the density of sequence reads mapped to the genome, and the rate of apparent heterozygous single nucleotide polymorphisms, to determine genomic copy number. We tested our algorithm on short read sequence data generated from re-sequencing chromosome 17 of the mouse strains A/J and CAST/EiJ with the Illumina platform. In total, we identified 118 copy number variants (43 for A/J and 75 for CAST/EiJ). We investigated the performance of our algorithm through comparison to CNVs previously identified by array-comparative genomic hybridization (array CGH). We performed quantitative-PCR validation on a subset of the calls that differed from the array CGH data sets.

Entities: Disease Gene Species

Mesh：

Year: 2009 PMID： 20022973 PMCID： PMC2820678 DOI： 10.1093/bioinformatics/btp693

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Copy number variants (CNVs) are segments of DNA that have been duplicated, or lost, in the genome of one individual or strain with respect to another. CNVs are thought to contribute significantly to phenotypic differences between mouse strains. In humans, CNVs have been causally linked to a range of disorders including schizophrenia (Moon et al., 2006), autism (Sebat et al., 2007) and birth defect syndromes (Lu et al., 2008). High-resolution surveys for CNVs have been performed in common laboratory strains of mice using array-comparative genomic hybridization (array CGH) (Cahan et al., 2009; Cutler et al., 2007; Graubert et al., 2007; Henrichsen et al., 2009; She et al., 2008). These studies have found a significant level of variation between strains, such that as much as 15% of the reference C57BL/6J mouse genome may be found as CNVs in another strain. While array CGH can be an effective way of identifying CNVs, aCGH studies are limited in resolution by the number of probes that can be placed on a microarray. The widespread adoption of short read sequencing platforms has led to a rapid decrease in the cost of whole-genome re-sequencing making it a viable alternative to array CGH (Xie and Tammi, 2009). Hidden Markov Models (HMM) have previously been used to detect copy number variation from array CGH data (Cahan et al., 2008; Fridlyand et al., 2004). We have developed a HMM to detect CNVs in inbred strains from the alignments of short read sequences to a reference genome.

2 DESCRIPTION

The central idea behind our model is that the alignment of reads from regions with copy number gains (with respect to a reference genome) will be ‘collapsed’ to a single location on the reference genome. The effect of this will be 2-fold. First, the sequence depth of this location on the reference genome will be increased by an integral amount corresponding to the relative number of copies that exist in the sequenced strain. Second, any base-pair differences between the copied regions will appear to be heterozygous single nucleotide polymorphisms (SNPs) with respect to the reference. This fact is crucial to our model as laboratory strains of mice are inbred to be effectively homozygous at every position in the genome, hence any apparent heterozygous SNPs that are not sequencing errors are actually paralogous sequence variants and therefore define regions collapsed in the reference genome. Conversely, the alignment of reads from regions with copy number losses in the sequenced genome will be distributed over the corresponding copies in the reference genome and hence the reference regions will have lower sequence depth, with the important distinction that there will not be a heterozygous SNP signal. Our HMM exploits these factors to detect regions of copy number gain and loss. Our algorithm proceeds in three stages. First, the sequence reads are aligned to the mouse reference genome (build NCBI 37, Mouse Genome Sequencing Consortium, Waterston et al., 2002) using the MAQ aligner (Li et al., 2008). MAQ calls SNPs and classifies them as homozygous or heterozygous. Summary statistics are computed for the sequence read depth, the number of heterozygous SNPs and the average number of hits per read over 1 kb windows of the reference genome sequence. This triplet of data for each 1 kb region of the reference genome is input to the HMM which classifies each region as corresponding to a gain, loss or no change in copy number.

2.1 The HMM

We developed a 10-state HMM of the copy number structure of the genome being sequenced. There are five major states of the model, representing normal sequence, a 2-fold increase in copy number, a 3-fold increase in copy number, a 2-fold decrease in copy number and zero copy number. In addition, each major state of the model has a sub-state corresponding to highly repetitive sequence, allowing the model to accommodate the frequent high-copy repeat elements dispersed throughout mammalian genomes. In all states expect for the repeat states the depth distribution is modeled by a normal distribution with the mean and variance reflecting the copy number of the state. For states representing a copy number gain, the heterozygous SNP rate is modeled by a negative binomial distribution. The heterozygous SNP rate is modeled by a Poisson distribution in all other states. More information about the HMM and emission distributions is given in the supplemental material. The parameters of the model are learned for each chromosome in the input data set by Viterbi training for both the transition probabilities and emission distribution parameters (Durbin et al., 1998). After the model parameters have been determined, the sequence of states is computed by a final application of the Viterbi algorithm. The output of the Viterbi algorithm is processed to extract contiguous regions of gain or loss. The minimum threshold for detection is the input window size, typically one kilobase. There is a final optional filtering step to remove calls below a minimum size threshold.

3 RESULTS

We tested our model on Illumina short read sequence data from chromosome 17 for the A/J and CAST/EiJ strains of mouse that were sequenced to 22- and 34-fold, respectively (ERA accession number ERA000077). The data sets were generated using 36-bp paired-end reads of 200-bp insert libraries. For this experiment, we set a minimum call size threshold of 10 kb (see Supplementary data). We evaluated our calls against a collection of previously published aCGH copy number variation data (Cahan et al., 2009; Cutler et al., 2007; Henrichsen et al., 2009; She et al., 2008). Our algorithm called 22 copy number gains (1.38 Mb of sequence) and 21 losses (0.49 Mb) for the A/J data set (see Fig. 1 and Supplementary Fig. 6 for example regions). The gain regions overlap 38% of the regions identified by aCGH (36% by sequence, 1.1 Mb). Seventy-seven percent of the gains cnD found were previously seen by aCGH. For CAST/EiJ, 45 gains (2.44 Mb of sequence) and 30 losses (1.16 Mb) were called. The gain regions overlap 76% of the gains called by aCGH (79% by sequence, 1.3 Mb). Thirty-six percent of the gains found by cnD were previously seen in the array CGH data set. This figure is much lower than that of A/J due to the fact that the CAST/EiJ strain was not used in the highest coverage aCGH study (Cahan et al., 2009). In both strains the regions of copy number loss called by our algorithm and aCGH differed widely (11% concordance by region for A/J and 32% for CAST/EiJ) owing to the relative difficulty of calling CNV losses compared to gains. We performed qPCR validation on a subset of both the gain calls that were novel to our algorithm (those not found by aCGH) and the novel gain calls found by aCGH. In total we attempted validation on 20 novel cnD gains, of which five were confirmed to be amplified relative to C57BL/6J. Of the 14 novel aCGH gains that we attempted to validate, one was confirmed to be a gain relative to C57BL/6J. Our concordance with array CGH and initial confirmation rates are similar to previously published copy number variation studies (Conrad et al., 2009; Redon et al., 2006; Scherer et al., 2007). Full details of the experimental validation are provided in the Supplementary data.

Fig. 1.

(A) Plot of sequencing depth across a one megabase region of A/J chromosome 17 clearly shows both a region of 3-fold increased copy number (30.6–31.1 Mb) and a region of decreased copy number (at 31.3 Mb). The solid black line above the depth plot indicates the called copy number gain and the solid black line below the plot indicates the called copy number loss. (B) Plot of the heterozygous SNP rate for the same region showing the high number of apparent heterozygous SNPs associated with the copy number gain.

15 in total

1. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

2. Identification of DNA copy-number aberrations by array-comparative genomic hybridization in patients with schizophrenia.

Authors: Ho Jin Moon; Sung-Vin Yim; Woon Kyu Lee; Yang-Whan Jeon; Young Hoon Kim; Young Jin Ko; Kwang-Soo Lee; Kweon-Haeng Lee; Sang-Ick Han; Hyoung Kyun Rha
Journal: Biochem Biophys Res Commun Date: 2006-04-03 Impact factor: 3.575

3. Segmental copy number variation shapes tissue transcriptomes.

Authors: Charlotte N Henrichsen; Nicolas Vinckenbosch; Sebastian Zöllner; Evelyne Chaignat; Sylvain Pradervand; Frédéric Schütz; Manuel Ruedi; Henrik Kaessmann; Alexandre Reymond
Journal: Nat Genet Date: 2009-03-08 Impact factor: 38.330

4. Global variation in copy number in the human genome.

Authors: Richard Redon; Shumpei Ishikawa; Karen R Fitch; Lars Feuk; George H Perry; T Daniel Andrews; Heike Fiegler; Michael H Shapero; Andrew R Carson; Wenwei Chen; Eun Kyung Cho; Stephanie Dallaire; Jennifer L Freeman; Juan R González; Mònica Gratacòs; Jing Huang; Dimitrios Kalaitzopoulos; Daisuke Komura; Jeffrey R MacDonald; Christian R Marshall; Rui Mei; Lyndal Montgomery; Kunihiro Nishimura; Kohji Okamura; Fan Shen; Martin J Somerville; Joelle Tchinda; Armand Valsesia; Cara Woodwark; Fengtang Yang; Junjun Zhang; Tatiana Zerjal; Jane Zhang; Lluis Armengol; Donald F Conrad; Xavier Estivill; Chris Tyler-Smith; Nigel P Carter; Hiroyuki Aburatani; Charles Lee; Keith W Jones; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2006-11-23 Impact factor: 49.962

5. Significant gene content variation characterizes the genomes of inbred mouse strains.

Authors: Gene Cutler; Lisa A Marshall; Ni Chin; Helene Baribault; Paul D Kassner
Journal: Genome Res Date: 2007-11-07 Impact factor: 9.043

Review 6. Challenges and standards in integrating surveys of structural variation.

Authors: Stephen W Scherer; Charles Lee; Ewan Birney; David M Altshuler; Evan E Eichler; Nigel P Carter; Matthew E Hurles; Lars Feuk
Journal: Nat Genet Date: 2007-07 Impact factor: 38.330

7. Origins and functional impact of copy number variation in the human genome.

Authors: Donald F Conrad; Dalila Pinto; Richard Redon; Lars Feuk; Omer Gokcumen; Yujun Zhang; Jan Aerts; T Daniel Andrews; Chris Barnes; Peter Campbell; Tomas Fitzgerald; Min Hu; Chun Hwa Ihm; Kati Kristiansson; Daniel G Macarthur; Jeffrey R Macdonald; Ifejinelo Onyiah; Andy Wing Chun Pang; Sam Robson; Kathy Stirrups; Armand Valsesia; Klaudia Walter; John Wei; Chris Tyler-Smith; Nigel P Carter; Charles Lee; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2009-10-07 Impact factor: 49.962

8. Strong association of de novo copy number mutations with autism.

Authors: Jonathan Sebat; B Lakshmi; Dheeraj Malhotra; Jennifer Troge; Christa Lese-Martin; Tom Walsh; Boris Yamrom; Seungtai Yoon; Alex Krasnitz; Jude Kendall; Anthony Leotta; Deepa Pai; Ray Zhang; Yoon-Ha Lee; James Hicks; Sarah J Spence; Annette T Lee; Kaija Puura; Terho Lehtimäki; David Ledbetter; Peter K Gregersen; Joel Bregman; James S Sutcliffe; Vaidehi Jobanputra; Wendy Chung; Dorothy Warburton; Mary-Claire King; David Skuse; Daniel H Geschwind; T Conrad Gilliam; Kenny Ye; Michael Wigler
Journal: Science Date: 2007-03-15 Impact factor: 47.728

9. A high-resolution map of segmental DNA copy number variation in the mouse genome.

Authors: Timothy A Graubert; Patrick Cahan; Deepa Edwin; Rebecca R Selzer; Todd A Richmond; Peggy S Eis; William D Shannon; Xia Li; Howard L McLeod; James M Cheverud; Timothy J Ley
Journal: PLoS Genet Date: 2006-11-22 Impact factor: 5.917

10. The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells.

Authors: Patrick Cahan; Yedda Li; Masayo Izumi; Timothy A Graubert
Journal: Nat Genet Date: 2009-03-08 Impact factor: 38.330

28 in total

Review 1. Uncovering the roles of rare variants in common disease through whole-genome sequencing.

Authors: Elizabeth T Cirulli; David B Goldstein
Journal: Nat Rev Genet Date: 2010-06 Impact factor: 53.242

2. Efficient algorithms for tandem copy number variation reconstruction in repeat-rich regions.

Authors: Dan He; Farhad Hormozdiari; Nicholas Furlotte; Eleazar Eskin
Journal: Bioinformatics Date: 2011-04-19 Impact factor: 6.937

3. Streptococcus agalactiae Strains with Chromosomal Deletions Evade Detection with Molecular Methods.

Authors: Isabella A Tickler; Fred C Tenover; Scott Dewell; Victoria M Le; Rachel N Blackman; Richard V Goering; Amy E Rogers; Heather Piwonka; Brittney D Jung-Hynes; Derrick J Chen; Michael J Loeffelholz; Devasena Gnanashanmugam; Ellen Jo Baron
Journal: J Clin Microbiol Date: 2019-03-28 Impact factor: 5.948

Review 4. Systems genetics in "-omics" era: current and future development.

Authors: Hong Li
Journal: Theory Biosci Date: 2012-11-09 Impact factor: 1.919

5. CNVeM: copy number variation detection using uncertainty of read mapping.

Authors: Zhanyong Wang; Farhad Hormozdiari; Wen-Yun Yang; Eran Halperin; Eleazar Eskin
Journal: J Comput Biol Date: 2013-02-19 Impact factor: 1.479

6. Spontaneous dominant mutations in chlamydomonas highlight ongoing evolution by gene diversification.

Authors: Alix Boulouis; Dominique Drapier; Hélène Razafimanantsoa; Katia Wostrikoff; Nicolas J Tourasse; Kevin Pascal; Jacqueline Girard-Bascou; Olivier Vallon; Francis-André Wollman; Yves Choquet
Journal: Plant Cell Date: 2015-03-24 Impact factor: 11.277

7. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.

Authors: WeiBo Wang; Wei Wang; Wei Sun; James J Crowley; Jin P Szatkiewicz
Journal: Nucleic Acids Res Date: 2015-04-16 Impact factor: 16.971

8. Detecting copy number variation with mated short reads.

Authors: Paul Medvedev; Marc Fiume; Misko Dzamba; Tim Smith; Michael Brudno
Journal: Genome Res Date: 2010-08-30 Impact factor: 9.043

9. Common copy number variation detection from multiple sequenced samples.

Authors: Junbo Duan; Hong-Wen Deng; Yu-Ping Wang
Journal: IEEE Trans Biomed Eng Date: 2014-03 Impact factor: 4.538

10. Sequence-based characterization of structural variation in the mouse genome.

Authors: Binnaz Yalcin; Kim Wong; Avigail Agam; Martin Goodson; Thomas M Keane; Xiangchao Gan; Christoffer Nellåker; Leo Goodstadt; Jérôme Nicod; Amarjit Bhomra; Polinka Hernandez-Pliego; Helen Whitley; James Cleak; Rebekah Dutton; Deborah Janowitz; Richard Mott; David J Adams; Jonathan Flint
Journal: Nature Date: 2011-09-14 Impact factor: 49.962