Birte Kehr1, Páll Melsted2, Bjarni V Halldórsson3. 1. deCODE genetics/Amgen, Reykjavík, Iceland. 2. deCODE genetics/Amgen, Reykjavík, Iceland, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland and. 3. deCODE genetics/Amgen, Reykjavík, Iceland, Institute of Biomedical and Neural Engineering, Reykjavík University, Reykjavík, Iceland.
Abstract
MOTIVATION: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. Although the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. RESULTS: We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this article, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a dataset of 305 Icelanders demonstrate the practicality of the new approach. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns is available from http://github.com/bkehr/popins CONTACT: birte.kehr@decode.is SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. Although the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. RESULTS: We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this article, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a dataset of 305 Icelanders demonstrate the practicality of the new approach. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns is available from http://github.com/bkehr/popins CONTACT: birte.kehr@decode.is SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Simon N Stacey; Birte Kehr; Julius Gudmundsson; Florian Zink; Aslaug Jonasdottir; Sigurjon A Gudjonsson; Asgeir Sigurdsson; Bjarni V Halldorsson; Bjarni A Agnarsson; Kristrun R Benediktsdottir; Katja K H Aben; Sita H Vermeulen; Ruben G Cremers; Angeles Panadero; Brian T Helfand; Phillip R Cooper; Jenny L Donovan; Freddie C Hamdy; Viorel Jinga; Ichiro Okamoto; Jon G Jonasson; Laufey Tryggvadottir; Hrefna Johannsdottir; Anna M Kristinsdottir; Gisli Masson; Olafur T Magnusson; Paul D Iordache; Agnar Helgason; Hannes Helgason; Patrick Sulem; Daniel F Gudbjartsson; Augustine Kong; Eirikur Jonsson; Rosa B Barkardottir; Gudmundur V Einarsson; Thorunn Rafnar; Unnur Thorsteinsdottir; Ioan N Mates; David E Neal; William J Catalona; José I Mayordomo; Lambertus A Kiemeney; Gudmar Thorleifsson; Kari Stefansson Journal: Hum Mol Genet Date: 2016-01-05 Impact factor: 6.150
Authors: Maria Nattestad; Sara Goodwin; Karen Ng; Timour Baslan; Fritz J Sedlazeck; Philipp Rescheneder; Tyler Garvin; Han Fang; James Gurtowski; Elizabeth Hutton; Elizabeth Tseng; Chen-Shan Chin; Timothy Beck; Yogi Sundaravadanam; Melissa Kramer; Eric Antoniou; John D McPherson; James Hicks; W Richard McCombie; Michael C Schatz Journal: Genome Res Date: 2018-06-28 Impact factor: 9.043