| Literature DB >> 29735690 |
Laura Gómez-Romero1, Kim Palacios-Flores2,3, José Reyes2, Delfino García2, Margareta Boege2,3, Guillermo Dávila2,3, Margarita Flores2,3, Michael C Schatz4,5, Rafael Palacios1,3.
Abstract
The precise determination of de novo genetic variants has enormous implications across different fields of biology and medicine, particularly personalized medicine. Currently, de novo variations are identified by mapping sample reads from a parent-offspring trio to a reference genome, allowing for a certain degree of differences. While widely used, this approach often introduces false-positive (FP) results due to misaligned reads and mischaracterized sequencing errors. In a previous study, we developed an alternative approach to accurately identify single nucleotide variants (SNVs) using only perfect matches. However, this approach could be applied only to haploid regions of the genome and was computationally intensive. In this study, we present a unique approach, coverage-based single nucleotide variant identification (COBASI), which allows the exploration of the entire genome using second-generation short sequence reads without extensive computing requirements. COBASI identifies SNVs using changes in coverage of exactly matching unique substrings, and is particularly suited for pinpointing de novo SNVs. Unlike other approaches that require population frequencies across hundreds of samples to filter out any methodological biases, COBASI can be applied to detect de novo SNVs within isolated families. We demonstrate this capability through extensive simulation studies and by studying a parent-offspring trio we sequenced using short reads. Experimental validation of all 58 candidate de novo SNVs and a selection of non-de novo SNVs found in the trio confirmed zero FP calls. COBASI is available as open source at https://github.com/Laura-Gomez/COBASI for any researcher to use.Entities:
Keywords: coverage map; de novo mutations; genomic algorithms; genomic landscape; human genome variation
Mesh:
Year: 2018 PMID: 29735690 PMCID: PMC6003530 DOI: 10.1073/pnas.1802244115
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Fig. 1.Rationale of the COBASI approach. (A) A specific nucleotide (large bold C) cannot be uniquely localized along the genome until its context is included in the search. (Left) The string to be searched; (Right) the number of positions at which such a string is found. The bottom string is a COIN-String (CS) of 30 nt. (B–D) (Upper) Schematic representation of sequence reads. (Lower) Specific regions of variation landscapes (VLs) for three scenarios. (B) No variation signal. (C) A heterozygous SNV variation signal. (D) A homozygous SNV variation signal. Black lines in B, C, and D represent reads from the genome project that contain the reference allele. Red lines represent reads from the genome project that contain the SNV allele. The sections of the VL in ref. 2 are represented by blue lines. The x axis indicates the genome position for every CS start. The y axis indicates the number of reads containing the CS sequence starting at that position.
Fig. 2.Variation landscape transformation into a relative coverage landscape. (Left) A homozygous SNV is shown. (Right) A heterozygous SNV is shown. (A) The VL for a region composed of 30 nt upstream and 30 nt downstream of each VSR is shown. The plots show the start position of each CS in that genomic region (x axis) and the coverage for each CS (y axis). (B) The VL is turned into the RVL using the RCI. RCIn refers to the relative coverage index for nucleotide n. Cn and Cn1 denote the number of reads that contain the CS starting at nucleotide n and the next downstream CS, respectively. (C) The RVL for the same regions shown in A. The plots show the start position of each CS (x axis) and RCI values associated with each CS (y axis). The VL and the RVL are represented by blue lines. The PrevCS and PostCS are shown as orange and yellow lines at the Bottom of each plot, and their start positions are highlighted with dashed black vertical lines ().
Fig. 3.The COBASI experimental pipeline for SNV discovery in one individual. (A, Left) Every overlapping 30-nt kmer (with a sliding window of 1 nt) along each of the reads of the sequencing project is obtained (only 3 kmers are shown per read). The counts for every kmer are stored in a database. Reads and read kmers are shown as gray and light gray lines, respectively. (A, Right) CS along the RG is obtained, and the start and end positions of all nonoverlapping unique regions is stored. RG and RG kmers are shown as purple and light purple lines. (B) The two virtual products are merged and the variation landscape (VL) is generated. (C) A region of the VL containing one heterozygous SNV is presented. The plot shows the start position of each CS along the genome (x axis) and each CS coverage (y axis). The VL is represented as a blue line. The VL is transformed into the RVL. Only the VL is depicted. The start position of the PrevCS and the PostCS are indicated by vertical orange and yellow lines, respectively. The PrevCS and PostCS are represented by horizontal orange and yellow lines, respectively. Some interCSs are shown as horizontal brown lines. The position of the SNV is shown as a red vertical line. All CSs located between the Prev- and PostCSs (interCSs) contain the SNV position. (D) The Prev- and PostCSs (signature CSs) are used as anchors to retrieve all of the reads of interest (). (E) Each of the retrieved reads is then aligned with the corresponding region of the RG. An aligned read-RG region is shown. The SNV position and specific nucleotide is highlighted in a red rectangle.
Fig. 4.The COBASI experimental pipeline for SNV discovery in a family-based framework. (A) For each SNV in the child, its signature CSs are used as anchors to retrieve the corresponding reads in the parents. The reads are then aligned to the RG. (B) A catalog containing all child SNVs and the alleles found in each parent at the same positions is generated. The three genotypes are then compared, and the possible de novo SNVs are identified.
Fig. 5.Experimental example of the COBASI strategy in the family-based framework. (Left) A Mendelian SNV is shown. Position 1 in the plots corresponds to chrX position 8928409. (Right) A de novo SNV is shown. Position 1 in the plots corresponds to chr11 position 66915681. (A) The corresponding section of the VL is shown for each parent–offspring trio individual: the red, green, and purple lines correspond to the VL for the father, mother, and child, respectively. Since the Mendelian SNV is located in the chrX, the father has around half the coverage of the mother. (B) The RVL is shown for both parents. (C) The RVL is shown for the child. (D) The nucleotide present at the RG is shown. (E) The chromatograms obtained by Sanger sequencing for these regions are shown. The genotypes obtained for each individual by the COBASI approach are shown in bold letters. An asterisk next to the individual genotype indicates that the chromatogram is in the reverse orientation. The SNV position is shadowed according to the individual color code.