| Literature DB >> 28961772 |
José Carbonell-Caballero1, Alicia Amadoz1, Roberto Alonso1, Marta R Hidalgo1, Cankut Çubuk1, David Conesa2, Antonio López-Quílez2, Joaquín Dopazo1,3,4,5.
Abstract
MOTIVATION: Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome.Entities:
Mesh:
Year: 2017 PMID: 28961772 PMCID: PMC5870781 DOI: 10.1093/bioinformatics/btx482
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.General scheme of the methodology. (a) The LGP is constructed from sample reads that cover regions across the genome. (b) Then, specific markers of interest can be evaluated by contrasting their corresponding window value against the stored empirical distributions. Finally, the CES is computed to obtain the definitive diagnosis
Fig. 2.Distribution of CES values depending on similarity score for Ahy (a), Sce (b) and Ath (c). CES was also plotted for Ath patched regions (d) and splitted in deletions (DEL), insertions (INS), substitutions (SUBS) and the set of randomly selected loci (B) that represents the background variability state of the genome. Distribution of REAPR values are also represented for the same categories: Ahy (e), Sce (f), Ath (g) and Ath patches (h)
Correlations between BLAST-based similarity score and REAPR/log(CES) for Ath, Sce and Ahy
| REAPR | CES | |
|---|---|---|
| 0.30 | 0.48 | |
| 0.55 | 0.62 | |
| 0.37 | 0.41 |
Fig. 3.CES distribution values for Hsa analysis. Clear differences are shown between patched and random regions of the genome (a). Also, CES showed a clear correlation with the number of mismatches between the NGS protocol and the validation SNP array (b). Interestingly, the false-positive variants of an independent set of samples fall at the end of the rank (c). The mean cumulative density function (cdf) of false positives is depicted (d) with clear differences between REAPR (light red curve) and our methodology (black curve)