| Literature DB >> 29048539 |
Brent S Pedersen1,2, Ryan L Collins3,4,5, Michael E Talkowski3,6,4,5, Aaron R Quinlan1,7,2.
Abstract
The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large-scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample. Indexcov is available at https://github.com/brentp/goleft under the MIT license.Entities:
Mesh:
Year: 2017 PMID: 29048539 PMCID: PMC5737511 DOI: 10.1093/gigascience/gix090
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Difference between median-scaled sequencing depth in 16 384-bp bins from samtools, which recovers per-base depth from the BAM file, and indexcov, which estimates coverage from the BAM index. Samtools required ∼61 minutes to compute the depth in 16.4-kb bins of the genome, whereas indexcov estimated the depth of these regions in about 2 seconds. Pictured here is a summary from NA12878 chromosome 1. The x-axis values indicate the relative difference in normalized coverage estimates between samtools and indexcov in 16.4-kb bins for chromosome 1. Of the 15 196 bins measured, only 2.76% (420) have a difference in depth estimate outside the range of the plot (greater than 0.5). The Pearson correlation coefficient between the samtools and indexcov depths is 0.81.
Figure 2:Coverage profiles for 45 human WGS samples on chromosome 15. The estimated coverage along the chromosome is shown in (A), and an alternative representation showing the proportion of tiles covered at a certain depth and as the lower path is shown in (B). The sample highlighted with a green line has a ∼10-MB deletion just after the (acrocentric) centromere that has been previously associated with Angelman syndrome. The crimson line tracks a sample with a large variability in coverage; samples like this one will have many spurious CNV calls. These plots are interactive in the indexcov output, allowing users to hover and identify samples of interest.
Figure 3:Sex inference plot for a cohort of 2076 human WGS samples analyzed with indexcov. Samples projected on this plot represent ∼30–40× human WGS from 519 “quartet” families recently analyzed as a study of simplex autism [13]. The x-axis shows the copy number for chrX, and the y-axis shows the copy number for chrY inferred by indexcov. Sex is inferred from the copy number of X. As expected, we see 2 dominant clusters of samples, 1 of males (X = 1 and Y = 1) and 1 of females (X = 2 and Y = 0). Notably, indexcov further identifies samples with supernumerary sex chromosome aneuploidies (XXY and XYY), which had previously been identified by SNP microarray analysis [15]. The green point in the lower left just below the origin represents a sample with no apparent coverage on chromosomes X or Y due to a truncated BAM index file, which can be rapidly corrected once identified by indexcov QC.
Figure 4:Proportion of 16 384-bp bins where the estimated coverage is less than 0.15 on the x-axis and outside of (0.85–1.15) on the y-axis among 2076 human WGS samples. High values on the x-axis indicate large areas with low or no coverage. Values on the y-axis indicate samples with a large bias—with high variance in coverage values. Note that the samples that were PCR-amplified (red) as part of the sample-preparation are generally more likely to have a higher proportion of bins outside of the expected (0.85–1.15) range.