| Literature DB >> 29706990 |
Brian J Knaus1, Niklaus J Grünwald1.
Abstract
Inference of copy number variation presents a technical challenge because variant callers typically require the copy number of a genome or genomic region to be known a priori. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (in both genic and non-genic regions) sequenced at heterozygous positions throughout a genome. These heterozygous positions are summarized by using arbitrarily sized windows of heterozygous positions, binning the allele frequencies, and selecting the bin with the greatest abundance of positions. This provides a non-parametric summary of the frequency that alleles were sequenced at. The method is applicable to organisms that have reference genomes that consist of full chromosomes or sub-chromosomal contigs. In contrast to other software designed to detect copy number variation, our method does not rely on an assumption of base ploidy, but instead infers it. We validated these approaches with the model system of Saccharomyces cerevisiae and applied it to the oomycete Phytophthora infestans, both known to vary in copy number. This functionality has been incorporated into the current release of the R package vcfR to provide modular and flexible methods to investigate copy number variation in genomic projects.Entities:
Keywords: Phytophthora; R package; bioinformatics; computational biology; copy number variation (CNV); high throughput sequencing (HTS); ploidy
Year: 2018 PMID: 29706990 PMCID: PMC5909048 DOI: 10.3389/fgene.2018.00123
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Functions available to analyze copy number variation and mixed copy number data in the current release of vcfR.
| Function | Description |
|---|---|
| extract.gt() | Isolate data from the delimited VCF genotype fields. |
| freq_peak() | Windowize and identify peaks of density. |
| is_het() | Identify heterozygous variants. |
| masplit() | Isolate values from a matrix of delimited data. |
| peak_to_ploid() | Convert peaks of density to an expected copy number. |
| freq_peak_plot() | Visualize results from freq_peak(). |
| rePOS() | Convert chromosomal positions to genomic (non-overlapping) positions. |
| genetic_diff() | Calculate genetic differentiation ( |
Genetic differentiation as reported by the function genetic_diff().
| CHROM | POS | Hs_a | Hs_b | Ht | n_a | n_b | Gst | Htmax | Gstmax | Gprimest |
|---|---|---|---|---|---|---|---|---|---|---|
| Supercontig_1.50 | 2 | 0.42 | 0.42 | 0.4650 | 20 | 20 | 0.096 | 0.710 | 0.408 | 0.237 |
| Supercontig_1.50 | 246 | 0.42 | 0.42 | 0.4632 | 20 | 30 | 0.093 | 0.698 | 0.399 | 0.234 |
| Supercontig_1.50 | 549 | 0.42 | 0.42 | 0.4600 | 20 | 40 | 0.0870 | 0.678 | 0.380 | 0.229 |
Coefficients resulting from the linear regression of execution time (seconds) as a function of genome size (Mbp).
| Coefficient | Estimate | Standard error | ||
|---|---|---|---|---|
| Intercept | -1.085 | 1.010 | -1.075 | 0.286 |
| Slope | 0.805 | 0.008 | 103.663 | <2e-16 |