| Literature DB >> 33119626 |
Ronald J Nowling1, Krystal R Manke2, Scott J Emrich3.
Abstract
Chromosomal inversions can lead to reproductive isolation and adaptation in insects such as Drosophila melanogaster and the non-model malaria vector Anopheles gambiae. Inversions can be detected and characterized using principal component analysis (PCA) of single nucleotide polymorphisms (SNPs). To aid in developing such methods, we formed a new benchmark derived from three publicly-available insect data. We then used this benchmark to perform an extended validation of our software for inversion analysis (Asaph). Through that process, we identified and characterized several problematic test cases liable to misinterpretation that can help guide PCA-based inversion detection. Lastly, we re-analyzed the 2R chromosome arm of 150 An. gambiae and coluzzii samples and observed two inversions (2Rc and 2Rd) that were previously known but not annotated in these particular individuals. The resulting benchmark data set and methods will be useful for future inversion detection based solely on SNP data.Entities:
Mesh:
Year: 2020 PMID: 33119626 PMCID: PMC7595445 DOI: 10.1371/journal.pone.0240429
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summaries of inversion analysis tools.
Details of existing software tools that were either designed or can be applied to inversion analysis using SNP data are summarized.
| SNPRelate | PCAdapt | Asaph | inveRsion | invClust | EIGENSOFT | PLINK | |
|---|---|---|---|---|---|---|---|
| [ | [ | [ | [ | [ | [ | [ | |
| R | C and R | Python | R | R | C | C | |
| SNPRelate provides parallel implementations of PCA for SNP data and the ability to perform correlation testing between PC coordinates and SNP genotypes. Although not designed for inversion detection, SNPRelate can be applied to inversion detection using PCA scatter and Manhattan plots. | PCAdapt uses PCA to infer population structure and assumes variants with strong associations with the PC coordinates are under local selection. Although not designed for inversion detection, PCA scatter plots and variant p-values from association tests can be used to detect inversions with scatter and Manhattan plots, respectively. | Asaph uses PCA, clustering, and association tests to detect, genotype, and localize inversions. | inveRsion identifies changes in linkage disequilibrium along the chromosome arm from SNP data to find inversion breakpoints. | Developed by the authors of inveRsion, invClust performs PCA and clustering of samples with Gaussian mixture models to perform inversion genotype inference. Inversions can first be detected and localized by inveRsion and then invClust can be applied to SNPs in the inversion region. | EIGENSOFT provides analysis of population structure using PCA. | PLINK can perform population inference with PCA and perform regression with quantitative traits. Although not the intended purpose, these techniques can be used for inversion analysis. | |
| 2012 | 2014 (C) / 2016 (R) | 2018 | 2012 | 2015 | 2006 | 2007 | |
| Yes | Yes | Yes | Yes | Yes | Yes | Yes | |
| Yes | Yes | Yes | No | Yes | Yes | Yes | |
| Yes | Yes | Yes | Yes | No | No | Yes | |
|
|
|
|
|
|
|
|
Comparison of methods.
The capabilities of three PCA-based methods (PCA scatter plots with optional clustering and association testing SNPs against either cluster labels or PC coordinates) are summarized. We compare the methods on detecting, genotyping, and localizing inversions in terms of capability, easy of use, and potential for ambiguous results.
| PCA Scatter Plots | Clustering | Cluster-SNP Association Tests | PC-SNPAssociation Tests | |
|---|---|---|---|---|
| Yes | Yes | Yes | Yes | |
| No | Yes | No | No | |
| No | No | Yes | Yes | |
| Easy | Moderate | Difficult | Easy | |
| Yes | Yes | No | No |
Fig 1Workflows for detecting, localizing, and genotyping inversions.
The three approaches (PCA with clustering, PC-SNP association testing, and Cluster-SNP association testing) all begin with performing PCA on a feature matrix generated from SNP data. K-Means clustering is performed using the PC coordinates to infer genotypes. The inferred genotypes and PC coordinates of the samples are represented using scatter plots. Association testing can be performed between the samples’ SNP genotypes and either the PC coordinates or cluster labels. The p-values from the association tests are plotted along the chromosome in a Manhattan plot to visualize the spatial distribution of the associations and detect and localize inversions.
Characterization of SNP data sets.
A benchmark data set for evaluating methods for inversion detection using using SNP data was formed from data for three insect species (D. melanogaster [37, 38], An. gambiae and coluzzii [17, 39]). The chromosome arms were organized into three test cases (negatives, positive drawn from a single population, and positive drawn from multiple populations) based on known inversion genotypes from previous papers. We analyzed SNPs from the 2R chromosome arms of An. gambiae and coluzzii but do not include these data in our benchmark data set since not all inversions were fully characterized. For each chromosome arm, the geographic locations in which the samples were collected, species of the samples, number of samples, inversions identified in these data by the original authors and their frequencies, and the number of SNPs are provideded.
| Test Case | Data Source | Location | Species | Chrom. | Samples | Inversions (Frequency) | SNPs |
|---|---|---|---|---|---|---|---|
| Negative | [ | 3L | 192 | 896,257 | |||
| Negative | [ | BCMT | 3L | 34 | 1,329,375 | ||
| Negative | [ | B | 3L | 150 | 7,449,486 | ||
| Single | [ | 2L | 198 | 910,880 | |||
| Single | [ | 2R | 198 | 740,948 | |||
| Single | [ | 3R | 198 | 884,009 | |||
| Multiple | [ | B | 2L | 150 | 2La (94.7%) | 8,296,600 | |
| Multiple | [ | B | 2L | 81 | 2La (90.7%) | ||
| Multiple | [ | BCMT | 2L | 34 | 2La (54.4%) | ||
| Other | [ | B | 2R | 150 | 2Rb (59.3%) | 11,332,702 | |
| Other | [ | B | 2R | 81 | 2Rb (82.1%) | 11,332,702 | |
| Other | [ | B | 2R | 69 | 2Rb (31.1%) | 11,332,702 | |
| Other | [ | B | 2L | 69 | 2La (99.3%) | 8,296,600 |
* Inversions were present in only six samples, which we removed; B: Burkina Faso, C: Cameroon, M: Mali, and T: Tanzania
Fig 2Negative cases.
Analysis of chromosome arms without known major inversions (Drosophila 3L—6 samples with inversion excluded (see Methods), 150 Anopheles 3L, and 34 Anopheles 3L). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—l, one Manhattan plot per PC).
Occurrences of 2La inversion genotypes by Anopheles species and data set.
The 2La inversion genotypes for the 34 An. gambiae and coluzzii samples from [39] and 150 An. gambiae and coluzzii samples from [17] were analyzed for association with species. The two papers do not agree on the definitions of the standard and inverted orientations. The homozygous standard inversion genotype was not observed in the 150 Burkina Faso samples but was dominant in the Burkina Faso samples from [39] (see Table 3). Likewise, the homozygous inverted genotype was not observed in the Burkina Faso samples from [39] but was dominant among the 150 Burkina Faso samples.
| Data Source | Species | Homo. Std. | Hetero. | Homo. Inv |
|---|---|---|---|---|
| [ | 0 | 1 | 68 | |
| [ | 0 | 15 | 66 | |
| [ | 3 | 0 | 8 | |
| [ | 10 | 5 | 8 |
Occurrences of 2La inversion genotypes by location for 34 Anopheles samples.
The 2La inversion genotypes for the 34 An. gambiae and coluzzii samples from [39] by were analyzed for association with geographic location. The homozygous inverted genotype was observed primarily in samples from Cameroon, while the homozygous standard genotype was observed in samples only from Burkina Faso and Mali. Association of the inversion genotypes with geographic location prevents correction for potential confounding effects for this data set.
| Location | Homo. Std. | Hetero. | Homo. Inv |
|---|---|---|---|
| 5 | 2 | 0 | |
| 0 | 1 | 15 | |
| 8 | 0 | 0 | |
| 0 | 2 | 1 |
Fig 3Positive cases with a single species.
Analysis of chromosome arms with known major inversions in samples drawn from a single species (Drosophila 2L, 2R, and 3R). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—l, one Manhattan plot per PC).
Fig 4Positive cases with a multiple species and/or populations.
Analysis of the 2L Anopheles chromosome arm with known major inversions in samples drawn from multiple species and/or locations (150 Anopheles from Burkina Faso, 81 Anopheles gambiae samples of the 150 Anopheles samples, and 34 Anopheles gambiae and coluzzii samples from four geographic locations). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—k, one Manhattan plot per PC).
Genotype inference task.
We evaluated a single methods (PCA with clustering) on the genotype inference task (which inversion genotype does a sample have?) using two benchmark test cases (positive from a single population and positive from multiple populations). Note that the two association-testing methods are not able to infer genotypes. For each chromosome arm used, we indicated known inversions, how many genotypes are present in the data set, and a measure of balanced accuracy calculated from the cluster predictions. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.
| Test Case | Chrom. | Inversion | Present Genotypes | Clusters | Balanced Accuracy |
|---|---|---|---|---|---|
| Single | 3 | 3 | 93.3% | ||
| Single | 3 | 3 | 94.4% | ||
| Single | 3 | 60.7% | |||
| Single | 3 | 43.3% | |||
| Single | 3 | 55.0% | |||
| Multiple | 150 | 2La | 2 | 3 | 66.7% |
| Multiple | 81 | 2La | 2 | 2 | 100.0% |
| Multiple | 34 | 2La | 3 | 4 | 100.0% |
We evaluated clustering in terms of accuracy of inferring inversion genotypes. Inversion genotypes were retrieved from the original papers describing the data [17, 37–39]. Association of the known genotypes with the cluster labels was measured using balanced accuracy. *Could not resolve multiple, mutually-exclusive inversions
Inversion localization task.
We evaluated the two association-testing methods (PC-SNP and Cluster-SNP association tests) on the inversion localization task (what region is spanned by an inversion?) using two benchmark test cases (positive from a single population and positive from multiple populations). Note that the two PCA scatter plot method is not able to localize inversions. For each chromosome arm used, we indicated known inversions, the expected ranges, and the ranged identified be each method. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.
| Test Case | Chrom. | Inversion | Exp. Range (Mb) | PC-SNP Obs. Range (Mb) | Cluster-SNP Obs. Range (Mb) |
|---|---|---|---|---|---|
| Single | 2.2–13.2 | start–16.0 (PC1) | start–16.0 | ||
| Single | 11.3–16.2 | 10.0–17.5 (PC1) | 10.0–18.0 | ||
| Single | 12.6–20.6 | 14.0–end | 14.0–end | ||
| Single | 7.6–22.0 Mb | 14.0–end | 14.0–end | ||
| Single | 17.2–24.9 Mb | 14.0–end | 14.0–end | ||
| Multiple | 150 | 2La | 20.0–45.0 | 20.0–43.0 (PC2) | start–end |
| Multiple | 81 | 2La | 20.0–45.0 | 20.0–43.0 (PC1) | 20.0–43.0 |
| Multiple | 34 | 2La | 20.0–45.0 | 20.0–43.0 (PC1) | 20.0–43.0 |
We evaluated the PC-SNP and Cluster-SNP association test methods on localizing inversions. We compared the range of inversions observed in the Manhattan plots created from these two methods with the coordinates described for these inversions in prior work [37–39, 45, 46].
*Could not resolve multiple, mutually-exclusive inversions
†Could not resolve 2La
Fig 5Anopheles 2R chromosome arm.
Analysis of the 2R chromosome arm of the 150 Anopheles samples from Burkina Faso (all samples, 81 Anopheles gambiae samples, and 69 Anopheles coluzzii samples). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—k, one Manhattan plot per PC).
Inversion detection task.
We evaluated three methods (PCA with clustering, PC-SNP association testing, and Cluster-SNP association testing) on the inversion detection task (is an inversion present?) using our three benchmark test cases (negative, positive from a single population, and positive from multiple populations). For each chromosome arm used, we indicated known inversions and whether the inversion was detected by a given method. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.
| Test Case | Chrom. | Inversion | Clusters | PC-SNP | Cluster-SNP |
|---|---|---|---|---|---|
| Negative | None | 1 | No | No | |
| Negative | 34 | None | 4 | No | No |
| Negative | 150 | None | 2 | No | No |
| Single | 3 | Yes (PC 1) | Yes | ||
| Single | 3 | Yes (PC 1) | Yes | ||
| Single | 3 | Yes (PC 1) | Yes | ||
| Single | 3 | Yes (PC 1) | Yes | ||
| Single | 3 | Yes (PC 1) | Yes | ||
| Multiple | 150 | 2La | 3 | Yes (PC 2) | No |
| Multiple | 81 | 2La | 2 | Yes (PC 1) | Yes |
| Multiple | 34 | 2La | 4 | Yes (PC 1) | Yes |
We compared inversions detected by the three methods to the known inversion karyotypes for these data sets taken from the original papers describing the data [17, 37–39]. If an inversion was present with no population structure, three clusters corresponding to three possible genotypes (which may not all be present) would be expected.
*Multiple, mutually-exclusive inversions were detected as a single inversion by our methods.