| Literature DB >> 30517156 |
Jianfeng Yang1, Xiaofan Ding2, Weidong Zhu1.
Abstract
With the advance of next-generation sequencing (NGS) technologies, non-invasive prenatal testing (NIPT) has been developed and employed in fetal aneuploidy screening on 13-/18-/21-trisomies through detecting cell-free fetal DNA (cffDNA) in maternal blood. Although Z-test is widely used in NIPT NGS data analysis, there is still necessity to improve its accuracy for reducing a) false negatives and false positives, and b) the ratio of unclassified data, so as to lower the potential harm to patients as well as the induced cost of retests. Combining the multiple Z-tests with indexes of clinical signs and quality control, features were collected from the known samples and scaled for model training using support vector machine (SVM). We trained SVM models from the qualified NIPT NGS data that Z-test can discriminate and tested the performance on the data that Z-test cannot discriminate. On screenings of 13-/18-/21-trisomies, the trained SVM models achieved 100% accuracies in both internal validations and unknown sample predictions. It is shown that other machine learning (ML) models can also achieve similar high accuracy, and SVM model is most robust in this study. Moreover, four false positives and four false negatives caused by Z-test were corrected by using the SVM models. To our knowledge, this is one of the earliest studies to employ SVM in NIPT NGS data analysis. It is expected to replace Z-test in clinical practice.Entities:
Mesh:
Year: 2018 PMID: 30517156 PMCID: PMC6281214 DOI: 10.1371/journal.pone.0207840
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Density plot of Z scores from current one-Z-test based NIPT.
Negatives and positives are shown in dark and red respectively. Green dash indicates the cutoff of Z = 3 that was frequently used as a criterion in discrimination. Blue dashes shows the “grey zone” interval between Z = 1.96 and 4, which means failure in discrimination using Z = 3, and requires a retest.
Demographic subjects of pregnant women undergoing non-invasive prenatal testing (NIPT) for aneuploidies between 1 March and 31 July in 2016.
| Subject | Total | % of all | Negative | % in group | % of all | Positive | % in group | % of all | P13 | % in group | % of all | P18 | % in group | % of all | P21 | % in group | % of all |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5518 | 5472 | 99.18 | 0.82 | 5 | 0.09 | 14 | 0.25 | 27 | 0.49 | ||||||||
| Age | 31.83 | (15–47) | 31.83 | (15–47) | 31.70 | (20–43) | 31.2 | (25–40) | 29.57 | (23–42) | 32.59 | (20–43) | |||||
| <24 | 700 | 12.69 | 691 | 12.63 | 98.71 | 9 | 20.00 | 1.29 | 0 | 0.00 | 0.00 | 4 | 28.57 | 0.57 | 5 | 18.52 | 0.71 |
| 25–29 | 1285 | 23.29 | 1273 | 23.26 | 99.07 | 12 | 26.67 | 0.93 | 3 | 60.00 | 0.23 | 5 | 35.71 | 0.39 | 4 | 14.81 | 0.31 |
| 30–34 | 1371 | 24.85 | 1368 | 25.00 | 99.78 | 3 | 6.67 | 0.22 | 0 | 0.00 | 0.00 | 1 | 7.14 | 0.07 | 3 | 11.11 | 0.22 |
| 35–40 | 1741 | 31.55 | 1727 | 31.55 | 99.20 | 14 | 31.11 | 0.80 | 1 | 20.00 | 0.06 | 1 | 7.14 | 0.06 | 12 | 44.44 | 0.69 |
| >40 | 421 | 7.63 | 414 | 7.56 | 98.34 | 7 | 15.56 | 1.66 | 1 | 20.00 | 0.24 | 3 | 21.43 | 0.71 | 3 | 11.11 | 0.71 |
| Week | 17.19 | (8–37) | 17.2 | (8–37) | 15.93 | (12–21) | 13.6 | (12–15) | 16.21 | (12–20) | 16.33 | (12–21) | |||||
| <13 | 651 | 11.80 | 643 | 11.75 | 98.77 | 8 | 17.78 | 1.23 | 2 | 40.00 | 0.31 | 1 | 7.14 | 0.15 | 5 | 18.52 | 0.77 |
| 14–27 | 4807 | 87.11 | 4770 | 87.16 | 99.23 | 37 | 82.22 | 0.77 | 3 | 60.00 | 0.06 | 13 | 92.86 | 0.27 | 22 | 81.48 | 0.46 |
| >28 | 60 | 1.09 | 60 | 1.10 | 100.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
| CostDay | 10.39 | (5–62) | 10.38 | (5–62) | 10.80 | (6–18) | 10 | (8–14) | 12.07 | (7–18) | 10.11 | (6–17) | |||||
| <7 | 956 | 17.33 | 947 | 17.30 | 99.06 | 9 | 20.00 | 0.94 | 0 | 0.00 | 0.00 | 2 | 14.29 | 0.21 | 7 | 25.93 | 0.73 |
| 8–14 | 3963 | 71.82 | 3933 | 71.86 | 99.24 | 30 | 66.67 | 0.76 | 5 | 100.00 | 0.13 | 9 | 64.29 | 0.23 | 17 | 62.96 | 0.43 |
| 15–21 | 561 | 10.17 | 555 | 10.14 | 98.93 | 6 | 13.33 | 1.07 | 0 | 0.00 | 0.00 | 3 | 21.43 | 0.53 | 3 | 11.11 | 0.53 |
| >22 | 38 | 0.69 | 38 | 0.69 | 100.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
Demographic characteristics of pregnant women undergoing NIPT for aneuploidies in this study.
a Age means the age of the pregnant mother while doing the NIPT; Week means the gestational week while doing the NIPT; CostDay means the time cost in our NIPT service.
b Positive means the trisomy in either chromosome 13, 18 or 21. If none of these three chromosomes were found trisomy, the sample would be regarded as Negative in this study. P13 means trisomy in chromosome 13; P18 means trisomy in chromosome 18; P21 means trisomy in chromosome 21.
c Average values of relevant subjects with minimums and maximums in the brackets.
List of features employed in SVM classification.
| Feature Number | Feature Name | Description | SVM Scale | P value |
|---|---|---|---|---|
| D1 | Z_baseline_vs_n | Z value normalized to the baseline of control samples | No | < 2.2e-16 |
| D2 | Z_baseline_vs_p | Z value normalized to the baseline of predictive positive samples | No | < 2.2e-16 |
| D3 | Z_chr_vs_n | Z value normalized to the internal chromosome reference | No | < 2.2e-16 |
| D4 | Z_chr_vs_p | Z value normalized to the predictive positive internal chromosome | No | < 2.2e-16 |
| D5 | Z_sample_vs_n | Z value normalized to the baseline of control samples | No | < 2.2e-16 |
| D6 | Z_sample_vs_p | Z value normalized to the baseline of predictive positive samples | No | < 2.2e-16 |
| D7 | Fetal | Fetal fraction in maternal plasma | Yes | 0.7542 |
| D8 | Peak | Peak value of read length distribution | Yes | 0.6655 |
| D9 | MA | Maternal age | Yes | 0.2541 |
| D10 | GW | Gestational week | Yes | 0.5125 |
In total ten features were used in SVM model training and classification
d Wilcoxon rank-sum test.
Fig 2Boxplots of Z scores from six types of Z tests on chromosomes 13/18/21.
The six types of Z tests were corresponding with formulas (1) to (6) in Methods. "N" means negatives and "P" means positives. Red dots represent false positives and green dots represent false negatives.
Fig 3Strategy of employing SVM models to improve NIPT calling.
The SVM models were trained using known datasets. Once after confirmation, validated data could be added up to the training dataset to enhance the prediction.
Performance of SVM models on NIPT prediction using different parameter setting.
| Chr21 | Group "N" & "P" | Group "Unclassified" | ||||||||
| Model | Real status | Support vector number | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | |||||||
| SVM-RBF-opt | N | 365 | 4672 | 0 | 100.00% | 100.00% | 57 | 0 | 100.00% | 100.00% |
| P | 19 | 0 | 19 | 0 | 4 | |||||
| SVM-linear-opt | N | 2 | 4672 | 0 | 100.00% | 100.00% | 57 | 0 | 0.00% | 100.00% |
| P | 2 | 0 | 19 | 4 | 0 | |||||
| SVM-RBF-opt-w | N | 478 | 4672 | 0 | 100.00% | 100.00% | 57 | 0 | 100.00% | 100.00% |
| P | 19 | 0 | 19 | 0 | 4 | |||||
| SVM-linear-opt-w | N | 2 | 4672 | 0 | 100.00% | 100.00% | 57 | 0 | 0.00% | 100.00% |
| P | 2 | 0 | 19 | 4 | 0 | |||||
| Chr18 | Group "N" & "P" | Group "Unclassified" | ||||||||
| Model | Real status | Support vector number | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | |||||||
| SVM-RBF-opt | N | 106 | 4697 | 0 | 100.00% | 100.00% | 44 | 0 | 100.00% | 100.00% |
| P | 7 | 0 | 7 | 0 | 4 | |||||
| SVM-linear-opt | N | 2 | 4697 | 0 | 85.71% | 100.00% | 44 | 0 | 0.00% | 100.00% |
| P | 2 | 1 | 6 | 4 | 0 | |||||
| SVM-RBF-opt-w | N | 303 | 4697 | 0 | 100.00% | 100.00% | 44 | 0 | 100.00% | 100.00% |
| P | 6 | 0 | 7 | 0 | 4 | |||||
| SVM-linear-opt-w | N | 3 | 4697 | 0 | 85.71.00% | 100.00% | 44 | 0 | 0.00% | 100.00% |
| P | 1 | 1 | 6 | 4 | 0 | |||||
| Chr13 | Group "N" & "P" | Group "Unclassified" | ||||||||
| Model | Real status | Support vector number | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | |||||||
| SVM-RBF-opt | N | 1976 | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 4 | 0 | 4 | 0 | 0 | |||||
| SVM-linear-opt | N | 2 | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 2 | 0 | 4 | 0 | 0 | |||||
| SVM-RBF-opt-w | N | 2070 | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 4 | 0 | 4 | 0 | 0 | |||||
| SVM-linear-opt-w | N | 2 | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 2 | 0 | 4 | 0 | 0 | |||||
Four types of SVM models were compared in both internal and external validation for each of chromosome 13/18/21.
e w means employing class weight to adjust parameter C; opt means employing optimization for parameters C and gamma in cross validation.
f Sens. is short for sensitivity; Spec. is short for specificity.
Performance of different discrimination models on NIPT prediction using ten selected features.
| Chr21 | Group "N" & "P" | Group "Unclassified" | |||||||
| Model | Real status | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | ||||||
| LDA | N | 4672 | 0 | 84.20% | 100.00% | 57 | 0 | 25.00% | 100.00% |
| P | 3 | 16 | 3 | 1 | |||||
| QDA | N | 4669 | 3 | 100.00% | 99.90% | 53 | 4 | 75.00% | 93.00% |
| P | 0 | 19 | 1 | 3 | |||||
| Dtree | N | 4672 | 0 | 100.00% | 100.00% | 51 | 6 | 100.00% | 89.50% |
| P | 0 | 19 | 0 | 4 | |||||
| Chr18 | Group "N" & "P" | Group "Unclassified" | |||||||
| Model | Real status | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | ||||||
| LDA | N | 4697 | 0 | 85.70% | 100.00% | 44 | 0 | 50.00% | 100.00% |
| P | 1 | 6 | 2 | 2 | |||||
| QDA | N | 4697 | 0 | 0.00% | 100.00% | 44 | 0 | 0.00% | 100.00% |
| P | 7 | 0 | 4 | 0 | |||||
| Dtree | N | 4697 | 0 | 100.00% | 100.00% | 40 | 4 | 100.00% | 90.90% |
| P | 0 | 7 | 0 | 4 | |||||
| Chr13 | Group "N" & "P" | Group "Unclassified" | |||||||
| Model | Real status | Prediction | Sens. | Spec. | Prediction | Sens. | Spec. | ||
| N | P | N | P | ||||||
| LDA | N | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 0 | 4 | 0 | 0 | |||||
| QDA | N | 4706 | 0 | 100.00% | 100.00% | 42 | 0 | NA | 100.00% |
| P | 0 | 4 | 0 | 0 | |||||
| Dtree | N | 4703 | 3 | 100.00% | 99.90% | 25 | 17 | NA | 59.50% |
| P | 0 | 4 | 0 | 0 | |||||
Four types of ML models were compared in both internal and external validation for each of chromosome 13/18/21.
g Corresponding R packages were employed to build models for each ML algorithms except SVM that employed libSVM. It is because libSVM is comparably applicable in parameter selection. Here SVM is default parameter setting; LDA means linear discriminant analysis; QDA means quadratic discriminant analysis; Dtree means decision tree.
h Sens. is short for sensitivity; Spec. is short for specificity; and Accu. is short for accuracy.
Fig 4A 2-D contour plot of four ML models on NIPT data of Groups "N" and "P" on chromosome 21.
Features D1 and D3 were applied in this visualization and represented as X-axis and Y-axis respectively. Dark solid points illustrate the negative samples and red solid points the positive samples. The four two-dimension hyper-planes for discrimination (green for SVM, blue for LDA, pink for QDA and orange for decision tree) were drawn on the basis of predicted categories, using 'contour' in R package 'graphic'.
Correction of previous false negatives and false positives by current SVM model.
| Sample | Error type | Reported Z score | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | D10 | Probability of negative in SVM prediction | Probability of positive in SVM prediction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chr13 | ||||||||||||||
| 13_FP_1 | FP | 3.61 | 7.80908 | -18.4017 | 10.4249 | -15.4931 | 8.12209 | -19.1393 | 22.38411066 | 151.515 | 32 | 17 | 0.998877 | 0.00112284 |
| 13_FP_2 | FP | 4.78 | 7.74425 | -9.47269 | 9.7306 | -7.34031 | 8.2322 | -10.0696 | 14.70334619 | 130.36 | 25 | 17 | 0.998162 | 0.00183793 |
| chr18 | ||||||||||||||
| 18_FN_1 | FN | 1.77 | 3.31662 | -2.6893 | 5.82699 | -0.108174 | 3.20528 | -2.59902 | 5.63756896 | 119.412 | 42 | 17 | 1.71E-007 | 1 |
| 18_FN_2 | FN | 1.43 | 5.54394 | -3.20759 | 8.52815 | -0.100806 | 5.79965 | -3.35554 | 8.214781802 | 149.353 | 41 | 16 | 1.00E-007 | 1 |
| 18_FP_1 | FP | 3.22 | 3.36731 | -10.1558 | 4.56993 | -8.8769 | 3.31148 | -9.98747 | 12.69375338 | 153.482 | 26 | 17 | 0.984857 | 0.015143 |
| 18_FN_3 | FN | 1.93 | 5.92865 | -4.21056 | 6.68878 | -3.41427 | 4.61288 | -3.2761 | 15.25077553 | 148.944 | 27 | 18 | 2.13E-005 | 0.999979 |
| chr21 | ||||||||||||||
| 21_FP_1 | FP | 2.27 | 2.17527 | -10.4262 | 4.38166 | -8.02835 | 2.35229 | -11.2747 | 17.3586114 | 139.857 | 25 | 17 | 0.997866 | 0.00213356 |
| 21_FN_1 | FN | 1.76 | 6.7041 | -3.30991 | 8.41899 | -1.47674 | 5.21281 | -2.57365 | 13.79432935 | 150.056 | 36 | 12 | 0.00435514 | 0.995645 |
The SVM model trained in this study corrected four false negatives and four false positives previously called by one-Z-test method.
i FP means false positive and FN means false negative.
j The definitions of features D1 to D10 were given in Table 2.