| Literature DB >> 31600275 |
Melis Onel1,2, Burcu Beykal1,2, Kyle Ferguson3, Weihsueh A Chiu3, Thomas J McDonald4, Lan Zhou5, John S House6, Fred A Wright6,7, David A Sheen8, Ivan Rusyn3, Efstratios N Pistikopoulos1,2.
Abstract
A detailed characterization of the chemical composition of complex substances, such as products of petroleum refining and environmental mixtures, is greatly needed in exposure assessment and manufacturing. The inherent complexity and variability in the composition of complex substances obfuscate the choices for their detailed analytical characterization. Yet, in lieu of exact chemical composition of complex substances, evaluation of the degree of similarity is a sensible path toward decision-making in environmental health regulations. Grouping of similar complex substances is a challenge that can be addressed via advanced analytical methods and streamlined data analysis and visualization techniques. Here, we propose a framework with unsupervised and supervised analyses to optimally group complex substances based on their analytical features. We test two data sets of complex oil-derived substances. The first data set is from gas chromatography-mass spectrometry (GC-MS) analysis of 20 Standard Reference Materials representing crude oils and oil refining products. The second data set consists of 15 samples of various gas oils analyzed using three analytical techniques: GC-MS, GC×GC-flame ionization detection (FID), and ion mobility spectrometry-mass spectrometry (IM-MS). We use hierarchical clustering using Pearson correlation as a similarity metric for the unsupervised analysis and build classification models using the Random Forest algorithm for the supervised analysis. We present a quantitative comparative assessment of clustering results via Fowlkes-Mallows index, and classification results via model accuracies in predicting the group of an unknown complex substance. We demonstrate the effect of (i) different grouping methodologies, (ii) data set size, and (iii) dimensionality reduction on the grouping quality, and (iv) different analytical techniques on the characterization of the complex substances. While the complexity and variability in chemical composition are an inherent feature of complex substances, we demonstrate how the choices of the data analysis and visualization methods can impact the communication of their characteristics to delineate sufficient similarity.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31600275 PMCID: PMC6786635 DOI: 10.1371/journal.pone.0223517
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1SOM recreated from de Carvalho Rocha, Schantz (14).
Standard Reference Materials (SRM) samples from de Carvalho Rocha, Schantz (14).
| SRM ID | 3-Class Grouping | 9-Class Grouping | 16-Class Grouping | Sample IDs |
|---|---|---|---|---|
| SRM 2722 | Crude Oil | Crude Oil | Crude Oil (Heavy-Sweet) | petro203; petro204; petro205 |
| SRM 2721 | Crude Oil | Crude Oil | Crude Oil (Light-Sour) | petro274; petro275; petro276 |
| SRM 2779 | Crude Oil | Crude Oil | Gulf of Mexico Crude Oil | petro270; petro271; petro272 |
| SRM 1615 | Heavy Refinery Product | Gas Oil | Gas Oil | petro207; petro208; petro209 |
| SRM 1848 | Heavy Refinery Product | Motor Oil | Motor Oil Additive | petro218; petro219; petro220 |
| SRM 2770 | Heavy Refinery Product | RFO | S in Residual Fuel Oil | petro234; petro235; petro236 |
| SRM 1623c | Heavy Refinery Product | RFO | S in Residual Fuel Oil | petro238; petro239; petro240 |
| SRM 1620c | Heavy Refinery Product | RFO | S in Residual Fuel Oil | petro278; petro279; petro280 |
| SRM 2773 | Light Refinery Product | Biodiesel | Biodiesel (Animal-based) | petro230; petro231; petro232 |
| SRM 2772 | Light Refinery Product | Biodiesel | Biodiesel (Soy-based) | petro266; petro267; petro268 |
| SRM 2723b | Light Refinery Product | Diesel | Low S Diesel | petro226; petro227; petro228 |
| SRM 1624d | Light Refinery Product | Diesel | Sulfur in Diesel | petro214; petro215; petro216 |
| SRM 2771 | Light Refinery Product | Diesel | Zero S Diesel | petro222; petro223; petro224 |
| Gasoline | Light Refinery Product | Gasoline | 87 Octane Gasoline | petro258; petro259; petro260 |
| SRM 2299 | Light Refinery Product | Gasoline | S in gasoline | petro210; petro211; petro212 |
| JP8 | Light Refinery Product | Jet Fuel | Jet Fuel | petro246; petro247; petro248 |
| JP5 | Light Refinery Product | Jet Fuel | Jet Fuel | petro250; petro251; petro252 |
| Jet Fuel A | Light Refinery Product | Jet Fuel | Jet Fuel | petro254; petro255; petro256 |
| SRM 1617b | Light Refinery Product | Kerosene | S in Kerosene (High Level) | petro242; petro243; petro244 |
| SRM 1616b | Light Refinery Product | Kerosene | S in Kerosene (Low Level) | petro262; petro263; petro264 |
*16-class grouping is based on designation by the National Institute of Standards and Technology (NIST), which was further grouped into 9 major classes. The 3-class grouping reflects the major refining distinctions among the SRMs.
Petroleum UVCB samples.
| Sample ID | Manufacturing class | CAS RN | CAS Name |
|---|---|---|---|
| CON07 | OGO | 64742-46-7 | Distillates (petroleum), hydrotreated middle |
| CON09 | OGO | 64742-80-9 | Distillates (petroleum), hydro-desulfurized middle |
| CON01 | SRGO | 64741-43-1 | Gas oils (petroleum), straight-run |
| CON05 | SRGO | ||
| CON02 | SRGO | 68814-87-9 | Distillates (petroleum), full-range straight-run middle |
| CON03 | SRGO | ||
| CON04 | SRGO | 68915-96-8 | Distillates (petroleum), heavy straight-run |
| CON12 | VHGO | 64741-49-7 | Condensates (petroleum), vacuum tower |
| CON13 | VHGO | 64741-58-8 | Gas oils (petroleum), light vacuum |
| CON14 | VHGO | 64741-77-1 | Distillates (petroleum), light hydrocracked |
| CON15 | VHGO | 64742-87-6 | Gas oils (petroleum), hydrodesulfurized light vacuum |
| CON16 | VHGO | 68334-30-5 | Fuels, diesel |
| CON17 | VHGO | 68476-30-2 | Fuel oil, no. 2 |
| CON18 | VHGO | 68476-31-3 | Fuel oil, no. 4 |
| CON20 | VHGO | 92045-24-4 | Gas oils (petroleum), hydrotreated light vacuum |
Fig 2Data processing and visualization workflow.
Fig 3Dendrograms for the SRM samples clustering from the reduced data set into 3, 9 and 16 categories.
LRP: Light Refinery Product, HRP: Heavy Refinery Product.
Fig 4Fowlkes-Mallows index for the outcomes of clustering of SRM samples.
* indicates that the results are statistically significant at the 0.05 level.
Fig 5Confusion matrices for SRM sample classification with 3 replicates.
(A) 3-class, (B) 9-class, and (C) 16-class grouping.
Fig 6Confusion matrices for SRM sample classification with 1 replicate.
(A) 3-class, (B) 9-class, and (C) 16-class grouping.
Classification accuracy of SRM samples using sample replicates.
| Prediction class type | Number of sample replicates used | Classification accuracy | Classification accuracy (permuted) | |
|---|---|---|---|---|
| 3-class | 3 | 100% | 44.8±7.0% | 0.000 |
| 1 | 65% | 48.0±9.2% | 0.023 | |
| 9-class | 3 | 100% | 10.9±5.1% | 0.000 |
| 1 | 35% | 6.6±6.9% | 0.000 | |
| 16-class | 3 | 100% | 6.5±4.1% | 0.000 |
| 1 | 15% | 4.2±5.1% | 0.019 |
Classification accuracy of Petroleum UVCB samples.
| Prediction class type | Analytical technique used | Classification accuracy | Classification accuracy (permuted) | |
|---|---|---|---|---|
| 3-class | GC-MS | 40.0% | 39.9±13.5% | 0.395 |
| GC×GC-FID | 46.7% | 39.4±13.9% | 0.222 | |
| IM-MS | 60.0% | 41.4±12.0% | 0.047 |
3-class classification accuracy of SRM substances excluding sample replicates.
| Prediction class type | Classification accuracy | Classification accuracy (permuted) | |
|---|---|---|---|
| 3-class | 75% | 71.8±3.6% | 0.092 |
Top 10 most informative GC-MS chromatographic features with respect to the classification accuracy of the petroleum substances.
See S3–S5 Tables for the list of all chromatographic features and their respective ranks in each analysis.
| 3-Class prediction | 9-Class prediction | 16-Class prediction | ||||
|---|---|---|---|---|---|---|
| GC-MS Chromatographic Feature | Rank | Mean decrease in accuracy (%) | Rank | Mean decrease in accuracy (%) | Rank | Mean decrease in accuracy (%) |
| C4-Naphthalenes | 6 | 6.82 | 13 | 8.68 | 1 | 10.38 |
| Naphthobenzothiophene | 10 | 6.54 | 14 | 8.56 | 7 | 9.44 |
| C2-Naphthobenzothiophenes | 24 | 6.12 | 2 | 9.09 | 10 | 9.40 |
| Benzothiophene | 21 | 6.25 | 3 | 9.08 | 14 | 9.25 |
| C3-Phenanthrene/anthracenes | 5 | 6.90 | 6 | 8.86 | 33 | 8.55 |
| C4-Phenanthrene/anthracenes | 8 | 6.58 | 35 | 7.76 | 3 | 9.57 |
| Benzo(b)fluoranthene | 19 | 6.28 | 20 | 8.42 | 8 | 9.42 |
| Dibenzothiophene | 41 | 5.76 | 4 | 9.02 | 4 | 9.56 |
| C1-Dibenzothiophenes | 2 | 7.10 | 22 | 8.35 | 26 | 8.69 |
| Benzo(k)fluoranthene | 27 | 6.06 | 16 | 8.53 | 11 | 9.34 |
*Rank of the feature among 55 total for each classification analysis (3-, 9-, or 16-class prediction). Top 10 features with the overall highest rank in all three analyses were selected.
#Mean decrease in the accuracy of classification when this feature is removed from the analysis.
Fig 7ToxPi visualization of SRM samples using top 10 most informative chromatographic features.
Fig 8PCA of ToxPi scores.
Fig 9F-M index for the outcomes of clustering of Petroleum UVCB samples analyzed using 3 different techniques.
* indicates that the results are statistically significant.
Fig 10Dendrograms for Petroleum UVCB samples clustering from the reduced data set analyzed using 3 different techniques.