Literature DB >> 33456620

Principal Component Analysis Applications in COVID-19 Genome Sequence Studies.

Bo Wang1, Lin Jiang2.   

Abstract

RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been submitted to GISAID, and the number is growing fast each day which increased the difficulties in data analysis; however, genome sequence analysis is critical in understanding the COVID-19 and preventing the spread of the disease. In this study, a principal component analysis (PCA) was applied to the aligned large size genome sequences and the numerical numbers were converted from the letters using a published method designed for protein sequence cluster analysis. The study initialized with a shortlist sequence testing, and the PCA score plot showed high tolerance with low-quality data, and the major virus sequences from humans were separated from the pangolin and bat samples. Our study also successfully built a model for a large number of sequences with more than 20,000 sequences which indicate the potential mutation directions for the COVID-19 which can be served as a pretreatment method for detailed studies such as decision tree-based methods. In summary, our study provided a fast tool to analyze the high-volume genome sequences such as the COVID-19 and successfully applied to more than 20,000 sequences which may provide mutation direction information for COVID-19 studies. © Springer Science+Business Media, LLC, part of Springer Nature 2021.

Entities:  

Keywords:  COVID-19; Genome Sequences; Mutation; Principle Component Analysis

Year:  2021        PMID: 33456620      PMCID: PMC7804214          DOI: 10.1007/s12559-020-09790-w

Source DB:  PubMed          Journal:  Cognit Comput        ISSN: 1866-9956            Impact factor:   4.890


Introduction

Coronavirus is an RNA virus that contains large known RNA genomes with 27 to 32 kilobases in length, and coronavirus disease 2019 (COVID-19) is caused by a novel coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1, 2]. Coronavirus evolution has been studied using a phylogenetic network [3] to reconstruct their evolutionary paths. Phylogenetic tree analysis [4-6] was used to investigate the identification of coronavirus. Dimension reduction algorithm Uniform Manifold Approximation and Projection (UMAP) [7, 8] has also been applied for COVID-19 protein studies [9], but the genome sequence application is rare. A potential problem for the phylogenetic relationship-based method is that only largely complete genomes are preferred with limited noise suppression [3]. Principal component analysis (PCA) is a powerful tool for reducing the dimensionality of complex data sets [10, 11]. The PCA study has been reported to have high performance in noise suppression and fast data process for various applications [12-14] and showed high performance in excluding outliers or artifacts [15, 16]. When over 20,000 viral genome sequences have been shared in GISAID (www.gisaid.org), PCA is one of the best options to study the sample clustering information and provides the potential virus mutation information for further studies. Though GISAID is considered as an effective and trusted database for sharing of both published and “unpublished” influenza data [17, 18], it is still possible to get low quality of data due to various types of errors. Therefore, PCA could also be served as good noise suppression and data-pretreatment tool for other popular genome sequences analysis methods. PCA has been applied to study both protein sequences [19-21] and whole-genome [22] to simplify the highly complex data. A frequency-rank-based method was specially designed for PCA on protein sequence [23] which can efficiently reserve the variance information. In the PCA study, when projecting the large size data set into an eigenspace that represents the directions of greatest variation, complex samples such as large size protein sequences could be represented by principal component [23, 24]. The method has been successfully applied to distinguish protein clusters; however, it has not been applied to DNA or RNA sequences that are much longer than the protein sequences. Though DNA or RNA sequences have fewer variables (possible nucleotides) than that of protein sequences (20 amino acids), the idea of converting letter sequences to the numerical data matrix is similar. While the RNA genome has fewer options than proteins, the core idea of reserve sequence variance using frequency is the same. Since the ranks are used as the final data (same frequency data are ranked alphabetically), the difference between protein and RNA sequences is limited after mean-centered scaling. PCA is a suitable tool to work with higher volumes of features than observations [25], and the RNA genomes like the coronavirus which is as large as 30 k are appropriate for PCA analysis. The GISAID database has collected more than 20,000 sequences from clinicians and researchers all over the world from December 2019 till mid-May 2020, and the number is growing fast each day which provides a great opportunity to study the potential application of frequency-based PCA analysis. The method in this study would provide a fast tool to analyze the extremely large amount of sequences with regular lab computers and have a high tolerance to the potential noises. More important, the clustering analysis generated by PCA could help to track the COVID-19 genome mutations.

Materials and Methods

Genome Sequence Alignment

The genome sequences were obtained from the GISAID database (www.gisaid.org) in fasta format. The data were downloaded on May 25, 2020, and the laboratories contributed to the database are listed in Dataset S1. The genome sequences were then aligned using MAFFT [26] version 7 (https://mafft.cbrc.jp/), and the whole genome was aligned using a lightweight option.

Sequence Conversion

The aligned sequences were imported to Matlab 2019 (Mathworks) for the frequency conversion analysis. The aligned sequence was also visualized in UGENE V34.0 (Unipro) [27, 28]. The sequence conversion was carried out using the algorithm reported before [23] in a Matlab script. Briefly, the method converted letters to numerical numbers using the frequency of each letter in their sequence position. The converted numerical matrix was analyzed using the PLS-tool box to do the PCA analysis using the singular value decomposition (SVD) algorithm [29].

Genome Sequence Selections

To test the performance of the method, 75 random sequences were selected from various places of the world samples and 6 animal samples were included for comparison purposes, and the detailed sequence name and date information is in Table 1. After the shortlist of sequences was tested, the total number of more than 20,000 sequences was applied to test the performance of the method in a large number of sample sizes.
Table 1

The detailed sequences names reported locations and dates with the three nucleotides positions listed. The animal samples are listed at the bottom of the table. The first 3 positions were selected by using the 2D loading plot and the rest positions were selected using the PC1 plot. The difference could be clearly observed from the last few rows for the animal samples

Sequence names23011228152340425762534111411120119245187482636826437
JapanCAAAAAACCAA
JapanCAAAAAACCAA
BeijingCAAAAAACCAA
BeijingCAAAAAACCAA
BelgiumCAAAAAACCAA
EnglandCAAAAAACCAA
IrelandCAAAAAACCAA
IrelandCAAAAAACCAA
EnglandCAAAAAACCAA
AlgeriaCAAAAAACCAA
AlgeriaCAAAAAACCAA
AustriaCAAAAAACCAA
AustriaCAAAAAACCAA
GreeceCAAAAAACCAA
BrazilCAAAAAACCAA
GermanyCAAAAAACCAA
LebanonCAAAAAACCAA
RussiaCAAAAAACCAA
RussiaCAAAAAACCAA
SwitzerlandCAAAAAACCAA
FranceCAAAAAACCAA
LatviaCAAAAAACCAA
SwitzerlandCAAAAAACCAA
IcelandCAAAAAACCAA
IcelandCAAAAAACCAA
SwedenCAAAAAACCAA
DenmarkCAAAAAACCAA
GermanyCAAAAAACCAA
LebanonCAAAAAACCAA
SlovakiaCAAAAAACCAA
DenmarkCAAAAAACCAA
LatviaCAAAAAACCAA
SwedenCAAAAAACCAA
ScotlandCAAAAAACCAA
GreeceCAAAAAACNAA
FranceCAAAAAACCAA
ScotlandCAAAAAACCAA
BelarusCAAAAAACCAA
Czech RepublicCAAAAAACCAA
LithuaniaCAAAAAACCAA
ArgentinaCAAAAAACCAA
ArgentinaCAAAAAACCAA
LithuaniaCAAAAAACCAA
SlovakiaCAAAAAACCAA
BrazilCAAAAAACCAA
ColombiaCAAAAAACCAA
ColombiaCAAAAAACCAA
ChileCAAAAAACCAA
NetherlandsCAAAAAACCAA
NetherlandsCAAAAAACCAA
ChileCAAAAAACCAA
AustraliaCAAAAAACCAA
Costa RicaCAAAAAACCAA
CroatiaCAAAAAACCAA
BelgiumCAAAAAACCAA
AustraliaCAAAAAACCAA
USACAAAAAACCAA
USACAAAAAACCAA
USACAAAAAACCAA
BelarusCAAAAAACCAA
WalesCAAAAAACCAA
WalesCAAAAAACCAA
USACAAAAAACCAA
Costa RicaCAAAAAACCAA
SingaporeCAAAAAACCAA
GhanaCAAAAAACCAA
Czech RepublicCAAAAAACCAA
SingaporeCAAAAAACCAA
GhanaCAAAAAACCAA
batCAAAAAGTTAA
pangolinAGC----NT--
pangolinAGCGGGGTTCC
pangolinNNNNNNNYNNC
pangolinTTGTTTGAATT
batGCTCCGGCNNN

Results

The Sequence Alignment and Conversion

After alignment using MAFFT, a total of 31,690 elements were obtained for the shortlist (75) sequences, and 35,466 elements were obtained when the total sequences were applied to a total number of 21,094 sequences. One example of the aligned sequence is listed in the supporting files. (Dataset S2).

The PCA Study of the Shortlist Sequences

The conversion method successfully converted the letter sequences to numerical numbers which are suitable for PCA studies. The PCA study was carried out using mean-centered data, and the first two PCs showed a clear separation between the animal sequences and human sequences (Fig. 1). Most human sequences were separated from the first principal component (PC1) direction, and four sequences were separated from the second principal component (PC2). The loading plot [31] represented the features in the raw sequence which could lead to the separation of the samples in the score plot. In another word, when a potential mutation was observed by the PCA score plot, it is possible to track the changes in the sequence. For example, when the separation was mostly observed in the first PC, it is of interest to study the higher absolute values of PC1. However, the PCA score plot was calculated with a combined contribution of a large number of loadings but not just a single position. In the combined plot of the PCA score and loading plot (Fig. 2), the positions that lead to the separation and two large loadings can be analyzed by studying similar directions, and three representative positions were shown on the top of Fig. 2, and the detailed information was listed in Table 1. Though the score plot was calculated based on the combinations of all the important loadings, the difference between the animal samples and human virus samples can also be observed in the representative positions (Table 1), and the results showed clear differences between the human samples and animal samples. In addition, the discovered nucleotide position regions with high first PC loadings (Fig. 3) are very similar to previous studies [30]. Position numbers may be different due to the alignment of the different sequences.
Fig. 1

The PCA score plot of the shortlist sequences. The plot contains 75 sequences including 2 sequences from the bat and 4 sequences from pangolin (the orange dots); the rest of the sequences were randomly selected from human virus samples from all over the world (the blue dots)

Fig. 2

The score and loading combined plot of the shortlist sequences. The red dots are animal sample scores, orange dots are human sample scores, and the blue dots are loadings. The scores were scaled to the range of the loading plot to allow the plotting in the same figure. The numbers on the figure indicate the representative nucleotide positions and the sequences were listed in Table 1

Fig. 3

The first PC loading distributions for the shortlist genome sequences which showed the genome site differences. The selected high loadings are included in Table 1, and more details can be obtained in Table 1

The PCA score plot of the shortlist sequences. The plot contains 75 sequences including 2 sequences from the bat and 4 sequences from pangolin (the orange dots); the rest of the sequences were randomly selected from human virus samples from all over the world (the blue dots) The score and loading combined plot of the shortlist sequences. The red dots are animal sample scores, orange dots are human sample scores, and the blue dots are loadings. The scores were scaled to the range of the loading plot to allow the plotting in the same figure. The numbers on the figure indicate the representative nucleotide positions and the sequences were listed in Table 1 The first PC loading distributions for the shortlist genome sequences which showed the genome site differences. The selected high loadings are included in Table 1, and more details can be obtained in Table 1

The PCA Study of a Large Number of Sequences

A large size dataset with 21,094 genome sequences obtained from GISAID was analyzed using the PCA method. The calculation for about 19,697 sequences cost around 90 min using a personal computer with 12G ram. Due to the complexity of the sequences, a hoteling T2 method was applied to exclude the potential outliers in the PCA score plot to minimize the error caused by potential human errors in sequence upload. The final dataset reserved 19,697 samples from all over the world and a PCA model was built using the mean-centered data in Figs. 4 and 5. The PCA score plot separation is mainly in the PC1 direction with several potential sub-groups. Since the pandemic was firstly reported in China, the East and Southeast Asian samples were highlighted in the score plot. No significant difference was observed in the highlighted samples excepted the left bottom part showed relatively fewer samples. Though the score plot showed that Asian samples tend to appear on the right side of the score plot, a clear difference that can classify virus types like the previous reported phylogenetic network study using 160 samples [3] was not observed. The comparison between Europe and the USA were also highly mixed (Fig. 6). Hence, the data reported time was applied in the following analysis.
Fig. 4

The PCA score plot of the COVID-19 sequences including 19,697 samples. The orange dots are sequences reported from East and Southeast countries and regions. The blue dots are the samples reported from countries and regions in the other part of the world. The PCA score plot showed the distribution of the genome sequences from all over the world with differences in the first and second PC directions. The other PCs (PC3 to PC10) are listed in Fig. 5

Fig. 5

The PCA score plot of the COVID-19 sequences including 19,697 samples. The orange dots are sequences reported from East and Southeast countries and regions. The blue dots are the samples reported from countries and regions in the other parts of the world. PCA score plots for PCs from 4 to 10. a Scores of PC3 vs PC4. b Scores of PC5 vs PC6. c Scores of PC7 vs PC8. d Scores of PC9 vs PC10

Fig. 6

The PCA score plot with the comparison of Europe and the US. a Scores of PC1 vs PC2. b Scores of PC3 vs PC4. c Scores of PC5 vs PC6. d Scores of PC7 vs PC8. e Scores of PC9 vs PC10

The PCA score plot of the COVID-19 sequences including 19,697 samples. The orange dots are sequences reported from East and Southeast countries and regions. The blue dots are the samples reported from countries and regions in the other part of the world. The PCA score plot showed the distribution of the genome sequences from all over the world with differences in the first and second PC directions. The other PCs (PC3 to PC10) are listed in Fig. 5 The PCA score plot of the COVID-19 sequences including 19,697 samples. The orange dots are sequences reported from East and Southeast countries and regions. The blue dots are the samples reported from countries and regions in the other parts of the world. PCA score plots for PCs from 4 to 10. a Scores of PC3 vs PC4. b Scores of PC5 vs PC6. c Scores of PC7 vs PC8. d Scores of PC9 vs PC10 The PCA score plot with the comparison of Europe and the US. a Scores of PC1 vs PC2. b Scores of PC3 vs PC4. c Scores of PC5 vs PC6. d Scores of PC7 vs PC8. e Scores of PC9 vs PC10

Discussion

The Potential Changes with the Sequences Report Time

The mutation of COVID-19 is critical to study the prevention and drug development to fight with the novel coronavirus. When the early reported samples (before Jan 15, 2020) were highlighted in the PCA score plot, they mainly showed on the right side of the score plot (red dots in Figs. 7 and 8). With a slight increase in the time point to the end of January, more sequences were shown to the left side of the plot and the March samples showed more cases to the left side. Though the trend was observed, newer cases were scattered in all the parts of the PCA score plot which is reasonable since the newer reported cases are not necessarily newer mutations. The total of around 1700 samples on the left bottom side of the score plot are mainly from Europe and North America (around 95%) which may draw some attention for further mutation studies. The geographic clustering and time section were analyzed after the unsupervised PCA study without pre-settings. The nonlinear PCA method [32] could help to reduce errors judging by the short test data (Fig. 9); however, the methods consume hundreds of times processing time which could be used only for a small number of sequences.
Fig. 7

The PCA score plot of the COVID-19 sequences including 19,697 samples. The red dots, green dots, and yellow dots are sequences reported before Jan 15, 2020, before Feb 1, 2020, and before March 1, 2020, representatively. The blue dots are the samples reported after March 1, 2020. The other PCs (PC3 to PC10) are listed in Fig. 8

Fig. 8

The PCA score plot of the COVID-19 sequences including 19,697 samples. The red dots, green dots, and yellow dots are sequences reported before Jan 15, 2020, before Feb 1, 2020, and before March 1, 2020, representatively. The blue dots are the samples reported after March 1, 2020. a Scores of PC3 vs PC4. b Scores of PC5 vs PC6. c Scores of PC7 vs PC8. d Scores of PC9 vs PC10

Fig. 9

The PCA score plot for non-linear PCA for the shortlist sample in Fig. 1 (processing time is about 2 h using a personal computer with 12 G ram)

The PCA score plot of the COVID-19 sequences including 19,697 samples. The red dots, green dots, and yellow dots are sequences reported before Jan 15, 2020, before Feb 1, 2020, and before March 1, 2020, representatively. The blue dots are the samples reported after March 1, 2020. The other PCs (PC3 to PC10) are listed in Fig. 8 The PCA score plot of the COVID-19 sequences including 19,697 samples. The red dots, green dots, and yellow dots are sequences reported before Jan 15, 2020, before Feb 1, 2020, and before March 1, 2020, representatively. The blue dots are the samples reported after March 1, 2020. a Scores of PC3 vs PC4. b Scores of PC5 vs PC6. c Scores of PC7 vs PC8. d Scores of PC9 vs PC10 The PCA score plot for non-linear PCA for the shortlist sample in Fig. 1 (processing time is about 2 h using a personal computer with 12 G ram)

PCA Loading Plot Information

The score plot was calculated with a combined contribution of a large number of loadings but not just a single position, so the loading plot provides the protentional important positions. For example, if there are group differences in the score plot, the loadings in the same direction will be more important in the differences. In the shortlist sequences, the beginning and end positions usually have differences due to the low coverages or the undetermined nucleotide; however, when the first and last 1000 positions were excluded from the calculation, almost no significant difference could be observed (Fig. 10). This showed a great advantage in a large amount of data interpretation when potential errors could exit with a hard cleanup process. Figure 11 showed that the first PC loadings lead to the separation of Fig. 7 which indicates the potential mutation after time. The important loadings are listed under Fig. 11, and a large area with multiple important loadings lies in position 22,182 which is the spike protein gene site [33]. The predicted gene site positions may contribute to the mutation studies in COVID-19. In addition, though the loading plot can only provide the general trend of the virus sequence difference, it can still be powerful in filtering important changes.
Fig. 10

The comparison of the PCA score plot between the original one and the one after removing the first and last 1000 positions. The blue and orange dots are the full sequences for human virus and animal virus, representatively. The grey dots and yellow dots are the sequences without end positions for the human and animal virus, representatively

Fig. 11

The first PC loading distributions for the total 19,697 genome sequences which potentially indicated the mutation directions from time. The large loadings are 1782 3584, 9804, 11126, 12276, 16358, 20593, 22182, 22220, 30306, and 32631

The comparison of the PCA score plot between the original one and the one after removing the first and last 1000 positions. The blue and orange dots are the full sequences for human virus and animal virus, representatively. The grey dots and yellow dots are the sequences without end positions for the human and animal virus, representatively The first PC loading distributions for the total 19,697 genome sequences which potentially indicated the mutation directions from time. The large loadings are 1782 3584, 9804, 11126, 12276, 16358, 20593, 22182, 22220, 30306, and 32631

Conclusions

In this study, we adapted the frequency method-based PCA approaches originally designed for protein sequence to COVID-19 genome sequences which have more than 30,000 positions after alignment. The study was first demonstrated using a randomly selected shortlist of sequences with 75 samples with animal samples from bat and pangolin for testing the method, and the PCA showed as a powerful tool to separate the human samples from the animal samples using in the PCA score plot. The study also indicated that the end sequence uncertainty and gap positions have very limited influence on the score plot clustering which is suitable for the COVID-19 data which were submitted by users from all over the world and may have potential errors. The method was applied to a total of more than 20,000 sequences which showed the potential direction of the virus mutation with the time. The PCA method is expected to provide a fast analysis method with limited requirements for data cleaning but may have limitations in classifications. The authors suggest researchers use the tool in combination with other methods when a classification result is expected. When the COVID-19 genome sequence database becomes larger and larger, the method provides a great opportunity to study the mutation of the coronavirus and may also be served as a pre-processing for other methods such as the tree building-related approaches. The PCA score plot interpretation also discovered that potential virus evolution directions may worth further study to investigate COVID-19 potential mutations.
  26 in total

1.  Evaluation of dissolution profiles using principal component analysis.

Authors:  E Adams; R De Maesschalck; B De Spiegeleer; Y Vander Heyden; J Smeyers-Verbeke; D L Massart
Journal:  Int J Pharm       Date:  2001-01-05       Impact factor: 5.875

2.  Use of principal components analysis (PCA) on estuarine sediment datasets: the effect of data pre-treatment.

Authors:  M K Reid; K L Spencer
Journal:  Environ Pollut       Date:  2009-05-01       Impact factor: 8.071

3.  Evolution of Sequence-Diverse Disordered Regions in a Protein Family: Order within the Chaos.

Authors:  Thomas Shafee; Antony Bacic; Kim Johnson
Journal:  Mol Biol Evol       Date:  2020-08-01       Impact factor: 16.240

4.  A method to predict functional residues in proteins.

Authors:  G Casari; C Sander; A Valencia
Journal:  Nat Struct Biol       Date:  1995-02

5.  Flexible design of multiple metagenomics classification pipelines with UGENE.

Authors:  Rebecca Rose; Olga Golosova; Dmitrii Sukhomlinov; Aleksey Tiunov; Mattia Prosperi
Journal:  Bioinformatics       Date:  2019-06-01       Impact factor: 6.937

6.  Shared bioinformatics databases within the Unipro UGENE platform.

Authors:  Ivan V Protsyuk; German A Grekhov; Alexey V Tiunov; Mikhail Y Fursov
Journal:  J Integr Bioinform       Date:  2015-09-03

7.  GISAID: Global initiative on sharing all influenza data - from vision to reality.

Authors:  Yuelong Shu; John McCauley
Journal:  Euro Surveill       Date:  2017-03-30

8.  Principal Component Analysis applied directly to Sequence Matrix.

Authors:  Tomokazu Konishi; Shiori Matsukuma; Hayami Fuji; Daiki Nakamura; Nozomi Satou; Kunihiro Okano
Journal:  Sci Rep       Date:  2019-12-17       Impact factor: 4.379

9.  Phylogenetic network analysis of SARS-CoV-2 genomes.

Authors:  Peter Forster; Lucy Forster; Colin Renfrew; Michael Forster
Journal:  Proc Natl Acad Sci U S A       Date:  2020-04-08       Impact factor: 11.205

10.  Mutations on COVID-19 diagnostic targets.

Authors:  Rui Wang; Yuta Hozumi; Changchuan Yin; Guo-Wei Wei
Journal:  Genomics       Date:  2020-09-20       Impact factor: 5.736

View more
  1 in total

1.  Principal Component Analysis as a Statistical Tool for Concrete Mix Design.

Authors:  Janusz Kobaka
Journal:  Materials (Basel)       Date:  2021-05-19       Impact factor: 3.623

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.