Akiko Nakashima1, Mitsue Takeya1, Keiji Kuba2, Makoto Takano1, Noriyuki Nakashima1. 1. Department of Physiology, Kurume University School of Medicine, Asahi-machi 67, Kurume, Fukuoka, 830-0011, Japan. 2. Department of Biochemistry and Metabolic Science, Akita University Graduate School of Medicine, 1-1-1 Hondo, Akita, 010-8543, Japan.
Abstract
The global pandemic of SARS-CoV-2 has disrupted human social activities. In restarting economic activities, successive outbreaks by new variants are concerning. Here, we evaluated the applicability of public database annotations to estimate the virulence, transmission trends and origins of emerging SARS-CoV-2 variants. Among the detectable multiple mutations, we retraced the mutation in the spike protein. With the aid of the protein database, structural modelling yielded a testable scientific hypothesis on viral entry to host cells. Simultaneously, annotations for locations and collection dates suggested that the variant virus emerged somewhere in the world in approximately February 2020, entered the USA and propagated nationwide with periodic sampling fluctuation likely due to an approximately 5-day incubation delay. Thus, public database annotations are useful for automated elucidation of the early spreading patterns in relation to human behaviours, which should provide objective reference for local governments for social decision making to contain emerging substrains. We propose that additional annotations for past paths and symptoms of the patients should further assist in characterizing the exact virulence and origins of emerging pathogens.
The global pandemic of SARS-CoV-2 has disrupted human social activities. In restarting economic activities, successive outbreaks by new variants are concerning. Here, we evaluated the applicability of public database annotations to estimate the virulence, transmission trends and origins of emerging SARS-CoV-2 variants. Among the detectable multiple mutations, we retraced the mutation in the spike protein. With the aid of the protein database, structural modelling yielded a testable scientific hypothesis on viral entry to host cells. Simultaneously, annotations for locations and collection dates suggested that the variant virus emerged somewhere in the world in approximately February 2020, entered the USA and propagated nationwide with periodic sampling fluctuation likely due to an approximately 5-day incubation delay. Thus, public database annotations are useful for automated elucidation of the early spreading patterns in relation to human behaviours, which should provide objective reference for local governments for social decision making to contain emerging substrains. We propose that additional annotations for past paths and symptoms of the patients should further assist in characterizing the exact virulence and origins of emerging pathogens.
The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has disrupted social and economic activities worldwide since the first outbreak in China in 2019 [1]. COVID-19 presents varied symptomatic features [[2], [3], [4], [5], [6], [7]] with a wide range of incubation periods and epidemic curves across the world [1,4,8], which is likely influenced by many factors, including age, region, preventive measures, health care infrastructure, and the fundamental nature of the virus [3,9,10]. Characterization of the pathogens and preparations for medication and preventive vaccines are being urgently and extensively explored worldwide [[11], [12], [13], [14]].On restarting economic activities, the second wave of SARS-CoV-2 cases is concerning [15,16]. In Japan, the initial wave of SARS-CoV-2 infection was suppressed by cluster tracking and quarantine [17]. Since the lifting of the emergency declaration, the number of cases has been rapidly increasing [16,18], which necessitates rapid decision making regarding policy updates [19] for deregulation balance in social activities including travelling and business. Currently, the suspected patients are checked for symptoms and tested using reverse-transcriptase polymerase chain reaction (RT-PCR) and antibody-associated immunological assays [20]. Diagnostic tests with high sensitivity and specificity are necessary [[21], [22], [23]]. To trace patient behaviours, local cluster tracking [17] is effective, but information accuracy depends on the reliabilities of directly interviewing the infected individuals. Mobile phone applications (apps) to trace past close contact with patients via Bluetooth technology are utilized in some areas, including Japan [24,25]. However, limitations of such mobile apps may exist for the global tracing outside of the service areas, the user number in all age groups [26], self-reporting reliability or the definition of a close-contact period and distance. Ethical problems may also exist in identifying the index case in small communities and exposing them to scrutiny and harsh judgment due to panic and anxiety [27,28].Simultaneously, the emergence of mutated variants of SARS-CoV-2 has been confirmed [12,29,30]. RT-PCR, immune assays for diagnostics, medications for treatment or vaccines for prevention might be vulnerable to mutated substrains. Although technical advances are occurring [21,22], the pathogenicity and origins of the mutated substrains of SARS-CoV-2 should be available in real time to adopt early measures by authorities at the onset of emergence.In parallel with individual treatment at hospitals and clinics, specimens from infectedpatients are directly sequenced, and the genetic information of SARS-CoV-2 is being globally sampled and added to public databases [[31], [32], [33]]. The databases have been used to predict viral transmissibility, antibody affinities and drug efficacy [34]. The cross-disciplinary usability of databases should promote the feedback of accumulating raw data to predict the actual profiles of pathogenic diseases [35]. Simple real-time surveys with regional public assistance are fundamentally necessary in an internationally available format.Here, we utilized these database annotations to detect virus variants and to estimate the virulence and transmission trajectories of the emerging substrains. We examined the nucleotide mutations and visualized the transmission trajectories of SARS-CoV-2 by consulting the world specimens registered in the virus data bank of the National Center for Biotechnology Information (NCBI) [32].
Methods
Data acquisition for SARS-CoV-2 and other specimens
Due to its accessibility to the raw data of nucleotides and proteins with multiple annotations in a simple FASTA format, we used the data deposited in the NCBI Virus-SARS-CoV-2 data hub [36]. In the “Refine results” window, we specified the data by release date 2019/1/11–2019/5/3 (From 11 Jan 2019 to 3 May 2020). The latest data at that point were deposited on 1 May 2020. In the “Results” window, we rearranged the “Length” in ascending order. Then, we obtained 23042-FASTA formatted data of Protein containing Accession, GenBank Title, Geo_Location, Host, Species and Nucleotide Completeness in order, and 2051-FASTA formatted data of Nucleotide. For another analysis, we also obtained 23042-table view result data of Protein and 2051-table view result data of Nucleotide in CSV formats containing Accession, Release_Date, Species, Genus, Family, Length, Sequence_Type, Nuc_Completeness, Genotype, Segment, Authors, Publications, Geo_Location, Host, Isolation_Source, Collection_Date, BioSample and GenBank_Title. The protein and nucleotide sequences in one letter codes were analysed by using the Excel filter function. Accession numbers for viral genomes are SARS-CoV-2, NC_045512.2; SARS-CoV, NC_004718.3; MERS, NC_019843.3; HCoV-NL63, KF530114.1; HCoV-229E, KF514433.1; HCoV-HKU1, NC_006577.2; and HCoV-OC43, KX344031.1. The homology alignment was performed using the online tool CLUSTALW [37]. See Supplementary notes 1 and 2 for the step-by-step procedures.
Entropy calculation
To evaluate the proportions (P) of mutations at a certain ith residue of the S protein, we used the information entropy [30] as a sum of P
lnP
, where AA is 20 biological amino acids. P
lnP
was defined as zero when the mutated residue was not detected in the samples at the ith residue. The calculated entropy scores were plotted along the whole S protein structure to yield the spectrum view of the scores as an entropy spectrum. See Supplementary note 1 for how to use the program.
Protein conformational analysis
Model structures with D614G mutagenesis were constructed using PyMOL (The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC., NY, USA). The structural models for coronavirusspike glycoprotein are 6vsb [38] (SARS-CoV-2), 5xlr [39] (SARS-CoV), 5x5c [40] (MERS), 6nzk [41] (HCoV-OC43) and 6u7h [42] (HCoV-229E) obtained from the Protein Data Bank [43] (www.rcsb.org). For the hydrophobicity search, we consulted the online resource portal of the Swiss Institute of Bioinformatics (https://web.expasy.org/protscale/) [44].
Program for sequential data analysis
We originally built a program to manipulate big data. Codes are provided as supplementary data by Excel Visual Basic (Office Professional 2016, Microsoft Corporation, WA, USA). All the source codes of the programs are provided with the annotation. Each program operates as follows: “prPCVcov2” aligns each single letter code of all the amino acid sequences separately in each cell for all the different protein datasets in a FASTA formatted data file; “priNuc” extracts a text string “GAT” or “GGT”, which comes after a text string “CCAGGTTGCTGTTCTTTATCAG”. See Supplementary notes 1 and 2 for how to use the program.
Graph design
The processed metrics were visualized by Kaleida Graph 4 (HULINKS Inc., Tokyo, Japan), and artworks were originally created with Illustrator (Adobe Systems Incorporated, CA, USA). The SVG data were obtained from the public domain under the license of CC0 1.0 at https://en.wikipedia.org/wiki/File:BlankMap-World6-Equirectangular.svg. The world map originated from the United States Central Intelligence Agency's World Fact Book.
Spectral analysis of sampling periodicity
The periodicity of sampling of the mutated specimens was analysed by power spectrum using Axograph X (Version 1.7.4).
Results
Annotation search detects multiple conversions across the proteins of SARS-CoV-2
Coronaviruses are unique RNA viruses equipped with proofreading machinery [45]. However, substantial mutations were expected, leading to the overestimation of substrains with unchanged genetic codons. On the other hand, amino acid mutations occur less frequently due to the wobble nature of codons [46].Therefore, we consulted the NCBI database [36] and utilized a downloaded data table with all the applicable annotations (see also Methods). Among them, partial sequences or incomplete readouts were eliminated. We used 1500–2000 nucleotide and protein sequences with all applicable annotations, including sampling dates, locations, and genetic information of the virus.We detected the accumulation of the same mutations or the branching to multiple amino acids at approximately 100 residues in several component proteins of SARS-CoV-2 (Supplementary Table 1). Despite the genome proofreading ability of coronaviruses [45], multiple random mutations in SARS-CoV-2 have been reported [47,48]. Any of these conversions might be attributable to increased or decreased virulence of viral particles [29]. In particular, the presence and increase of identical mutations at the same residues from different specimens could be due to the transmissible pathogenic substrains of SARS-CoV-2 [12]. Mutations in the amino acid sequences have indeed occurred in different phases of the COVID-19 pandemic and are probably fixed, inherited and dominantly spreading around the world.However, the pathogenicity and exact origins of these variations are difficult to retrace only using this mutation profile. Among the proteins with frequent mutations, the surface glycoprotein, namely, the spike or S protein, contained a single eminent mutation from aspartate (D) to glycine (G) at 614 (D614G conversion; Supplementary Table 1). The relative mutation frequency at each residue was calculated as information entropy to digitize the variations across the S protein in the database [30] and visualized in a spectral view across the S protein: D614G appeared in the early stage of the COVID-19 pandemic and accumulated over time (Fig. 1
a). Among the D614G substrain, additional major mutations accumulated in other viral proteins in contrast to the D614 original strain (Supplementary Table 1, Supplementary Figure 1); the D614G could be an initial mutation for a more dominant substrain circulating in the world afterwards. Therefore, we next investigated the possible impact of the D614G mutation in the S protein of the converted substrain by the structural analysis and estimated the regional origins by the sampling periodicity analysis based on the obtained Excel data.
Fig. 1
Comparison between highly pathogenic and common coronaviruses.
(a) Schematic of the SARS-CoV-2 spike (S) protein (upper) and the entropy spectra (lower) to visualize mutations in S protein at different dates. RBD = receptor-binding domain, SD = subdomains, S1/S2 = protease cleavage site between N-terminal S1/C-terminal S2 domains. N and C = amino and carboxyl termini. D614 is in SD. (b) Homology alignment of the residues around D614 between highly pathogenic coronaviruses and other human coronaviruses (HCoV-OC43, -HKU1, -229E and -NL63). HCoVs are less lethal and common cold viruses. Middle East respiratory syndrome coronavirus (MERS) contains A614, but the surrounding residues are also variable compared to SARS-CoV and SARS-CoV-2 (SCoV-2). Among HCoVs, only HCoV-OC43 possesses the equivalent residues with relative homology to SARS-CoV-2. Amino acid numbering is based on SARS-CoV-2. (c) Hydrophobicity between the 598–628 residues of different coronaviruses. The hydrophobicity presents a steep increase in SARS-CoV, SARS-CoV-2 with D614G conversion, and MERS compared to that in the initial SARS-CoV-2 with D614 between 603 and 618 residues indicated in the grey bars in (b) and (c).
Comparison between highly pathogenic and common coronaviruses.(a) Schematic of the SARS-CoV-2spike (S) protein (upper) and the entropy spectra (lower) to visualize mutations in S protein at different dates. RBD = receptor-binding domain, SD = subdomains, S1/S2 = protease cleavage site between N-terminal S1/C-terminal S2 domains. N and C = amino and carboxyl termini. D614 is in SD. (b) Homology alignment of the residues around D614 between highly pathogenic coronaviruses and other human coronaviruses (HCoV-OC43, -HKU1, -229E and -NL63). HCoVs are less lethal and common cold viruses. Middle East respiratory syndrome coronavirus (MERS) contains A614, but the surrounding residues are also variable compared to SARS-CoV and SARS-CoV-2 (SCoV-2). Among HCoVs, only HCoV-OC43 possesses the equivalent residues with relative homology to SARS-CoV-2. Amino acid numbering is based on SARS-CoV-2. (c) Hydrophobicity between the 598–628 residues of different coronaviruses. The hydrophobicity presents a steep increase in SARS-CoV, SARS-CoV-2 with D614G conversion, and MERS compared to that in the initial SARS-CoV-2 with D614 between 603 and 618 residues indicated in the grey bars in (b) and (c).
D614G conversion in the S protein may affect viral entry
We consulted the Protein Data Bank [43] and the Swiss Institute of Bioinformatics resource portal [44] for the subsequent structural analysis. The spike of SARS-CoV-2 forms a homotrimer. Each S protein, comprising approximately 1300 amino acids, is a large transmembrane protein containing two subdomains, S1 and S2, which are responsible for receptor binding and membrane fusion, respectively [49] (Fig. 1a). D614 in S protein is conserved in SARS-CoV in 2003 as well as in the initial isolates of SARS-CoV-2 in China in January 2020 [50].Compared to less lethal human coronaviruses [51], highly pathogenic coronaviruses possess a large increase in hydrophobicity upstream of D614 in the subdomain (Fig. 1b and 1c). Moreover, D614 and the corresponding residues slightly deviated or did not exist in the equivalent positions among the other coronaviruses (Fig. 2
).
Fig. 2
Comparison of D614 and equivalent residues across coronaviruses.
The entire surface views of the available spike crystal structures of SARS-CoV-2, SARS-CoV, MERS, HCoV-OC
43 and HCoV-229E. See also the grey bar in Fig. 1b and 1c. The SARS-CoV-2 S protein with D614G conversion is indicated in beige, and the other spikes are in light grey; the aligned structures are shown in mosaic colours. Homologous regions surrounding D614 of SARS-CoV-2 are highlighted with distinct colours. A box with grey solid lines shows a close-up view of D614G conversion in SARS-CoV-2. Boxes with black solid lines show the equivalent residues in other CoVs. Boxes with black-grey dotted lines show the structural alignment of equivalent regions between SARS-CoV-2:D614G and the other respective CoVs. HCoV-229E lacks the residue equivalent to D614 of SARS-CoV-2. Boxes in green indicate the close-up view of the equivalent residues rotated by 90°; not overlaid. The data for HCoV-HKU1 and HCoV–NL63 were not found in the database. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
Comparison of D614 and equivalent residues across coronaviruses.The entire surface views of the available spike crystal structures of SARS-CoV-2, SARS-CoV, MERS, HCoV-OC43 and HCoV-229E. See also the grey bar in Fig. 1b and 1c. The SARS-CoV-2 S protein with D614G conversion is indicated in beige, and the other spikes are in light grey; the aligned structures are shown in mosaic colours. Homologous regions surrounding D614 of SARS-CoV-2 are highlighted with distinct colours. A box with grey solid lines shows a close-up view of D614G conversion in SARS-CoV-2. Boxes with black solid lines show the equivalent residues in other CoVs. Boxes with black-grey dotted lines show the structural alignment of equivalent regions between SARS-CoV-2:D614G and the other respective CoVs. HCoV-229E lacks the residue equivalent to D614 of SARS-CoV-2. Boxes in green indicate the close-up view of the equivalent residues rotated by 90°; not overlaid. The data for HCoV-HKU1 and HCoV–NL63 were not found in the database. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)Structurally, D614 is embedded in the S1 domain of the S protein, facing another protein unit within the trimer (Fig. 3
a and 3b), but this aspartic acid residue is not accessible from the orifice for receptor binding. Thus, D614G conversion is expected to change the inter- and intramolecular properties of the spike trimers. Molecular simulation predicted that the single D614G replacement would increase the thermal fluctuation not only in the vicinity but also throughout S protein, especially in the S2 subunit near the viral membrane [52] (Fig. 3c–f). D614G conversion resulted in the deletion of the side chain of the aspartic acid residue, and the distance between D614 and T859 of another protein unit should expand from 4.4 to 6.4 angstroms (Fig. 3g and 3h). This estimation indicates that the D614G mutation should change the inter-subunit interaction in the subdomain and the conformational state of the receptor binding domain so that the mutated viral particles can effectively interact with its cognate receptor in host cells for viral entry [12,34,53].
Fig. 3
D614G could affect the molecular stability of the S protein.
(a, b) Spike structure of (a) D614 and (b) D614G. I-III indicate the trimeric S protein units shown in different colours. The D614 atoms of unit I are highlighted in blue and red. (c–f) Thermal fluctuation profiles by Debey-Waller factor (c, d) across the whole protein unit and (e, f) in the close-up views of the vicinities of D614 and D614G, respectively. The heat map represents the atomic displacement. (g, h) Estimated distance from (g) D614 or (h) D614G to T859 of the other protein unit. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
D614G could affect the molecular stability of the S protein.(a, b) Spike structure of (a) D614 and (b) D614G. I-III indicate the trimeric S protein units shown in different colours. The D614 atoms of unit I are highlighted in blue and red. (c–f) Thermal fluctuation profiles by Debey-Waller factor (c, d) across the whole protein unit and (e, f) in the close-up views of the vicinities of D614 and D614G, respectively. The heat map represents the atomic displacement. (g, h) Estimated distance from (g) D614 or (h) D614G to T859 of the other protein unit. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
GAU to GGU conversion can be traced back to February 2020
Next, we investigated the mutations at the genome level to retrace the transmission history of the D614G-converted virus. By analysing the nucleotide database, we detected the identical conversion from guanine-adenine-uracil to guanine-guanine-uracil (GAU to GGU) in all the D614G-converted cases despite 8 more convertible codons.We then retraced the specimens with a GGU mutation. The GGU specimens exponentially increased in March 2020 worldwide (Fig. 4
a). As of May 1, the conversion was found in more than half of the databank resource samples and similarly in almost all the regions (Fig. 4a). The SARS-CoV-2 virus originally isolated in Wuhan, China, was a nonconverted GAU type [50,54] and so were all the specimens reported in China until the de facto termination of an emergency state in March. This GAU-to-GGU conversion was first detected in a specimen in Spain (MT292580) and next in the United States; there was also one from New England (MT276323) and two from Florida (MT276329, MT276330) collected on 28 February and one each from New Hampshire (MT304484) and Georgia (MT276327) on 29 February (Fig. 4b). Interestingly, the mutated specimens were sampled in distant cities in the USA on 1 March 2020, in Washington (MT415895), Connecticut (MT350239), and California (MT304491), implying that the D614G conversion might confer the virulence of SARS-CoV-2 in Europe and America [29,54]. This result suggests that the patient had close contact with another patient along their mobility history.
Fig. 4
GAU-to-GGU conversion was detected worldwide.
(a) Monthly timeline of the newly added specimens carrying GAU and GGU nucleotide mutations in the world and the continents. The data sampled in China are separately shown, where a state of emergency has been lifted. The data in April were not registered (NR) in China. GGU conversion was not registered in China. The specimen numbers are shown on a logarithmic scale. No other codons for D614G were found. (b) Geolocations and dates of the newly collected specimens in March 2020. Red circles indicate regions as follows: Spain, USA, Peru, Israel, France, Greece, South Africa, Sri Lanka, Czech Republic, Taiwan, India, Germany, Hong Kong and Puerto Rico. In the USA, only the representative states were listed: Washington, Connecticut, California, Minnesota, Illinois, Texas, Virginia, Arizona, Utah and Indiana. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
GAU-to-GGU conversion was detected worldwide.(a) Monthly timeline of the newly added specimens carrying GAU and GGU nucleotide mutations in the world and the continents. The data sampled in China are separately shown, where a state of emergency has been lifted. The data in April were not registered (NR) in China. GGU conversion was not registered in China. The specimen numbers are shown on a logarithmic scale. No other codons for D614G were found. (b) Geolocations and dates of the newly collected specimens in March 2020. Red circles indicate regions as follows: Spain, USA, Peru, Israel, France, Greece, South Africa, Sri Lanka, Czech Republic, Taiwan, India, Germany, Hong Kong and Puerto Rico. In the USA, only the representative states were listed: Washington, Connecticut, California, Minnesota, Illinois, Texas, Virginia, Arizona, Utah and Indiana. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
Transmission of D614G substrain within the USA after entry
Specimens of the original SARS-CoV-2 without mutations had already been reported in the United States in January 2020 (Fig. 5
a). The sampling ratio of the GGU-mutated specimens with respect to the original GAU specimens suddenly increased at the end of February, followed by periodic fluctuations (Fig. 5b). The spectral analysis indicated that the predominant transmission interval ranged from 4 to 6 days. This period most likely corresponds to the approximate incubation delay at the early phase of transmission of the mutated substrain within the United States (Fig. 5c). When the database was consulted again on 22 June 2020, the deposited data increased not only in total number (from 1866 to 7596 specimens in the world) but also in the number of monthly specimens based on collection dates: January, from 84 to 129; February, from 78 to 189; March, from 1527 to 4155; April, from 161 to 2468 specimens. Therefore, many specimens were deposited after a substantial delay because of the collection dates (for example, at around the end of February, Fig. 5a and 5b). Despite such a discrepancy, the trends in spectral analysis were mainly unchanged; the periodicity was slightly sharpened (Fig. 5c).
Fig. 5
The transmission in the United States was retraceable.
(a) The original SARS-CoV-2 (GAU) and the converted (GGU) specimens collected in the United States and deposited as of 1 May 2020 (upper), and the proportion of GGU in the daily deposits (GAU + GGU). (b) The GAU and GGU specimens collected as of 22 June 2020 (upper) and their proportions. (c) Spectral analysis of the ratio trend from 28 February through 8 April in (a) and from 28 February through 21 May in (b). Periodicities (red arrow) of infection expanding within the United States were elucidated. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
The transmission in the United States was retraceable.(a) The original SARS-CoV-2 (GAU) and the converted (GGU) specimens collected in the United States and deposited as of 1 May 2020 (upper), and the proportion of GGU in the daily deposits (GAU + GGU). (b) The GAU and GGU specimens collected as of 22 June 2020 (upper) and their proportions. (c) Spectral analysis of the ratio trend from 28 February through 8 April in (a) and from 28 February through 21 May in (b). Periodicities (red arrow) of infection expanding within the United States were elucidated. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)Collectively, the annotations in the virus genome database are of fundamental use to hypothesize the pathogenicity and to trace the transmission route at the early phase of emergence of the new substrains. These results have elucidated the need for additional annotations on patients (Fig. 6
a and 6b), which should reinforce the utility of virus genomic annotations by characterizing the symptomatic features (Fig. 6c).
Fig. 6
Annotation tags for medical and mobility history enable worldwide real-time meta-analysis.
(a) Current annotation list associated with virus data lacks medical information on the host patients. (b) Example format of digitized annotations based on medical interviews, laboratory data and medical treatment with privacy protection. (c)Analyses from the virus annotation data (left) reveal the emergence of the new substrains, predicted spreading span and cycles of the viruses. Additional information on medical records enables refinement of the relationship between the host and SARS-Cov-2 in general (right). The origins for the cases with similar clinical features and the same viral information can be retraced back with past-path annotation tags. Furthermore, the annotations with follow-ups and outcomes will update the profiles of COVID-19 with substrains including severity, morbidity or unique symptomatic trends.
Annotation tags for medical and mobility history enable worldwide real-time meta-analysis.(a) Current annotation list associated with virus data lacks medical information on the host patients. (b) Example format of digitized annotations based on medical interviews, laboratory data and medical treatment with privacy protection. (c)Analyses from the virus annotation data (left) reveal the emergence of the new substrains, predicted spreading span and cycles of the viruses. Additional information on medical records enables refinement of the relationship between the host and SARS-Cov-2 in general (right). The origins for the cases with similar clinical features and the same viral information can be retraced back with past-path annotation tags. Furthermore, the annotations with follow-ups and outcomes will update the profiles of COVID-19 with substrains including severity, morbidity or unique symptomatic trends.
Discussion
Annotations need further implementation
The current NCBI database can elucidate the weekly or monthly trends in the propagation of emerging virus variants. Estimations of the transmission trajectories and close contacts by multiple data comparisons can refine the current genomic and geological features of SARS-CoV-2 [25,30,54]. However, the exact origins and the pathogenicity of the virus variants need to be more refined [55,56]. The virus is closely linked to human behaviours and health conditions. If viral information is tagged with additional annotatable data on the patients, we can make the best of the limited number of specimens. In particular, travel history and medical records are critically useful. Such human-associated information should be tagged to the virus information.
Information on the origins of emerging viruses
To stop further economic loss on a global scale, the restrictions of international personal travelling will be mitigated in the future. If any outbreaks of other pathogenic substrains that require different medical treatments [[57], [58], [59]] occur during this ongoing pandemic, the restart of global traffic may result in the sequential attacks of variant viruses on human society. In regard to the urgent clinical necessity, it is also important to locate the origins of the emergence of new strains of fatal viruses to prepare medical facilities for the emergency [29]. The annotation tags for patients’ mobility history linked to virus information should be useful to retrace the detailed transmission paths of virus variants using similar filtering functions on the Excel format. There should be a deposit delay after collection under conditions of social turmoil even with the aid of next-generation sequencing [15,60,61]. Thus, occasional updating on the same datasheet is important. Even though voluntary service is not always unlimited all over the world, international cooperation for fixed point surveys is necessary to reinforce the global monitoring and retracing of the transmission paths of emerging pathogens along with human mobility.
Information on the pathogenicity of emerging viruses
Structural modelling would be helpful for hypothesising pathogenicity. However, a portion of the conformational data downstream of D614 is also not in the PDB database. Such structural information which will add to further insights into the molecular mechanisms should be accompanied by symptomatic features [29,56] to ultimately understand pathogenicity in humans on the basis of experimental studies [62].
Clinical review system at present
The outbreak has been rather small in Japan [63], and the mechanisms remain unknown due to the lack of available information on the disease, namely, the patient symptoms.Currently, individual case reports, case series, regional analyses, and meta-analyses are conducted under ethical regulations and structured protocols [17,[63], [64], [65]]. These close observations by trained clinical staff characterized the unique symptoms in COVID-19, including olfaction and gustatory impairments [66]. However, the meta-analysis on the symptoms will appear later [17,63]. Moreover, the majority of young patients with COVID-19 are suspected to be asymptomatic [67]. If the emergence of the new variant is traced and retraced in a real-time public platform with visualization [30,68] utilizing such medical data with human mobility history data [69], governments and other authorities can take swift and flexible actions to contain the virus.
Medical record annotations are needed
At present, little is known about the specific symptoms of COVID-19 [70]. An analysis of the trends of a pandemic is reinforced by open source, public databases with medical annotations. Medical records on past paths of human mobility should be used to refine the total profile of virus-human relationships with acceptable anonymity [70]. Since partial data are available at the initial time of deposit, the information may need continual updates.SARS-CoV-2 and other emerging infectious diseases [71,72] are associated with human socioeconomic activities together with environmental and ecological factors. Compartmentalization of the world into monitorable regions based on human mobile trends [73] and sentinel surveillance including pathogen sampling with patient medical records is necessary. Local governments around the world should share real-time information on the changing nature of viruses and could conduct regional prevention measures, including caution procedures, travel restrictions and lockdowns [9].Additionally, the susceptibility of animals to pathogens as the intermediate transmission source should be addressed [74]. Along with a risk assessment for animal-borne infections via domestic animals or animals in zoos, a simple tag for infected animals must be used to estimate the potential risks of zoonosis [75,76].
Use of other medical bioinformatics
Conventionally, the use of databases to predict other diseases has been developed as a disease mining method. The literature can be searched using MeSH terms [77], and the extraction of available data from the literature, including case reports using natural languages [78], can be conducted as a meta-analysis or evidence-based medicine (EBM). However, a lack of verbalized resources can be a barrier to EBM [79]. Diagnosis by digital data is especially powerful in evaluating electrocardiogram, gene-phenotype association, and pathological data [80,81] or radiographic images [78]. Composite phenotypes can also be assessed through multivariate correlations [82,83]. Automated clustering using digital annotations should decrease the substantial risk of overlooking a relevant prior study or finding. Artificial intelligence (AI) can further optimize the diagnostic accuracy [84]. However, AI may confront other risks in overlooking minor trends in rare cases by overfitting errors [84]. Virus and medical annotation tags in a simple and unified spreadsheet format are preferable for further analyses in the future. The empirical insights of medical staff are surely needed for detailed annotations, which is important for the emergence of unique pathogens.
Conclusions
The current databases are already powerful and useful and can evolve based on the needs of the implementation of sociomedical science. We propose the use of additional annotation tags for patients that are anonymized with maximum privacy protection and informed consent on sampling virus genetic data around the world without borders. Additionally, a cooperative system of international databases [32,33] in a single platform might also be helpful during this global emergency. Urgent international discussion is needed.
Author contributions
AN and NN analysed the downloaded data. AN and NN discussed the results and wrote the manuscript with the other authors.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors: Paul F Agris; Emily R Eruysal; Amithi Narendran; Ville Y P Väre; Sweta Vangaveti; Srivathsan V Ranganathan Journal: RNA Biol Date: 2017-09-21 Impact factor: 4.652
Authors: Jiumeng Sun; Wan-Ting He; Lifang Wang; Alexander Lai; Xiang Ji; Xiaofeng Zhai; Gairu Li; Marc A Suchard; Jin Tian; Jiyong Zhou; Michael Veit; Shuo Su Journal: Trends Mol Med Date: 2020-03-21 Impact factor: 11.951
Authors: Frank Esper; Eugene D Shapiro; Carla Weibel; David Ferguson; Marie L Landry; Jeffrey S Kahn Journal: J Infect Dis Date: 2005-01-14 Impact factor: 5.226
Authors: Gurjit Sidhu; Layla Schuster; Lin Liu; Ryan Tamashiro; Eric Li; Taimour Langaee; Richard Wagner; Gary P Wang Journal: Sci Rep Date: 2020-05-18 Impact factor: 4.379