| Literature DB >> 31035705 |
Yoji Igarashi1, Daisuke Mori2, Susumu Mitsuyama3, Kazutoshi Yoshitake4, Hiroaki Ono5, Tsuyoshi Watanabe6, Yukiko Taniuchi7,8, Tomoko Sakami9,10, Akira Kuwata11, Takanori Kobayashi12, Yoshizumi Ishino13, Shugo Watabe14, Takashi Gojobori15, Shuichi Asakawa16.
Abstract
Metagenomic data have mainly been addressed by showing the composition of organisms based on a small part of a well-examined genomic sequence, such as ribosomal RNA genes and mitochondrial DNAs. On the contrary, whole metagenomic data obtained by the shotgun sequence method have not often been fully analyzed through a homology search because the genomic data in databases for living organisms on earth are insufficient. In order to complement the results obtained through homology-search-based methods with shotgun metagenomes data, we focused on the composition of protein domains deduced from the sequences of genomes and metagenomes, and we utilized them in characterizing genomes and metagenomes, respectively. First, we compared the relationships based on similarities in the protein domain composition with the relationships based on sequence similarities. We searched for protein domains of 325 bacterial species produced using the Pfam database. Next, the correlation coefficients of protein domain compositions between every pair of bacteria were examined. Every pairwise genetic distance was also calculated from 16S rRNA or DNA gyrase subunit B. We compared the results of these methods and found a moderate correlation between them. Essentially, the same results were obtained when we used partial random 100 bp DNA sequences of the bacterial genomes, which simulated raw sequence data obtained from short-read next-generation sequences. Then, we applied the method for analyzing the actual environmental data obtained by shotgun sequencing. We found that the transition of the microbial phase occurred because the seasonal change in water temperature was shown by the method. These results showed the usability of the method in characterizing metagenomic data based on protein domain compositions.Entities:
Keywords: correlation coefficient; environmental DNA; metagenomics; phylogenetic analysis; protein domain
Year: 2019 PMID: 31035705 PMCID: PMC6630717 DOI: 10.3390/proteomes7020019
Source DB: PubMed Journal: Proteomes ISSN: 2227-7382
Figure 1Dot plots for correlation coefficients of domain combinations and pairwise distances of DNA sequences: (a) The pairwise distances were calculated based on the 16S rRNA sequence. The correlation coefficient was 0.4285, P < 2.2e−16; (b) Domain counts were converted to 0 (absence)/1 (presence), and pairwise distances were calculated based on the 16S rRNA sequence. The correlation coefficient was 0.5967, P < 2.2e−16; (c) Domain counts were converted to ln [number of domain + 1], and pairwise distances were calculated based on the 16S rRNA sequence. The correlation coefficient was 0.5993, P < 2.2e−16; and (d) The pairwise distances were calculated based on the DNA gyrase subunit B sequence. The correlation coefficient was 0.4723, P < 2.2e−16.
Figure 2Heatmap analysis of the protein domains using 30 samples of the environmental metagenomic data. It is divided into two large clusters: Clusters of 5 μm and 0.8 μm samples on the left cluster, while the right cluster contains 0.2 μm samples. See Supplementary Figure S7 for an analysis of the results using all of the data sets.
Figure 3Cluster analysis based on the protein domains using environmental metagenomic data. The distance between the samples was calculated by correlating the distance and they were clustered using the “ward.D2” method. It is divided into four clusters. The black bars and arrows indicate 5 μm filter samples in a 0.8 μm filter sample. See Supplementary Figure S8 for the high-resolution version.
Figure 4A principal component analysis was carried out on the protein domains by the environmental data. The data of the 0.8 μm filter samples were examined under three conditions: Sea depth, namely surface (1 m) vs. SCM (10–20 m); locations, namely the bay vs. the offshore area; the season, namely from December to April vs. from May to November. The red and green circles show samples from December to April and from May to November, respectively.