| Literature DB >> 24800745 |
Yuki Iwasaki1, Takashi Abe2, Norihiro Okada3, Kennosuke Wada1, Yoshiko Wada1, Toshimichi Ikemura1.
Abstract
With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.Entities:
Keywords: CG suppression; DNA methylation; SOM; big data; epigenetic
Mesh:
Substances:
Year: 2014 PMID: 24800745 PMCID: PMC4195492 DOI: 10.1093/dnares/dsu012
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Oligonucleotide BLSOM for 100-kb sequences. (Ai) DegTri for six fish genomes. Lattice points containing sequences from multiple species are indicated in black and those containing sequences from a single species are coloured as follows: coelacanth (), medaka (), stickleback (Gasterosteus aculeatus; ), fugu (), tetraodon () and zebrafish (). Lattice points containing no sequences after BLSOM calculation are indicated in white blank. 89, 95 and 97% of the sequences are located at the lattice points containing sequences from a single species (designated as pure lattice points) on DegDi, DegTri and DegTetra (see also Supplementary Fig. S1Ai and Bi). The separation of two closely related species, fugu and tetraodon, is rather poor especially on DegDi; therefore, black lattice points are evident for their territories (green for fugu and dark green for tetraodon). When these species belonging to Tetraodontidae are grouped into one category, 92, 97 and 98% of all fish sequences are located in the pure lattice points on DegDi, DegTri and DegTetra. BLSOM for 50-kb sequences also gave a clear species-specific separation (data not shown). (Aii) Examples of trinucleotides over-represented and evidently under-represented in the coelacanth genome. The occurrence of the trinucleotide for each lattice point has been calculated and normalized with the occurrence expected from the mononucleotide composition for each lattice point.[8] This observed/expected ratio is indicated in colour presented at the bottom of the panel. (Bi) DegTri for 11 vertebrate genomes. Lattice points containing sequences from multiple species are indicated in black and those containing sequences from a single species are coloured. Fishes are marked as described in Ai, and tetrapods are as follows: chicken (), human (Homo sapiens; ), mouse (Mus musculus; ), lizard (Anolis carolinensis; ) and X. tropicalis (). About 89, 97 and 98% of the sequences are located at the lattice points containing sequences from a single species (pure lattice points) on DegDi, DegTri and DegTetra, respectively (see also Fig. 2Ai and Supplementary Fig. S1Ci). When fugu and tetraodon are grouped into one category, 90, 98 and 99% of all fish sequences are located in the pure lattice points. (Bii) Lattice points containing sequences derived only from fish genomes are marked with the colours listed in Ai. (Biii) Examples of trinucleotides whose occurrence differs between coelacanth and other fishes. Lattice points are marked as described in Aii.
Figure 2.DegTetra for 100-kb sequences derived from 11 vertebrate genomes. (Ai and ii) Lattice points are marked as described in Fig. 1Bi and ii. (Aiii) Examples of tetranucleotides whose occurrence differs between coelacanth and other fishes. Lattice points are marked as described in Fig. 1Aii. (B) DegTri for 100-kb unique sequences from 11 vertebrate genomes. DegTri for repeat sequences also has shown the species-specific separation (data not shown). (Ci) DegTri constructed for unique plus repeat sequences. (Cii) Lattice points containing only unique sequences are marked. (Ciii) Lattice points containing only repeat sequences are marked.
Figure 3.Normalized CG and CC+GG levels for 44 vertebrate genomes: Armadillo (Dasypus novemcinctus), bushbaby (Otolemur garnettii), cat (F. catus), chicken (G. gallus), chimp (Pan troglodytes), coelacanth (L. chalumnae), cow (Bos taurus), dog (Canis lupus), elephant (Loxodonta africana), fugu (Fugu rubripes), Geospiza fortis (Geospiza fortis), Gibbon (Nomascus leucogenys), Gorilla (Gorilla gorilla), Guinea pig (Cavia porcellus), Hedgehog (Erinaceus europaeus), horse (Equus caballus), human (H. sapiens), lamprey (P. marinus), lizard (A. carolinensis), marmoset (Callithrix jacchus), medaka (O. latipes), micorbat (Myotis lucifugus), mole rat (Heterocephalus glaber), mouse (M. musculus), opposum (M. domestica), orangutan (Pongo pygmaeus), panda (Ailuropoda melanoleuca), pig (Sus scrofa), platypus (O. anatinus), rabbit (Oryctolagus cuniculus), rat (Rattus norvegicus), rhesus (Macaca mulatta), sheep (Ovis aries), shrew (Sorex araneus), stickleback (G. aculeatus), tasmanian devil (S. harrisii), tenrec (E. telfairii), tetraodon (Tetraodon nigroviridis), Turkey (M. gallopavo), turtle (Chrysemys picta), wallaby (M. eugenii), Xenopus (X. tropicalis), Zebra finch (Taeniopygia guttata), and Zebrafish (D. rerio). (A and B) The occurrence level of CG and of CC+GG normalized with the level expected from the mononucleotide composition. The species are arranged in descending order of the normalized CG level. The data for coelacanth is arrowed and those for other fishes are indicated by a horizontal bar.
Figure 4.CG occurrence within and outside CpG islands for 11 vertebrate genomes analysed in Fig. 2.
Figure 5.Phylogenetic tree of Dnmt3 constructed with maximum likelihood. Nine hundred and fifty amino acid sites are used for the tree inference and the tree is unrooted. A bootstrap value is presented above each branch. The scale is proportional to the number of substitutions per amino acid.
Figure 6.Phylogenetic tree of Dnmt1 constructed with maximum likelihood. One thousand five hundred and twenty amino acid sites are used for the tree inference and the tree is unrooted. A similar tree was obtained with neighbour-joining and minimum evolution (data not shown).