Literature DB >> 26484105

Third party annotation gene data set of eutherian lysozyme genes.

Abstract

The eutherian comparative genomic analysis protocol annotated most comprehensive eutherian lysozyme gene data set. Among 209 potential coding sequences, the third party annotation gene data set of eutherian lysozyme genes included 116 complete coding sequences that first described seven major gene clusters. As one new framework of future experiments, the present integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis proposed new classification and nomenclature of eutherian lysozyme genes.

Entities: Chemical Gene Species

Keywords: Comparative genomic analysis; Gene annotations; Molecular evolution; Phylogenetic analysis

Year: 2014 PMID： 26484105 PMCID： PMC4535835 DOI： 10.1016/j.gdata.2014.08.003

Source DB: PubMed Journal: Genom Data ISSN： 2213-5960

Direct link to deposited data

Deposited data could be found here: http://www.ebi.ac.uk/ena/data/view/HG931734-HG931849.

Experimental design, materials and methods

The eutherian comparative genomic analysis protocol included gene annotations, phylogenetic analysis and protein molecular evolution analysis [1], [2].

Gene annotations

The eutherian public genomic sequence assemblies [3] were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/). The NCBI's BLAST package was used in identification of genes in eutherian genomic sequence assemblies (ftp://ftp.ncbi.nlm.nih.gov/blast/). Alternatively, the Ensembl genome browser's BLAST or BLAT web tools were used in gene identifications (http://www.ensembl.org/index.html). The protocol annotated 209 eutherian lysozyme (LYZ) potential coding sequences (Supplementary data file 1) that were tested using tests of reliability of eutherian public genomic sequences. The analysis of nucleotide sequence coverage of each potential coding sequence was included in the first test step using primary sequence reads in NCBI's Trace Archive (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi) and NCBI's BLAST package. The classification of potential coding sequences was included in the second test step and potential coding sequences were described as complete coding sequences if consensus trace sequence coverage was available for every nucleotide. Alternatively, they were classified as putative coding sequences that were not used in analyses. The EBI-reviewed third party annotation gene data set of eutherian LYZ genes included 116 complete coding sequences (Fig. 1) (http://www.ebi.ac.uk/embl/Documentation/third_party_annotation_dataset.html). In gene descriptions, the guidelines of human and mouse gene nomenclature were used (http://www.genenames.org/guidelines.html and http://www.informatics.jax.org/mgihome/nomen/gene.shtml). The genomic sequence alignments included RepeatMasker program version open-4.0.3 (http://www.repeatmasker.org/) and mVISTA tools (http://genome.lbl.gov/vista/index.shtml) [1], [2]. The common predicted promoter genomic sequence regions of eutherian LYZ genes were described (Supplementary data file 2). For example, the average pairwise nucleotide sequence identity of common predicted promoter genomic sequence region of primate LYZA genes was ā = 0,864 (amax = 0,987, amin = 0,75, āad = 0,11) (Supplementary data file 3A). Among primate LYZB genes, the average pairwise nucleotide sequence identities of two common predicted promoter genomic sequence regions were ā = 0,868 (amax = 0,981, amin = 0,758, āad = 0,101) (Supplementary data file 3B) and ā = 0,882 (amax = 0,991, amin = 0,751, āad = 0,098) (Supplementary data file 3C). The average pairwise nucleotide sequence identity of common predicted promoter genomic sequence region of primate LYZC genes was ā = 0,909 (amax = 0,979, amin = 0,877, āad = 0,028) (Supplementary data file 3D). In primate LYZD genes, the average pairwise nucleotide sequence identity of common predicted promoter genomic sequence region was ā = 0,877 (amax = 1, amin = 0,774, āad = 0,047) (Supplementary data file 3E). The average pairwise nucleotide sequence identity of common predicted promoter genomic sequence region of primate LYZF genes was ā = 0,929 (amax = 0,976, amin = 0,881, āad = 0,027) (Supplementary data file 3F). Finally, among primate LYZG genes, the average pairwise nucleotide sequence identity of common predicted promoter genomic sequence region was ā = 0,915 (amax = 0,956, amin = 0,892, āad = 0,022) (Supplementary data file 3G).

Fig. 1

A) Phylogenetic analysis of eutherian lysozyme genes. Using maximum composite likelihood method, the minimum evolution tree was calculated. The estimates ≥ 50% were shown after 1000 bootstrap replicates. B) Distribution of common cysteine amino acid residues in eutherian lysozyme proteins. The common cysteine residues 1-8 were labelled using black rectangles. The numbers indicated numbers of amino acids.

Phylogenetic analysis

The complete coding sequences were aligned at amino acid level using ClustalW implemented in BioEdit 7.0.5.3 (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). The protein and nucleotide sequence alignments were corrected manually. In calculations of phylogenetic trees, the MEGA5 program was used (http://www.megasoftware.net) [1], [2]. The seven eutherian LYZA–LYZG major gene clusters were first described in the present study (Fig. 1). Among eutherian LYZD and LYZF genes, there was evidence of differential gene expansions. There were some discrepancies between the present work and phylogenetic analysis of Irwin et al. [4]. For example, the present minimal evolution phylogenetic analysis of eutherian LYZ gene data set showed grouping of eutherian LYZA major gene cluster (LYZL4) and eutherian LYZB major gene cluster (LYZL6). Furthermore, the eutherian LYZF major gene cluster (LYZ) included horse, domestic dog and domestic cat genes that were previously described as calcium-binding lysozyme genes (Lysc1). The present eutherian LYZ gene classification was confirmed by calculations of pairwise nucleotide sequence identity patterns among eutherian LYZA–LYZG major gene clusters (Supplementary data file 4). Among eutherian LYZ genes, the average pairwise identity was ā = 0,502 (amax = 0,995, amin = 0,278, āad = 0,089). In eutherian LYZA–LYZG major gene clusters respectively, there were nucleotide sequence identity calculations typical in comparisons between eutherian orthologous and paralogous genes. In comparisons between eutherian LYZA–LYZG major gene clusters, there were nucleotide sequence identity patterns of close eutherian homologues.

Protein molecular evolution analysis

The N-terminal signal peptide predictions in all eutherian LYZ major protein clusters were undertaken using SignalP-4.1 web tool using default settings (http://www.cbs.dtu.dk/services/SignalP/) (data not shown). There were eight invariant cysteine amino acid residues among eutherian LYZ proteins [4], [5] (Fig. 1B). Whereas invariant common potential N-glycosylation sites were observed in eutherian LYZB and LYZD major protein clusters (58N in human LYZB, 104N in human LYZD1), there were variant common potential N-glycosylation sites in other eutherian LYZ major protein clusters except eutherian LYZE and LYZF major protein clusters (Supplementary data file 5). The tests of protein molecular evolution integrated patterns of nucleotide sequence similarities of aligned complete coding sequence data set with human LYZF1 crystal structure (1LZ1) [6]. The DeepView/Swiss-PdbViever 4.0.1 program was used in 1LZ1 analysis (http://spdbv.vital-it.ch/). The relative synonymous codon usage statistic R was calculated using MEGA5 as ratio between observed and expected amino acid codon counts. The not preferable codons (R ≤ 0.7) were: TTA (0,25), CTT (0,7), CTA (0,23), ATA (0,52), GTA (0,39), TCG (0,19), CCG (0,32), ACG (0,39), GCG (0,29), CGT (0,46), CGA (0,53), GGT (0,4) and GGG (0,64). The reference human LYZF1 protein primary sequence residues were classified as invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment positions that did not include amino acid codons with R ≤ 0.7) or compensatory amino acid sites (variant alignment positions that included amino acid codons with R ≤ 0.7) (Supplementary data file 5, Supplementary data file 6). Among 148 reference protein sequence amino acid sites, there were 14 invariant amino acid sites and 22 forward amino acid sites that described amino acid site cluster 1 (N45–N62), cluster 2 (G66–C83) and cluster 3 (V92–C99) with overrepresented invariant and/or forward amino acid sites (Supplementary data file 6A). Structurally, the amino acid site clusters 1–3 were positioned in close proximities (Supplementary data file 6B, Supplementary data file 6C). For example, the amino acid site clusters 1–3 included nine amino acid residues common in lysozymes [5]. Moreover, the catalytic amino acid residue E53 in amino acid site cluster 1 was described as forward amino acid site [7].

Discussion

Because of the incompleteness of public genomic sequence assemblies [3], [8], [9] and potential sequence errors [10], eutherian gene data sets were subject to updates and revisions. The eutherian comparative genomic analysis protocol annotated most comprehensive eutherian LYZ gene data set. As one new framework of future experiments, the present integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis proposed new classification and nomenclature of eutherian LYZ genes. The following are the Supplementary data related to this article.

Supplementary data file 1

Gene data set of eutherian lysozyme genes.

Supplementary data file 2

Pairwise genomic sequence alignments of eutherian lysozyme genes. The rectangles labelled common predicted promoter regions (P). The translated genomic sequence regions were displayed as indigo rectangles and untranslated genomic sequence regions were displayed as cyan rectangles in base sequences (top). The genomic sequence regions that showed conservation levels that exceeded empirically determined cut-offs of detection of common genomic sequence regions were shown accordingly in pairwise alignments. The cut-offs of detection of common genomic sequence regions in pairwise alignments with Homo sapiens or Pan troglodytes were 95% per 100 bp (Homo sapiens, Pan troglodytes, Gorilla gorilla), 90% per 100 bp (Pongo abelii, Nomascus leucogenys), 85% per 100 bp (Macaca mulatta, Papio hamadryas), 80% per 100 bp (Callithrix jacchus), 75% per 100 bp (Microcebus murinus, Otolemur garnettii), 65% per 100 bp (Rodentia) or 70% per 100 bp in other pairwise alignments. The Homo sapiens exons in base sequences (top) were annotated using transcripts: BC016747.2 (A), BC054481.1 (P1) and AY359018.1 (P2) (B), BC100886.2 (C), BC021730.2 (D), BC004147.2 (F) and BC112316.1 (G).

Supplementary data file 3

Nucleotide sequence alignments of common predicted promoter genomic sequence regions. The Xs below alignments labelled first exons and triangles above alignments labelled translation start sites. The first exons were annotated using Homo sapiens transcripts as in Supplementary data file 2. The numbers in brackets indicated positions of 3′-terminal nucleotides relative to translation start sites. The nucleotide positions were labelled according to conservation levels: white letters on black background depicted 100% conservation, white letters on dark grey background depicted ≥ 85% conservation and black letters on grey background depicted ≥ 70% conservation.

Supplementary data file 4

Pairwise nucleotide sequence identities of eutherian lysozyme genes.

Supplementary data file 5

Protein sequence alignments of eutherian lysozyme proteins. The invariant amino acid sites were shown using white letters on violet backgrounds and forward amino acid sites were shown using white letters on red backgrounds in reference human LYZF1 protein amino acid sequence (top). The amino acid positions in major protein clusters LYZA–LYZG were labelled according to conservation levels: white letters on black background depicted 100% conservation, white letters on dark grey background depicted ≥ 75% conservation and black letters on grey background depicted ≥ 50% conservation. The stop codons were labelled by &s.

Supplementary data file 6

Molecular evolution analysis of eutherian lysozyme proteins. A) Human LYZF1 primary structure. The black triangle indicated signal peptide cleavage site. The invariant amino acid sites (white letters on violet background) and forward amino acid sites (white letters on red background) were labelled. The amino acid site clusters 1–3 were labelled by rectangles. The secondary structure elements were labelled grey [6]. The arrows indicated catalytic amino acid residues implicated in Phillips mechanism [7]. The amino acid residues common in lysozymes were labelled by #s [5]. B–C) Analysis of human LYZF1 crystal structure (1LZ1) [6]. B) Ribbon representation of human LYZF1 crystal structure (1LZ1). In amino acid site clusters 1–3, the reference sequence invariant amino acid sites were labelled violet, forward amino acid sites were labelled red and compensatory amino acid sites were labelled white. The amino acid site clusters 1–3 were indicated. C) The amino acid site clusters 1–3 were shown as van der Waals representations. The view was identical to B.

Conflict of interest

No potential conflict of interest was declared.

Specifications
Organism/cell line/tissue	35 eutherian species
Sex	N/A
Sequencer or array type	Sanger DNA sequencing method sequencers
Data format	FAS, TXT
Experimental factors	Eutherian comparative genomic analysis protocol
Experimental features	Third party annotation gene data set
Consent	N/A
Sample source location	N/A

10 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

2. Catalysis by hen egg-white lysozyme proceeds via a covalent intermediate.

Authors: D J Vocadlo; G J Davies; R Laine; S G Withers
Journal: Nature Date: 2001-08-23 Impact factor: 49.962

3. Phylogenetic analysis of invertebrate lysozymes and the evolution of lysozyme function.

Authors: Sana Bachali; Muriel Jager; Alexandre Hassanin; Françoise Schoentgen; Pierre Jollès; Aline Fiala-Medioni; Jean S Deutsch
Journal: J Mol Evol Date: 2002-05 Impact factor: 2.395

4. Comparative genomic analysis of eutherian ribonuclease A genes.

Authors: Marko Premzl
Journal: Mol Genet Genomics Date: 2013-12-15 Impact factor: 3.291

5. Comparative genomic analysis of eutherian Mas-related G protein-coupled receptor genes.

Authors: Marko Premzl
Journal: Gene Date: 2014-02-26 Impact factor: 3.688

6. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing.

Authors: Elliott H Margulies; Jade P Vinson; Webb Miller; David B Jaffe; Kerstin Lindblad-Toh; Jean L Chang; Eric D Green; Eric S Lander; James C Mullikin; Michele Clamp
Journal: Proc Natl Acad Sci U S A Date: 2005-03-18 Impact factor: 11.205

7. Refinement of human lysozyme at 1.5 A resolution analysis of non-bonded and hydrogen-bond interactions.

Authors: P J Artymiuk; C C Blake
Journal: J Mol Biol Date: 1981-11-15 Impact factor: 5.469

8. Finishing the euchromatic sequence of the human genome.

Authors:
Journal: Nature Date: 2004-10-21 Impact factor: 49.962

9. GENCODE: the reference human genome annotation for The ENCODE Project.

Authors: Jennifer Harrow; Adam Frankish; Jose M Gonzalez; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen L Aken; Daniel Barrell; Amonida Zadissa; Stephen Searle; If Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles Steward; Rachel Harte; Michael Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael Tress; Jose Manuel Rodriguez; Iakes Ezkurdia; Jeltje van Baren; Michael Brent; David Haussler; Manolis Kellis; Alfonso Valencia; Alexandre Reymond; Mark Gerstein; Roderic Guigó; Tim J Hubbard
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

10. Evolution of the mammalian lysozyme gene family.

Authors: David M Irwin; Jason M Biegel; Caro-Beth Stewart
Journal: BMC Evol Biol Date: 2011-06-15 Impact factor: 3.260

10 in total

4 in total

1. Comparative genomic analysis of eutherian tumor necrosis factor ligand genes.

Authors: Marko Premzl
Journal: Immunogenetics Date: 2015-12-09 Impact factor: 2.846

2. Initial description of primate-specific cystine-knot Prometheus genes and differential gene expansions of D-dopachrome tautomerase genes.

Authors: Marko Premzl
Journal: Meta Gene Date: 2015-04-25

3. Curated eutherian third party data gene data sets.

Authors: Marko Premzl
Journal: Data Brief Date: 2015-12-11

4. Revised eutherian gene collections.

Authors: Marko Premzl
Journal: BMC Genom Data Date: 2022-07-23

4 in total