| Literature DB >> 35325213 |
Ming Li1, Congjiao Sun2, Naiyi Xu1, Peipei Bian1, Xiaomeng Tian1, Xihong Wang1, Yuzhe Wang3,4, Xinzheng Jia5,6, Rasmus Heller7, Mingshan Wang8,9, Fei Wang1, Xuelei Dai1, Rongsong Luo1, Yingwei Guo1, Xiangnan Wang1, Peng Yang1, Dexiang Hu1, Zhenyu Liu1, Weiwei Fu1, Shunjin Zhang1, Xiaochang Li2, Chaoliang Wen2, Fangren Lan2, Amam Zonaed Siddiki10, Chatmongkon Suwannapoom11, Xin Zhao12, Qinghua Nie13, Xiaoxiang Hu3, Yu Jiang1,14, Ning Yang2.
Abstract
The gene numbers and evolutionary rates of birds were assumed to be much lower than those of mammals, which is in sharp contrast to the huge species number and morphological diversity of birds. It is, therefore, necessary to construct a complete avian genome and analyze its evolution. We constructed a chicken pan-genome from 20 de novo assembled genomes with high sequencing depth, and identified 1,335 protein-coding genes and 3,011 long noncoding RNAs not found in GRCg6a. The majority of these novel genes were detected across most individuals of the examined transcriptomes but were seldomly measured in each of the DNA sequencing data regardless of Illumina or PacBio technology. Furthermore, different from previous pan-genome models, most of these novel genes were overrepresented on chromosomal subtelomeric regions and microchromosomes, surrounded by extremely high proportions of tandem repeats, which strongly blocks DNA sequencing. These hidden genes were proved to be shared by all chicken genomes, included many housekeeping genes, and enriched in immune pathways. Comparative genomics revealed the novel genes had 3-fold elevated substitution rates than known ones, updating the knowledge about evolutionary rates in birds. Our study provides a framework for constructing a better chicken genome, which will contribute toward the understanding of avian evolution and the improvement of poultry breeding.Entities:
Keywords: avian evolution; chicken; missing genes; noncanonical DNA secondary structure; pan-genome
Mesh:
Year: 2022 PMID: 35325213 PMCID: PMC9021737 DOI: 10.1093/molbev/msac066
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Fig. 1.Chicken novel nr sequences identified by 20 de novo assemblies. (a) Geographic locations of the original chicken breeds used for de novo assembly and their sequencing platforms. The rectangle indicates this breed has two individuals. (b) Genome assembly completeness assessed by BUSCO. (c) Length of novel sequences initially obtained from 20 de novo assemblies. The polygonal line represents the average length and the column represents the total length. (d) The number of novel sequences validated by other chicken genomes, homology with Galliformes, and transcriptome. The colors of the breed name in (b) and (c) are consistent with (a).
The Characteristics of Novel Sequences in this Study.
| Characteristic | |
|---|---|
| Total novel sequence length (bp) | 158,981,245 |
| Total gap length (bp) | 1,405,623 |
| Number of novel sequences | 45,715 |
| Novel sequence N50 (bp) | 6,784 |
| Mean novel sequence length (bp) | 3,478 |
| GC ratio | 57.20% |
| G4 motif content | 37.08% |
| Tandem repeat content | 79.13% |
Fig. 2.Characterization of novel sequences. (a) The distribution and cumulative curve of observed frequencies of novel sequences in 20 assemblies and 922 resequenced individuals. (b) The observed frequency of the expressed novel sequences in the transcriptomes of six chickens (column) and their corresponding genomes (row). (c) Relative read depth of novel sequences in the specific assembly in which the novel sequence was present (green) and absent (orange). The whole-genome read depth was set to one. (d) Left: TR and GC content of the GRCg6a and novel sequences, respectively; right: the feature importance of TR and GC for the detection rate of novel sequences. (e) Left: the content of noncanonical DNA structures in the novel sequences; middle: the putative structures of noncanonical DNA; right: the read depth ratio of novel sequences with or without noncanonical DNA structures. TR, tandem repeat; DR, direct repeat; G4, G-quadruplexes; MR, mirror repeat; IR, inverted repeat; APR, A-phased repeat; Z, Z-DNA.
Fig. 3.Abundantly expressed genes are embedded in novel sequences. (a) Relative expression of transcripts of reference and novel sequences, respectively. (b) The identification of protein-coding genes/lncRNAs increased with sample numbers. The shaded area indicates the 95% confidence interval. (c) The number of protein-coding genes in representative species including mammals, reptiles, and birds. Blue, orange, and green columns refer to protein-coding genes identified by NCBI, Yin et al (2019), and our study, respectively. (d) Total lncRNAs numbers of mammalian representative species and chicken. The blue and green columns refer to lncRNAs identified by Sarropoulos et al (2019) and this study, respectively.
Fig. 4.The novel coding genes clustered in the microchromosomes and subtelomere of chromosomes. (a) The location of the novel coding gene clusters on chromosomes. (b) Box plot for dS, dN, dN/dS values of genes on macrochromosomes, microchromosomes, the Z chromosome, unplaced scaffold of chicken reference genome, and the novel sequences. (c) The proportion of orthologous sequences detected number, homologous sequences detected in microchromosomes and subtelomere, and contents of GC, G4 motif, and TR of homologous novel coding genes clusters in chicken and 22 other species. The region containing more than three genes are considered as clusters, and genes located within 5-MB of the end of chromosomes are considered as subtelomeric regions. (d) A detailed synteny conservation of novel coding genes on chromosome 16 of chicken with mammal (human), reptilia (lizard, turtle), and aves (thrush, chicken), respectively. Hollow rectangles represent annotated genes in the genome, and other color rectangles, with gene name in red, represent novel coding genes in chickens.
Fig. 5.Function enrichment of novel coding genes and the case of NF-κB pathway-related novel coding genes. (a) The top 20 significant Reactome pathway with the largest number of novel coding genes. (b) Novel coding genes (red) and partially missing gene (yellow) are related to NF-κB signaling pathway. The green boxes represent differentially expressed genes (DEGs) in avian influenza virus.