| Literature DB >> 31619167 |
Ran Li1, Xiaomeng Tian1, Peng Yang1, Yingzhi Fan1, Ming Li1, Hongxiang Zheng2, Xihong Wang1, Yu Jiang3.
Abstract
BACKGROUND: The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist.Entities:
Keywords: Alternate alleles; Genomic variation; Human genome; Pan-genome
Mesh:
Year: 2019 PMID: 31619167 PMCID: PMC6796347 DOI: 10.1186/s12864-019-6107-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of the workflow to identify non-reference sequences (NRS)
Fig. 2Non-reference sequences identified from each of the 31 human de novo assemblies. The repeat information was summarized from the repeat annotation files (*_rm.out.gz), which were generated with RepeatMasker and downloaded from the NCBI. The radius of each pie chart was log2 transformed
Fig. 3Characteristics of the insertions and alternate alleles. a Frequency of insertions within the 31 de novo assemblies. b Frequency of alternate alleles within the 31 de novo assemblies. c Frequency of ambiguous sequences within the 31 de novo assemblies. d Length distributions of the insertions, alternate alleles and ambiguous sequences. e The first two principal components based on the occurrence matric of the insertions among the 16 LRS de novo assemblies. f The first two principal components based on the occurrence matric of the alternate alleles among the LRS de novo assemblies
Fig. 4Examples of alternate alleles. a and b Alignment of the alternate alleles and their flanking sequences with hg38 and other assemblies. The blue lines in the top represent the gene annotations in hg38. The yellow segments represent the NRS. The gray block represents alignments that share ≥95% identity, while the green block represents alignments that share < 90% identity. For each of the alignments, the reference sequence from hg38 is shown at the top following by the sequence where the NRS is derived from and sequences from two additional genomes. c and d RNA-seq reads mapping of the NRS shows expression potential. e and f Expression of the alternate alleles in nine tissues. Three replicates were used for each tissue. The expression level was measured using CPM (reads count per million of total mapped reads)
Fig. 5Locations of the non-private insertions and alternate alleles on the human reference genome (hg38). Red lines represent insertions, while blue lines represent alternate alleles. The gene symbol is shown for each of the ten longest insertions and ten longest alternate alleles overlapping genic regions. The black triangles represent the NRS overlapping with the gap regions. Line width is not drawn to scale
Fig. 6The repeat contents associated with the insertions and alternate alleles. a TE content within the insertions and alternate alleles; b TE content in the flanking region (5 kb on each side). c The tandem repeat content within the insertions and alternate alleles. d The tandem repeat content in the flanking region (5 kb on each side). The non-private sequences were included for statistics. e Dot plot showing the tandem repeat content in alternate allele (x axis) and in the corresponding reference allele