| Literature DB >> 34306009 |
Fabien Degalez1, Frédéric Jehl1, Kévin Muret1, Maria Bernard2,3, Frédéric Lecerf1, Laetitia Lagoutte1, Colette Désert1, Frédérique Pitel4, Christophe Klopp2, Sandrine Lagarrigue1.
Abstract
Most single-nucleotide polymorphisms (SNPs) are located in non-coding regions, but the fraction usually studied is harbored in protein-coding regions because potential impacts on proteins are relatively easy to predict by popular tools such as the Variant Effect Predictor. These tools annotate variants independently without considering the potential effect of grouped or haplotypic variations, often called "multi-nucleotide variants" (MNVs). Here, we used a large RNA-seq dataset to survey MNVs, comprising 382 chicken samples originating from 11 populations analyzed in the companion paper in which 9.5M SNPs- including 3.3M SNPs with reliable genotypes-were detected. We focused our study on in-codon MNVs and evaluate their potential mis-annotation. Using GATK HaplotypeCaller read-based phasing results, we identified 2,965 MNVs observed in at least five individuals located in 1,792 genes. We found 41.1% of them showing a novel impact when compared to the effect of their constituent SNPs analyzed separately. The biggest impact variation flux concerns the originally annotated stop-gained consequences, for which around 95% were rescued; this flux is followed by the missense consequences for which 37% were reannotated with a different amino acid. We then present in more depth the rescued stop-gained MNVs and give an illustration in the SLC27A4 gene. As previously shown in human datasets, our results in chicken demonstrate the value of haplotype-aware variant annotation, and the interest to consider MNVs in the coding region, particularly when searching for severe functional consequence such as stop-gained variants.Entities:
Keywords: FATP4; MNV; SLC27A4; SNP; rescued stop-gained; variation
Year: 2021 PMID: 34306009 PMCID: PMC8293744 DOI: 10.3389/fgene.2021.659287
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Example of MNVs: predicted impact on the associated protein (A) and how to identify them (B,C). (A) Example of an MNV composed of two nearby SNPs in one codon and its four potential haplotypes in the population and their predicted impact on the associated protein. In contrast to other haplotypes, haplotype no. 2 contains two variants (T and A) and corresponds to an MNV. (B) The IGV (Integrated Genome Browser; Robinson et al., 2011) screen shot indicates the principle of read-based phasing of SNPs: short read mapping against the reference genome of the heterozygous individual allows us to phase both SNPs giving two haplotypes: C with T (reference alleles) on one side and T with A (alternative alleles) on the other side. When translated, these two haplotypes correspond to a leucine or a STOP codon and not to a simple amino acid change (LEU → GLU) if the two haplotypes had been composed by only one reference and one variant as shown in (A). (C) Information (PID and PGT) provided by GATK in the VCF files about the phased SNPs according to the read-based phasing shown in (B).
FIGURE 2Workflow of MNV detection in coding regions and functional consequence prediction. Left: MNV detection from 3.3M SNPs previously identified using RNA-seq of 382 chickens (companion article; Jehl et al., 2021). Right: MNV constituent SNP selection and protein impact selection of these SNPs separately analyzed by VEP.
Occurrences for each type of re-prediction according to the number of individuals carrying the MNV.
| SNP annotation | → | MNV annotation | Number of individuals carrying the MNV | ||||||||||
| 1 | 2 | 3 | 4 | 5 | 10 | 15 | 20 | 30 | 50 | 100 | |||
| Missense | → | Synonymous | 110 | 91 | 86 | 86 | 76 | 74 | 74 | 72 | 66 | 54 | |
| Stop_gained | → | Missense | 194 | 118 | 102 | 99 | 63 | 47 | 39 | 33 | 27 | 14 | |
| Stop_lost | → | Stop_lost | 8 | 4 | 4 | 4 | 2 | 2 | 2 | 2 | 1 | 0 | |
| Start_lost | → | Start_lost | 19 | 11 | 11 | 11 | 8 | 7 | 6 | 5 | 3 | 1 | |
| Synonymous | → | Synonymous | 95 | 84 | 81 | 78 | 69 | 63 | 58 | 54 | 45 | 32 | |
| Missense | → | Missense | 4,932 | 3,425 | 3,107 | 2,854 | 2,162 | 1,876 | 1,652 | 1,369 | 1,083 | 716 | |
| Stop_gained | → | Stop_gained | 12 | 6 | 5 | 5 | 3 | 2 | 2 | 2 | 1 | 0 | |
| Stop_retained | → | Stop_lost | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Synonymous | → | Missense | 15 | 6 | 6 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | |
| Missense | → | Stop_gained | 30 | 13 | 11 | 10 | 6 | 4 | 3 | 3 | 2 | 1 | |
| 5,416 | 3,758 | 3,413 | 3,151 | 2,391 | 2,076 | 1,837 | 1,541 | 1,229 | 819 | ||||
FIGURE 3Comparison of the functional impact of MNVs (right) and their component SNPs (left) for each of the 2,965 MNVs. Left: The consequence originally predicted for the component SNPs, the most severe impact being retained by the codon (see Figure 2 or the text for the order). Right: The new prediction associated with the MNVs. For each category of functional predictions of the component SNPs (left), the numbers and percentages are given with new predictions due to the associated MNV. The two slashes indicate that the scale has been adapted (reduction by five times) for better readability.
FIGURE 4SLC27A4 with an MNV composed of two phased SNPs observed in the experimental divergent lean line (LL). (A) Exon structure of the SLC27A4 gene and the MNV location. (B) For the two SNPs (SNP1: rs316701182 and SNP2: rs15031398) related to the MNV of interest, the allele position (on the Galgal5 genome), functional impact on the associated protein, and frequencies in the FLLL population are indicated, and the two FL and LL subpopulations are divergently selected on abdominal fat weight. (C) Effects of the four haplotypes related to SNP1 and SNP2 separately analyzed by VEP and frequencies in LL (n = 12) and FL (n = 12) subpopulations and focus on the percentage of observed haplotypes in the two FL and LL subpopulations. The haplotypes were determined through the IGV browser of mapped RNA-seq reads against the chicken genome. (D) Tissue expression of the gene in a chicken RNA-seq dataset composed of 21 tissues (Jehl et al., 2020).