Literature DB >> 26317361

Genome-Wide Study of Structural Variants in Bovine Holstein, Montbéliarde and Normande Dairy Breeds.

Mekki Boussaha¹, Diane Esquerré², Johanna Barbieri², Anis Djari³, Alain Pinton², Rabia Letaief¹, Gérald Salin², Frédéric Escudié², Alain Roulet², Sébastien Fritz⁴, Franck Samson⁵, Cécile Grohs¹, Maria Bernard³, Christophe Klopp³, Didier Boichard¹, Dominique Rocha¹.

Abstract

High-throughput sequencing technologies have offered in recent years new opportunities to study genome variations. These studies have mostly focused on single nucleotide polymorphisms, small insertions or deletions and on copy number variants. Other structural variants, such as large insertions or deletions, tandem duplications, translocations, and inversions are less well-studied, despite that some have an important impact on phenotypes. In the present study, we performed a large-scale survey of structural variants in cattle. We report the identification of 6,426 putative structural variants in cattle extracted from whole-genome sequence data of 62 bulls representing the three major French dairy breeds. These genomic variants affect DNA segments greater than 50 base pairs and correspond to deletions, inversions and tandem duplications. Out of these, we identified a total of 547 deletions and 410 tandem duplications which could potentially code for CNVs. Experimental validation was carried out on 331 structural variants using a novel high-throughput genotyping method. Out of these, 255 structural variants (77%) generated good quality genotypes and 191 (75%) of them were validated. Gene content analyses in structural variant regions revealed 941 large deletions removing completely one or several genes, including 10 single-copy genes. In addition, some of the structural variants are located within quantitative trait loci for dairy traits. This study is a pan-genome assessment of genomic variations in cattle and may provide a new glimpse into the bovine genome architecture. Our results may also help to study the effects of structural variants on gene expression and consequently their effect on certain phenotypes of interest.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26317361 PMCID： PMC4552564 DOI： 10.1371/journal.pone.0135931

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Background

Over the past decade, many studies have attempted cataloging the nature and pattern of genomic alterations in population (e.g.[1]). The advent of novel high-throughput sequencing technologies [2-6] with the ability to partially or completely re-sequence genomes, in a relatively cost-effective manner, has offered new opportunities to study large scale genomic variations. In addition to single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels), several other studies have identified larger and more complex structural variants (SVs). Originally, SVs were considered as genomic alterations affecting DNA segments greater than 1,000 base pairs (1 kbp) in size [7]. However, with new advances in high-throughput sequencing technologies, the operational spectrum of SVs has widened to include much smaller genomic alteration events (> 50 bp in size) [8]. SVs such as large insertions, large deletions, inversions, duplications, translocations and Copy Number Variants (CNVs), are less frequent than SNPs and indels within a given genome however some of them may have more significant functional effects [9] and may also play a role in genome structure remodeling [10-16]. For example, during its pilot phase, the 1000 Genomes Project Consortium has sequenced 185 human whole-genomes and has identified more than 22,025 deletions and 6,000 additional SVs [17]. Some of those SVs are associated with disease susceptibility, such as autism [18-20] or schizophrenia [21-23] in humans. Many animal genomes have now been sequenced, including the genomes of several bulls and cows [24-48]. For example, Eck et al. (2009) generated the first cattle genome sequence by a next-generation sequencing method [24]. By sequencing a Fleckvieh bull genome, they discovered more than 2 million novel cattle SNPs. More recently, Daetwyler et al. (2014) have sequenced the whole-genome of 234 bulls from four different breeds and have identified more than 28 million variants (SNPs and indels). These polymorphisms have then been used to identify putative causative mutations for genetic defects or economically important complex traits [44]. Studies of large genomic variations in cattle have mostly focused on CNVs [27,29,32,43,49-63]. Some of these alterations have been involved in important phenotypes, such as resistance or susceptibility to gastrointestinal nematodes in Angus cattle [64-66] or feed intake in Holstein cows [67]. Other studies have also reported the involvement of other types of structural variants such as deletions, duplications or translocations in inherited disorders or coat colour patterning [31,38,68-75]. More recently McDaneld et al. have found a 70 kb-long deletion on BTA5 associated with decreased female reproductive efficiency in Bos indicus [76]; while Kadri et al. found a 660-kb long deletion on BTA12 with antagonistic effects on female fertility and milk production in Nordic Red cattle [77]. Here, we performed a large scale study to investigate both small indels (< = 50 bp) and large SVs (> 50 bp) in cattle by sequencing the whole-genome of 62 bulls from the three French major dairy breeds (Holstein, Montbéliarde and Normande breeds). The collection of SVs reported in this study may prove useful to study their potential effect on the expression levels of certain genes of interest and consequently to study their link with the genetic variability of economically important traits in cattle.

Materials and Methods

Animal ethics

No animal experimentation was used in this study, therefore no ethical permission was required from any relevant authority. Sequencing was performed using genomic DNA obtained from sperm collected from semen straws kindly provided by approved commercial artificial insemination stations as part of their regular semen collection process. The authors did not participate in the acquisition of semen samples for the purpose of this research.

Genomic DNA extraction

Genomic DNAs were extracted from semen of 62 dairy bulls (27 Holstein, 17 Montbéliarde and 18 Normande bulls) chosen based on their genetic contribution to the French cattle populations, using the Wizard Genomic DNA Purification Kit (Promega, Charbonnières-les-Bains, France) or using a standard phenol-chloroform method, respectively. A quality control inspection of each purified DNA sample was performed by agarose gel electrophoresis. DNA concentration was then measured with a Nanodrop ND-100 instrument (Thermo Scientific, Ilkirch, France).

Library construction and sequencing

Genomic libraries were prepared using the TruSeq DNA Sample Preparation Kit (Illumina) according to the manufacturer’s instructions. Briefly, 4 μg genomic DNA were fragmented into 150–400 bp pieces using divalent cations at 94°C for 8 min. The resulting cleaved DNA fragments were purified using Agencourt AMPure XP beads (Beckman Coulter, Villepinte, France), then subjected to end-repair and phosphorylation and subsequent purification was performed using Agencourt AMPure XP beads (Beckman Coulter). These repaired DNA fragments were 3′-adenylated producing DNA fragments with a single ‘A’ base overhang at their 3′-ends for subsequent adapter-ligation. Illumina adapters were ligated to the ends of these 3′-adenylated DNA fragments followed by two purification steps using Agencourt AMPure XP beads (Beckman Coulter). Ten rounds of PCR amplification were performed to enrich the adapter-modified DNA library using primers complementary to the ends of the adapters. The PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter) and size-selected (200 ± 25 bp) on a 2% agarose Invitrogen E-Gel (Thermo Scientific). Libraries were then checked on an Agilent Technologies 2100 Bioanalyzer using the Agilent High Sensitivity DNA Kit and quantified by quantitative PCR with the QPCR NGS Library Quantification kit (Agilent Technologies, Massy, France). Libraries were used for 2×100 bp paired-end sequencing on an Illumina HiSeq2000 with a TruSeq SBS v3-HS Kit (Illumina).

Alignment to the reference

Sequence alignments were carried out using the Burrows-Wheeler Alignment tool (BWA v0.6.1-r104) [78] with default parameters for mapping reads to the UMD3.1 bovine reference genome [79]. Potential PCR duplicates, which can adversely affect the variant calls, were removed using the MarkDuplicates tool from Picard version 1.4.0 [80]. Only properly paired reads with a mapping quality of at least 30 (−q = 30) were kept. The resulting BAM files were then used for all subsequent analysis.

Identification of small insertions and deletions

Small indels were detected using the Genome Analysis Tool Kit 2.4–9 (GATK) version and GATK-UnifiedGenotyper as SNP caller [81]. Prior to variant discovery, reads were subjected to local realignment, coordinate sort, quality recalibration, and PCR duplicate removal. In the GATK analysis, we used a minimum confidence score threshold of Q30 with default parameters. We have also used multi-sample variant calling in order to distinguish between a homozygous reference genotype and a missing genotype in the analyzed samples.

Identification of SVs

Bioinformatics detection of potential genomic variation events was carried out on the 62 BAM files. We have performed multi-sample variant calling by Pindel software, v. 0.2.4y [82] using parameters as described in https://trac.nbic.nl/pindel. We first set the "Maximum event size index" to 9 in order to detect events whose sizes are up to 8,286,208 bp. We also set the–m parameter (min_perfect_match_around_BP) to 30 (i.e. at the point where the read is split into two, there should at least be 30 perfectly matching bases between the read and the reference sequences). We required a minimum mapping quality of the split read of 30 to support a breakpoint or junction. We finally used a custom python script to filter out Pindel-generated raw data: Only samples presenting at least three unique reads at the breakpoint of SVs were declared positive for the corresponding SV.

Annotation of SV regions

Analyses of the overlap between SVs and functional elements were performed based on the gene build 77 database for the UMD3.1 bovine gene dataset obtained from the Ensembl Genome Browser using the Biomart software (http://www.ensembl.org/index.html). Positions of SV breakpoints predicted by Pindel were compared to gene start and end positions in order to identify SVs that may encompass an entire gene, those that overlap with exons of a given gene, those that overlap gene starts or ends and those for which both SV breakpoints are located within two different genes. The Ensembl Biomart software was also used to find gene paralogs located within or overlapping the annotated SV regions. Gene Ontology (GO) enrichment was also performed using the MouseMine analysis tools, a powerful new system for accessing MGI (Mouse Genome Informatics) data, using the InterMine framework and is available at the MGI international database resources (http://www.mousemine.org/mousemine/begin.do). In order to investigate QTL regions within SV regions, we first downloaded all Bovine QTL regions from the public cattle QTL database release 24 (Aug 25, 2014), available at http://www.animalgenome.org. QTLs linked to milk traits (fat and protein content and yield) and somatic cell scores were subsequently extracted. A custom python script was then used to search for SVs located within or overlapping with QTLs regions.

SV validation by high-throughput genotyping

In order to investigate our approach efficiency to detect SVs, we developed a genotyping-based strategy using the already available Illumina BovineLD custom BeadChip [83]. With this strategy, many individuals can be genotyped for many SVs at limited cost. The main idea was to convert predicted SVs into “virtual SNPs” by testing the base change at the SV breakpoints. Therefore, several selection filters were applied in order to select a panel of SVs for validation: (1) in order to overcome genotyping problems due to sequence repeats, the SV flanking sequences were first analyzed with the RepeatMasker software [84] and all SVs with masked flanking sequences were removed; (2) For deletions, if the first nucleotide of the deleted region is different from the first nucleotide which is located immediately after the SV 3’ breakpoint, then we selected the corresponding SVs for further analysis. This deletion was then converted into a “virtual SNP” for which the reference allele corresponds to the first nucleotide of the deleted region and the alternative allele corresponds to the first nucleotide immediately after the SV 3’ breakpoint; (3) For inversion, if the first nucleotide of the SV region is different from the reverse-complement of the last nucleotide of the same SV region, then we selected the corresponding SV for further analysis. This inversion was then converted into a “virtual SNP” for which the reference allele corresponds to the first nucleotide of the inverted region and the alternative allele corresponds to the reverse-complement of the last nucleotide of the same inverted region. Steps 2 and 3 were repeated with the reverse-complement sequences. After applying the above filters, 331 deletions and inversions were selected for validation. They were genotyped for a large number of animals. High-throughput genotyping reactions were performed at Labogena core facility, using the custom low-density Illumina BovineLD SNP chip (San Diego, CA). SNPs with an Illumina design score above 0.4 were retained for further analysis. Oligonucleotides were designed, synthesized, and used to genotype 382 animals from at least eight major dairy breeds (Table 1). Several other breeds such as beef breeds (Limousine: 15, Charolaise: 19, Blonde d’Aquitaine: 12, Parthenaise: 12 and Gasconne: 9) were also included in our genotyping panel. However, none of the bulls used for SV identification was included in this genotyping sample list.

Table 1

Breed distribution of animals used in the validation study.

Breed	number of animals
Holstein	29
Montbeliarde	32
Normande	30
Abondance	29
Brown Swiss	30
Pie rouge des plaines	9
Simmental	16
Tarentaise	27
Others	180
Total	382

Table 1 summarizes the sample panel that was used for genotyping assays.

Analysis of population structure

To indirectly validate the results of this SV detection study, we compared the population structure assessed from SVs to those previously obtained with SNPs [85]. We first performed Principal Components Analysis (PCA) using “dudi.pca” implemented in the R package ade4 [86] using all validated SV information. Second, we used the STRUCTURE software package [87] to assess the population structure. This program implements a model-based clustering method to infer population structure using genotype data of unlinked markers. We used the admixture model and correlated allele frequency version of STRUCTURE [88].

Results and Discussion

Whole-genome sequencing, read mapping

Sixty-two of the most contributing bulls from the three major French dairy breeds (Holstein, Montbeliarde and Normande) were selected for whole-genome sequencing. A total of 31,140 million raw paired-end reads with a length of 100 bases were generated, resulting in a total of 3,114 gigabases. Each sample was sequenced on 1–4 lanes and approximately 140 to 1,120 million paired-ends reads were obtained for each library. On average, 93% (from 75% to 97%) of the paired-end reads were properly aligned on the UMD3.1 bovine reference genome (S1 Table). Similar read mapping rates were obtained in other bovine whole-genome sequencing studies. For example, Kawahara-Miki et al. (2011) found that 86% of the paired-end reads they generated while sequencing the genome of a Japanese Kuchinoshima-Ushi bull mapped uniquely onto the bovine genome [25]. The average genome-wide sequence coverage from the mapped reads ranged from 5× to 42× across the different genomes, with 52 samples sequenced at least at 10 fold average coverage.

Identification of genomic variations

Search for small variations with GATK-UnifiedGenotyer software resulted in the identification of 2,021,215 indels (S2 Table). On average we found 873,372 +/- 47,845 indels per bull. With this approach based on GATK, the largest indel identified was 11 bp in length. With Pindel algorithm, we generated two categories of variations. First, we produced a catalog containing 1,384,490 small SVs mainly small insertions and deletions (< = 50 bp) out of which 1,383,007 small SVs were less than 11 bp in size (S2 Table). These were subsequently used for concordance analysis with small indels data generated by GATK. Almost 98.9% (1,368,226 out 1,383,007) of small indels detected by Pindel were also identified with GATK (Fig 1). This relatively high percentage of concordance suggests that most small SVs detected by Pindel might be true variations. However, it is difficult to precisely estimate the sensitivity (false-positive rate) of our SV detection method as small indels found with GATK but not with Pindel might not be true indels.

Fig 1

Small indels identified with GATK and Pindel.

Venn diagram summarizing small indels identified by GATK and by Pindel.

Small indels identified with GATK and Pindel.

Venn diagram summarizing small indels identified by GATK and by Pindel. Second, we produced another catalog containing 6,426 putative large SVs (>50 bp) corresponding to 3,138 large deletions, 1,061 tandem duplications and 2,227 inversions (S3 Table). On average we observed nearly 199,200 small SVs and 305 large SVs per individual. Analysis of the length distribution of large SVs (Fig 2) revealed that most deletions (38.9%) are between 51 and 1,000 base pairs-long, whereas the length of most inversions (50.5%) is between 1 and 10 Kb while the vast majority of tandem duplications (80.9%) are larger than 10 Kb. These preliminary results seem to indicate a possible correlation between SV type and size. However, these observations should further be investigated.

Fig 2

Distribution of SVs based on their type and size.

Histogram summarizing the distribution of SVs based on their type and size. Inversions are highlighted in blue, deletions in red and tandem duplications in green.

Distribution of SVs based on their type and size.

Histogram summarizing the distribution of SVs based on their type and size. Inversions are highlighted in blue, deletions in red and tandem duplications in green. Analysis of the chromosomal distribution of the large SVs did not reveal any correlation with chromosome size (Fig 3). BTA12 harbours the highest number of SVs with approximately 7% of the total, followed by BTAX (5.5%) and BTA23 (5%). Moreover, no correlation has been observed between SV types and chromosomal distributions (Fig 3). The highest percentages of deletions were observed in BTA12 (6.3%), BTAX (5.7%) and BTA23 (5.3%). For tandem duplications, the highest percentages were observed in BTA1 (6.2%), BTA12 (6.2%) and BTA15 (5%). Finally, the highest percentages of inversions were observed in BTA12 (10.6%), BTA23 (7.2%) and BTA2 (6.2%).

Fig 3

Chromosomal distribution of large SVs.

Histogram showing the distribution of SVs within bovine chromosomes. Deletions are shown in blue, inversions in red and tandem duplications in green.

Chromosomal distribution of large SVs.

Histogram showing the distribution of SVs within bovine chromosomes. Deletions are shown in blue, inversions in red and tandem duplications in green. Deletions and tandem duplications identified in this study covered a total length of up to 277 Mb corresponding to almost 10% of the whole bovine genome, whereas inversions covered a total length up to 152 Mb, ie almost 6% of the bovine genome. However, these percentages could be overestimated as SVs identified in our study are indeed putative variations and at this stage we do not know yet the false positive rate of our detection approach.

Distribution of SVs between animals and between breeds

Overall, 61% of SVs were found only in single bulls (Fig 4). One deletion was found to be present in all 62 animals. One deletion and one tandem duplication were observed in 60 and 61 animals, respectively. Analysis of raw results generated by Pindel revealed that these 2 SVs were present in all 62 animals of our study but, for two animals, they were supported by less than the minimum number of 3 reads which was required to support an SV. These samples were therefore excluded from the final list of animals presenting the SVs. The cow genome reference sequence is derived from a single Hereford animal called Dominette. Therefore the first deletion and probably the two other SVs might be Hereford- or Dominette-specific SVs. Alternatively, these SVs could also be due to local errors in the UMD3.1 reference genome assembly.

Fig 4

SV distribution among the 62 sequenced animals.

SV distribution among the 62 sequenced animals.

Histogram showing the distribution of SVs among all 62 sequenced animals. Frequencies of SVs present in more than 16 sequenced samples were too low to be visualized and were therefore drawn in a separate graph embedded in the first one. Comparison of large SVs revealed that 12% of these were shared between the three breeds (Fig 5) and at least one third were shared between at least two breeds. As shown in Fig 5, we identified more large SVs (2,195) in Holstein bulls than in Montbéliarde and Normande bulls (1,103 and 1,240, respectively). This result could be partly explained by the larger number of sequenced bulls in Holstein (27) than in Montbéliarde and Normande (18 and 17, respectively). Our results suggest that at least one third of the SV events occurred before the separation of the three breeds and therefore might also be present in other cattle breeds.

Fig 5

Distribution of SVs found within the three breeds.

Venn diagram showing shared and unique SVs between the 3 breeds.

Distribution of SVs found within the three breeds.

Venn diagram showing shared and unique SVs between the 3 breeds.

Identification of potential CNV regions

CNVs are defined as loss (deletions) or gain (duplications) of copies of DNA segments. In order to identify SV regions (SVRs) that might correspond to potential CNV regions (CNVRs), we searched for DNA segments for which we could observe at the same time either a deletion in one bull and a duplication in another bull (across animals) or a deletion and a duplication at the same region within the same bull (within animal). We considered a given DNA region as a potential CNVR when the deleted and the duplicated segments are located within the same region, and are at least 70 percent overlapping. In our study, we found 452 unique deletions and 392 unique duplications which may code for potential CNVRs (S4 Table). In parallel, all deletion and tandem duplication regions identified in our study were also compared to publicly available CNVRs. Overall, 175 regions (128 deletion and 47 tandem duplication regions) overlapped with publicly available CNV datasets [29][43][52][56][60][65][89]. Out of these, 33 deletions and 29 tandem duplications were also identified with our first approach (S5 Table). Overall, we identified 957 SVs that could potentially code for CNVs. Out of these, 547 SVs were deletions and 410 were tandem duplications (S4 and S5 Tables).

Annotation of SVRs

Gene content

Analyses of functional elements lying within SVRs revealed a total of 2,415 (38%) SVRs which contain either entire gene-coding regions or only parts of genes (S6 Table). Therefore these SVs could potentially have an effect on expression of some of these genes and consequently a potential effect on some phenotypes. Out of these, 48% (1,168) were deletions, 27% (650) were tandem duplications and 25% (597) were inversions. Overall, a total of 5,011 genes overlap with these SVRs. The vast majority of these genes has paralogs (S7 Table) and correspond to uncharacterized genes (587) and genes coding for the olfactory receptor (327), U6 splicesomal RNA (159) and for the 5S ribosomal RNA (86). Interestingly, we found 182 large deletions removing an entire gene. Overall, 115 different genes are affected by these large deletions (S8 Table). Almost 91.3% (105/115 genes) of these genes belong to large multigene families. The remaining large deletions remove 10 single-copy genes, out of which we found 3 pseudogenes, 3 protein coding genes and 4 genes encoding for microRNAs. Alignment to the UMD3.1 bovine genome sequence of the sequence of the genes encoding for the novel miRNA ENSBTAG00000044935 and for bta-mir-2887-2 revealed several significant perfect matches (S9 Table), suggesting that multiple paralogous copies of these two microRNAs are located throughout the bovine genome. A single perfect alignment match was however observed for the other two miRNAs. The gene encoding for bta-mir-2310 has been discovered in the normal adult bovine kidney (MDBK) cell line after infection with bovine herpesvirus 1 and shows a low expression in non- and infected cells [90]. Further analysis using TargetScan database [91] identified four genes to be targets of bta-mir-2310. These encode for interleukin 5, protein inhibitor of activated STAT 1 (PIAS1), solute carrier family 25 member 31 (SLC25A31) and zinc finger protein 316 (ZFN316). They are involved in different functions such as immune response, gene signaling, metabolite transport, and gene expression regulation. It is therefore possible that bta-mir-2310 plays an important role by negatively modulating the gene expression of these genes. However, its inactivation might also have limited impact as targets for numerous other miRNAs were also found in the 3’-untranslated regions of these four target genes. The gene deleted by INRA_BovSV6339 encode for the mediator complex subunit 10 protein-coding gene (MED10) which is is a coactivator for DNA-binding factors that activate transcription of RNA polymerase II-dependent genes [92]. The other two genes deleted by INRA_BovSV1327 and by INRA_BovSV4164 encode for two yet uncharacterized proteins. The first gene contains only one exon and the predicted protein is around 100 amino acids. Alignment of this protein sequences against protein databases revealed a perfect match (100% identity) with the 3’-end of Bos taurus partitioning defective 3 homolog B isoform X5 (PARD3B). The second gene, however, contains 13 exons and code for a 463 amino acid protein. Amino acid sequence alignments against protein databases revealed high similarities with Bos Taurus ankyrin repeat domain-containing protein 26-like isoform X1 (LOC513969). Further analyses are needed to check whether these deletions have any functional impact in cattle.

Gene Ontology

Gene Ontology analyses were also performed for all 5,011 genes and GO terms were obtained for biological processes, cellular components and molecular functions (S10 Table). Several GO terms were found to be significantly over-represented. For example, the five most enriched GO categories corresponding to biological process are related to metabolic process, primary metabolic process, organic substance metabolic process, single-organism metabolic process and cellular metabolic process.

QTLs in SVRs

The positions of the 6,426 predicted large SV events were also compared to the positions on the UMD3.1 bovine genome assembly of known quantitative trait loci (QTLs) deposited in the public database AnimalQTLdb [93]. Overall 587 SVs (246 large deletions, 236 inversions and 105 tandem duplications) were found located within or overlapping QTLs linked to milk traits and somatic cell count and scores (S11 Table). The most frequent traits corresponded to somatic cell score (257 SVs) followed by milk fat percentage (161 SVs), milk protein yield (143) and milk protein percentage (107 SVs). QTL enrichment analysis (S11 Table) showed no significant enrichment of specific QTLs linked to milk trait or somatic cell counts when comparing the SVs overlapping the QTL regions against SVs overlapping all other known QTL regions available in the AnimalQTLdb database for cattle.

Validation of large SVs by genotyping

The efficiency of the selection approach and the relevance of the resulting SVs were assessed by genotyping a selected panel of SVs in 382 animals. None of the sequenced individuals was present in this genotyped panel. Assays were developed for 331 putative SVs (S12 Table), out of which 255 (77%) were successfully genotyped (S13 Table) while genotyping failed for 76 (23%). These did not either cluster well according to genotype or failed to amplify most probably because of the sequence complexity or the presence of polymorphisms within flanking sequences or failed manufacture with Illumina. These were considered "failed assays". Out of the 255 successfully genotyped SVs, 237 were deletions and 18 were inversions. For almost 25% (64 SVs) of the successfully genotyped SVs, only one SV allele was identified in all individuals (S13 and S14 Tables). Out of these, 61 SVs were homozygous for the reference allele and could therefore be incorrectly identified as true SVs by Pindel. Some of these SVs may also correspond to rare variants that were not present in the samples genotyped in this study. Indeed, almost 50% (30 out of 61) of these monomorphic SVs were found in a single bull and 84% (51 out of 61) were present in less than 5 animals. The remaining 3 SVs were homozygous for the alternative allele and were therefore considered as true SVs. Finally, 75% (191) of the successfully genotyped SVs were polymorphic and reliably scored, and thus were considered as true SVs (S13 and S14 Tables). Out of these, 184 SVs were deletions and 7 were inversions. The observed minor allele frequency (MAF) mean among true SVs was 0.20 ± 0.15 (SD), while the observed heterozygosity mean across loci was 0.28 ± 0.17, and the PIC (Polymorphic Information Content) mean was 0.23 ± 0.13 (S15 Table). Based on the observed heterozygosity and PIC rates in the validated SV panel and across the eight main breeds analyzed (S15 Table), we could conclude that this type of markers may be informative and is therefore of particular interest for linkage analysis. Nine deletions overlapping with publicly available CNVs and 37 others identified as potential CNVs with our first approach were also validated in our genotyping study (S4 and S5 Tables).

Assessment of population structure using SV genotyping data

Our validation study was carried out using animals from at least eight major dairy breeds (Table 1), out of which there were 29 Holstein, 32 Montbeliarde and 30 Normande animals. Using only genotyping data related to the three dairy breeds (S14 Table), PCA grouped individuals into three clusters according to their breeds of origin (Fig 6A).

Fig 6

Results of PCA analysis.

PCA analysis results were shown for the 3 main dairy breeds (Fig 6A) and for the 8 breeds (Fig 6B).

Results of PCA analysis.

PCA analysis results were shown for the 3 main dairy breeds (Fig 6A) and for the 8 breeds (Fig 6B). For K = 3, which corresponded to the three main breeds, STRUCTURE successfully sorted individuals into three groups entirely corresponding to the three breeds (Fig 7A).

Fig 7

Genetic population structure prediction.

Genetic population structure predicted by STRUCTURE software for the 3 main dairy breeds (Fig 7A) and for the 8 breeds (Fig 7B).

Genetic population structure prediction.

Genetic population structure predicted by STRUCTURE software for the 3 main dairy breeds (Fig 7A) and for the 8 breeds (Fig 7B). Similar results were also observed with the eight breeds used in our validation study (Fig 6B and Fig 7B). Our results are of particular interest as they could be considered as a global statistical validation step in addition to the genotyping validation approach we developed. Indeed, the SV used in these analyses provided a good description of the breed structure, similar to the one previously provided by SNP data [85].

Conclusions

In the present study, we performed a pan-genome assessment of structural variations in cattle using whole genome sequence data. Analysis of WGS data of 62 bulls from the three main dairy breeds used in France (Holstein, Montbéliarde and Normande breeds) allowed the identification of 6,426 large SVs (> 50 bp). Out of these, 547 deletions and 410 tandem duplications were identified as potential CNVs. To analyze the accuracy of our SV detection approach, a set of 331 SVs were selected for validation using a novel high-throughput genotyping strategy. Almost 75% of the successfully genotyped SVs could be validated and were polymorphic. The collection of newly discovered SVs may prove useful to study their link with genetic variability of economically-important traits in cattle. It will be particularly interesting to analyze the impact of the large deletions inactivating completely single-copy genes.

Reads mapping statistics.

Summary of read mapping of the 62 bovine whole-genomes. (XLSX) Click here for additional data file.

Small indels.

Summary of all variants identified by GATK and by Pindel in all 62 samples. (XLSX) Click here for additional data file.

Large SV.

List of all 6,426 putative large SVs identified by Pindel. (XLSX) Click here for additional data file.

List of overlapping deletions and tandem duplications.

Overlapping deletions and duplications were considered as potential CNVs. (XLSX) Click here for additional data file.

List of SVs that overlap with publicly available CNVs.

Results of comparison between deletions and tandem duplication regions identified in this study with publicly available CNV regions (XLSX) Click here for additional data file.

Gene content in SVs.

List of genes located within or overlapped with SV regions. (XLSX) Click here for additional data file.

Gene description and counts of genes and their paralogs.

Counts of gene based on gene description and number of paralogs corresponding to each gene description. (XLSX) Click here for additional data file.

Large deletions missing completely genes.

List of deletions that remove completely a complete gene coding region. Deletions that remove completely a single copy gene (gene with no known paralog) were highlighted in bold and italics. (XLSX) Click here for additional data file.

miRNA sequence homology search.

Results of alignment of miRNA gene sequences onto the UMD3.1 genome. (XLSX) Click here for additional data file.

GO enrichment.

Results of gene ontology analysis for biological process, cellular component and molecular function enrichment. (XLSX) Click here for additional data file.

SVs overlapping with QTLs.

List of SV regions overlapping with publicly available QTL regions related to milk traits. (XLSX) Click here for additional data file.

SVs used for genotyping.

List of SVs present in the custom LD chip used for validation study. (XLSX) Click here for additional data file.

Successfully genotyped SVs.

List of SVs for which we obtained a good quality genotype. (XLSX) Click here for additional data file.

Genotyping data.

Individual genotype for all 191 validated SVs in the 8 major dairy breeds. (XLSX) Click here for additional data file.

Frequencies of validated SVs.

Estimation of allelic frequencies, Minor Allele Frequency (MAF), Heterozygosity (He) estimated by He = 2pq and Polymorphic Information Content (PIC) estimated by PIC = He-2p2q2. (XLSX) Click here for additional data file.

88 in total

1. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.

Authors: Kevin Judd McKernan; Heather E Peckham; Gina L Costa; Stephen F McLaughlin; Yutao Fu; Eric F Tsung; Christopher R Clouser; Cisyla Duncan; Jeffrey K Ichikawa; Clarence C Lee; Zheng Zhang; Swati S Ranade; Eileen T Dimalanta; Fiona C Hyland; Tanya D Sokolsky; Lei Zhang; Andrew Sheridan; Haoning Fu; Cynthia L Hendrickson; Bin Li; Lev Kotler; Jeremy R Stuart; Joel A Malek; Jonathan M Manning; Alena A Antipova; Damon S Perez; Michael P Moore; Kathleen C Hayashibara; Michael R Lyons; Robert E Beaudoin; Brittany E Coleman; Michael W Laptewicz; Adam E Sannicandro; Michael D Rhodes; Rajesh K Gottimukkala; Shan Yang; Vineet Bafna; Ali Bashir; Andrew MacBride; Can Alkan; Jeffrey M Kidd; Evan E Eichler; Martin G Reese; Francisco M De La Vega; Alan P Blanchard
Journal: Genome Res Date: 2009-06-22 Impact factor: 9.043

2. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery.

Authors: Sebastian H Eck; Anna Benet-Pagès; Krzysztof Flisikowski; Thomas Meitinger; Ruedi Fries; Tim M Strom
Journal: Genome Biol Date: 2009-08-06 Impact factor: 13.583

3. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Authors: Kai Ye; Marcel H Schulz; Quan Long; Rolf Apweiler; Zemin Ning
Journal: Bioinformatics Date: 2009-06-26 Impact factor: 6.937

4. Identification of copy number variations and common deletion polymorphisms in cattle.

Authors: Joon Seol Bae; Hyun Sub Cheong; Lyoung Hyo Kim; Suk NamGung; Tae Joon Park; Ji-Yong Chun; Jason Yongha Kim; Charisse Flerida A Pasaje; Jin Sol Lee; Hyoung Doo Shin
Journal: BMC Genomics Date: 2010-04-09 Impact factor: 3.969

5. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.

Authors: Radoje Drmanac; Andrew B Sparks; Matthew J Callow; Aaron L Halpern; Norman L Burns; Bahram G Kermani; Paolo Carnevali; Igor Nazarenko; Geoffrey B Nilsen; George Yeung; Fredrik Dahl; Andres Fernandez; Bryan Staker; Krishna P Pant; Jonathan Baccash; Adam P Borcherding; Anushka Brownley; Ryan Cedeno; Linsu Chen; Dan Chernikoff; Alex Cheung; Razvan Chirita; Benjamin Curson; Jessica C Ebert; Coleen R Hacker; Robert Hartlage; Brian Hauser; Steve Huang; Yuan Jiang; Vitali Karpinchyk; Mark Koenig; Calvin Kong; Tom Landers; Catherine Le; Jia Liu; Celeste E McBride; Matt Morenzoni; Robert E Morey; Karl Mutch; Helena Perazich; Kimberly Perry; Brock A Peters; Joe Peterson; Charit L Pethiyagoda; Kaliprasad Pothuraju; Claudia Richter; Abraham M Rosenbaum; Shaunak Roy; Jay Shafto; Uladzislau Sharanhovich; Karen W Shannon; Conrad G Sheppy; Michel Sun; Joseph V Thakuria; Anne Tran; Dylan Vu; Alexander Wait Zaranek; Xiaodi Wu; Snezana Drmanac; Arnold R Oliphant; William C Banyai; Bruce Martin; Dennis G Ballinger; George M Church; Clifford A Reid
Journal: Science Date: 2009-11-05 Impact factor: 47.728

6. Repertoire of bovine miRNA and miRNA-like small regulatory RNAs expressed upon viral infection.

Authors: Evgeny A Glazov; Kritaya Kongsuwan; Wanchai Assavalapsakul; Paul F Horwood; Neena Mitter; Timothy J Mahony
Journal: PLoS One Date: 2009-07-27 Impact factor: 3.240

7. Analysis of recent segmental duplications in the bovine genome.

Authors: George E Liu; Mario Ventura; Angelo Cellamare; Lin Chen; Ze Cheng; Bin Zhu; Congjun Li; Jiuzhou Song; Evan E Eichler
Journal: BMC Genomics Date: 2009-12-01 Impact factor: 3.969

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. Development and characterization of a high density SNP genotyping assay for cattle.

Authors: Lakshmi K Matukumalli; Cynthia T Lawley; Robert D Schnabel; Jeremy F Taylor; Mark F Allan; Michael P Heaton; Jeff O'Connell; Stephen S Moore; Timothy P L Smith; Tad S Sonstegard; Curtis P Van Tassell
Journal: PLoS One Date: 2009-04-24 Impact factor: 3.240

10. A whole-genome assembly of the domestic cow, Bos taurus.

Authors: Aleksey V Zimin; Arthur L Delcher; Liliana Florea; David R Kelley; Michael C Schatz; Daniela Puiu; Finnian Hanrahan; Geo Pertea; Curtis P Van Tassell; Tad S Sonstegard; Guillaume Marçais; Michael Roberts; Poorani Subramanian; James A Yorke; Steven L Salzberg
Journal: Genome Biol Date: 2009-04-24 Impact factor: 13.583

18 in total

1. Characterization of copy number variants in a large multibreed population of beef and dairy cattle using high-density single nucleotide polymorphism genotype data.

Authors: Pierce Rafter; Deirdre C Purfield; Donagh P Berry; Andrew C Parnell; I Claire Gormley; J Francis Kearney; Mike P Coffey; Tara R Carthy
Journal: J Anim Sci Date: 2018-09-29 Impact factor: 3.159

2. Genome-wide association study identifies loci associated with milk leukocyte phenotypes following experimental challenge with Streptococcus uberis.

Authors: Lydia Siebert; Margaret E Staton; Susan Headrick; Mark Lewis; Barbara Gillespie; Charles Young; Raul A Almeida; Stephen P Oliver; Gina M Pighetti
Journal: Immunogenetics Date: 2018-06-03 Impact factor: 2.846

Review 3. Use of Genomic Tools to Improve Cattle Health in the Context of Infectious Diseases.

Authors: Mikolaj M Raszek; Le L Guan; Graham S Plastow
Journal: Front Genet Date: 2016-03-07 Impact factor: 4.599

4. Whole-Genome Resequencing of Holstein Bulls for Indel Discovery and Identification of Genes Associated with Milk Composition Traits in Dairy Cattle.

Authors: Jianping Jiang; Yahui Gao; Yali Hou; Wenhui Li; Shengli Zhang; Qin Zhang; Dongxiao Sun
Journal: PLoS One Date: 2016-12-28 Impact factor: 3.240

5. Construction of a large collection of small genome variations in French dairy and beef breeds using whole-genome sequences.

Authors: Mekki Boussaha; Pauline Michot; Rabia Letaief; Chris Hozé; Sébastien Fritz; Cécile Grohs; Diane Esquerré; Amandine Duchesne; Romain Philippe; Véronique Blanquet; Florence Phocas; Sandrine Floriot; Dominique Rocha; Christophe Klopp; Aurélien Capitan; Didier Boichard
Journal: Genet Sel Evol Date: 2016-11-15 Impact factor: 4.297

6. Distribution and Functionality of Copy Number Variation across European Cattle Populations.

Authors: Maulik Upadhyay; Vinicus H da Silva; Hendrik-Jan Megens; Marleen H P W Visker; Paolo Ajmone-Marsan; Valentin A Bâlteanu; Susana Dunner; Jose F Garcia; Catarina Ginja; Juha Kantanen; Martien A M Groenen; Richard P M A Crooijmans
Journal: Front Genet Date: 2017-08-23 Impact factor: 4.599

7. CNV discovery for milk composition traits in dairy cattle using whole genome resequencing.

Authors: Yahui Gao; Jianping Jiang; Shaohua Yang; Yali Hou; George E Liu; Shengli Zhang; Qin Zhang; Dongxiao Sun
Journal: BMC Genomics Date: 2017-03-29 Impact factor: 3.969

8. Whole Genome Sequencing of Greater Amberjack (Seriola dumerili) for SNP Identification on Aligned Scaffolds and Genome Structural Variation Analysis Using Parallel Resequencing.

Authors: Kazuo Araki; Jun-Ya Aokic; Junya Kawase; Kazuhisa Hamada; Akiyuki Ozaki; Hiroshi Fujimoto; Ikki Yamamoto; Hironori Usuki
Journal: Int J Genomics Date: 2018-03-28 Impact factor: 2.326

9. Identification of copy number variation in French dairy and beef breeds using next-generation sequencing.

Authors: Rabia Letaief; Emmanuelle Rebours; Cécile Grohs; Cédric Meersseman; Sébastien Fritz; Lidwine Trouilh; Diane Esquerré; Johanna Barbieri; Christophe Klopp; Romain Philippe; Véronique Blanquet; Didier Boichard; Dominique Rocha; Mekki Boussaha
Journal: Genet Sel Evol Date: 2017-10-24 Impact factor: 4.297

10. Genome-wide mapping of large deletions and their population-genetic properties in dairy cattle.

Authors: Md Mesbah-Uddin; Bernt Guldbrandtsen; Terhi Iso-Touru; Johanna Vilkki; Dirk-Jan De Koning; Didier Boichard; Mogens Sandø Lund; Goutam Sahana
Journal: DNA Res Date: 2017-09-08 Impact factor: 4.458