| Literature DB >> 28122487 |
Long Chen1,2, Amanda J Chamberlain3, Coralie M Reich3, Hans D Daetwyler3,4, Ben J Hayes3,4.
Abstract
BACKGROUND: Several examples of structural variation (SV) affecting phenotypic traits have been reported in cattle. Currently the identification of SV from whole-genome sequence data (WGS) suffers from a high false positive rate. Our aim was to construct a high quality set of SV calls in cattle using WGS data. First, we tested two SV detection programs, Breakdancer and Pindel, and the overlap of these methods, on simulated sequence data to determine their precision and sensitivity. We then identified population SV from WGS of 252 Holstein and 64 Jersey bulls based on the overlapping calls from the two programs. In addition, we validated an overlapped SV set in 28 twice-sequenced Holstein individuals, and in another two validated sets (one for each breed) that were transmitted from sire to son. We also tested whether highly conserved gene sets across eukaryotes and recently expanded gene families in bovine were depleted and enriched, respectively, for SV.Entities:
Mesh:
Year: 2017 PMID: 28122487 PMCID: PMC5267451 DOI: 10.1186/s12711-017-0286-5
Source DB: PubMed Journal: Genet Sel Evol ISSN: 0999-193X Impact factor: 4.297
Parameters for the simulation scenarios
| Simulation set | HOM | HET | Base error rate | Repetitive | SNP% in SV | Mix |
|---|---|---|---|---|---|---|
| HOM/HET | HOM | HET | HOM | HOM | HOM | HET |
| SNP% in SV | 0 | 0 | 0 | 0 | 0.01–0.25 | 0.01 |
| REP region | 0 | 0 | 0 | 100% | 0 | 50% |
| Number of SV | 100 × 3 | 100 × 3 | 100 × 3 | 100 × 3 | 100 × 3 | 100 × 3 |
| Insert size | 500 | 500 | 500 | 500 | 500 | 500 |
| SD of insert size | 50 | 50 | 50 | 50 | 50 | 50 |
| Base error rate | 0.01 | 0.01 | 0.001–0.025 | 0.01 | 0.01 | 0.01 |
| SNP rate | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.008 |
| Indel rate | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
HOM/HET represents homozygous and heterozygous SV, respectively; SNP% in SV represents the percentage of SNPs that occur in a SV region; REP region represents the percentage of SV that fall in repetitive regions (LINE regions). One hundred each for deletions, inversions and tandem duplications were simulated under each simulation. Default insert size and standard deviation of insert size of 500 and 50, respectively, were used; SNP rate is the overall SNP percentage that exists across the whole cattle genome
Genome coverage read depth and insert size of SV for the WGS datasets
| Population | Number | Coverage | Insert size | ||||
|---|---|---|---|---|---|---|---|
| Min | Mean | Max | Min | Mean | Max | ||
| Holstein | 308 | 3.21 | 10.81 | 44.53 | 250 | 347.6656 | 514 |
| Jersey | 64 | 3.45 | 10.92 | 25.68 | 250 | 364.5469 | 502 |
Fig. 1Precision and sensitivity of Breakdancer, Pindel and overlap methods for the detection of structural variations in different simulation scenarios. a Precision of each method; b sensitivity of each method; BD Breakdancer, PD Pindel, OV overlap method, DEL deletions, INV inversions, DUP duplications. BCE base calling error rate, MIX mix scenario with SNP% = 0.01, SNP rate = 0.008 and half of the SV falling into repetitive regions. Precision is defined as the average number of true positives divided by the average number of total calls made by each program. Sensitivity is defined as the average number of true positives divided by the average number of actual variants in the simulations
Number and length of genome regions covered by SV detected in the Holstein and Jersey sets by Breakdancer and Pindel
| RAW_SV_output | SV counts | SV covered region (Mb) | ||||||
|---|---|---|---|---|---|---|---|---|
| SV Set | DEL | INS | INV | DUP | DEL | INS | INV | DUP |
| POP_HOL_Breakdancer | 2,124,795 | 2,047,019 | 46,975 | 28,745 | 116.97 | 115.47 | 118.82 | 15.69 |
| POP_HOL_Pindel | 51,302 | 85,946 | 457,575 | 21,888 | 144.96 | 6.35 | 269.69 | 84.86 |
| POP_JER_Breakdancer | 412,830 | 498,257 | 4397 | 4502 | 31.56 | 32.77 | 13.30 | 7.98 |
| POP_JER_Pindel | 37,717 | 47,234 | 63,683 | 20,889 | 46.58 | 3.38 | 62.28 | 27.53 |
Fig. 2Size distribution of four types of structural variations in validation datasets (SV in twice-sequenced and Holstein and Jersey sire-son transmission sets). The x axis represents the length of SV; the y axis represents the frequency of SV for each length; the pink area represents the Holstein sire-son transmission validated set; the green area represents the Jersey sire-son transmission validated set; the blue area represents the twice-sequenced validated set
Number and length of genome regions covered by SV detected in the Holstein and Jersey sets and the three validated sets, and in the overlapped set between Holstein and Jersey
| Final SV output set | SV counts | SV covered region (Mb) | ||||||
|---|---|---|---|---|---|---|---|---|
| SV Set | DEL | INS | INV | DUP | DEL | INS | INV | DUP |
| POP_HOL | 4037 | 7679 | 3623 | 2179 | 8.4889 | 0.6334 | 13.8377 | 4.3995 |
| POP_JER | 2679 | 0* | 415 | 1191 | 5.2239 | 0.0000* | 1.0497 | 2.3675 |
| Overlap between POP_HOL and POP_JER | 1533 | 0 | 69 | 601 | 3.1790 | 0.0000 | 0.2188 | 1.2270 |
| TWICE_SEQ | 10,893 | 174 | 200 | 267 | 4.8495 | 0.0077 | 0.3882 | 0.3934 |
| FAM_HOL | 4230 | 24 | 106 | 258 | 2.9639 | 0.0012 | 0.2057 | 0.3173 |
| FAM_JER | 619 | 0* | 17 | 58 | 0.5944 | 0.0000* | 0.0240 | 0.0466 |
| Overlap between FAM_HOL and FAM_JER | 509 | 0 | 14 | 27 | 0.4704 | 0.0000 | 0.0185 | 0.0225 |
* No insertions were found for the Jersey population
Comparison of the TWICE_SEQ and MERGE sets for 21 twice-sequenced individuals
| Animal ID | MERGE | OVERLAP | SHARE | Overlap% | Coverage |
|---|---|---|---|---|---|
| HOLFRAM268 | 845 | 27 | 15 | 55.56 | 21.32 |
| HOLFRAM266 | 1896 | 1635 | 1041 | 63.67 | 14.25 |
| HOLNLDM273 | 1796 | 1913 | 1176 | 65.48 | 15.34 |
| HOLNLDM270 | 1677 | 2069 | 1136 | 67.74 | 15.51 |
| HOLDNKM259 | 1682 | 2001 | 1146 | 68.13 | 15.74 |
| HOLUSAM277 | 2219 | 2398 | 1524 | 68.68 | 17.62 |
| HOLNLDM272 | 2039 | 2520 | 1410 | 69.15 | 16.91 |
| HOLDEUM255 | 1881 | 2168 | 1341 | 71.29 | 17.02 |
| HOLUSAM280 | 2000 | 2200 | 1440 | 72.00 | 17.55 |
| HOLNLDM274 | 690 | 1585 | 497 | 72.03 | 14.78 |
| HOLDNKM262 | 980 | 2537 | 714 | 72.86 | 17.4 |
| HOLDNKM261 | 1969 | 1867 | 1361 | 72.90 | 15.84 |
| HOLUSAM278 | 1011 | 3067 | 761 | 75.27 | 18.96 |
| HOLDNKM260 | 2557 | 1305 | 986 | 75.56 | 16.93 |
| HOLDEUM256 | 1059 | 2730 | 806 | 76.11 | 17.48 |
| HOLSWEM275 | 1214 | 2882 | 926 | 76.28 | 18.86 |
| HOLUSAM279 | 1331 | 2581 | 1036 | 77.84 | 16.75 |
| HOLDNKM263 | 1159 | 2697 | 916 | 79.03 | 17.02 |
| HOLDEUM257 | 1356 | 3626 | 1087 | 80.16 | 21.53 |
| HOLCANM253 | 1600 | 257 | 255 | 99.22 | 42.71 |
| HOLUSAM276 | 845 | 132 | 131 | 99.24 | 16.89 |
MERGE and OVERLAP represent the counts of SV that were observed by using the merge and overlap method, respectively. SHARE represents the counts of SV that were found by both methods. The overlap percentage is equal to SHARE counts divided by the smaller number found in the merge and overlap method. Coverage is the sum of the coverages for each twice-sequenced individual
Structural variation found in a set of genes that are highly conserved across eukaryotes
| Gene name | Chromosome | Start bp | End bp | SV type | Dataset |
|---|---|---|---|---|---|
|
| Chr3 | 67,687,801 | 67,824,633 | DEL | TWICE_SEQ |
|
| Chr8 | 10,456,053 | 10,576,397 | DEL | TWICE_SEQ |
|
| Chr20 | 9,851,676 | 10,148,631 | DEL | TWICE_SEQ |
|
| Chr20 | 23,727,320 | 23,853,125 | DEL | TWICE_SEQ |
|
| Chr21 | 31,993,936 | 32,063,870 | DEL | TWICE_SEQ |
|
| Chr21 | 33,646,396 | 33,647,527 | DEL | TWICE_SEQ |
|
| Chr29 | 7,723,699 | 7,725,004 | DEL | TWICE_SEQ |
|
| Chr8 | 85,268,883 | 85,350,117 | INV | TWICE_SEQ |
|
| Chr21 | 31,993,936 | 32,063,870 | DEL | FAM_HOL |
|
| Chr21 | 33,646,396 | 33,647,066 | DEL | FAM_HOL |
|
| Chr8 | 85,268,883 | 85,350,117 | INV | FAM_HOL |
Chi squares and p values for the test on conserved genes
| Conserved | Non-conserved | Chi square | p value | |
|---|---|---|---|---|
| TWICE_SEQ_SV | ||||
| SV | 8 | 965 | 5.0155 | 0.025 |
| not_SV | 229 | 12555 | ||
| Total | 237 | 13,520 | ||
| FAM_HOL | ||||
| SV | 3 | 565 | 4.9937 | 0.02544 |
| not_SV | 234 | 12,955 | ||
| Total | 237 | 13,520 | ||
Expanded gene families in the bovine genome with structural variations
| Gene | Chr | Start bp | End bp | SV type | SV sets |
|---|---|---|---|---|---|
|
| Chr13 | 61,561,981 | 61,578,126 | DEL | POP_HOL;TWICE_SEQ;POP_JER |
|
| Chr13 | 61,561,981 | 61,578,126 | INS | POP_HOL |
|
| Chr13 | 61,562,053 | 61,566,096 | DEL | POP_HOL;POP_JER |
|
| Chr13 | 61,562,053 | 61,566,096 | INS | POP_HOL |
|
| Chr13 | 61,371,090 | 61,377,521 | INV | POP_HOL |
|
| Chr13 | 61,391,541 | 61,402,435 | INV | POP_HOL |
|
| Chr23 | 22,381,986 | 22,387,950 | DEL | TWICE_SEQ |
|
| Chr27 | 5,457,175 | 5,465,032 | INS | POP_HOL |
|
| Chr27 | 5,483,406 | 5,539,158 | INS | POP_HOL |
|
| Chr27 | 5,448,917 | 5,465,074 | INS | POP_HOL |
|
| Chr27 | 6,223,483 | 6,225,131 | DEL | FAM_HOL;TWICE_SEQ;POP_JER |
|
| Chr27 | 5,134,073 | 5,276,254 | DUP | POP_JER |
|
| Chr27 | 5,134,073 | 5,276,254 | INS | POP_HOL |
|
| Chr27 | 5,134,073 | 5,276,254 | DEL | POP_JER |
|
| Chr27 | 5,245,806 | 5,351,104 | INS | POP_HOL |
|
| Chr22 | 52,189,557 | 52,191,061 | DEL | POP_HOL;POP_JER |
|
| Chr29 | 38,952,100 | 39,189,606 | DEL | TWICE_SEQ |
|
| Chr29 | 38,952,100 | 39,189,606 | INS | POP_HOL |
|
| Chr29 | 38,952,100 | 39,189,606 | INV | POP_HOL |
|
| Chr29 | 38,428,102 | 38,437,106 | DEL | TWICE_SEQ |
|
| Chr23 | 34,386,963 | 34,491,996 | DEL | FAM_HOL;POP_HOL;TWICE_SEQ |
|
| Chr23 | 34,386,963 | 34,491,996 | DUP | POP_HOL;POP_JER |
|
| Chr23 | 34,479,662 | 34,491,996 | DEL | POP_HOL;TWICE_SEQ |
* Genes that are completely spanned by SV
Chi squares and p values for expanded gene families analysis
|
| Other refseq genes | Chi square | p value | |
|---|---|---|---|---|
| SV | 4 | 969 | 7.0135 | 0.00809 |
| Non-SV | 13 | 12,771 | ||
| Total | 17 | 13,740 |
Proportion of structural variants in LINE regions, compared with the genome as a whole and other regions
| Sample set | Non-L1_exon | L1 | Fold_change | t test p value |
|---|---|---|---|---|
| Deletions | 0.003538 | |||
| FAM_HOL | 0.000805 | 0.003322 | 4.124097 | |
| FAM_JER | 0.000139 | 0.000818 | 5.893840 | |
| POP_HOL | 0.002868 | 0.005747 | 2.003659 | |
| POP_JER | 0.001654 | 0.004282 | 2.588642 | |
| VAL_SV | 0.001384 | 0.004992 | 3.608023 | |
| Insertions | 0.185507 | |||
| FAM_HOL | 0.000001 | 0.000000 | 0.000000 | |
| FAM_JER | 0.000000 | 0.000000 | 0.000000 | |
| POP_HOL | 0.000249 | 0.000196 | 0.787844 | |
| POP_JER | 0.000000 | 0.000000 | 0.000000 | |
| VAL_SV | 0.000003 | 0.000003 | 1.080158 | |
| Inversions | 0.260667 | |||
| FAM_HOL | 0.000084 | 0.000040 | 0.479628 | |
| FAM_JER | 0.000010 | 0.000005 | 0.484786 | |
| POP_HOL | 0.005262 | 0.005435 | 1.032979 | |
| POP_JER | 0.000395 | 0.000437 | 1.106285 | |
| VAL_SV | 0.000152 | 0.000123 | 0.810684 | |
| Duplications | 0.082899 | |||
| FAM_HOL | 0.000122 | 0.000115 | 0.945850 | |
| FAM_JER | 0.000016 | 0.000033 | 2.100240 | |
| POP_HOL | 0.001611 | 0.002144 | 1.331104 | |
| POP_JER | 0.000828 | 0.001416 | 1.710535 | |
| VAL_SV | 0.000148 | 0.000164 | 1.103813 |
Fold change is equal to the percentage of the genome that harbors SV in the L1 regions divided by the percentage of the genome that harbors SV in the other regions