Literature DB >> 29320538

Germline and somatic variant identification using BGISEQ-500 and HiSeq X Ten whole genome sequencing.

Ann-Marie Patch¹, Katia Nones¹, Stephen H Kazakoff¹, Felicity Newell¹, Scott Wood¹, Conrad Leonard¹, Oliver Holmes¹, Qinying Xu¹, Venkateswar Addala¹, Jenette Creaney^2,3, Bruce W Robinson^2,3, Shujin Fu⁴, Chunyu Geng⁴, Tong Li⁴, Wenwei Zhang⁴, Xinming Liang⁴, Junhua Rao⁴, Jiahao Wang⁴, Mingyu Tian⁴, Yonggang Zhao⁴, Fei Teng⁴, Honglan Gou⁴, Bicheng Yang⁴, Hui Jiang⁴, Feng Mu⁴, John V Pearson¹, Nicola Waddell¹.

Abstract

Technological innovation and increased affordability have contributed to the widespread adoption of genome sequencing technologies in biomedical research. In particular large cancer research consortia have embraced next generation sequencing, and have used the technology to define the somatic mutation landscape of multiple cancer types. These studies have primarily utilised the Illumina HiSeq platforms. In this study we performed whole genome sequencing of three malignant pleural mesothelioma and matched normal samples using a new platform, the BGISEQ-500, and compared the results obtained with Illumina HiSeq X Ten. Germline and somatic, single nucleotide variants and small insertions or deletions were independently identified from data aligned human genome reference. The BGISEQ-500 and HiSeq X Ten platforms showed high concordance for germline calls with genotypes from SNP arrays (>99%). The germline and somatic single nucleotide variants identified in both sequencing platforms were highly concordant (86% and 72% respectively). These results indicate the potential applicability of the BGISEQ-500 platform for the identification of somatic and germline single nucleotide variants by whole genome sequencing. The BGISEQ-500 datasets described here represent the first publicly-available cancer genome sequencing performed using this platform.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29320538 PMCID： PMC5761881 DOI： 10.1371/journal.pone.0190264

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The human genome project was an important achievement in life sciences and paved the way for major technology developments in DNA sequencing. The development of next generation sequencing (NGS, also known as massively parallel or high-throughput sequencing) machines commenced with the 454 DNA sequencer (Life Sciences), followed by the Genome Analyzer (Solexa) and SOLiD (Agencourt) platforms. Solexa, who pioneered sequencing by synthesis technology, were acquired by Illumina who further refined the technology and developed the HiSeq sequencers (reviewed in [1, 2]). The HiSeq platforms have now produced the majority of the publicly available human DNA sequencing data. Over time the cost of sequencing has decreased and the technology has become more accessible, both in terms of sequence hardware and tools for analysis, which has resulted in NGS being adopted by many researchers. NGS has been applied in cancer research to identify somatic mutations occurring in many tumour types. Two large consortia, The Cancer Genome Atlas (TCGA) [3] and the International Cancer Genome Consortium (ICGC) [4], have sequenced thousands of tumours from over 50 cancer types. These two consortia have been instrumental in increasing our knowledge of cancer genomics and have identified significantly mutated genes, candidate actionable mutations and mutational processes [5] that occur during tumour development. To date most large scale cancer genome studies have utilised the Illumina HiSeq platforms. In 2015, Beijing Genomics Institute (BGI) launched the BGISEQ-500 as alternative to existing short-read sequencing technologies. The BGISEQ-500 is based on combinatorial Probe-Anchor Synthesis and improved DNA Nanoballs technology [6]. Previously the BGISEQ-500 has been used to sequence small non-coding RNAs [7], insect derived transcriptomes [8], genomes from historic and ancient dog and wolf samples [9] and the whole genome of a single human DNA reference sample [10]. However to date no studies have used the platform for cancer whole genome sequencing (WGS). Here we evaluate WGS data generated on the BGISEQ-500 and HiSeq X Ten using DNA extracted from cancer and matched germline samples from patients with malignant pleural mesothelioma.

Materials and methods

Patients and samples

Samples were collected from three patients (identified as 9869, 11202 and 11398) diagnosed with malignant pleural mesothelioma at the Sir Charles Gairdner Hospital in Perth, Western Australia. The work in this study was approved by the Human Research Ethics Committee of Sir Charles Gairdner Hospital and QIMR Berghofer Medical Research Institute and all patients provided written consent. Blood samples were collected in K2EDTA plasma Vacutainer tubes (BD Bioscience, New Jersey, USA). Pleural effusion samples were collected without preservative by routine pleurocentesis and were in excess to that required for diagnosis. A diagnosis of malignant pleural mesothelioma was confirmed by pathologists experienced in the diagnosis of effusions. Effusions were centrifuged for 10 min at 1000 g and the resulting cell pellet was washed in PBS by centrifugation at 400 g for 10 min then depleted of CD45 positive cells using the EasyStep Human CD45 Depletion kit (Stemcell technologies, Vancouver, Canada). Resulting cellular composition was reviewed on cytospin cell preparations.

DNA extraction and quality assessment

DNA was extracted using the AllPrep DNA/RNA/miRNA Universal kit (Qiagen) following the manufacturer’s instructions. DNA samples extracted from blood and matched pleural effusion samples were quantified using a Qubit (ThermoFisher Scientific). To ensure that there was high tumour content in each sample the DNA was assayed using SNP arrays (Infinium Omni2.5–8, Illumina) and tumour content estimated using qPure [11]. The tumour content was 89% for patient 9869; 78% for patient 11202 and 81% for patient 11398. A total of 2 μg of each DNA sample was sent to both BGI and the Kinghorn Centre for Clinical Genomics (KCCG) for WGS using the BGISEQ-500 and HiSeq X Ten, respectively.

Library construction and whole genome sequencing

Sequence libraries for the BGISEQ-500 platform were prepared using a sonication or fragmentase based library construction method. The MGIEasy™ DNA Library Prep Kit V1 (BGI, Cat. No. 85-05533-00) was applied to construct the sonication based library using 1000ng of genomic DNA that had been sheared with an E220 Covaris instrument (Covaris Inc.) following the manufacturer’s manual. The fragmentase based WGS libraries used 100ng of each genomic DNA sample that was sheared by fragmentase (NEB). All samples described were prepared using the fragmentase-based library method, except for the normal samples from 9869 which underwent sonication. After fragmentation by sonication or fragmentase, the DNA fragments were size selected using AMpure XP Beads (Beckman Coulter, Indiana, USA) and then underwent end-repairing, phosphorylation and A-tailing reactions. BGISEQ-500 platform-specific adaptors were ligated to the A-tailed fragments, and the ligated fragments were purified, and then amplified using PCR. Finally, circularization was performed to generate single stranded DNA circles. After quantitation and qualification, the libraries were sequenced. BGI performed the DNA nanoball preparation and whole genome sequencing using the circular single stranded libraries as a template for rolling circle amplification to form DNA nanoballs. The DNA nanoballs were loaded onto a sequencing flow cell and then processed for 50 bp paired-end sequencing on the BGISEQ-500 platform. In contrast the KCCG performed WGS on a HiSeq X Ten using the HiSeq X Ten Reagent Kit v2.5 following manufacturer’s guidelines.

Whole genome sequence analysis

Whole genome sequencing was performed as 50 bp paired end using the BGISEQ-500 platform and 150 bp paired end on the HiSeq X Ten. The BGISEQ-500 sequence data has been deposited into the EGA (Accession number: EGAS00001002298) and the Illumina data is available in the EGA (Accession number: EGAS00001002299). Data from the BGISEQ-500 and HiSeq X Ten was analysed using the same pipeline. Essentially, sequence reads were trimmed using Cutadapt (version 1.11), aligned to GRCh37 using BWA-MEM (version 0.7.12-r1039), duplicates marked with Picard (version 1.129, http://picard.sourceforge.net) and coordinates sorted using Samtools (version 1.3) [12]. Single nucleotide substitution variants (SNV) were detected using a dual calling strategy using qSNP [13] and GATK HaplotypeCaller [14]. Short insertion and deletions (indels) of ≤50bp, were also called with the GATK Haplotype caller. Variants were annotated with Ensembl v75 gene feature information and transcript or protein consequences using SnpEff (version 4.2) [15]. All germline SNV and indels were annotated with whether they are present in the genome Aggregation Database (gnomAD), which is comprised of two datasets: exome sequence data from the Exome Aggregation Consortium[16] and whole genome sequencing from 15,496 individuals. Variants were considered “called” and used in subsequent analysis if they passed the following filters: a minimum read depth of 8 reads in the normal control data and 12 in the tumour data; at least 4 reads containing the variant where the variant was identified on both strands and not within the first or last 5 bases. Additionally, indels that were located immediately adjacent to homopolymer regions of at least 6 bp and for which the inserted or deleted base were identical to the homopolymer base were filtered. Variants that did not pass these filters were considered “low evidence”. The processes used to analyse the somatic data were established for the International Cancer Genome Consortia (ICGC)[4] and have been used for several high impact cancer studies[17-19]. These processes have also been internationally benchmarked against other pipelines [20]. In this manuscript the term ‘somatic variants’ refers to mutations acquired by the tumour, or tumour specific variants which are not present in the germline (matched normal sample).

Comparison of variants detected between different platforms

The germline genotypes from the SNP arrays were compared to the BGISEQ-500 and HiSeq X Ten sequence data where sequencing read depth required ≥10 reads. This resulted in 525,029 and 521,040 SNPs from the SNP array being compared with the BGI and Illumina sequence data respectively for patient 9869; 504,234 and 504,352 SNPs for patient 11202; and 503,527 and 512,213 SNPs for patient 11398. The chromosome position and genotype of each germline and somatic variant called from each sequence platform was used to compare and identify the SNVs and indels which were only detected in either the BGISEQ-500 or HiSeq X Ten datasets. A sequence pileup to count the bases present at each discordant position was performed to reveal any evidence of the variant at each locus Quality filtering was also employed during the pileup analysis to ensure only non-duplicate marked reads that contained a minimum of 35 matched bases as reported in the CIGAR string and 3 or fewer mismatches in the sequencing MD field were counted.

Results

Whole genome sequencing coverage

The average non-duplicate sequencing read depth achieved by the BGISEQ-500 (50 base pair read length) and HiSeq X Ten (150 base pair read length) platforms was similar both before and after filtering by alignment quality (Fig 1). In the BGISEQ-500 data the average post-quality filtering read depth was 28X (range 24-33X) in the normal and 50X (range 41-56X) in the tumour samples and in the HiSeq X Ten data 29X (range 27-30X) in the normal and 58X (range 57-61X) in the tumour samples.

Fig 1

Average genome read depth using BGISEQ-500 and HiSeq X Ten data.

The average whole-genome sequencing read depth for each platform (blue BGISEQ-500, yellow HiSeq X Ten), for each tumour (T) and normal (N) sample is displayed for three mesothelioma patients (9869, 11202 and 11398). Prior to variant calling sequence reads underwent quality filtering, and the subsequent average read depth remained similar between sequencing platforms, this is a more relevant measure of read depth as it represents the ‘usable’ portion of the sequencing data for detecting variants. The average quality-filtered sequencing read depth is indicated by the shaded bar.

Average genome read depth using BGISEQ-500 and HiSeq X Ten data.

Germline SNV and indel variant detected by each platform

The sequence data generated on the BGISEQ-500 and the HiSeq X Ten platforms showed a >99% concordance with the genotypes obtained from the Illumina SNP arrays (Table 1), indicating that both platforms were able to accurately detect common germline SNV assayed by the SNP arrays.

Table 1

The percent concordance of germline genotypes ascertained by SNP arrays compared to the BGISEQ-500 and HiSeq X Ten data.

Patient	SNP array vs BGISEQ-500	SNP array vs HiSeq X Ten
9869	99.797	99.789
11202	99.794	99.794
11398	99.797	99.795

A summary of the number of germline and somatic SNV and indels identified with the BGISEQ-500 and HiSeq X Ten sequencing platforms is provided in Table 2. Across the genome the BGISEQ-500 and HiSeq X Ten platforms called an average of 3,562,321 germline SNV in each patient (representing 3,508,123; 3,586,280; and 3,592,559 germline SNV in patients: 9869, 11202 and 11398 respectively). The majority of the germline SNV (86%) were identified in both sequencing platforms (Fig 2a). However, across the 3 patients there were a total of 1,042,608 SNV which were only called by the HiSeq X Ten analyses and comprised 8.9%, 9.0% and 11.4% of the SNV identified in the 3 patient samples (patients: 9869, 11202 and 11398 respectively). There were less calls unique to BGISEQ-500 (371,514 SNV) which represented 4.6%, 3.3% and 2.6% of the SNV in the 3 patient samples (patients: 9869, 11202 and 11398 respectively). An average of 232,987 germline indels were called in each patient (representing 233,527; 232,260 and 233,174 germline indels in patients: 9869, 11202 and 11398 respectively) (Fig 2b). The majority of these indels (81.5%) were identified by both of the sequencing platforms, with only 15.7% called in the HiSeq X Ten only (representing 109,876 indels) and 2.8% (19,745 indels) called in the BGISEQ-500 data.

Table 2

Number of germline and somatic variants identified in three mesothelioma samples using whole genome sequencing.

The percentage of the germline variants identified in this study and reported in European population data from gnomAD are presented in brackets.

		SNV				Indels
		9869	11202	11398	All Patients	9869	11202	11398	All Patients
Germline	Identified in both platforms	3,033,980	3,146,317	3,092,543	9,272,840	193,359	190,436	185,905	569,700
		(96.8%)	(96.8%)	(96.8%)	(96.8%)	(91.7%)	(91.8%)	(92%)	(91.8%)
	HiSeq X Ten only	313,015	321,627	407,966	1,042,608	33,143	35,253	41,480	109,876
		(42.3%)	(42.3%)	(41.9%)	(42.1%)	(58.5%)	(58.4%)	(59.2%)	(58.7%)
	BGISEQ-500 only	161,128	118,336	92,050	371,514	7,025	6,931	5,789	19,745
		(4%)	(2.4%)	(4.1%)	(3.55%)	(13.8%)	(13.8%)	(11.6%)	(13.1%)
	Total	3,508,123	3,586,280	3,592,559	10,686,962	233,527	232,620	233,174	699,321
Somatic	Identified in both platforms	3,554	2,342	1,955	7,851	197	168	114	479
	HiSeq X Ten only	697	424	411	1,532	135	93	78	306
	BGISEQ-500 only	540	474	493	1,507	102	156	229	487
	Total	4,791	3,240	2,859	10,890	434	417	421	1,272

Fig 2

Germline variants identified in three mesothelioma samples (patients: 9869, 11202 and 11398) using BGISEQ-500 and HiSeq X Ten data.

The number of germline SNV (a) and indels (b) identified in each patient using the BGISEQ-500 and HiSeq X Ten platforms. We investigated germline SNV (c) and indels (d) which were only called in one platform and that fall into three categories: i) identified as germline in the other platform but with low evidence; ii) identified in the other platform but predicted as a somatic variant; or iii) not identified in the other platform. Across the 3 patients only 197,434 (1.85%) SNVs were truly unique to the HiSeq X Ten and not identified in the BGISEQ-500 (c). Similarly in the BGISEQ-500 platform only 38,236 SNVs (0.36% of the total) were truly unique to the BGISEQ-500, not called in the HiSeq X Ten data (c). The same pattern was observed for indels (d), only 3.23% were unique to HiSeq X Ten and 0.19% to BGISEQ-500.

Number of germline and somatic variants identified in three mesothelioma samples using whole genome sequencing.

The percentage of the germline variants identified in this study and reported in European population data from gnomAD are presented in brackets.

Germline variants identified in three mesothelioma samples (patients: 9869, 11202 and 11398) using BGISEQ-500 and HiSeq X Ten data.

Discordant germline SNV and indels between the different sequencing platforms

A proportion of SNVs and indels that were called germline in only one platform were either: i) identified as low evidence germline in the other platform; ii) identified in the other platform but predicted as a somatic variant; or iii) not identified in the other platform (Table 2, Fig 2c and 2d). Of the 10,686,962 SNVs called across the 3 data sets, 1,042,608 (9.76%) SNV that were called germline in the HiSeq X Ten platform only, 7.1% (760,482) were identified as low evidence in the BGISEQ-500 data; 0.79% (84,692) were identified in the BGISEQ-500 data but predicted as somatic which suggests that the alternate allele was not sequenced in the normal due to low coverage or sampling; only a small percentage of the total SNVs 1.85% (197,434) were uniquely identified in the HiSeq X Ten (Fig 2c). The same pattern was observed for BGISEQ-500, 3.48% of the SNV were called only in this platform, with 3.01% (321,937) identified as germline low evidence in the HiSeq X Ten data; 0.11% (11,341) were predicted as somatic in the HiSeq X Ten data; and only 0.36% (38,236) were uniquely identified in the BGISEQ-500 (Fig 2c). Similar to the SNV calls, the majority of discordant indel variants were actually detected but as low evidence in the other platform (Fig 2d). Of the total 699,321 indels identified 15.71% (109,876) were identified in HiSeq X Ten platform only. When compared to low evidence calls 11.72, 75% (81,935) were also identified as low evidence germline in the BGISEQ-500 data; 0.77% (5,364) were identified as somatic in the BGISEQ-500 data; and 3.23% (22,577) remained uniquely identified in the HiSeq X Ten (Fig 2d). Similarly, of the 19,745 indels that were called only using the BGISEQ-500 platform 92% (18,268) were identified in the HiSeq X Ten data but as low evidence; 0.03% (175) were identified as somatic in the HiSeq X Ten data; and 0.19% (1,302) were uniquely identified in the BGISEQ-500 (Fig 2d). To determine why a small proportion of the total germline calls across all patients were unique to each platform (0.36 and 1.85% SNV and 0.19 to 3.23% indels in the BGISEQ-500 and HiSeq X Ten respectively), an analysis of the read depth at the position of each variant was performed. Variants unique to the BGISEQ-500 data (38,236 SNV and 1,302 indels) were generally covered at a reasonable depth in the HiSeq X Ten data but no evidence for the variant was detected (Fig 3a). Such variants may not have been seen in the HiSeq X Ten data due to biases in the sampling of the variant allele. Alternatively, mapping errors affecting the shorter reads in the BGISEQ-500 may have led to artefact calls in regions that are difficult to map but were removed from the HiSeq X Ten data due to the >3 mismatches filter. Overall these variants, which are unique to the BGISEQ-500, represent a small number of the total germline SNV (38,236 of 10,686,962 or 0.36%) and indels (1,302 of 699,321 or 0.19%) identified from that platform. In contrast the majority (68%) of the 197,434 SNVs and 33% of the 22,577 indels that were unique to the HiSeq X Ten and not identified using the BGISEQ-500 were due to low sequence coverage across the variants positions (<8 reads in the normal) (Fig 3b). This may be due to random sampling during sequencing or that these regions in the genome are more problematic to sequence using the 50 bp paired end read lengths in the BGISEQ-500 data.

Fig 3

The sequence coverage of germline variants and the length of the indels which were identified in one sequence platform.

The sequence coverage of germline variants and the length of the indels which were identified in one sequence platform.

Read depth in Illumina for variants unique to BGISEQ-500 (a) read depth in BGI for variants unique to Illumina (b). The distribution of the length (number of bases) of the indels that were identified in both sequencing platforms or unique to the HiSeq X Ten or BGISEQ-500 data is plotted (c). As an in-silico validation of germline calls we used the genome Aggregation Database (gnomAD) [16] to determine the occurrence of variants in the general population. The percentages of germline SNVs and indels present in the European population in gnomAD are included in Table 2. A total of 96.8% of the 9,272,840 SNVs called by both platforms have been reported in gnomAD. As expected the private variants in each platform have a much smaller representation in gnomAD. However, these variants are a small fraction of the total germline calls (3.4 and 9.76% of SNVs and 2.8 and 15% of indels for BGISEQ-500 and HiSeq XTen, respectively). The size of the indels which were identified only in the HiSeq X Ten or BGISEQ-500 platform differed. The frequency of indels detected that were between 1–8 bps in length was similar between the platforms but the HiSeq X Ten data was able to detect a higher number of indels >8bp long (Fig 3c). This may be due the longer read length (150bp paired end) used in the HiSeq X Ten, as opposed to the 50 bp with the BGISEQ-500, as the longer read length will be able to align across larger indels more effectively. However a local realignment methodology may aid detection of longer indels in the shorter reads.

Somatic SNV and indel variants detected by the different platforms

A total of 10,890 somatic SNV were called using the HiSeq X Ten and BGISEQ-500 platforms across all three patients (representing 4,791; 3,240 and 2,859 somatic SNV in patients: 9869, 11202 and 11398 respectively). The majority of the somatic SNV (72%) were identified in both sequencing platforms, while 14% of the somatic SNVs were only called in the HiSeq X Ten data and 14% only called in the BGISEQ-500 data (Fig 4a). An average of 424 somatic indels were called using the HiSeq X Ten and BGISEQ-500 platforms each patient (representing 434; 417 and 421 somatic indels in patients: 9869, 11202 and 11398 respectively) (Fig 4b). Interestingly only 38% of the indels were identified by both sequencing platforms, while 14% were only called in the HiSeq X Ten and 38% only called in the BGISEQ-500. The high proportion of discordant somatic indel calls is not completely unexpected, as previous benchmarking studies have also found a higher discordant rate in somatic indels compared to SNV analysis[20]. In total 156 of the somatic mutations (141 SNV and 15 indels) were located in gene coding regions. Of these, 109 coding mutations (70%) were identified in both sequencing platforms and included the known mesothelioma driver gene, BAP1 [21], while 20 mutations (13%) and 27 mutations (17%) were only called in the BGISEQ-500 and HiSeq X Ten data respectively (S1 Fig).

Fig 4

Somatic variants in mesothelioma patients identified using BGISEQ-500 and HiSeq X Ten data.

A summary of the somatic variants identified in 3 mesothelioma patient samples (patient ID: 9869, 11202 and 11398) using different sequencing platforms. The number of somatic SNV (a) and indels (b) identified using the BGISEQ-500 and HiSeq X Ten platforms in each patient. The somatic SNV (c) and indels (d) which were only called in one platform fall into three categories: i) identified as somatic in the other platform but with low evidence; ii) identified in the other platform but predicted as a germline variant; or iii) not identified in the other platform.

Somatic variants in mesothelioma patients identified using BGISEQ-500 and HiSeq X Ten data.

Discordant somatic SNV and indel variants between the different platforms

Similar to the germline analysis the somatic SNV and indel variants which were called in one platform fell into three categories: i) identified as somatic in the other platform but as low evidence; ii) identified in the other platform but predicted as a germline variant; or iii) not identified in the other platform (Fig 4c and 4d). However compared to the germline calls, there were a higher proportion of SNV and indel variants which were unique to each platform. Also the somatic SNV and indels called in the BGISEQ-500 data contained a higher proportion of events which were identified as germline in the HiSeq X Ten platform (Fig 4c and 4d), which is likely due to biases against the variant allele in the normal sequencing data from the BGISEQ-500.

Discussion

We sequenced three cancer and matched normal DNA pairs from mesothelioma patients using the BGISEQ-500 and HiSeq X Ten sequencing platforms. A comparison of the germline and somatic SNVs and indels detected using the BGISEQ-500 to those identified using the HiSeq X Ten platform revealed that the majority of variants were identified by both sequencing platforms. The three mesothelioma genomes are typical of that disease. They have a range of somatic mutations per megabase of between 0.85–1.52 which is at the low end of the spectrum of mutation load across many different cancers [22]. The small proportion of variants called in one platform but not the other are due to a multiplicity of factors. One key factor contributing to differences between the platform variant calls is the difference in read length between the two platforms (50 bp in the BGISEQ-500 and 150 bp in the HiSeq X Ten). Read length affects the ability to call variants primarily through alignment bias and error which are higher for short reads as there are fewer bases with which to uniquely align that read to the reference sequence. The effects of alignment bias are not evenly represented across the genome but are higher in AT-rich regions associated with repetitive, typically non-coding DNA. High concordance of known polymorphic SNP positions assessed by both the sequencing and array platforms are consistent with the selection of robust marker polymorphisms located within unique sequence regions. This suggests that alignment biases are much reduced in these selected sites. Read alignment was carried out using BWA-MEM, which is a development of the original Burrows-Wheeler Aligner algorithm, specifically designed for read lengths of over 70bp. It is reported that BWA-backtrack may perform better for reads shorter than 70bp. Alignment of the shorter BGI reads may have been penalised by BWA-MEM. A further factor that may have contributed to the small discordance observed was the application of the same variant calling and analysis pipeline to both datasets. This pipeline was designed for use with long Illumina reads and may have penalised the analysis of the BGISEQ-500 data by requiring a minimum of 35 contiguous matched bases, and fewer than three mismatched bases within a read. This filtering step only removes reads failing these tests prior to variant calling with qSNP and it is not applied before processing with GATK Haplotype Caller. This means short BGISEQ-500 reads with hard or soft clipping of >16 bases or those containing indels would not contribute to variant detection using qSNP. The second part of the filter requires less than 3 mismatches and is much more likely to penalise the longer Illumina reads. This would leave short, poorly aligned BGISEQ-500 reads in regions prone to high alignment bias that could contribute to low quality variant calls. To minimise the possibility of differences in the sample quality causing discordance we supplied an aliquot of high molecular weight DNA from the same nucleic acid extraction for all three sample pairs to each of the sequencing centres. Random sampling of DNA molecules during the library preparation and sequencing process are likely sources of discordant calls in our data. This source of error was evident in the germline calls detected in the data from only one platform but as a somatic call or as a low evidence call in the other. The failure to pass calling thresholds in just one of the platforms for a true positive variant is most likely due to this sampling affect. Library preparation for both platforms was different including the fragmentation processes, template size selection and cluster or DNA nanoball generation. These differences will introduce a degree of bias that could particularly affect somatic variant calling where the tumour specific signal may be reduced as compared with the germline signal. These platform specific differences would likely persist in any comparison. Use of a bespoke analysis pipeline, which better considers the shorter read lengths for BGISEQ-500 data may reduce some discordant calls but could also lead to a different set of discordant calls Overall, the BGISEQ-500 and HiSeq X Ten sequencing platforms show a high concordance to germline genotypes ascertained from SNP arrays. Both sequencing platforms show a high concordance to each other in their ability to detect germline and somatic SNVs and indels.

Protein coding mutations detected using BGISEQ-500 and HiSeq X Ten data.

A summary of the genes affected by the protein coding mutations which were identified in 3 mesothelioma samples (patient ID: 9869, 11202 and 11398). (TIF) Click here for additional data file.

22 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

Review 2. Coming of age: ten years of next-generation sequencing technologies.

Authors: Sara Goodwin; John D McPherson; W Richard McCombie
Journal: Nat Rev Genet Date: 2016-05-17 Impact factor: 53.242

3. Comparative transcriptome analysis of chemosensory genes in two sister leaf beetles provides insights into chemosensory speciation.

Authors: Bin Zhang; Wei Zhang; Rui-E Nie; Wen-Zhu Li; Kari A Segraves; Xing-Ke Yang; Huai-Jun Xue
Journal: Insect Biochem Mol Biol Date: 2016-11-09 Impact factor: 4.714

4. A reference human genome dataset of the BGISEQ-500 sequencer.

Authors: Jie Huang; Xinming Liang; Yuankai Xuan; Chunyu Geng; Yuxiang Li; Haorong Lu; Shoufang Qu; Xianglin Mei; Hongbo Chen; Ting Yu; Nan Sun; Junhua Rao; Jiahao Wang; Wenwei Zhang; Ying Chen; Sha Liao; Hui Jiang; Xin Liu; Zhaopeng Yang; Feng Mu; Shangxian Gao
Journal: Gigascience Date: 2017-05-01 Impact factor: 6.524

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

6. International network of cancer genome projects.

Authors: Thomas J Hudson; Warwick Anderson; Axel Artez; Anna D Barker; Cindy Bell; Rosa R Bernabé; M K Bhan; Fabien Calvo; Iiro Eerola; Daniela S Gerhard; Alan Guttmacher; Mark Guyer; Fiona M Hemsley; Jennifer L Jennings; David Kerr; Peter Klatt; Patrik Kolar; Jun Kusada; David P Lane; Frank Laplace; Lu Youyong; Gerd Nettekoven; Brad Ozenberger; Jane Peterson; T S Rao; Jacques Remacle; Alan J Schafer; Tatsuhiro Shibata; Michael R Stratton; Joseph G Vockley; Koichi Watanabe; Huanming Yang; Matthew M F Yuen; Bartha M Knoppers; Martin Bobrow; Anne Cambon-Thomsen; Lynn G Dressler; Stephanie O M Dyke; Yann Joly; Kazuto Kato; Karen L Kennedy; Pilar Nicolás; Michael J Parker; Emmanuelle Rial-Sebbag; Carlos M Romeo-Casabona; Kenna M Shaw; Susan Wallace; Georgia L Wiesner; Nikolajs Zeps; Peter Lichter; Andrew V Biankin; Christian Chabannon; Lynda Chin; Bruno Clément; Enrique de Alava; Françoise Degos; Martin L Ferguson; Peter Geary; D Neil Hayes; Thomas J Hudson; Amber L Johns; Arek Kasprzyk; Hidewaki Nakagawa; Robert Penny; Miguel A Piris; Rajiv Sarin; Aldo Scarpa; Tatsuhiro Shibata; Marc van de Vijver; P Andrew Futreal; Hiroyuki Aburatani; Mónica Bayés; David D L Botwell; Peter J Campbell; Xavier Estivill; Daniela S Gerhard; Sean M Grimmond; Ivo Gut; Martin Hirst; Carlos López-Otín; Partha Majumder; Marco Marra; John D McPherson; Hidewaki Nakagawa; Zemin Ning; Xose S Puente; Yijun Ruan; Tatsuhiro Shibata; Michael R Stratton; Hendrik G Stunnenberg; Harold Swerdlow; Victor E Velculescu; Richard K Wilson; Hong H Xue; Liu Yang; Paul T Spellman; Gary D Bader; Paul C Boutros; Peter J Campbell; Paul Flicek; Gad Getz; Roderic Guigó; Guangwu Guo; David Haussler; Simon Heath; Tim J Hubbard; Tao Jiang; Steven M Jones; Qibin Li; Nuria López-Bigas; Ruibang Luo; Lakshmi Muthuswamy; B F Francis Ouellette; John V Pearson; Xose S Puente; Victor Quesada; Benjamin J Raphael; Chris Sander; Tatsuhiro Shibata; Terence P Speed; Lincoln D Stein; Joshua M Stuart; Jon W Teague; Yasushi Totoki; Tatsuhiko Tsunoda; Alfonso Valencia; David A Wheeler; Honglong Wu; Shancen Zhao; Guangyu Zhou; Lincoln D Stein; Roderic Guigó; Tim J Hubbard; Yann Joly; Steven M Jones; Arek Kasprzyk; Mark Lathrop; Nuria López-Bigas; B F Francis Ouellette; Paul T Spellman; Jon W Teague; Gilles Thomas; Alfonso Valencia; Teruhiko Yoshida; Karen L Kennedy; Myles Axton; Stephanie O M Dyke; P Andrew Futreal; Daniela S Gerhard; Chris Gunter; Mark Guyer; Thomas J Hudson; John D McPherson; Linda J Miller; Brad Ozenberger; Kenna M Shaw; Arek Kasprzyk; Lincoln D Stein; Junjun Zhang; Syed A Haider; Jianxin Wang; Christina K Yung; Anthony Cros; Anthony Cross; Yong Liang; Saravanamuttu Gnaneshan; Jonathan Guberman; Jack Hsu; Martin Bobrow; Don R C Chalmers; Karl W Hasel; Yann Joly; Terry S H Kaan; Karen L Kennedy; Bartha M Knoppers; William W Lowrance; Tohru Masui; Pilar Nicolás; Emmanuelle Rial-Sebbag; Laura Lyman Rodriguez; Catherine Vergely; Teruhiko Yoshida; Sean M Grimmond; Andrew V Biankin; David D L Bowtell; Nicole Cloonan; Anna deFazio; James R Eshleman; Dariush Etemadmoghadam; Brooke B Gardiner; Brooke A Gardiner; James G Kench; Aldo Scarpa; Robert L Sutherland; Margaret A Tempero; Nicola J Waddell; Peter J Wilson; John D McPherson; Steve Gallinger; Ming-Sound Tsao; Patricia A Shaw; Gloria M Petersen; Debabrata Mukhopadhyay; Lynda Chin; Ronald A DePinho; Sarah Thayer; Lakshmi Muthuswamy; Kamran Shazand; Timothy Beck; Michelle Sam; Lee Timms; Vanessa Ballin; Youyong Lu; Jiafu Ji; Xiuqing Zhang; Feng Chen; Xueda Hu; Guangyu Zhou; Qi Yang; Geng Tian; Lianhai Zhang; Xiaofang Xing; Xianghong Li; Zhenggang Zhu; Yingyan Yu; Jun Yu; Huanming Yang; Mark Lathrop; Jörg Tost; Paul Brennan; Ivana Holcatova; David Zaridze; Alvis Brazma; Lars Egevard; Egor Prokhortchouk; Rosamonde Elizabeth Banks; Mathias Uhlén; Anne Cambon-Thomsen; Juris Viksna; Fredrik Ponten; Konstantin Skryabin; Michael R Stratton; P Andrew Futreal; Ewan Birney; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Sancha Martin; Jorge S Reis-Filho; Andrea L Richardson; Christos Sotiriou; Hendrik G Stunnenberg; Giles Thoms; Marc van de Vijver; Laura van't Veer; Fabien Calvo; Daniel Birnbaum; Hélène Blanche; Pascal Boucher; Sandrine Boyault; Christian Chabannon; Ivo Gut; Jocelyne D Masson-Jacquemier; Mark Lathrop; Iris Pauporté; Xavier Pivot; Anne Vincent-Salomon; Eric Tabone; Charles Theillet; Gilles Thomas; Jörg Tost; Isabelle Treilleux; Fabien Calvo; Paulette Bioulac-Sage; Bruno Clément; Thomas Decaens; Françoise Degos; Dominique Franco; Ivo Gut; Marta Gut; Simon Heath; Mark Lathrop; Didier Samuel; Gilles Thomas; Jessica Zucman-Rossi; Peter Lichter; Roland Eils; Benedikt Brors; Jan O Korbel; Andrey Korshunov; Pablo Landgraf; Hans Lehrach; Stefan Pfister; Bernhard Radlwimmer; Guido Reifenberger; Michael D Taylor; Christof von Kalle; Partha P Majumder; Rajiv Sarin; T S Rao; M K Bhan; Aldo Scarpa; Paolo Pederzoli; Rita A Lawlor; Massimo Delledonne; Alberto Bardelli; Andrew V Biankin; Sean M Grimmond; Thomas Gress; David Klimstra; Giuseppe Zamboni; Tatsuhiro Shibata; Yusuke Nakamura; Hidewaki Nakagawa; Jun Kusada; Tatsuhiko Tsunoda; Satoru Miyano; Hiroyuki Aburatani; Kazuto Kato; Akihiro Fujimoto; Teruhiko Yoshida; Elias Campo; Carlos López-Otín; Xavier Estivill; Roderic Guigó; Silvia de Sanjosé; Miguel A Piris; Emili Montserrat; Marcos González-Díaz; Xose S Puente; Pedro Jares; Alfonso Valencia; Heinz Himmelbauer; Heinz Himmelbaue; Victor Quesada; Silvia Bea; Michael R Stratton; P Andrew Futreal; Peter J Campbell; Anne Vincent-Salomon; Andrea L Richardson; Jorge S Reis-Filho; Marc van de Vijver; Gilles Thomas; Jocelyne D Masson-Jacquemier; Samuel Aparicio; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Hendrik G Stunnenberg; Laura van't Veer; Douglas F Easton; Paul T Spellman; Sancha Martin; Anna D Barker; Lynda Chin; Francis S Collins; Carolyn C Compton; Martin L Ferguson; Daniela S Gerhard; Gad Getz; Chris Gunter; Alan Guttmacher; Mark Guyer; D Neil Hayes; Eric S Lander; Brad Ozenberger; Robert Penny; Jane Peterson; Chris Sander; Kenna M Shaw; Terence P Speed; Paul T Spellman; Joseph G Vockley; David A Wheeler; Richard K Wilson; Thomas J Hudson; Lynda Chin; Bartha M Knoppers; Eric S Lander; Peter Lichter; Lincoln D Stein; Michael R Stratton; Warwick Anderson; Anna D Barker; Cindy Bell; Martin Bobrow; Wylie Burke; Francis S Collins; Carolyn C Compton; Ronald A DePinho; Douglas F Easton; P Andrew Futreal; Daniela S Gerhard; Anthony R Green; Mark Guyer; Stanley R Hamilton; Tim J Hubbard; Olli P Kallioniemi; Karen L Kennedy; Timothy J Ley; Edison T Liu; Youyong Lu; Partha Majumder; Marco Marra; Brad Ozenberger; Jane Peterson; Alan J Schafer; Paul T Spellman; Hendrik G Stunnenberg; Brandon J Wainwright; Richard K Wilson; Huanming Yang
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

7. cPAS-based sequencing on the BGISEQ-500 to explore small non-coding RNAs.

Authors: Tobias Fehlmann; Stefanie Reinheimer; Chunyu Geng; Xiaoshan Su; Snezana Drmanac; Andrei Alexeev; Chunyan Zhang; Christina Backes; Nicole Ludwig; Martin Hart; Dan An; Zhenzhen Zhu; Chongjun Xu; Ao Chen; Ming Ni; Jian Liu; Yuxiang Li; Matthew Poulter; Yongping Li; Cord Stähler; Radoje Drmanac; Xun Xu; Eckart Meese; Andreas Keller
Journal: Clin Epigenetics Date: 2016-11-21 Impact factor: 6.551

8. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

9. Deciphering signatures of mutational processes operative in human cancer.

Authors: Ludmil B Alexandrov; Serena Nik-Zainal; David C Wedge; Peter J Campbell; Michael R Stratton
Journal: Cell Rep Date: 2013-01-10 Impact factor: 9.423

10. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing.

Authors: Tyler S Alioto; Ivo Buchhalter; Sophia Derdak; Barbara Hutter; Matthew D Eldridge; Eivind Hovig; Lawrence E Heisler; Timothy A Beck; Jared T Simpson; Laurie Tonon; Anne-Sophie Sertier; Ann-Marie Patch; Natalie Jäger; Philip Ginsbach; Ruben Drews; Nagarajan Paramasivam; Rolf Kabbe; Sasithorn Chotewutmontri; Nicolle Diessl; Christopher Previti; Sabine Schmidt; Benedikt Brors; Lars Feuerbach; Michael Heinold; Susanne Gröbner; Andrey Korshunov; Patrick S Tarpey; Adam P Butler; Jonathan Hinton; David Jones; Andrew Menzies; Keiran Raine; Rebecca Shepherd; Lucy Stebbings; Jon W Teague; Paolo Ribeca; Francesc Castro Giner; Sergi Beltran; Emanuele Raineri; Marc Dabad; Simon C Heath; Marta Gut; Robert E Denroche; Nicholas J Harding; Takafumi N Yamaguchi; Akihiro Fujimoto; Hidewaki Nakagawa; Víctor Quesada; Rafael Valdés-Mas; Sigve Nakken; Daniel Vodák; Lawrence Bower; Andrew G Lynch; Charlotte L Anderson; Nicola Waddell; John V Pearson; Sean M Grimmond; Myron Peto; Paul Spellman; Minghui He; Cyriac Kandoth; Semin Lee; John Zhang; Louis Létourneau; Singer Ma; Sahil Seth; David Torrents; Liu Xi; David A Wheeler; Carlos López-Otín; Elías Campo; Peter J Campbell; Paul C Boutros; Xose S Puente; Daniela S Gerhard; Stefan M Pfister; John D McPherson; Thomas J Hudson; Matthias Schlesner; Peter Lichter; Roland Eils; David T W Jones; Ivo G Gut
Journal: Nat Commun Date: 2015-12-09 Impact factor: 14.919

19 in total

1. Mutation of chromatin regulators and focal hotspot alterations characterize human papillomavirus-positive oropharyngeal squamous cell carcinoma.

Authors: Sunny Haft; Shuling Ren; Guorong Xu; Adam Mark; Kathleen Fisch; Theresa W Guo; Zubair Khan; John Pang; Mizuo Ando; Chao Liu; Akihiro Sakai; Takahito Fukusumi; Joseph A Califano
Journal: Cancer Date: 2019-04-01 Impact factor: 6.860

2. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing.

Authors: Bohu Pan; Luyao Ren; Vitor Onuchic; Meijian Guan; Rebecca Kusko; Steve Bruinsma; Len Trigg; Andreas Scherer; Baitang Ning; Chaoyang Zhang; Christine Glidewell-Kenney; Chunlin Xiao; Eric Donaldson; Fritz J Sedlazeck; Gary Schroth; Gokhan Yavas; Haiying Grunenwald; Haodong Chen; Heather Meinholz; Joe Meehan; Jing Wang; Jingcheng Yang; Jonathan Foox; Jun Shang; Kelci Miclaus; Lianhua Dong; Leming Shi; Marghoob Mohiyuddin; Mehdi Pirooznia; Ping Gong; Rooz Golshani; Russ Wolfinger; Samir Lababidi; Sayed Mohammad Ebrahim Sahraeian; Steve Sherry; Tao Han; Tao Chen; Tieliu Shi; Wanwan Hou; Weigong Ge; Wen Zou; Wenjing Guo; Wenjun Bao; Wenzhong Xiao; Xiaohui Fan; Yoichi Gondo; Ying Yu; Yongmei Zhao; Zhenqiang Su; Zhichao Liu; Weida Tong; Wenming Xiao; Justin M Zook; Yuanting Zheng; Huixiao Hong
Journal: Genome Biol Date: 2022-01-03 Impact factor: 13.583

3. Extrachromosomal DNA in HPV-Mediated Oropharyngeal Cancer Drives Diverse Oncogene Transcription.

Authors: John Pang; Nam Nguyen; Jens Luebeck; Vineet Bafna; Joseph Califano; Laurel Ball; Andrey Finegersh; Shuling Ren; Takuya Nakagawa; Mitchell Flagg; Sayed Sadat; Paul S Mischel; Guorong Xu; Kathleen Fisch; Theresa Guo; Gabrielle Cahill; Bharat Panuganti
Journal: Clin Cancer Res Date: 2021-09-21 Impact factor: 13.801

4. Draft Genome Sequences of the Kocuria subflava Type Strain KCTC 39547 and Kocuria sp. Strain JC486, a Newly Isolated Strain from a Wild Ass Sanctuary in Gujarat, India.

Authors: Jagadeeshwari Uppada; Sasikala Chintalapati; Karthika K; Venkata Ramana Chintapati
Journal: Microbiol Resour Announc Date: 2022-09-01

5. Formation of Blood Neutrophil Extracellular Traps Increases the Mastitis Risk of Dairy Cows During the Transition Period.

Authors: Lu-Yi Jiang; Hui-Zeng Sun; Ruo-Wei Guan; Fushan Shi; Feng-Qi Zhao; Jian-Xin Liu
Journal: Front Immunol Date: 2022-04-27 Impact factor: 8.786

Review 6. Genomics and Epigenetics of Malignant Mesothelioma.

Authors: Adam P Sage; Victor D Martinez; Brenda C Minatel; Michelle E Pewarchuk; Erin A Marshall; Gavin M MacAulay; Roland Hubaux; Dustin D Pearson; Aaron A Goodarzi; Graham Dellaire; Wan L Lam
Journal: High Throughput Date: 2018-07-27

7. The application of NIPT using combinatorial probe-anchor synthesis to identify sex chromosomal aneuploidies (SCAs) in a cohort of 570 pregnancies.

Authors: Hongge Li; Yu Lei; Hui Zhu; Yuqin Luo; Yeqing Qian; Min Chen; Yixi Sun; Kai Yan; Yanmei Yang; Bei Liu; Liya Wang; Yingzhi Huang; Junjie Hu; Jianyun Xu; Minyue Dong
Journal: Mol Cytogenet Date: 2018-12-03 Impact factor: 2.009

8. Genomic variants identified from whole-genome resequencing of indicine cattle breeds from Pakistan.

Authors: Naveed Iqbal; Xin Liu; Ting Yang; Ziheng Huang; Quratulain Hanif; Muhammad Asif; Qaiser Mahmood Khan; Shahid Mansoor
Journal: PLoS One Date: 2019-04-11 Impact factor: 3.240

9. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers.

Authors: Jiayun Chen; Xingsong Li; Hongbin Zhong; Yuhuan Meng; Hongli Du
Journal: Sci Rep Date: 2019-06-27 Impact factor: 4.379

10. Molecular digitization of a botanical garden: high-depth whole-genome sequencing of 689 vascular plant species from the Ruili Botanical Garden.

Authors: Huan Liu; Jinpu Wei; Ting Yang; Weixue Mu; Bo Song; Tuo Yang; Yuan Fu; Xuebing Wang; Guohai Hu; Wangsheng Li; Hongcheng Zhou; Yue Chang; Xiaoli Chen; Hongyun Chen; Le Cheng; Xuefei He; Hechen Cai; Xianchu Cai; Mei Wang; Yang Li; Sunil Kumar Sahu; Jinlong Yang; Yu Wang; Ranchang Mu; Jie Liu; Jianming Zhao; Ziheng Huang; Xun Xu; Xin Liu
Journal: Gigascience Date: 2019-04-01 Impact factor: 6.524