| Literature DB >> 26907326 |
Jens Friis-Nielsen1, Kristín Rós Kjartansdóttir2, Sarah Mollerup3, Maria Asplund4, Tobias Mourier5, Randi Holm Jensen6, Thomas Arn Hansen7, Alba Rey-Iglesia8, Stine Raith Richter9, Ida Broman Nielsen10, David E Alquezar-Planas11, Pernille V S Olsen12, Lasse Vinner13, Helena Fridholm14, Lars Peter Nielsen15, Eske Willerslev16, Thomas Sicheritz-Pontén17, Ole Lund18, Anders Johannes Hansen19, Jose M G Izarzugaza20, Søren Brunak21,22.
Abstract
Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.Entities:
Keywords: assay contamination; cancer causing viruses; next generation sequencing; novel sequence identification; oncoviruses; sequence clustering; taxonomic characterisation
Mesh:
Substances:
Year: 2016 PMID: 26907326 PMCID: PMC4776208 DOI: 10.3390/v8020053
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Figure 1Schematic representation of the bioinformatics pipeline used to process sequencing reads from all data sets. The ‘preprocessing’ step includes removal of adapter sequences, trimming of low-quality sequences, and merging of paired-end reads. Data sets progress in parallel until the ‘clustering’ step, where contigs from all data sets are pooled and grouped.
Figure 2p-values of all significant associations. Rows describe features with biological features in red, methodological in green and technical in blue. There are 73 features significantly associated to one or more clusters. Columns describe all significant associations of each of the 6165 unique clusters. The cluster identifiers have been excluded to avoid cluttering.
Figure 3Lowest p-values of clusters established by the pipeline. The p-values are arranged by feature of the strongest significant association of each of the 6165 clusters. The 50 features involved as strongest associations have been coloured by type: biological (red), methodological (green), and technical (blue). The boxes span the first and third quartiles. The dark band inside each box represents the median. The whiskers of the boxes extend to the lowest and highest values within a distance of 1.5 times the interquartile range. As can be seen, most p-values were above 1e-24, but a few methodological features have associated clusters with very low p-values, such as f056, f068, f069, f076, f079, and f084. The library preparation kit ScriptSeq v2 RNA-Seq, Illumina (f084) displays strongly associated clusters with p-values as low as 3.04e-89 that mapped as species Avian myeloblastosis-associated virus. Clusters that were annotated as NCBI species Parvovirus NIH/CQV were associated to laboratory-kit RNeasy MinElute, Qiagen (f076) with minimal p-value 5.48e-38. Finally, a cluster annotated as Acanthocystis turfacea chlorella virus MN0810.1 (ATCV) was associated to DNase/RNase: Promega DNase stop solution (f069) with p-value = 4.19e-12.
Annotation of associations. The 6165 clusters were mapped using BLASTn and BLASTx. Rows describe the corresponding type of feature involved as the strongest association of each cluster.
| Feature type | BLASTn | BLASTx | Unmapped | Total |
|---|---|---|---|---|
| Biological | 593 | 5 | 4 | 602 |
| Methodological | 2662 | 1515 | 868 | 5045 |
| Technical | 298 | 110 | 110 | 518 |
| Total | 3553 | 1630 | 982 | 6165 |
Taxonomical characterisation of certain biologically associated clusters. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP. In cases where several clusters shared the annotated species, the lowest p-value of the associations is given #sig: number of significant clusters.
| Feature | Cluster annotation (species) | #sig. | HMP body site | |
|---|---|---|---|---|
| Colon cancer biopsy | 2 | 2.43e-20 | Gastrointestinal tract | |
| Colon cancer biopsy | 3 | 1.60e-20 | Gastrointestinal tract | |
| Colon cancer biopsy | 1 | 2.92e-17 | Gastrointestinal tract | |
| Colon cancer biopsy | 1 | 1.34e-13 | Gastrointestinal tract | |
| Oral cavity cancer | 292 | 1.74e-24 | Oral | |
| Oral cavity cancer | 2 | 4.60e-23 | Oral | |
| Oral cavity cancer | 8 | 1.73e-21 | Oral | |
| Oral cavity cancer | 1 | 5.37e-16 | Oral | |
| Oral cavity cancer | 22 | 2.31e-15 | Oral | |
| Oral cavity cancer | 7 | 2.31e-15 | Oral | |
| Oral cavity cancer | 2 | 4.49e-14 | Oral | |
| Oral cavity cancer | 1 | 1.34e-13 | Oral | |
| Oral cavity cancer | 2 | 8.26e-13 | Oral | |
| Oral cavity cancer | 1 | 2.60e-12 | Oral | |
| Oral cavity cancer | 2 | 4.12e-11 | Oral | |
| Oral cavity cancer | 1 | 4.12e-11 | Oral | |
| Oral cavity cancer | 1 | 4.85e-11 | Oral | |
| Vulva cancer | 1 | 1.03e-12 | Urogenital tract |
Figure 4Unmapped clusters. The clusters are placed by their strongest associated feature. Feature types are marked in colour as follows: biological (red), methodological (green), and technical (blue). Top: Number of clusters associated to each feature on a log-10 scaled axis. There are 648 associated clusters of feature DNase/RNase: Promega DNase stop solution (f069), and 1 associated cluster to feature Polymerases: Phusion HF, NEB (f086). Bottom: Base-pair length (bp) of all cluster representatives (longest contig of each cluster) on a log-10 scaled axis. The N50 of all unmapped cluster representatives are marked by a brown dot. The longest cluster representative is 33.6 kb with N50 = 617 bp.
Conserved domains of unmapped biological clusters. The cluster representatives of the four unmapped biologically associated clusters were manually searched for sequence similarities and conserved domains via the NCBI web-interfaces BLASTn, BLASTx, and CCD, respectively. Cells containing a dash had no hits with an e-value < 0.001. Cluster representative: Length of the sequence. BLASTn and BLASTx: Organism name (accession) %-id / %-coverage. CCD: Domain name (accession).
| Cluster representative | BLASTn | BLASTx | CCD |
|---|---|---|---|
| 1789 bp | - | Prevotella veroralis (WP_026284690.1) 92% / 42% | - |
| 3246 bp | Prevotella fusca JCM 17724 (CP012075.1) 76% / 18% | Prevotella veroralis (WP_004384161.1) 90% / 56% | DUF4280 super family (cl16620) TauE super family (cl21514) |
| 4661 bp | Prevotella fusca JCM 17724 (CP012075.1) 91% / 87% | Prevotella fusca (WP_050696472.1) 85% / 66% | Peptidase_M23 (pfam01551) lysozyme_like super family (cl00222) DUF4280 (pfam14107) Fil_haemagg_2 (pfam13332) Phage_base_V (pfam04717) |
| 4720 bp | Eubacterium sulci ATCC 35585 (CP012068.1) 71% / 21% | Peptostreptococcus anaerobius CAG:621 (CCY47489.1) 72% / 36% | Acyl_transf_3 super family (cl21495) ND5 (MTH00095) rve (pfam00665) ND2 super family (cl10157) |