| Literature DB >> 31527408 |
Ryan Connor1, Rodney Brister2, Jan P Buchmann3, Ward Deboutte4, Rob Edwards5, Joan Martí-Carreras6, Mike Tisza7, Vadim Zalunin8, Juan Andrade-Martínez9, Adrian Cantu10, Michael D'Amour11, Alexandre Efremov12, Lydia Fleischmann13, Laura Forero-Junco14, Sanzhima Garmaeva15, Melissa Giluso16, Cody Glickman17, Margaret Henderson18, Benjamin Kellman19, David Kristensen20, Carl Leubsdorf21, Kyle Levi22, Shane Levi23, Suman Pakala24, Vikas Peddu25, Alise Ponsero26, Eldred Ribeiro27, Farrah Roy28, Lindsay Rutter29, Surya Saha30, Migun Shakya31, Ryan Shean32, Matthew Miller33, Benjamin Tully34, Christopher Turkington35, Ken Youens-Clark36, Bert Vanmechelen37, Ben Busby38.
Abstract
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.Entities:
Keywords: SRA; STRIDES; cloud computing; hackathon; infrastructure; metagenomic; viruses
Mesh:
Year: 2019 PMID: 31527408 PMCID: PMC6771016 DOI: 10.3390/genes10090714
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Participant Demographics. Demographic summary of hackathon participants.
| Raw | % Participants | % Responses | |
|---|---|---|---|
| Participants | 37 | NA | NA |
| Survey Responses | 36 | 97.3 | NA |
| Institutional Affiliation | |||
| Academic | 27 | 72.97 | 75.00 |
| Government | 6 | 16.22 | 16.67 |
| Other | 2 | 5.41 | 5.56 |
| Unknown | 1 | 2.70 | 2.80 |
| Educational Attainment | |||
| Ph.D. | 14 | 37.84 | 38.89 |
| M.S. | 12 | 32.43 | 33.33 |
| B.S. | 2 | 5.41 | 5.56 |
| Unknown | 8 | 21.62 | 22.22 |
| Career Stage | |||
| In Training | 11 | 29.73 | 30.56 |
| Junior | 14 | 37.84 | 38.89 |
| Senior | 6 | 16.22 | 16.67 |
| Unknown | 5 | 13.51 | 13.89 |
| Programming Language | |||
| Shell | 33 | 89.19 | 91.67 |
| Python | 31 | 83.78 | 86.11 |
| R | 26 | 70.27 | 72.22 |
| Perl | 13 | 35.14 | 36.11 |
| Java | 10 | 27.03 | 27.78 |
| C/C++ | 9 | 24.32 | 25.00 |
| JavaScript | 4 | 10.81 | 11.11 |
| SQL | 3 | 8.11 | 8.33 |
| Matlab | 2 | 5.41 | 5.56 |
| Other | 4 | 10.81 | 11.11 |
Figure 1Overview of hackathon teams and data processing. All numbers detail the number of contigs processed at each step of the pipeline. A subset of ~3000 data sets were assembled, generating 55.5 million total contigs. Researchers attending the hackathon assembled into teams that roughly correspond to goals outlined in the Methods and Results. Members of the “Knowns Team” excluded contigs based on size (removing <1 kb in length) and the remaining ~4 million contigs were assigned classification to known viruses using a BLASTN search against the RefSeq Virus database (Section 2.3 and Section 3.3). Independently, members of the “Phylogeny Clustering Team” clustered ~4 million contigs using Markov Clustering techniques (Section 2.4). Members of the “Metadata Team” used machine learning approaches to build training sets that could be used to correlate sequences to sample source metadata (Section 2.7 and Section 3.7). Members of the “Domain Team” predicted functional domains with RPSTBLASTN and the CDD database using ~360,000 contigs that were not classified using the RefSeq Virus database (Section 2.5 and Section 3.3). Members of the “Gene Finding Team” predicted open reading frames and putative viral-related genes using the modified VIGA pipeline on ~4400 putative viral contigs (Section 2.6 and Section 3.6). Members of the “Visualization Team” devised ways to display complex data and the “Testing Team” accessed if components of the pipeline were accessible to future users. Two additional teams were tasked with analyzing sequences, which could not be identified as confidently cellular or virus-like with the methods described above (Section 3.5).
Figure 2Taxonomic classification of contiguous assemblies. Contiguous assemblies were compared to Virus RefSeq via BLASTN; (A) >99% of contigs have no hits amongst RefSeq viruses. (B) Of those with hits against RefSeq viruses, the majority had close matches, known–knowns. The remainder either had a weak match to a RefSeq, ‘known–unknowns’ 1, or a strong match to a RefSeq subsequence, ‘known–unknowns’ 2. (C) Top RefSeq viruses matched by ‘known–known’ contigs, by the number of contigs with a match to them. CrAssphage represents the most abundant virus in the subset.
Figure 3Abundance of contig clusters by cluster size. All of the contigs with length greater than 1 kb were combined with the RefSeq viral data set. An all-vs.-all comparison was made via BLASTN. BLAST hits were treated as edges in a graph with a weight equal to the log transform of the E-value, and this graph was clustered using Markov clustering. The resulting clusters were then analyzed for size, and a histogram of cluster size is shown. Approximately half of the clusters are singletons.
Figure 4Overview of the results obtained by classifying unknown–unknown sequences with RPSTBLASTN; and (A) absolute counts for the total contigs analyzed. On the left the total amount of passed and failed contigs are shown. On the right, “passed” and “failed” groups are split into three categories. Fail category 1 is the group of contigs that carried more than three eukaryotic Conserved Domain Databases (CDDs), more than three bacterial CDDs, and zero viral CDDs. Fail category 2 is comprised of contigs with more than three eukaryotic CDDs. Fail category 3 contains contigs that have more than three bacterial CDDs and zero viral CDDs. Pass category 1 contains the so called dark matter group (no CDD hits at all). Pass category 2 contains contigs that have more than zero viral domains, and pass category 3 contains all the rest of the passed contigs. (B) Relative counts for the five different taxonomic bins of the CDDs are shown for the passed group of contigs (top) and the failed group of contigs (bottom). To account for contig length, the number of CDD hits per category was divided by the total number of CDD hits in one contig. (C) Distributions of contig lengths vs. number of CDD hits for failed (blue) and passed (orange) contig groups. Both the contig length and the number of CDD hits are log-transformed.
Figure 5Sequence annotation with VIGA. Putative viral sequences were annotated using the modified VIGA pipeline. Contigs depicting (putative) viral sequences, prefiltered by the presence of a viral CDD, were passed to VIGA. Each contig had their open reading frames (ORF) detected and translated with Prodigal using the 11th genetic code. Each viral protein was further characterized with a combination of three different annotation methods: BLAST, DIAMOND, and HMMER. HMMER included two model databases: pVOGs (prokaryotic viruses) and RVDB (all viral sequences but prokaryotic viruses). Coordinates, protein translation, hits, E-values, viral quotient, percentage of identity, and other meaningful information are codified in a hierarchical JSON database for downstream analysis.
Figure 6Next generation sequencing (NGS) classification using associated metadata. (A) Study abstract and metadata are insufficient for NGS classification. Data sets where clustered using MASH, and partial least squares regression was performed to identify any covariance between the sequence content and word frequencies derived from the associated study abstracts and metadata; and (B) human gut microbiome samples are separable from other studies using Sequence Read Archive (SRA) metadata. Metadata was used as input to a word2vec model with 300 features, and the model was reduced to two dimensions using t-distributed stochastic neighbor embedding.