| Literature DB >> 31438850 |
Hamid Bagheri1, Usha Muppirala2, Rick E Masonbrink2, Andrew J Severin2, Hridesh Rajan3.
Abstract
BACKGROUND: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.Entities:
Keywords: Boag; Domain-Specific Language; Genome Annotation; Shared Data Science Infrastructure
Mesh:
Year: 2019 PMID: 31438850 PMCID: PMC6704658 DOI: 10.1186/s12859-019-2967-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Code to find the smallest and largest genomes in RefSeq
Exon Statistics for years > = 2016
| Name | Total species | Exon number | Gene number | Gene Length | Exon per Gene |
|---|---|---|---|---|---|
| Bacteria | 92,287 | N/A | 4 | 890 | N/A |
| Fungi | 90 | 32 | 10 | 1 | 2 |
| Archaea | 338 | N/A | 2 | 851 | N/A |
| Viridiplantae | 46 | 385 | 43 | 4 | 9 |
| Metazoas | 185 | 462 | 24 | 23 | 17 |
| Ascomycota | 70 | 28 | 10 | 1 | 2 |
| eudicotyledons (dicots) | 37 | 397 | 45 | 3 | 9 |
Exon Statistics for years < 2016
| Name | Total species | Exon number | Gene number | Gene Length | Exon per Gene |
|---|---|---|---|---|---|
| Bacteria | 51,537 | N/A | 3 | 885 | N/A |
| Fungi | 194 | 29 | 9 | 1 | 2 |
| Archaea | 474 |
| 2 | 855 | N/A |
| Viridiplantae | 61 | 273 | 32 | 4 | 8 |
| Metazoas | 262 | 314 | 22 | 22 | 13 |
| Ascomycota | 143 | 25 | 9 | 1 | 2 |
| eudicotyledons (dicots) | 41 | 328 | 38 | 4 | 8 |
Fig. 2Number of exons, genes, and exons per gene after 2016. The output is shown in Table 1
Fig. 3Bacterial assembly programs popularity over time. The output of this script is shown in Fig. 4
Fig. 4Assembler programs for Bacteria over the years
List of top three most used assembly programs for Metazoa (Year > =2016)
| Kingdom | Program Name | species | Total length | Scaffold-count | ScaffoldN50 | ContigCount | ContigN50 |
|---|---|---|---|---|---|---|---|
| Metazoa | SOAPdenovo | 21 | 1B ± 0.8B | 38 | 7.8 M ± 11 M | 86 k ± 66 k | 98 k ± 208 k |
| AllPaths | 48 | 0.9B ± 0.7B | 7.1 k ± 7 k | 4.3 M ± 1.4 M | 33 k ± 38 k | 188 k ± 335 k | |
| Newbler | 7 | 0.8B ± 0.9B | 3.3 k ± 2.2 k | 877 k ± 910 k | 56 k ± 80 k | 75 k ± 60 k |
List of top three most used assembly programs for Metazoa (Year < 2016)
| Kingdom | Program Name | species | Total length | Scaffold-count | ScaffoldN50 | ContigCount | ContigN50 |
|---|---|---|---|---|---|---|---|
| Metazoa | SOAPdenovo | 98 | 1 | 40 | 4 | 116 | 42 |
| AllPaths | 54 | 1 | 11 | 7 | 119 | 38 | |
| Newbler | 18 | 0 | 87 | 2 | 133 | 34 |
Fig. 5Assembly statistics for genomes for years after 2016. The output is shown in Table 5
Kingdoms and average summary statistics for their genome assemblies (Years > =2016)
| Tax ID | Name | Species | Total length | Scaffold-count | ScaffoldN50 | ContigCount | ContigN50 |
|---|---|---|---|---|---|---|---|
| 2 | Bacteria | 92,290 | 4 | 66 | 0 | 132 | 0 |
| 4751 | Fungi | 90 | 29 | 139 | 1 | 360 | 0 |
| 2157 | Archaea | 338 | 2 | 52 | 0 | 74 | 0 |
| 33,090 | Viridiplantae | 46 | 0 | 9 | 31 | 38 | 1 |
| 33,208 | Metazoas | 185 | 1 | 20 | 22 | 53 | 2 |
| 71,240 | eudicotyledons (dicots) | 37 | 0 | 6 | 26 | 40 | 1 |
Fig. 6The Boa database size comparison with the raw data in the RefSeq as well as the JSON version of the dataset
Fig. 7Scalability of Boa programs (time is in Log base 2 (sec)). Queries 1,2,3 and 4 are the four questions investigated here
Fig. 8Boa Architecture and Data Generation
Kingdoms and average summary statistics for their genome assemblies (Years <= 2015)
| Tax ID | Name | Species | Total length | Scaffold Count | ScaffoldN50 | ContigCount | ContigN50 |
|---|---|---|---|---|---|---|---|
| 2 | Bacteria | 51,962 | 3 | 45 | 1 | 126 | 0 |
| 4751 | Fungi | 202 | 2 | 341 | 2 | 858 | 0 |
| 2157 | Archaea | 470 | 29 | 17 | 1 | 110 | 0 |
| 33,090 | Viridiplantae | 67 | 0 | 22 | 14 | 52 | 0 |
| 33,208 | Metazoas | 295 | 1 | 37 | 7 | 118 | 0 |
| 71,240 | eudicotyledons (dicots) | 46 | 0 | 26 | 17 | 58 | 0 |
Fig. 9Comparison of the code needed to query the number of assembler programs per taxon id run on Refseq Data. On the left side, the MongoDB code needs eight lines of code in Python whereas the BoaG script needs only three lines of code. a. MongoDB query to calculate number of assembler programs per taxon id. b. Equivalent Boag query needs fewer lines of code
Comparison between MongoDB and BoaG
| Feature | MongoDB | BoaG |
|---|---|---|
| Lines of Code | larger | smaller because it abstracts details of data analysis |
| Data generation time | longer due to the larger file | faster because of Binary file |
| Data file | JSON is 2.7 times larger than raw data | Hadoop Sequence file 5 times smaller than raw data |
| Schema Flexibility | Yes. Supports semi-structured data | Yes. Schema and compiler can be modified |
| MapReduce | Yes | Yes |
Fig. 10Comparison of Line of Code (LOC) and performance to answer query “ What are the top three most used assembly programs?” run on Refseq Data. On the left side, the equivalent Boag code needs 38 lines of code in Python whereas the Boag script needs only five
Fig. 11Example of Boag programs to compute different tasks on the full RefSeq dataset. The python programs were running on the single core. The Hadoop infrastructure on Bridges has 5 shared nodes with 32 mappers. While these queries can be written in parallel in python, this needs more lines of code and more programming skills to write a parallel code
Domain types for Genomics data in BoaG
| Type | Attributes | Details |
|---|---|---|
| Genome | taxid | Taxonomy ID of each species |
| refseq | Refseq ID of the GFF file | |
| Sequence | List of sequence reads in each GFF file [ | |
| AssemblerRoot | List of assembly programs associated with this genome | |
| accession | Accession number | |
| Sequence | header | Header of Sequence |
| FeatureRoot | List of features including exon,gene,mRNA, and CDS associated with this sequence | |
| seq | Actual DNA sequences from FASTA files | |
| FeatureRoot | refseq | This field shows the key ID |
| feature | This field is the list of features associated with this ID | |
| Feature | accession | Accession code of the Sequence |
| seqid | Sequence ID | |
| source | A text qualifier that describes the algorithm or procedure that generated this feature. | |
| ftype | Type of the feature | |
| start | starting point of the feature | |
| end | End point of the feature | |
| score | Score of the feature. This is a floating point number. | |
| strand | + and - for positive and negative strand respectively | |
| phase | Phase of the feature. The phase is one of the integers 0, 1, or 2 | |
| Attribute | List of attributes for each feature | |
| parent | Shows the parent of the attribute | |
| Attribute | id | Attribute ID |
| tag | Attribute tag including gbkey etc. | |
| value | Value of the tag | |
| AssemblerRoot | Assembler | List of assembly programs |
| total-length | Total length or genome size (base pair) | |
| total-gap-length | Total gap length after genome assembly | |
| scaffold-N50 | Scaffold N50 metric | |
| scaffold-count | Scaffold count metric | |
| contig-N50 | Contig N50 metric | |
| contig-count | Contig count metric | |
| Assembler | name | Assembly program used to assemble the genome |
| desc | Program attributes: program name, program version, etc. |
The BoaG aggregators list
| Aggregator | Description |
|---|---|
| MeanAggreagtor | Calculates the average |
| MaxAggreagtor | Finds the maximum value |
| SumAggregator | Calculates the sum of the emitted values to the reducer |
| MinAggregator | Finds the minimum value |
| TopAggregator | Takes an integer argument and returns the top elements for the given argument |
| StDevAggregator | Calculates the standard deviation |