| Literature DB >> 19892826 |
Aaron R Jex1, Ross S Hall, D Timothy J Littlewood, Robin B Gasser.
Abstract
Mitochondrial (mt) genomics represents an understudied but important field of molecular biology. Increasingly, mt dysfunction is being linked to a range of human diseases, including neurodegenerative disorders, diabetes and impairment of childhood development. In addition, mt genomes provide important markers for systematic, evolutionary and population genetic studies. Some technological limitations have prevented the expanded generation and utilization of mt genomic data for some groups of organisms. These obstacles most acutely impede, but are not limited to, studies requiring the determination of complete mt genomic data from minute amounts of material (e.g. biopsy samples or microscopic organisms). Furthermore, post-sequencing bioinformatic annotation and analyses of mt genomes are time consuming and inefficient. Herein, we describe a high-throughput sequencing and bioinformatic pipeline for mt genomics, which will have implications for the annotation and analysis of other organellar (e.g. plastid or apicoplast genomes) and virus genomes as well as long, contiguous regions in nuclear genomes. We utilize this pipeline to sequence and annotate the complete mt genomes of 12 species of parasitic nematode (order Strongylida) simultaneously, each from an individual organism. These mt genomic data provide a rich source of markers for studies of the systematics and population genetics of a group of socioeconomically important pathogens of humans and other animals.Entities:
Mesh:
Year: 2009 PMID: 19892826 PMCID: PMC2811008 DOI: 10.1093/nar/gkp883
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Consensus sequence length and read statistics for each nematode mt genome sequenced by 454 technology
| Species | Total sequence length (nt) | Reads | Total sequence output (nt) | Read length (nt) | Mean length (SD) | DOC/nt | Mean DOC (SD) | Percentage of read Agr./nt | Mean percentage of read Agr. (SD) |
|---|---|---|---|---|---|---|---|---|---|
| 14147 | 14339 | 2197156 | 45–400 | 253 (66) | 1–863 | 155 (138) | 26.4–100.0 | 53.6 (11.4) | |
| 13919 | 13440 | 1639442 | 41–369 | 270 (73) | 25–454 | 118 (79) | 25.8–92.2 | 51.9 (10.6) | |
| 13674 | 6346 | 3498444 | 44–372 | 258 (84) | 1–1457 | 256 (323) | 26.2–100.0 | 52.6 (11.3) | |
| 15221 | 12465 | 3220781 | 4–389 | 260 (68) | 6–1197 | 206 (165) | 26.7–92.9 | 54.1 (10.8) | |
| 13804 | 10023 | 2403274 | 43–414 | 267 (81) | 87–1405 | 174 (101) | 26.0–94.1 | 51.0 (9.9) | |
| 13775 | 11771 | 2798458 | 46–417 | 253 (81) | 54–1433 | 203 (171) | 26.7–88.9 | 47.4 (9.1) | |
| 13871 | 15054 | 2810287 | 36–374 | 244 (71) | 25–2929 | 203 (307) | 25.8–96.7 | 53.6 (11.0) | |
| 14301 | 11049 | 2560992 | 40–372 | 236 (86) | 6–1136 | 179 (171) | 25.8–100.0 | 49.4 (9.8) | |
| 14661 | 7498 | 2017991 | 31–409 | 247 (83) | 1–991 | 138 (193) | 25.1–100.0 | 55.4 (14.8) | |
| 13731 | 11346 | 2603970 | 41–396 | 242 (90) | 1–708 | 190 (120) | 0.0 | 50.6 (10.2) | |
| 13731 | 8512 | 1883186 | 41–375 | 228 (92) | 5–697 | 137 (72) | 27.3–95.5 | 55.4 (11.5) | |
| 14036 | 14721 | 3588580 | 42–377 | 216 (89) | 15–5914 | 256 (598) | 25.9–100.0 | 53.8 (11.5) |
Seq.: sequence; DOC: read depth of coverage at each nucleotide (nt) position of each consensus sequence; read Agr. = the percentage of reads at each nt position of each consensus sequence that displayed the same nt sequence as the consensus; SD: standard deviation.
aRepresenting a single nucleotide position wherein a single read had an ‘N’ in the sequence. This position was corrected upon resequencing.
Figure 1.Flow-diagram of the 454 technology based high-throughput mt genomic sequencing and bioinformatic pipeline. The pipeline is divided into two main phases: the high-throughput sequencing (HTS) and annotation/analysis phases. Individual stages of the high-throughput pipeline are numbered 1–12 as follows: 1 = morphological identification of an individual nematode, 2 = total genomic extraction, 3 = PCR-amplification of the second internal transcribed spacer (ITS-2) of the nuclear ribosomal DNA, 4 = Direct sequencing of the ITS-2 (allowing specific identification of the nematode), 5 = long-PCR amplification of the complete mt genome as two overlapping fragments (see 23,24), 6 = quantification of each long-PCR amplicon by spectrophotometry (NanoDrop), 7 = simultaneous sequencing of the complete mt genome of each specimen by 454 technology (max. n ≥ 16), 8 = read assembly (automated: Newbler) generating a majority rule consensus sequence for the mt genome of each specimen sequenced (see 26,28), 9 = bioinformatic analysis of raw read data (using Perl, Python and Java script), 10 = analysis of read length, as well as, assessment of read depth of coverage and read nucleotide (nt) diversity at each nt position of each consensus sequence generated, 11 = automated annotation of each mt genome consensus sequence (see a–i below), 12 = estimation of sequence error rate, base-calling and selective re-sequencing of small regions by conventional means, and comparative analysis against published reference sequence data (from the GenBank database). All mt genomes were annotated using the novel bioinformatic pipeline developed for this study. The individual stages (boxed in diagram) of this annotation process (a–i) are: a = identification of the cox1 gene by comparative alignment using the translated amino acid sequence from a published (‘template’) sequence (user-defined), b = rotation of the ‘query’ consensus sequence relative to cox1, c = identification of all other coding mt genes by comparative alignment as per ‘a’, d = identification of the small and large ribosomal subunit genes (non-coding) by comparative alignment using nucleotide sequence data from the published template sequence, e = prediction of all possible tRNA genes based on inferred secondary structure, f = clustering of all predicted tRNA genes into amino acid groups based on their anticodon, g = comparative alignment of all predicted tRNA gene sequence data, employing a purpose built database containing all sequences of all published mt tRNA genes for nematodes (each predicted tRNA gene with the highest sequence identity score relative to the template database) and the highest bonding score (fewest mismatched pairs in the stem regions of the secondary structure) is selected for each amino acid), h = exportation of all annotated data for each mt genome sequence to a Sequin table for import into the program ‘Sequin’ (http://www.ncbi.nlm.nih.gov), allowing direct uploading of the annotated mt genome sequence to a public databases, such as GenBank (http://www.ncbi.nlm.nih.gov).
Figure 2.Summary of 454 read statistics displaying DOC and percent read agreement (Read Agr.) data for each position of each consensus sequence representing the complete mt genome of each of 12 species of Strongylida nematode sequenced from individual worms. (A) representative line graph of the range in depth of coverage across a complete mt genome (data displayed represents Metastrongylus salmi). (B) Frequency histogram displaying the range in depth of coverage at each nt position among all 12 sequencing runs. (C) Representative line graph displaying the range in % read agr. at each nt position across a complete mt genome (data displayed represents Metastrongylus salmi). (D) Frequency histogram displaying the percentage of read sequences at each nt position of the consensus sequence that were consistent with the consensus (i.e. had the nucleotide identity representing the majority at each nt position of the consensus sequence).
Estimation of consensus sequence error rates based on observed frameshift deletion errors in the protein-coding mt genes for each nematode mt genome sequenced by 454 technology
| Species | Homopoly (coding) | Homopoly (non-coding) | Indel errors (coding) | Error rate (per 100 homopoly) | Predicted errors (non-coding) | Total errors | Percentage of error (per genome) | Percentage of accuracy (per genome) |
|---|---|---|---|---|---|---|---|---|
| 214 | 55 | 27 | 13 | 3 | 30 | 0.22 | 99.78 | |
| 201 | 54 | 15 | 7 | 3 | 18 | 0.13 | 99.87 | |
| 194 | 49 | 16 | 8 | 3 | 19 | 0.14 | 99.86 | |
| 207 | 120 | 9 | 4 | 8 | 17 | 0.11 | 99.89 | |
| 291 | 79 | 17 | 6 | 5 | 22 | 0.16 | 99.84 | |
| 282 | 81 | 28 | 10 | 5 | 33 | 0.24 | 99.76 | |
| 192 | 52 | 14 | 7 | 3 | 17 | 0.12 | 99.88 | |
| 199 | 74 | 13 | 7 | 5 | 18 | 0.12 | 99.88 | |
| 210 | 75 | 7 | 3 | 5 | 12 | 0.08 | 99.92 | |
| 194 | 56 | 5 | 3 | 4 | 9 | 0.06 | 99.94 | |
| 221 | 82 | 7 | 3 | 5 | 12 | 0.09 | 99.91 | |
| 194 | 50 | 5 | 3 | 3 | 8 | 0.06 | 99.94 | |
| Complete dataset | 2599 | 827 | 163 | 6 | 52 | 215 | 0.13 | 99.87 |
Homopoly: homopolymeric regions of five or more nucleotides in length; Indel: insertion or deletion frameshift error. Predicted errors within the non-coding regions of each mt genome were estimated using the error rate calculated for the complete dataset. Total errors estimated for each mt genome are the sum of the predicted errors in the non-coding region and the observed errors in the coding region. All error estimates are based on observed insertion or deletion errors and do not include substitutions.
aBased on homopolymers detected in the coding regions of each mt genome only.
Figure 3.Phylogenetic analysis (Bayesian Inference) of the concatenated amino acid sequence data for all 12 protein coding mt genes for species representing the Strongylida, Rhabditida and Ascaridida (as an outgroup). Shaded boxes indicate the three major suborders of the Strongylida (Metastrongylina, Trichostrongylina and Strongylina). A dash-bordered box highlights paraphyletic clustering of the hookworms. The numbers above the midpoint of each tree branch represent the statistical support for each node (based on posterior probability score). The phylogram provided is presented to scale (scale bar = 0.2 substitutions per site). GenBank accession numbers are provided (in parentheses) for all reference sequences. An identical topology was found with maximum likelihood; all nodes supported by >99% bootstrap re-sampling (n = 100).