Literature DB >> 34292068

LGAAP: Leishmaniinae Genome Assembly and Annotation Pipeline.

Hatim Almutairi1,2, Michael D Urbaniak1, Michelle D Bates1, Narissara Jariyapan3, Godwin Kwakye-Nuako4, Vanete Thomaz-Soccol5, Waleed S Al-Salem2, Rod J Dillon1, Paul A Bates1, Derek Gatherer1.   

Abstract

We present the LGAAP computational pipeline, which was successfully used to assemble six genomes of the parasite subfamily Leishmaniinae to chromosome-scale completeness from a combination of long- and short-read sequencing data. LGAAP is open source, and we suggest that it may easily be ported for assembly of any genome of comparable size (∼35 Mb).

Entities:  

Year:  2021        PMID: 34292068      PMCID: PMC8297458          DOI: 10.1128/MRA.00439-21

Source DB:  PubMed          Journal:  Microbiol Resour Announc        ISSN: 2576-098X


ANNOUNCEMENT

We developed an automated genome assembly and annotation pipeline, successfully applying it to six genomes in the parasite subfamily Leishmaniinae, namely, (i) Leishmania martiniquensis (MHOM/TH/2012/LSCM1, LV760), (ii) Leishmania orientalis (MHOM/TH/2014/LSCM4, LV768), (iii) Leishmania enriettii (MCAV/BR/2001/CUR178, LV763), (iv) Leishmania sp. Ghana (MHOM/GH/2012/GH5, LV757), (v) Leishmania sp. Namibia (MPRO/NA/1975/252, LV425), and (vi) Porcisia hertigi (MCOE/PA/1965/C119, LV43). This paper closes the “protocol gap” (1) for this project by making all methods fully available. The pipeline was written and executed using the Snakemake (2) workflow management system and consists of a total of 314 computational steps, divided into 21 sequential processes in two main phases (Fig. 1). Genomic DNA was extracted from a previously developed culture system for L. orientalis axenic amastigotes (3) and sequenced using two standard technologies, i.e., short read (Illumina) and long read (Oxford Nanopore Technologies [ONT]).
FIG 1

Graphical representation of the LGAAP protocol.

Graphical representation of the LGAAP protocol. The first (assembly) phase of the pipeline comprises eight sequential processes, i.e., (i) long-read assembly using Flye (version 2.8.2) (4), (ii) mapping of short reads onto assemblies using Minimap2 (version 2.17) (5), (iii) creation of consensus sequences using SAMtools (version 1.11) (6), (iv) polishing of assemblies using Pilon (version 1.23) (7), (v) revision of consensus sequences using SAMtools, (vi) ordering and orientation of the chromosomes and breakage of any chimeric sequences using RaGOO (version 1.1) (8), (vii) sorting and removal of any duplicated scaffolds or contigs using Funannotate (version 1.5.3) (9), and (viii) generation of a quality report using QUAST (version 5.0.2) (10). The second (annotation) phase of the pipeline comprises 13 sequential processes, i.e., (i) scanning of assemblies for vector contamination using BLAST+ (version 2.10.1) (11) against UniVec (12), (ii) masking of contaminants using BEDTools (version 2.30) (13), (iii) quality statistics preannotation using AGAT (version 0.6.0) (14), (iv) detection of repeats using RepeatModeler (15) running from Dfam TE Tools Container (version 1.3.1) (16), (v) classification of transposable elements using TEclass (16) running from a docker container (version 2.1.3b) (17), (vi) masking of identified complex repeats using RepeatMasker (version 4.1.2-p1) (18), (vii) downloading of protein and transcript evidence from TriTrypDB (release 47) (19), (viii) evidence-based annotation using MAKER2 (20) running from a docker container (version 2.31.10) (21), (ix) quality checking of annotation using GenomeTools (version 1.2.1) (22) and GAAS (version 1.2.0) (23), (x) ab initio annotation using AUGUSTUS (version 3.3.2) (24) within MAKER2, (xi) repeating of the ninth step, (xii) annotation assignments using BLAST+ against UniProt (25) and InterProScan (version 5.22-61.0) (26), and (xiii) finalization of the longest isoforms of each predicted protein using AGAT. The final product of the analysis pipeline is five files per genome, i.e., the chromosome-scale assembly, proteins, and transcripts in FASTA format and two general feature format (GFF) files, one containing the coordinates of each feature and one with the longest isoforms. Testing on genomes longer than 35 Mb is a future optimization priority. Comparison of the performance of LGAAP with all 50 Leishmania genome assemblies in GenBank is shown in Table 1.
TABLE 1

Assembly metrics for Leishmania genome assemblies deposited in GenBank

OrganismNCBI assembly no.StrainSequencing technology(ies)Assembly methodNo. of scaffoldsTotal length (bp)N50 (bp)
L. aethiopicaGCA_003992445209-622PacBio RS IICANU11833,648,436763,733
L. aethiopicaGCA_000444285L147IlluminaAllpaths-LG16031,630,8161,001,864
L. amazonensisGCA_003992505210-660PacBio RS IICANU9233,504,997850,106
L. amazonensisGCA_000438535NARoche 454, IlluminaNewbler, Velvet, Zorro2,62729,029,34822,901
L. amazonensisGCA_005317125UA301IlluminaSMALT3432,156,470NA
L. arabicaGCA_000410695LEM1108IlluminaAllPaths-LG16831,269,0901,057,807
L. braziliensisGCA_003304975IOC-L 3564IonTorrentSPAdes1,02938,003,648758,103
L. braziliensisGCA_000340355MHOM/BR/75/M2903Roche 454Newbler74435,210,1501,030,512
L. braziliensisGCA_000002845MHOM/BR/75/M2904SangerNA13832,068,771992,961
L. braziliensisGCA_900537975MHOM/BR/75/M2904PacBio, IlluminaNA3532,301,632NA
L. chagasiGCA_014466975MCER/BR/1981/M6445/SalvaterraIlluminaSOAPdenovo3631,924,5661,043,794
L. chagasiGCA_014466935MHOM/HD/2017/M32502/AmapalaIlluminaSOAPdenovo3631,924,9751,043,719
L. donovaniGCA_000470725BHU 1220IlluminaBowtie3632,414,8531,024,085
L. donovaniGCA_000227135BPK282A1Roche 454, IlluminaNA3632,444,9681,024,085
L. donovaniGCA_003730175FDAARGOS_360PacBio, IlluminaCANU7134,011,430828,097
L. donovaniGCA_003730215FDAARGOS_361PacBio, IlluminaCANU5633,453,7221,033,854
L. donovaniGCA_900635355HU3IlluminaNA3633,035,865NA
L. donovaniGCA_000283395Ld 2001SOLiDbVelvet14,51827,466,4563,370
L. donovaniGCA_000316305Ld 39SOLiDVelvet16,32323,683,2961,772
L. donovaniGCA_003719575LdCLPacBio, IlluminaHGAP, Celera Assembler, CANU3632,959,864NA
L. donovaniGCA_001989955MHOM/IN/1983/AG83IlluminaAllPaths, STLab-assembler3632,148,3771,015,993
L. donovaniGCA_001989975MHOM/IN/1983/AG83IlluminaAllPaths3632,196,3931,029,368
L. donovaniGCA_002243465PasteurPacBioHGAP3733,545,8751,079,609
L. enriettiiGCA_000410755LEM3045IlluminaAllPaths-LG49530,761,861868,233
L. enriettii*GCA_017916305*MCAV/BR/2001/CUR178, LV763ONT, IlluminaLGAAP5433,318,8641,075,649
L. gerbilliGCA_000443025LEM452IlluminaAllPaths-LG49231,398,648379,527
L. guyanensisGCA_003664525204-365PacBio RS IICANU12333,816,023683,170
L. infantumGCA_003671315HUUFS14IlluminaABySS2,50732,578,91429,848
L. infantumGCA_000002875JPCM5SangerNA7632,122,0611,043,848
L. infantumGCA_900500625JPCM5PacBio, IlluminaNA3632,803,248NA
L. infantumGCA_003020905TR01IlluminaGeneious3632,009,138NA
L. lainsoniGCA_003664395216-34PacBio RS IICANU13734,152,029638,860
L. majorGCA_000002725FriedlinSangerNA3632,855,089NA
L. majorGCA_000331345LV39c5Roche 454Newbler84932,327,517978,401
L. majorGCA_000250755SD 75.1Roche 454Newbler3631,242,7501,022,795
L. martiniquensisGCA_000409445LEM2494IlluminaAllPaths-LG25130,813,970873,628
L. martiniquensis*GCA_017916325*MHOM/TH/2012/LSCM1, LV760ONT, IlluminaLGAAP4232,413,6701,046,741
L. mexicanaGCA_003992435215-49PacBio RS IICANU5532,057,209825,953
L. mexicanaGCA_000234665MHOM/GT/2001/U1103SangerNA58832,108,7411,044,075
L. orientalis*GCA_017916335*MHOM/TH/2014/LSCM4, LV768ONT, IlluminaLGAAP9834,194,2761,120,138
L. panamensisGCA_000340495MHOM/COL/81/L13IlluminaSOAP denovo95231,263,945156,905
L. panamensisGCA_000755165MHOM/PA/94/PSC-1Roche 454, IlluminaNewbler, PAGIT3530,688,7941,043,456
L. peruvianaGCA_001403695LEM-1537NANA3733,890,2001,047,715
L. peruvianaGCA_001403675PAB-4377NANA3732,907,7811,015,393
Leishmania sp.GCA_000981925AIIMS/LM/SS/PKDL/LD-974IlluminaA5 assembly pipeline1,10027,848,32261,709
Leishmania sp. Ghana*GCA_017918215*MHOM/GH/2012/GH5, LV757ONT, IlluminaLGAAP11635,953,5381,100,365
Leishmania sp. Namibia*GCA_017918225*MPRO/NA/1975/252, LV425ONT, IlluminaLGAAP6734,118,6241,066,046
L. tarentolaeGCA_009731335Parrot Tar IIPacBio RS IIHGAP17935,416,496663,019
L. tarentolaeGCA_009770625Parrot Tar IIRoche 454Newbler7,22731,556,5837,432
L. tropicaGCA_011316065ATCC 50129IlluminaCLC Genomics Workbench1,92830,870,16132,161
L. tropicaGCA_014139745CDC216-162PacBio RS II, IlluminaFlye4332,700,6681,070,514
L. tropicaGCA_000410715L590IlluminaAllPaths-LG44832,989,014303,214
L. tropicaGCA_003067545MHOM/LB /2017/IKIlluminaCLC NGS Cell9,49932,139,92713,854
L. tropicaGCA_003352575MHOM/LB/2015/IKIlluminaCLC NGS Cell17,01332,280,7127,721
L. turanicaGCA_000441995LEM423IlluminaAllPaths-LG33632,320,007397,299
Porcisia hertigi*GCA_017918235*MCOE/PA/1965/C119, LV43ONT, IlluminaLGAAP7434,958,538967,170

Asterisks indicate the six genomes assembled using LGAAP. NA, either not applicable to the technology used or not available from the GenBank record.

SOLiD, sequencing by oligonucleotide ligation and detection.

Assembly metrics for Leishmania genome assemblies deposited in GenBank Asterisks indicate the six genomes assembled using LGAAP. NA, either not applicable to the technology used or not available from the GenBank record. SOLiD, sequencing by oligonucleotide ligation and detection.

Data availability.

Genomes assembled using this protocol are available in the NCBI Assembly database with the following accession numbers: L. martiniquensis, GCA_017916325.1; L. orientalis, GCA_017916335.1; L. enriettii, GCA_017916305.1; Leishmania sp. Ghana, GCA_017918215.1; Leishmania sp. Namibia, GCA_017918225.1; and Porcisia hertigi, GCA_017918235.1. Raw sequencing data are available with the following NCBI BioProject accession numbers: L. martiniquensis, PRJNA691531; L. orientalis, PRJNA691532; L. enriettii, PRJNA691534; Leishmania sp. Ghana, PRJNA691536; Leishmania sp. Namibia, PRJNA689706; and Porcisia hertigi, PRJNA691541. The workflow is available at GitHub (https://github.com/hatimalmutairi/LGAAP) and Zenodo (https://doi.org/10.5281/zenodo.4663265).
  19 in total

1.  GenomeTools: a comprehensive software library for efficient processing of structured genome annotations.

Authors:  Gordon Gremme; Sascha Steinbiss; Stefan Kurtz
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2013 May-Jun       Impact factor: 3.710

2.  Predicting Genes in Single Genomes with AUGUSTUS.

Authors:  Katharina J Hoff; Mario Stanke
Journal:  Curr Protoc Bioinformatics       Date:  2018-11-22

3.  QUAST: quality assessment tool for genome assemblies.

Authors:  Alexey Gurevich; Vladislav Saveliev; Nikolay Vyahhi; Glenn Tesler
Journal:  Bioinformatics       Date:  2013-02-19       Impact factor: 6.937

4.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2016-03-19       Impact factor: 6.937

5.  BLAST+: architecture and applications.

Authors:  Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal:  BMC Bioinformatics       Date:  2009-12-15       Impact factor: 3.169

6.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects.

Authors:  Carson Holt; Mark Yandell
Journal:  BMC Bioinformatics       Date:  2011-12-22       Impact factor: 3.307

7.  InterProScan 5: genome-scale protein function classification.

Authors:  Philip Jones; David Binns; Hsin-Yu Chang; Matthew Fraser; Weizhong Li; Craig McAnulla; Hamish McWilliam; John Maslen; Alex Mitchell; Gift Nuka; Sebastien Pesseat; Antony F Quinn; Amaia Sangrador-Vegas; Maxim Scheremetjew; Siew-Yit Yong; Rodrigo Lopez; Sarah Hunter
Journal:  Bioinformatics       Date:  2014-01-21       Impact factor: 6.937

8.  Twelve years of SAMtools and BCFtools.

Authors:  Petr Danecek; James K Bonfield; Jennifer Liddle; John Marshall; Valeriu Ohan; Martin O Pollard; Andrew Whitwham; Thomas Keane; Shane A McCarthy; Robert M Davies; Heng Li
Journal:  Gigascience       Date:  2021-02-16       Impact factor: 6.524

9.  The Protocol Gap.

Authors:  Michael G Weller
Journal:  Methods Protoc       Date:  2021-02-03

10.  RaGOO: fast and accurate reference-guided scaffolding of draft genomes.

Authors:  Michael Alonge; Sebastian Soyk; Srividya Ramakrishnan; Xingang Wang; Sara Goodwin; Fritz J Sedlazeck; Zachary B Lippman; Michael C Schatz
Journal:  Genome Biol       Date:  2019-10-28       Impact factor: 13.583

View more
  1 in total

1.  Chromosome-scale genome sequencing, assembly and annotation of six genomes from subfamily Leishmaniinae.

Authors:  Hatim Almutairi; Michael D Urbaniak; Michelle D Bates; Narissara Jariyapan; Godwin Kwakye-Nuako; Vanete Thomaz Soccol; Waleed S Al-Salem; Rod J Dillon; Paul A Bates; Derek Gatherer
Journal:  Sci Data       Date:  2021-09-06       Impact factor: 6.444

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.