| Literature DB >> 30766900 |
Julia Soewarto1, Chantal Hamelin2,3,4, Stéphanie Bocs2,3,4, Pierre Mournet2,4, Hélène Vignes2,4, Angélique Berger2,4, Alix Armero2,4, Guillaume Martin2,4, Alexis Dereeper5,3, Gautier Sarah2,3,4, Fabian Carriconde1, Laurent Maggia1,6,4.
Abstract
The myrtle rust disease, caused by the fungus Austropuccinia psidii, infects a wide range of host species within the Myrtaceae family worldwide. Since its first report in 2013 in New Caledonia, it was found on various types of native environments where Myrtaceae are the dominant or codominant species, as well as in several commercial nurseries. It is now considered as a significant threat to ecosystems biodiversity and Myrtaceae-related economy. The use of predictive molecular markers for resistance against myrtle rust is currently the most cost-effective and ecological approach to control the disease. Such an approach for neo Caledonian endemic Myrtaceae species was not possible because of the lack of genomic resources. The recent advancement in new generation sequencing technologies accompanied with relevant bioinformatics tools now provide new research opportunity for work in non-model organism at the transcriptomic level. The present study focuses on transcriptome analysis on three Myrtaceae species endemic to New Caledonia (Arillastrum gummiferum, Syzygium longifolium and Tristaniopsis glauca) that display contrasting responses to the pathogen (non-infected vs infected). Differential gene expression (DGE) and variant calling analysis were conducted on each species. We combined a dual approach by using 1) the annotated reference genome of a related Myrtaceae species (Eucalyptus grandis) and 2) a de novo transcriptomes of each species.Entities:
Year: 2019 PMID: 30766900 PMCID: PMC6362868 DOI: 10.1016/j.dib.2018.12.080
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1Illustration of the three Myrtaceae species in this study. For each species: Left pictures show non-infected individual and right ones show myrtle rust symptoms on infected individual.
Detailed sampling of three Myrtaceae species for RNA-seq analysis.
| Sample 1 | Ag19 | leaf | nursery | infected | |
| Sample 2 | Ag28 | leaf | nursery | infected | |
| Sample 3 | Ag2 | leaf | nursery | non-infected | |
| Sample 4 | Ag3 | leaf | nursery | non-infected | |
| Sample 5 | Ag4 | leaf | nursery | non-infected | |
| Sample 6 | Ag6 | leaf | nursery | infected | |
| Sample 7 | Syl10 | leaf | nursery | non-infected | |
| Sample 8 | Syl13 | leaf | nursery | infected | |
| Sample 9 | Syl15 | leaf | nursery | infected | |
| Sample 10 | Syl18 | leaf | nursery | non-infected | |
| Sample 11 | Syl2 | leaf | nursery | non-infected | |
| Sample 12 | Syl4 | leaf | nursery | non-infected | |
| Sample 13 | Syl7 | leaf | nursery | non-infected | |
| Sample 14 | Tg2 | leaf | nursery | infected | |
| Sample 15 | Tg3 | leaf | nursery | non-infected | |
| Sample 16 | Tg4 | leaf | nursery | infected | |
| Sample 17 | Tg5 | leaf | nursery | infected | |
| Sample 18 | Tg6 | leaf | nursery | infected | |
| Sample 19 | V1 | leaf | natural field | infected | |
| Sample 20 | V2 | leaf | natural field | infected | |
| Sample 21 | V3 | leaf | natural field | non-infected | |
| Sample 22 | V4 | leaf | natural field | non-infected | |
| Sample 23 | V6 | leaf | natural field | infected | |
| Sample 24 | V7 | leaf | Tristaniopsis glauca | natural field | infected |
Fig. 2Bioinformatics pipeline showing the different steps involved in RNA-seq analysis until alignment to the two kind of reference.
Fig. 3Quality control statistics generated by FastQC for individual Syl18 (S. longifolium) at different stages of the data cleaning process.
Number of raw and cleaned reads from the three species.
| Number of libraries/individuals | 6 | 7 | 5 | 6 |
| Length of raw reads (bp) | 150 | 150 | 150 | 150 |
| Total number of raw reads | 176,074,893 | 200,293,564 | 137,602,172 | 154,194,966 |
| Total number of clean reads | 169,892,296 | 193,481,676 | 132,410,558 | 148,922,692 |
| Length of clean reads (bp) | 130 | 130 | 130 | 130 |
Alignment statistics indicative of reads aligned to the assembled transcriptome using SamTools flagstat.
| Species | Sample name | ||||||
|---|---|---|---|---|---|---|---|
| Total reads mapped | Properly paired (%) | Singletons mapped (%) | Total reads mapped | Properly paired (%) | Singletons mapped (%) | ||
| Ag19 | 22,195,213 | 70 | 4.8 | 19,953,353 | 95 | 0.1 | |
| Ag28 | 23,611,351 | 71 | 4.8 | 20,537,263 | 96 | 0.1 | |
| Ag2 | 25,655,914 | 68 | 4.5 | 23,462,050 | 96 | 0.1 | |
| Ag3 | 27,318,150 | 68 | 4.6 | 24,139,896 | 96 | 0.1 | |
| Ag4 | 18,576,164 | 69 | 4.5 | 17,399,264 | 97 | 0.1 | |
| Ag6 | 21,277,541 | 74 | 4.2 | 17,259,913 | 96 | 0.1 | |
| Syl10 | 26,685,822 | 68 | 6.1 | 25,218,899 | 96 | 0.1 | |
| Syl13 | 16,676,981 | 65 | 5.0 | 16,426,865 | 97 | 0.1 | |
| Syl15 | 17,382,267 | 63 | 5.2 | 15,522,277 | 97 | 0.1 | |
| Syl18 | 19,263,137 | 69 | 5.4 | 18,980,838 | 96 | 0.1 | |
| Syl2 | 23,370,932 | 67 | 5.3 | 22,382,429 | 96 | 0.1 | |
| Syl4 | 23,806,097 | 66 | 5.7 | 21,148,273 | 95 | 0.1 | |
| Syl7 | 19,456,046 | 66 | 5.1 | 19,209,000 | 97 | 0.1 | |
| Tg2 | 18,287,633 | 71 | 5.5 | 16,225,569 | 96 | 0.2 | |
| Tg3 | 20,985,018 | 70 | 5.5 | 19,345,206 | 97 | 0.2 | |
| Tg4 | 19,284,476 | 70 | 6.9 | 19,094,992 | 95 | 0.3 | |
| Tg5 | 19,370,817 | 65 | 4.2 | 17,549,513 | 97 | 0.1 | |
| Tg6 | 17,264,359 | 67 | 4.9 | 16,126,662 | 96 | 0.1 | |
| V1 | 12,259,586 | 58 | 3.6 | 12,446,883 | 99 | 0.0 | |
| V2 | 20,957,190 | 70 | 4.6 | 18,089,903 | 98 | 0.1 | |
| V3 | 18,561,244 | 66 | 4.1 | 17,126,183 | 98 | 0.1 | |
| V4 | 17,177,833 | 66 | 5.1 | 17,252,577 | 96 | 0.1 | |
| V6 | 22,656,615 | 66 | 4.9 | 20,231,181 | 97 | 0.0 | |
| V7 | 19,387,135 | 70 | 5.7 | 17,984,129 | 97 | 0.1 | |
Number of proper pairs in proportion to the total reads mapped
Number reads where one from a pair in proportion to the total mapped
Overlapped and uniques SNPs called using two different calling methods (GATK and in-house script) from mapping using E. grandis reference genome.
| GATK (Haplotype Caller) | 142,294 | 66 | 34 | |
| Inhouse script | 73,765 | 34 | 66 | |
| GATK (Haplotype Caller) | 181,967 | 79 | 21 | |
| Inhouse script | 68,106 | 45 | 55 | |
| GATK (Haplotype Caller) | 148,484 | 75 | 25 | |
| Inhouse script | 67,115 | 44 | 56 | |
| GATK (Haplotype Caller) | 137,073 | 60 | 40 | |
| Inhouse script | 83,243 | 34 | 66 |
Fig. 4Analytic pipeline for differential gene expression (DGE) and Variant calling (SNP).
Statistics of the de novo transcriptome assembly for each species using Trinity assembler.
| Total number of trinity genes (unigene) | 84,919 | 64,716 | 76,982 | 53,527 | |
| Total number of trinity transcripts | 117,839 | 89,780 | 108,823 | 74,684 | |
| Percent GC | 45.86 | 46.37 | 44.45 | 46.21 | |
| Contig N50 | 1,378 | 1,406 | 1,396 | 1,315 | |
| Median contig length (bp) | 501 | 530 | 547 | 525 | |
| Average contig length (bp) | 843.45 | 867.27 | 876.12 | 839.11 | |
| Total assembled bases | 99,391,026 | 77,863,296 | 95,341,718 | 62,667,790 | |
| Contig N50 | 1021 | 1219 | 1263 | 1199 | |
| Median contig length (bp) | 386 | 402 | 421 | 415 | |
| Average contig length (bp) | 672.69 | 727.09 | 755.46 | 734.46 | |
| Total number of assembled bases | 57,124,398 | 47,054,345 | 58,156,816 | 39,313,402 |
Overlapped and uniques SNPs called using two different calling methods (GATK and in-house script) from mapping using de novo transcriptomes.
| Species | Methods | filtered SNP counts | % unique SNP positions | % shared SNP positions between GATK and inhouse script methods |
|---|---|---|---|---|
| GATK (Unified Genotyper) | 65,623 | 34 | 66 | |
| Inhouse script | 64,098 | 33 | 67 | |
| GATK (Unified Genotyper) | 84,242 | 34 | 66 | |
| Inhouse script | 78,612 | 29 | 71 | |
| GATK (Unified Genotyper) | 89,791 | 38 | 62 | |
| Inhouse script | 57,835 | 4 | 96 | |
| GATK (Unified Genotyper) | 94,274 | 17 | 83 | |
| Inhouse script | 108,495 | 27 | 73 |
Differentially expressed gene resulting from EdgeR.
| Reference for reads alignment | Species | Total number of genes | Differentially expressed genes (common dispersion) | over-expressed genes (LogFC≥1) | under-expressed genes (LogFC≤-1) | % of differentially expressed genes |
|---|---|---|---|---|---|---|
| 27,294 | 3463 | 2792 | 671 | 12.69 | ||
| 26,626 | 2747 | 1.768 | 979 | 10.32 | ||
| 23,622 | 413 | 234 | 179 | 1.75 | ||
| 27,014 | 662 | 609 | 53 | 2.45 | ||
| 84,919 | 4751 | 2994 | 1757 | 5.59 | ||
| 39,929 | 3379 | 2063 | 1316 | 8.46 | ||
| 31,379 | 388 | 243 | 145 | 1.24 | ||
| 36,047 | 493 | 400 | 93 | 1.37 |
Fig. 5Numbers of SNPs after filtering steps per calling methods and using the E. grandis genome as reference for mapping.
Fig. 6Numbers of SNPs per calling methods and for each studied species using de novo transcriptome of each species as reference for mapping.
Fig. 7Venn diagram showing the differentially expressed genes in A. gummiferum, S. longifolium and T. glauca (FAR and BDS) using alignments with E. grandis reference genome. (A) diagram is for over-expressed genes and (B) diagram for under-expressed ones. Over or under expressed genes means that these genes are differentially expressed for the infected samples.
List of common differential expressed genes between A. gummiferum, T. glauca and S. longifolium using E. grandis reference genome.
| Gene name | Scaffold | Description | Position | |
|---|---|---|---|---|
| Begin | End | |||
| LOC104415198 | scaffold0008 | major allergen Pru ar 1-like | 57984411 | 57985238 |
| LOC104415200 | scaffold0008 | major allergen Pru ar 1-like | 57981077 | 57981765 |
| LOC104415201 | scaffold0008 | major allergen Pru ar 1-like | 57988552 | 57989434 |
| LOC104415202 | scaffold0008 | major allergen Pru ar 1-like | 58040706 | 58041532 |
| LOC104415205 | scaffold0008 | major allergen Pru ar 1-like | 58026813 | 58027698 |
| LOC104415206 | scaffold0008 | major allergen Pru ar 1-like | 58052364 | 58053248 |
| LOC104415209 | scaffold0008 | major allergen Pru ar 1-like | 58071441 | 58072223 |
| LOC104415211 | scaffold0008 | major allergen Pru ar 1-like | 58078042 | 58078973 |
| LOC104415212 | scaffold0008 | major allergen Pru ar 1-like | 58081746 | 58082621 |
| LOC104415213 | scaffold0008 | major allergen Pru ar 1-like | 58084821 | 58085687 |
| LOC104419011 | scaffold0009 | endochitinase-like | 25149486 | 25151347 |
| LOC104422218 | scaffold0010 | uncharacterized protein | 21581164 | 21582481 |
| LOC104425880 | scaffold0011 | miraculin-like | 30418678 | 30419730 |
| LOC104428733 | scaffold0045 | polyphenol oxidase%2C chloroplastic-like | 403715 | 406482 |
| LOC104430480 | scaffold0001 | cationic peroxidase 1-like | 10988229 | 10991371 |
| LOC104438326 | scaffold0001 | pathogenesis-related protein STH-2-like | 1819744 | 1820574 |
| LOC104441046 | scaffold0004 | polyphenol oxidase%2C chloroplastic-like | 11894466 | 11897452 |
| LOC104445691 | scaffold0005 | probable WRKY transcription factor 31 isoform X1 | 69013730 | 69016591 |
| LOC104447583 | scaffold0001 | 1-aminocyclopropane-1-carboxylate oxidase homolog 4-like | 37081804 | 37083288 |
| LOC104447594 | scaffold0001 | 1-aminocyclopropane-1-carboxylate oxidase homolog 4-like | 37094712 | 37096232 |
| LOC104450568 | scaffold0001 | lichenase | 4937091 | 4938832 |
| LOC104456214 | scaffold0008 | endochitinase PR4-like | 4453707 | 4454827 |
| LOC104456215 | scaffold0008 | chitinase 6-like | 4487298 | 4488395 |
| LOC104456217 | scaffold0008 | endochitinase PR4-like | 4496392 | 4497507 |
| LOC104456219 | scaffold0008 | endochitinase PR4-like | 4522275 | 4523440 |
| LOC104456220 | scaffold0008 | endochitinase PR4-like | 4533169 | 4534322 |
| LOC104456221 | scaffold0008 | chitinase 6-like | 4546083 | 4547171 |
| LOC104456223 | scaffold0008 | endochitinase PR4-like | 4553763 | 4554816 |
Fig. 8Screenshot of the JBrowse of Syzygium longiflorum.
| Subject area | Genetics and Transcriptomics |
| More specific subject area | Transcriptomics of Myrtaceae species |
| Type of data | Table, figure |
| How data was acquired | Leaves of individual plant from three endemic Myrtaceae species from New Caledonia were sampled for total RNA extraction. Paired-end library were prepared and RNA-Sequencing was performed by the Illumina HiSeq™ 2500 system. The obtained data was subjected to 1) |
| Data format | |
| Experimental factors | Non-infected and infected individuals from |
| Experimental features | For the RNA-Sequencing and transcriptome analysis, a total of 24 leaves samples from three host species have been collected: three infected and three non-infected individuals from |
| Data source location | The nursery was located in Farino, South Province, New Caledonia (Long 165.772024:, Lat: -21.663800). The protected reserve was located in Bois du Sud, South Province, New Caledonia (Long 166.758640:, Lat: -22.169974) |
| Data accessibility | All raw data for |
| The Superseries is available at | |
| The VCF files can be downloaded at | |
| Related research article | Soewarto J, Carriconde F, Hugot N, Bocs S, Hamelin C, Maggia L. Impact of |
| Pathology. 2018;48 (2). |