| Literature DB >> 29206218 |
João P Marques1,2, Mafalda S Ferreira1,2, Liliana Farelo1, Colin M Callahan3, Klaus Hackländer4, Hannes Jenny5, W Ian Montgomery6,7, Neil Reid6,7, Jeffrey M Good3, Paulo C Alves1,2, José Melo-Ferreira1,2.
Abstract
We report the first mountain hare (Lepus timidus) transcriptome, produced by de novo assembly of RNA-sequencing reads. Data were obtained from eight specimens sampled in two localities, Alps and Ireland. The mountain hare tends to be replaced by the invading European hare (Lepus europaeus) in their numerous contact zones where the species hybridize, which affects their gene pool to a yet unquantified degree. We characterize and annotate the mountain hare transcriptome, detect polymorphism in the two analysed populations and use previously published data on the European hare (three specimens, representing the European lineage of the species) to identify 4 672 putative diagnostic sites between the species. A subset of 85 random independent SNPs was successfully validated using PCR and Sanger sequencing. These valuable genomic resources can be used to design tools to assess population status and monitor hybridization between species.Entities:
Mesh:
Year: 2017 PMID: 29206218 PMCID: PMC5716010 DOI: 10.1038/sdata.2017.178
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Approximate mountain and European hare distribution.
Approximate distributions of the mountain hare, Lepus timidus, and the European hare, L. europaeus, in Eurasia with indication of the areas of contact and of broad geographic overlap between the species (distribution ranges were adapted from IUCN Spatial Data Resources; IUCN 2016[51]). Circles indicate the mountain hare sampling locations for this work (open circle—Ireland; closed circle—Alps).
Figure 2Methodological workflow.
Flowchart of the RNA-sequencing setup and data analysis steps. Commands used in the analytical steps shown in bold are detailed in Table 1 (available online only).
Open access tools and commands used to perform data analyses (analytical steps correspond to those in Fig. 2)
| Analytical Step | Description | Software/Version | Command |
|---|---|---|---|
| Read QC | Read quality control | FastQC v0.11.5 | fastqc /path_to/raw.fq.gz (Data Citation 1 and Data Citation 2) |
| Clean Reads | Adaptor and low quality trimming | TRIMMOMATIC v0.36 | java -jar /path_to/trimmomatic-0.36.jar PE -phred33 -threads 8 raw_R1.fq.gz raw_R2.fq.gz clean_FP.fq.gz clean_FU.fq.gz clean_RP.fq.gz clean_RU.fq.gz HEADCROP:10 ILLUMINACLIP:/path_to/adapters_list.fa:2:30:10 TRAILING:10 SLIDINGWINDOW:4:10 MINLEN:25 |
|
| Transcriptome assembly | Trinity v2.2.0 | Trinity --seqType fq --left clean_FP.fq.gz --right clean_RP.fq.gz --CPU 20 --max_memory 150G --SS_lib_type RF --output trinity_assembly |
| Assembly curation | Filtering out contigs with low read support | Transrate v1.0.3 | transrate --assembly Ltimidus_Trinity.fasta --left clean_FP.fq.gz --right clean_RP.fq.gz --threads 10 --reference Oryctolagus_cuniculus.OryCun2.0.81.pep.all.fa --output transrate_Ltimidus_Trinity |
| Remove redundancy | Clustering of highly homologous sequences | CD-HIT-EST v4.6.4 | cd-hit-est -i good.Ltimidus_Trinity.fasta -c 0.95 -o AlpsIrel.fasta |
| ORF prediction | Filtering based on candidate coding regions and pfam annotation | TransDecoder v3.0.0 | TransDecoder.LongOrfs -t AlpsIrel.fasta |
| HMMER v3.1b2 | hmmscan --cpu 8 --domtblout pfam.domtblout /path_to/Pfam-A.hmm transdecoder_dir/longest_orfs.pep | ||
| TransDecoder v3.0.0 | TransDecoder.Predict -t AlpsIrel.fasta --cpu 2 --retain_pfam_hits pfam.domtblout | ||
| Annotation | Annotation assessment | Trinotate v3.0.1 | wget " |
| Gunzip | gunzip Trinotate.sqlite.gz | ||
| Conditional reciprocal best blast annotation | crb-blast v0.6.6 | crb-blast --query AlpsIrel.cds --target database(SP and Ocun) --threads 4 --split 4 --output blastx.outfmt6 | |
| crb-blast v0.6.6 | crb-blast --query AlpsIrel.pep --target database(SP and Ocun) --threads 4 --split 4 --output blastp.outfmt6 | ||
| Signalp annotation | signalp v4.1 | signalp -f short -n signalp.out AlpsIrel.pep | |
| Pfam annotation | HMMER v3.1b2 | hmmscan --cpu 2 --domtblout TrinotatePFAM.out Pfam-A.hmm AlpsIrel.pep | |
| tmhmm annotation | tmHMM v2.0 | tmhmm --short < AlpsIrel.pep > tmhmm.out | |
| Combine annotations | Trinity utilities v2.2.0 | /path_to/trinityrnaseq-2.2.0/util/support_scripts/get_Trinity_gene_to_trans_map.pl AlpsIrel.fasta >AlpsIrel.gene_trans_map | |
| Trinotate v3.0.1 | Trinotate Trinotate.sqlite init --gene_trans_map AlpsIrel.gene_trans_map --transcript_fasta AlpsIrel.fasta --transdecoder_pep AlpsIrel.pep | ||
| SwissProt annotation load | Trinotate v3.0.1 | Trinotate Trinotate.sqlite LOAD_swissprot_blastp SP.blastp.outfmt6 #and# Trinotate Trinotate.sqlite LOAD_swissprot_blastx SP.blastx.outfmt6 | |
| Trinotate v3.0.1 | 1. Trinotate Trinotate.sqlite LOAD_custom_blast --outfmt6 Ocun.blastp.outfmt6 --prog blastp --dbtype Ocun; 2. Trinotate Trinotate.sqlite LOAD_custom_blast --outfmt6 Ocun.blastx.outfmt6 --prog blastx --dbtype Ocun | ||
| Pfam annotation load | Trinotate v3.0.1 | Trinotate Trinotate.sqlite LOAD_pfam TrinotatePFAM.out | |
| tmhmm annotation load | Trinotate v3.0.1 | Trinotate Trinotate.sqlite LOAD_tmhmm tmhmm.out | |
| Signalp annotation load | Trinotate v3.0.1 | Trinotate Trinotate.sqlite LOAD_signalp signalp.out | |
| Joint annotation file | Trinotate v3.0.1 | Trinotate Trinotate.sqlite report > LtimidusTranscriptome.xls | |
| Mapping | Read mapping onto the curated reference | bwa-mem v0.7.15 | bwa index AlpsIrel.cds |
| bwa-mem v0.7.15 | bwa mem -t 10 -R '@RG\tID:pop_sample_lane\tSM:popsample\tLB:LIBsample' AlpsIrel.cds Sample_L*_FP.fq.gz Sample_L*_RP.fq.gz > Sample_lane.sam | ||
| Bam conversion,sort and fixmate | Fixmate and BAM conversion | SAMtools v1.3.1 | samtools fixmate --output-fmt BAM sample_lane.sam sample_lane_fixmate.bam |
| BAM sort | SAMtools v1.3.1 | samtools sort -O bam -o sample_lane_sorted.bam -T /path_to/temp/ sample_lane_fixmate.bam | |
| Remove duplicates | Mark and remove duplicates | Picard v1.140 | java -jar /path_to/picard.jar MarkDuplicates REMOVE_DUPLICATES=True MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=950 ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT I=sample_lane_sorted.bam I=sample_lane_sorted.bam I=sample_lane_sorted.bam O=sample_rmdup.bam M=duplic_stats_sample TMP_DIR=/path_to/temp |
| Realignment and recalibration | Realignment | GATK v3.6-0 | java -jar /path_to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R AlpsIrel.cds -I sample_rmdup.bam -o sample_int.list |
| Recalibration | GATK v3.6-0 | java -jar /path_to/GenomeAnalysisTK.jar -T IndelRealigner -R AlpsIrel.cds -I sample_rmdup.bam -targetIntervals sample_int.list -o sample_realign.bam | |
| SNP call | SNP call | Reads2snp v2.0.64 | reads2snp_2.0.64.bin -bamlist LtimLeur_list.txt -bamref AlpsIrel.cds -out LtimVsLeur -min 10 -nbth 12 -th1 0.95 -par 1 -th2 0.01 -opt bfgs -fis 0.0 -pre 0.001 -rqt 20 |
| Differentiation analysis | Remove indels and missing data | VCFtools v0.1.14 | vcftools --vcf LtimVsLeur.vcf --recode --recode-INFO-all --remove-indels --max-missing-count 0 --out LtimVsLeur_noindels |
| Extract 1 SNP per contig | VCFtools v0.1.14 | vcftools --vcf LtimVsLeur_noindels.recode.vcf --recode --recode-INFO-all --thin 10000 --min-alleles 2 --out LtimVsLeur_1SNPperContig | |
| VCF to STRUCTURE conversion | PGDSpyder v2.1.1.0 | java -Xmx1024m -Xms512m -jar /path_to/PGDSpider2-cli.jar -inputfile LtimVsLeur_1SNPperContig.recode.vcf -inputformat VCF -outputfile LtimVsLeur_SNPs -outputformat STRUCTURE -spid VCF_to_STRUCTURE.spid | |
| Structure analysis | STRUCTURE v2.3.4 | structure -m mainparams (standard parameters except 1 million steps after a burn-in period of 200 000, K=2 and admixture model) | |
| CLUMPACK v42089 | The Web version was used - | ||
| PCA analysis | PLINK v1.90b3.45 | plink --file LtimVsLeur_1SNPperContig --pca 3 | |
| ggplot2 R package v2.2.1 | 1. R; 2. library(ggfortify); 3. pca <- read.table('plink.eigenvec', header=TRUE); 4. df <- pca[c(3, 4)]; 5. autoplot(prcomp(df), data=pca, colour='Species.Pop', size=5) | ||
| GO enrichment | Gene Ontology enrichment analysis | g:Profiler | Available at |
Summary of sample data information deposited in the NCBI database.
| Sample_3101 | liver | RNA-seq | SAMN06186748 | |
| Sample_3102 | liver | RNA-seq | SAMN06186761 | |
| Sample_3103 | liver | RNA-seq | SAMN06186762 | |
| Sample_3105 | liver | RNA-seq | SAMN06186763 | |
| Sample_3112 | liver | RNA-seq | SAMN06186727 | |
| Sample_3113 | liver | RNA-seq | SAMN06186728 | |
| Sample_3114 | liver | RNA-seq | SAMN06186729 | |
| Sample_3116 | liver | RNA-seq | SAMN06186738 |
Illumina RNA-seq data deposited in the NCBI database.
| Sample_3101 | SRR5133282 | 26,598,712 | 2,525 |
| Sample_3102 | SRR5133280 | 26,128,525 | 2,532 |
| Sample_3103 | SRR5133285 | 24,469,456 | 2,414 |
| Sample_3105 | SRR5133283 | 26,662,182 | 2,582 |
| Sample_3112 | SRR5133287 | 22,444,667 | 2,263 |
| Sample_3113 | SRR5133281 | 20,825,930 | 2,100 |
| Sample_3114 | SRR5133286 | 32,749,011 | 3,294 |
| Sample_3116 | SRR5133284 | 21,690,965 | 2,189 |
Mountain hare transcriptome assembly and curation statistics.
| Raw Reads | 207,882,430 |
| Clean Reads | 201,569,448 |
| Mapped Reads | 136,511,846 |
| Raw | |
| Number of contigs | 272,183 |
| Largest (bp) | 14,048 |
| Smallest (bp) | 201 |
| N50 (bp) | 839 |
| Mean (bp) | 594 |
| Post assembly curation (TransRate) | |
| Number of contigs | 113,694 |
| Largest (bp) | 14,048 |
| Smallest (bp) | 201 |
| N50 (bp) | 801 |
| Mean (bp) | 567 |
| Post redundancy removal (CD-HIT-EST) | |
| Number of contigs | 109,239 |
| Largest (bp) | 14,048 |
| Smallest (bp) | 201 |
| N50 (bp) | 765 |
| Mean (bp) | 554 |
| Post open reading frame prediction (TransDecoder) | |
| Number of contigs | 25,868 |
| Largest (bp) | 13,728 |
| Smallest (bp) | 297 |
| N50 (bp) | 1,182 |
| Mean (bp) | 842 |
| Reference Coverage (%) | 42 |
Mapping statistics.
| Sample_3101 | 26,598,712 | 19,648,435 | 74 | |
| Sample_3102 | 26,128,525 | 18,781,893 | 72 | |
| Sample_3103 | 24,469,456 | 16,102,091 | 66 | |
| Sample_3105 | 26,662,182 | 18,429,333 | 69 | |
| Sample_3112 | 22,444,667 | 13,913,982 | 62 | |
| Sample_3113 | 20,825,930 | 13,935,177 | 67 | |
| Sample_3114 | 32,749,011 | 21,360,771 | 65 | |
| Sample_3116 | 21,690,965 | 14,340,164 | 66 | |
| Sample_H1 | 20,825,930 | 14,100,961 | 62 | |
| Sample_H2 | 32,749,011 | 28,922,352 | 57 | |
| Sample_H3 | 21,690,965 | 38,522,367 | 55 |
Figure 3Annotation summary.
Number of transcripts annotated with different combinations of methods and databases: all transcripts; transcripts annotated with crb-blast against rabbit transcriptome; transcripts annotated with a unidirectional BLASTx against rabbit transcriptome; transcripts annotated with crb-blast against the Swiss-Prot database; and transcripts annotated with a unidirectional BLASTx against the Swiss-Prot database.
Figure 4Characterization of inferred SNPs in the sampled populations and species.
(a) Relative proportion of the 41 182 SNPs mapped to the mountain hare transcriptome, summarized as polymorphic within each species and fixed or shared between L. timidus (mountain hare) and L. europaeus (European hare). The proportion is shown considering the complete L. timidus dataset (i) and only the Irish (ii) and Alpine (iii) populations. (b) STRUCTURE analysis to evaluate cluster membership and admixture proportions. Individuals are sorted by population and species. Mountain hare populations are shown in blue and European hare individuals in orange. (c) Principal Component Analysis (PCA) plot using one SNP per contig. The first principal component (PC1) splits species and the second (PC2) the sampled populations.