| Literature DB >> 33344093 |
Daniel H Lewis1, David E Jarvis1, Peter J Maughan1.
Abstract
PREMISE: Many programs can identify simple sequence repeat (SSR) motifs in genomic data. SSRgenotyper extends SSR identification to en masse genotyping from resequencing data for diversity panels and linkage mapping populations. METHODS ANDEntities:
Keywords: genotyping; linkage mapping; microsatellite; simple sequence repeat; simple tandem repeat
Year: 2020 PMID: 33344093 PMCID: PMC7742204 DOI: 10.1002/aps3.11402
Source DB: PubMed Journal: Appl Plant Sci ISSN: 2168-0450 Impact factor: 1.936
FIGURE 1SSRgenotyper workflow. Sequences from each individual are mapped against a FASTA file containing the targeted SSR and 100 bp of flanking sequence. SSR targets can be developed from reference genomes using SSR‐finding software (Appendix 1) or as predeveloped SSRs. Read mapping, PCR duplicate removal, and read mapping quality control can be accomplished with a variety of alignment softwares (Appendix 2). SAM files are then utilized by SSRgenotyper for genotyping. Alleles are called based on their repeat number. Output files include GENEPOP and linkage analysis formats.
FIGURE 2Accuracy and error profile for SSRgenotyper. (A) Total number of genotypes reported that were concordant (blue) and discordant (red) at various minimum read support levels. (B) Percent concordant (blue) and discordant (red) genotypes based on di‐, tri‐, or tetranucleotide motif type. (C) Percent concordant (blue) and discordant (red) genotypes based on repeat length as identified in the reference FASTA file.
| 1. | Run MISA: |
| perl misa.pl my_Reference.fasta | |
| MISA requires a misa.ini file in the directory where MISA is being executed that should look like this: | |
| definition(unit_size,min_repeats): 2‐6 3‐4 4‐4 | |
| interruptions(max_difference_between_2_SSRs): 100 | |
| GFF: true | |
| 2. | Modify the MISA‐produced .gff files as follows: |
| A. Remove any compound SSRs and calculate how much flanking sequence is available at each SSR locus: | |
| for i in *.gff; do grep ‐v "compound" $i | awk '{if ($5‐$4 >10 && $5‐$4 <50) print $1 "\t" $4‐100 "\t" $5+100}' > $i.mod.gff; echo "processing $i"; done | |
| *We note that the size of the flanking region can be changed using the awk statement above (simply change the “‐100” and “+100” to the flanking size wanted). | |
| B. Concatenate the modified .gff files: | |
| for i in *.mod.gff; do cat $i >> cat.gff; echo "processing $i"; done | |
| C. Remove SSRs that do not have sufficient flanking sequence: | |
| awk '($2 >= 0)' cat.gff > cat_filter1.gff | |
| D. Use BEDTools getfasta to make the SsrReferenceFile.fasta: | |
| bedtools getfasta ‐fi my_Reference.fasta ‐bed cat_filter1.gff ‐fo SsrReferenceFile.fasta |
| 1. | Index the SsrReferenceFile.fasta: |
| bwa index SsrReferenceFile.fasta | |
| 2. | Map the Illumina reads to the SsrReferenceFile.fasta (paired‐end reads process shown): |
| for forward_file in *_1P.fq.gz; do name=echo $forward_file | sed 's/_1P.fq.gz//'; bwa mem ‐M ../reference/ SsrReferenceFile.fasta ${name}_1P.fq.gz ${name}_2P.fq.gz ‐o $name.sam; done | |
| Note: Raw reads should be pre‐trimmed. Using Trimmomatic (Bolger et al., | |
| 3. | Remove PCR duplicate reads: |
| A. Sort SAM files by read name | |
| for i in *.sam; do samtools sort ‐n ‐o $i.sorted $i; done | |
| B. Identify mate coordinates | |
| for i in *.sorted; do samtools fixmate ‐m $i $i.fixmate; done | |
| C. Re‐sort SAM files by genomic location | |
| for i in *.fixmate; do samtools sort ‐o $i.position $i; done | |
| D. Mark and remove duplicates: | |
| for i in *.position; do samtools markdup ‐r $i $i.markdup; done | |
| *Each sample will have a markdup file where PCR duplicates have been removed (the other files can be deleted). | |
| 4. | Read mapping quality control: |
| A. While the whole SAM file can be passed to SSRgenotyper, we encourage users to first filter the markdup file with SAMtools to improve performance and to remove errantly mapped reads. This command will remove reads with mapping quality < 45. | |
| For i in *.markdup; do SAMtools view $i ‐q 45 > $i.Q45.sam; done | |
| Note: SSRgenotyper provides further filtering (option ‐Q) that can be used for additional filter stringency if required. | |
| 5. | A .txt file listing all of the quality‐controlled SAM files (SamFiles.txt) to be processed by SSRgenotyper can be produced with: |
| ls *.Q45.sam > samFiles.txt |
| 1. | Run command for accuracy and error profile analysis, where ‐S was provided at various read coverage depths, ranging from 1 to 10 ( |
| python3 SSRgenotyperV3.py SsrReferenceFile.fasta samFiles.txt quinoa ‐M 0.35 ‐R 4 ‐P 3 ‐B 3 ‐S # ‐F 0.35 ‐f 0.30 ‐Q 60 ‐s 0.1 ‐m 0 ‐N 11 | |
| 2. | Run command for the |
| python3 SSRgenotyperV3.py SsrReferenceFile.fasta samFiles.txt BayreuthXShahdara ‐M 0.51 ‐R 4 ‐P 3 ‐B 3 ‐S 1 ‐F 0.35 ‐f 0.30 ‐Q 60 ‐s 0.1 ‐m 0 ‐N 11 ‐L 0.25 | |
| 3. | Run command for the foxtail millet RIL population (BioProject no.: PRJNA562988, SRA no.: SRR10038747–SRR10038795; Liu et al., |
| python3 SSRgenotyperV3.py SsrReferenceFile.fasta samFiles.txt foxtail ‐M 0.51 ‐R 4 ‐P 3 ‐B 3 ‐S 1 ‐F 0.35 ‐f 0.30 ‐Q 60 ‐s 0.1 ‐m 0 ‐N 11 ‐L 0.25 | |
| 4. | Run command for the quinoa diversity panel (BioProject no.: PRJNA306026, SRA no.: SRR4300210–SRR4300229; Jarvis et al., |
| python3 SSRgenotyperV3.py SsrReferenceFile.fasta samFiles.txt quinoa ‐M 0.25 ‐R 4 ‐P 3 ‐B 3 ‐S 4 ‐F 0.35 ‐f 0.30 ‐Q 60 ‐s 0.1 ‐m 0 ‐N 11 ‐G |