| Literature DB >> 28886064 |
Seongmun Jeong1, Jiwoong Kim2, Won Park1, Hongmin Jeon1, Namshin Kim1.
Abstract
Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28886064 PMCID: PMC5590872 DOI: 10.1371/journal.pone.0184087
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Procedure for extracting sex-specific marker sequences.
Two sex chromosomes were aligned with each other using LASTZ, and syntenic regions with polymorphisms were extracted. Final sex-specific marker sequences were selected after removal of similar sequences (90% identity by BLAST).
Fig 2Average read counts for each marker by input number of million sequence reads (log10) for human (hg38) datasets.
Red arrows indicate minimum read counts: 5 million (5×106) reads for whole-exome sequencing and RNA sequencing and 100 million (1×108) reads for whole-genome sequencing. The red horizontal line denotes the minimum average read counts of sex-specific marker sequences.
Summary of datasets used for testing and validation.
| Organism | Sequencing type | Male | Female | Total |
|---|---|---|---|---|
| Human | Whole-genome sequencing | 253 | 163 | 421 |
| Exome sequencing | 264 | 338 | 602 | |
| RNA sequencing | 59 | 72 | 131 | |
| Chicken | Whole-genome sequencing | 60 | 60 | 120 |
| RNA sequencing | 0 | 36 | 36 | |
| Total | 636 | 671 | 1,307 | |
Approximately half of the datasets were from our in-house database and the others were from public databases. Human and chicken were chosen because they have two different configurations of sex chromosomes (XY and WZ).
Sex-specific marker sequences for humans.
| Marker position (chr X) | Gene name | Marker position (chr Y) | Gene name | Seq. len. | Mismatch |
|---|---|---|---|---|---|
| chrX:11298648–11298818 | AMELX | chrY:6868192–6868362 | AMELY | 171 | 14 |
| chrX:11298809–11298973 | AMELX | chrY:6868037–6868201 | AMELY | 165 | 11 |
| chrX:12975506–12975677 | TMSB4X | chrY:13703601–13703772 | TMSB4Y | 172 | 23 |
| chrX:41136804–41137021 | USP9X | chrY:12726575–12726792 | USP9Y | 218 | 18 |
| chrX:41140966–41141217 | USP9X | chrY:12735998–12736249 | USP9Y | 252 | 29 |
| chrX:41143291–41143443 | USP9X | chrY:12738157–12738309 | USP9Y | 153 | 15 |
| chrX:41166023–41166214 | USP9X | chrY:12773734–12773925 | USP9Y | 192 | 19 |
| chrX:41168007–41168218 | USP9X | chrY:12776649–12776860 | USP9Y | 212 | 28 |
| chrX:41169995–41170234 | USP9X | chrY:12778019–12778258 | USP9Y | 240 | 25 |
| chrX:41184397–41184675 | USP9X | chrY:12786522–12786800 | USP9Y | 279 | 36 |
| chrX:41189313–41189475 | USP9X | chrY:12793039–12793201 | USP9Y | 163 | 22 |
| chrX:41223217–41223402 | USP9X | chrY:12846333–12846518 | USP9Y | 186 | 15 |
| chrX:45059247–45059466 | KDM6A | chrY:13359767–13359986 | UTY | 220 | 20 |
| chrX:45060609–45060764 | KDM6A | chrY:13358464–13358619 | UTY | 156 | 15 |
| chrX:45063422–45063817 | KDM6A | chrY:13355003–13355398 | UTY | 396 | 73 |
| chrX:45069579–45069955 | KDM6A | chrY:13335959–13336335 | UTY | 377 | 65 |
| chrX:45069962–45070357 | KDM6A | chrY:13335563–13335958 | UTY | 396 | 56 |
| chrX:45110079–45110249 | KDM6A | chrY:13251017–13251187 | UTY | 171 | 19 |
| chrX:45111953–45112110 | KDM6A | chrY:13249180–13249337 | UTY | 158 | 17 |
| chrX:53193437–53193636 | KDM5C | chrY:19706441–19706640 | KDM5D | 200 | 15 |
| chrX:53194139–53194577 | KDM5C | chrY:19707147–19707585 | KDM5D | 439 | 62 |
| chrX:53194546–53194708 | KDM5C | chrY:19707554–19707716 | KDM5D | 163 | 15 |
| chrX:53195231–53195410 | KDM5C | chrY:19708244–19708423 | KDM5D | 180 | 15 |
| chrX:53196686–53197044 | KDM5C | chrY:19709451–19709809 | KDM5D | 359 | 63 |
| chrX:53198490–53198642 | KDM5C | chrY:19715352–19715504 | KDM5D | 153 | 27 |
| chrX:53198977–53199158 | KDM5C | chrY:19715823–19716004 | KDM5D | 182 | 15 |
| chrX:53201550–53201743 | KDM5C | chrY:19716279–19716472 | KDM5D | 194 | 27 |
| chrX:53214689–53214847 | KDM5C | chrY:19732584–19732742 | KDM5D | 159 | 32 |
| chrX:53217796–53217966 | KDM5C | chrY:19741318–19741488 | KDM5D | 171 | 20 |
| chrX:53224740–53224908 | KDM5C | chrY:19744385–19744553 | KDM5D | 169 | 27 |
| chrX:5890658–5890815 | NLGN4X | chrY:14843196–14843353 | NLGN4Y | 158 | 17 |
| chrX:5890913–5891080 | NLGN4X | chrY:14842933–14843100 | NLGN4Y | 168 | 19 |
| chrX:5891146–5891303 | NLGN4X | chrY:14842706–14842863 | NLGN4Y | 158 | 16 |
| chrX:5892286–5892519 | NLGN4X | chrY:14841542–14841775 | NLGN4Y | 234 | 34 |
| chrX:5892564–5892721 | NLGN4X | chrY:14841340–14841497 | NLGN4Y | 158 | 23 |
| chrX:6151347–6151560 | NLGN4X | chrY:14622026–14622239 | NLGN4Y | 214 | 19 |
| chrX:6228566–6228845 | NLGN4X | chrY:14522638–14522917 | NLGN4Y | 280 | 25 |
| chrX:9465226–9465425 | TBL1X | chrY:6910762–6910961 | TBL1Y | 200 | 87 |
Thirty-eight sex-specific marker sequences were generated using the hg38 human genome assembly. Sequence lengths were 158–439 bp with 11–87 mismatches.
Sex-specific marker sequences for chickens.
| Marker position (chr X) | Gene | Marker position (chr Y) | Gene name | Seq. len. | Mismatch |
|---|---|---|---|---|---|
| chrZ:1524420–1524638 | SMAD2 | chrW:397369–397587 | 219 | 39 | |
| chrZ:1555791–1555999 | chrW:347291–347499 | 209 | 50 | ||
| chrZ:1556251–1556531 | chrW:346752–347032 | 281 | 54 | ||
| chrZ:1557294–1557452 | chrW:345857–346015 | 159 | 14 | ||
| chrZ:18953921–18954146 | chrW:809288–809513 | 226 | 52 | ||
| chrZ:19036774–19036925 | chrW:1096613–1096764 | FET1 | 152 | 5 | |
| chrZ:19055988–19056151 | chrW:1194410–1194573 | 164 | 15 | ||
| chrZ:435849–436019 | ST8SIA3 | chrW:513668–513838 | 171 | 11 | |
| chrZ:437109–437271 | ST8SIA3 | chrW:519588–519750 | 163 | 27 | |
| chrZ:439711–439871 | ST8SIA3 | chrW:523214–523374 | 161 | 30 | |
| chrZ:53600234–53600410 | chrW:619786–619962 | 177 | 45 | ||
| chrZ:7219805–7219996 | LOC407092 | chrW:132663–132854 | UBAP2 | 192 | 15 |
| chrZ:7297965–7298239 | LOC407092 | chrW:98620–98894 | UBAP2 | 275 | 49 |
| chrZ:7300018–7300231 | LOC407092 | chrW:97675–97888 | UBAP2 | 214 | 28 |
| chrZ:7300957–7301146 | LOC407092 | chrW:96741–96930 | UBAP2 | 190 | 42 |
Fifteen sex-specific marker sequences were generated using the galGal5 genome assembly. Sequence lengths were 152–281 bp with 5–54 mismatches.
Accuracy of sex identification by SEXCMD.
| Source | WGS | WES | RNA-seq | ||||
|---|---|---|---|---|---|---|---|
| Correct/ Total | Accuracy | Correct/ Total | Accuracy | Correct/ Total | Accuracy | ||
| 253/253 | 100% | 264/264 | 100% | 59/59 | 100% | ||
| 163/163 | 100% | 338/338 | 100% | 72/72 | 100% | ||
| 60/60 | 100% | - | - | - | - | ||
| 60/60 | 100% | - | - | 36/36 | 100% | ||
| 536/536 | 100% | 602/602 | 100% | 167/167 | 100% | ||
SEXCMD showed 100% accuracy of sex identification for human and chicken with all three sequencing data types tested: whole-exome sequencing, whole-genome sequencing, and RNA sequencing.