| Literature DB >> 35651694 |
Ishag Adam1, Mohammad Shafiul Alam2, Sisay Alemu3,4,5, Chanaki Amaratunga6, Roberto Amato7, Voahangy Andrianaranjaka8, Nicholas M Anstey9, Abraham Aseffa3, Elizabeth Ashley10,11, Ashenafi Assefa12, Sarah Auburn9,11,13, Bridget E Barber14,15, Alyssa Barry16,17,18, Dhelio Batista Pereira19, Jun Cao20,21, Nguyen Hoang Chau22, Kesinee Chotivanich23, Cindy Chu24, Arjen M Dondorp13, Eleanor Drury7, Diego F Echeverry25, Berhanu Erko26, Fe Espino27, Rick Fairhurst28, Abdul Faiz29, María Fernanda Villegas30, Qi Gao20, Lemu Golassa26, Sonia Goncalves7, Matthew J Grigg9, Yaghoob Hamedi31, Tran Tinh Hien22, Ye Htut32, Kimberly J Johnson7, Nadira Karunaweera33,34, Wasif Khan2, Srivicha Krudsood23, Dominic P Kwiatkowski7, Marcus Lacerda35,36, Benedikt Ley9, Pharath Lim6,37, Yaobao Liu20,21, Alejandro Llanos-Cuentas38, Chanthap Lon39, Tatiana Lopera-Mesa40, Jutta Marfurt9, Pascal Michon41, Olivo Miotto7,13, Rezika Mohammed42, Ivo Mueller16, Chayadol Namaik-Larp43, Paul N Newton10,11, Thuy-Nhien Nguyen11,22, Francois Nosten11,24, Rintis Noviyanti44, Zuleima Pava45, Richard D Pearson7, Beyene Petros4, Aung P Phyo13,46, Ric N Price9,11,13, Sasithon Pukrittayakamee23, Awab Ghulam Rahim47, Milijaona Randrianarivelojosia48,49, Julian C Rayner50, Angela Rumaseb9, Sasha V Siegel7, Victoria J Simpson7, Kamala Thriemer9, Alberto Tobon-Castano40, Hidayat Trimarsanto44, Marcelo Urbano Ferreira51,52, Ivan D Vélez53, Sonam Wangchuk54, Thomas E Wellems6, Nicholas J White11,13, Timothy William55,56, Maria F Yasnot57, Daniel Yilma58.
Abstract
This report describes the MalariaGEN Pv4 dataset, a new release of curated genome variation data on 1,895 samples of Plasmodium vivax collected at 88 worldwide locations between 2001 and 2017. It includes 1,370 new samples contributed by MalariaGEN and VivaxGEN partner studies in addition to previously published samples from these and other sources. We provide genotype calls at over 4.5 million variable positions including over 3 million single nucleotide polymorphisms (SNPs), as well as short indels and tandem duplications. This enlarged dataset highlights major compartments of parasite population structure, with clear differentiation between Africa, Latin America, Oceania, Western Asia and different parts of Southeast Asia. Each sample has been classified for drug resistance to sulfadoxine, pyrimethamine and mefloquine based on known markers at the dhfr, dhps and mdr1 loci. The prevalence of all of these resistance markers was much higher in Southeast Asia and Oceania than elsewhere. This open resource of analysis-ready genome variation data from the MalariaGEN and VivaxGEN networks is driven by our collective goal to advance research into the complex biology of P. vivax and to accelerate genomic surveillance for malaria control and elimination. Copyright:Entities:
Keywords: data resource; genomic epidemiology; genomics; malaria; plasmodium vivax
Year: 2022 PMID: 35651694 PMCID: PMC9127374 DOI: 10.12688/wellcomeopenres.17795.1
Source DB: PubMed Journal: Wellcome Open Res ISSN: 2398-502X
Count of samples in the dataset.
Countries are grouped into seven geographic regions based on their geographic and genetic characteristics. For each country, the table reports: the number of distinct sampling locations; the total number of samples sequenced; the number of high-quality samples; the number of high-quality samples included in the analysis; and the percentage of samples collected between 2015–2017, the most recent sampling period in the dataset. 70 samples are from countries that are genetically distinct from those from the seven regions, and a further 48 samples from Bangkok could not be assigned to either the WSEA or ESEA region. These 118 samples (of which 41 passed QC) are classified as unassigned. The breakdown by site is reported in Table 2 and the list of contributing studies in Table 3 and Table 4.
| Region | Country | Sampling
| Sequenced
| QC pass
| Analysis set
| % analysis samples
|
|---|---|---|---|---|---|---|
|
|
| 6 | 71 | 21 | 21 | 24% |
|
| 12 | 112 | 67 | 67 | 39% | |
|
| 1 | 2 | 1 | 1 | 0% | |
|
| 5 | 20 | 20 | 20 | 0% | |
|
| 1 | 1 | 1 | 1 | 0% | |
|
| 1 | 1 | 1 | 1 | 0% | |
|
| 6 | 123 | 48 | 48 | 15% | |
|
|
| 7 | 203 | 137 | 137 | 39% |
|
|
| 2 | 250 | 36 | 36 | 81% |
|
| 4 | 14 | 5 | 5 | 0% | |
|
| 1 | 15 | 5 | 5 | 0% | |
|
| 1 | 2 | 1 | 1 | 0% | |
|
|
| 5 | 141 | 127 | 127 | 7% |
|
|
| 7 | 236 | 172 | 172 | 28% |
|
| 2 | 3 | 2 | 2 | 0% | |
|
| 6 | 139 | 103 | 103 | 88% | |
|
|
| 2 | 109 | 73 | 73 | 0% |
|
| 1 | 6 | 3 | 3 | 100% | |
|
|
| 2 | 282 | 191 | 191 | 18% |
|
| 4 | 47 | 17 | 17 | 0% | |
|
|
| 1 | 28 | 6 | 0 | |
|
| 1 | 9 | 2 | 0 | ||
|
| 1 | 5 | 5 | 0 | ||
|
| 3 | 4 | 4 | 0 | ||
|
| 1 | 1 | 1 | 0 | ||
|
| 2 | 9 | 8 | 0 | ||
|
| 1 | 1 | 1 | 0 | ||
|
| 1 | 13 | 4 | 0 | ||
|
| 1 | 48 | 10 | 0 | ||
|
| 88 | 1,895 | 1,072 | 1,031 | 30% |
External studies contributing samples.
| External Study ID | Manuscript title | Citation | Samples | Sites |
|---|---|---|---|---|
|
| Population genomics
| pubmed
| 195 | Acrelândia (Brazil), Ampasimpotsy (Madagascar), Belem
|
|
| Selective sweep suggests
| pubmed
| 78 | Battambang (Cambodia), Kampot (Cambodia), Oddar
|
|
| Frequent expansion
| pubmed
| 24 | Jimma (Ethiopia) |
|
| 297 |
MalariaGEN studies contributing samples.
| Study ID | Study title | Contact | Samples | Sites |
|---|---|---|---|---|
|
| Genomics of parasite
| Thomas E Wellems
| 82 | Pursat (Cambodia), Ratanakiri
|
|
| Developing the
| Marcelo Ferreira
| 5 | Brazil (Brazil) |
|
| Developing the
| Nadira Karunaweera
| 2 | Kataragama (Sri Lanka) |
|
| Developing the
| Tran Tinh Hien
| 13 | Binh Phuoc (Vietnam), Viet Anh
|
|
| Developing the
| Ivo Mueller
| 20 | East Sepik (Papua New Guinea),
|
|
| Tracking Resistance to
| Elizabeth Ashley
| 4 | Bago (Myanmar), Binh Phuoc
|
|
| The prevalence of
| Lemu Golassa
| 88 | Amhara (Ethiopia), Oromia
|
|
| Genotyping
| Milijaona
| 1 | Maevatanana (Madagascar) |
|
| A global survey of
| Anup Pingle anup.
| 357 | Bangkok (Thailand), Cali
|
|
| Characterisation of drug
| Sarah Auburn
| 359 | Papua Indonesia (Indonesia),
|
|
|
| Sarah Auburn
| 667 | Anhui (China), Antioquia
|
|
| 1,598 |
Breakdown of analysis set samples by geography.
Sites are divided into seven regions as described in the main text. Note that samples from Pakchong and Sisaket in eastern Thailand have been assigned to the Eastern SE Asia (ESEA) region whereas samples from other regions in Thailand have been assigned to the Western SE Asia (WSEA) region. 41 samples that passed QC but were not assigned to one of the seven regions have been excluded from analyses.
| Region | Country | First-level administrative
| Site | Sequenced
| Analysis set
|
|---|---|---|---|---|---|
|
|
|
|
| 6 | 4 |
|
|
| 7 | 1 | ||
|
| 13 | 1 | |||
|
|
| 37 | 14 | ||
|
|
| 1 | 1 | ||
|
|
| 7 | 0 | ||
|
|
|
| 3 | 2 | |
|
|
| 8 | 2 | ||
|
|
| 1 | 0 | ||
|
|
| 26 | 13 | ||
|
| 1 | 0 | |||
|
|
| 3 | 3 | ||
|
| 1 | 1 | |||
|
| 43 | 37 | |||
|
|
| 2 | 2 | ||
|
|
| 16 | 3 | ||
|
|
| 3 | 3 | ||
|
| 5 | 1 | |||
|
|
|
| 2 | 1 | |
|
|
|
| 1 | 1 | |
|
| 1 | 1 | |||
|
| 1 | 1 | |||
|
| 16 | 16 | |||
|
| 1 | 1 | |||
|
|
|
| 1 | 1 | |
|
|
|
| 1 | 1 | |
|
|
|
| 89 | 16 | |
|
| 10 | 10 | |||
|
| 4 | 4 | |||
|
| 10 | 9 | |||
|
|
| 6 | 5 | ||
|
|
| 4 | 4 | ||
|
|
|
|
| 19 | 17 |
|
| 28 | 11 | |||
|
|
| 3 | 2 | ||
|
| 4 | 0 | |||
|
| 44 | 26 | |||
|
| 69 | 51 | |||
|
|
| 36 | 30 | ||
|
|
|
|
| 95 | 10 |
|
|
| 155 | 26 | ||
|
|
|
| 2 | 2 | |
|
| 1 | 0 | |||
|
|
| 1 | 0 | ||
|
|
| 2 | 1 | ||
|
|
| 8 | 2 | ||
|
|
|
| 15 | 5 | |
|
|
|
| 2 | 1 | |
|
|
|
|
| 20 | 20 |
|
|
| 4 | 4 | ||
|
| 42 | 40 | |||
|
| 11 | 5 | |||
|
| 64 | 58 | |||
|
|
|
|
| 9 | 9 |
|
|
| 9 | 9 | ||
|
|
| 2 | 1 | ||
|
|
| 133 | 104 | ||
|
|
| 1 | 1 | ||
|
|
| 79 | 46 | ||
|
|
| 3 | 2 | ||
|
|
|
| 1 | 1 | |
|
|
| 2 | 1 | ||
|
|
|
| 1 | 0 | |
|
|
| 30 | 15 | ||
|
| 31 | 26 | |||
|
|
| 34 | 28 | ||
|
|
| 42 | 33 | ||
|
| 1 | 1 | |||
|
|
|
|
| 108 | 73 |
|
|
| 1 | 0 | ||
|
|
|
| 6 | 3 | |
|
|
|
|
| 253 | 175 |
|
| 29 | 16 | |||
|
|
|
| 8 | 1 | |
|
| 3 | 0 | |||
|
|
| 6 | 0 | ||
|
|
| 30 | 16 | ||
|
| 1,777 | 1,031 | |||
Summary of discovered variant positions.
We divide variant positions into those containing single nucleotide polymorphisms (SNPs) and non-SNPs (indels and combinations of SNPs and indels at the same position). We then further sub-divide each of these into those within exons (coding) and those in intronic or intergenic regions (non-coding). We further sub-divide SNPs into those containing only two alleles (bi-allelic) or those containing three or more alleles (multi-allelic). Discovered variant positions are unique positions in the reference genome where either SNP or indel variation was discovered by our analysis pipeline. Pass variant positions are the subset of discovered positions that passed our quality filters. Alleles per pass position shows the mean number of distinct alleles at each pass position; biallelic variants have two alleles by definition.
| Type | Coding | Multi-
| Discovered
| Pass variant
| % pass | Alleles per
|
|---|---|---|---|---|---|---|
| SNP | Coding | Bi-allelic | 827,373 | 440,222 | 53% | 2.0 |
| Multi-allelic | 40,311 | 17,111 | 42% | 3.0 | ||
| Non-coding | Bi-allelic | 1,927,558 | 471,679 | 24% | 2.0 | |
| Multi-allelic | 288,212 | 16,637 | 6% | 3.0 | ||
| non-SNP | Coding | 279,694 | 138,544 | 50% | 3.4 | |
| Non-coding | 1,207,908 | 219,791 | 18% | 3.4 | ||
| Total | 4,571,056 | 1,303,984 | 29% | 2.4 | ||
Figure 1. Population structure.
( A) First two components of a genome-wide principal coordinate analysis. Each point represents one of 1,072 QC pass samples coloured according to country groupings ( Table 1): Latin America (green, n=159); Africa (red, n=137); Western Asia (orange, n=47); West south-east Asia (blue; n=127); East south-east Asia (purple; n=277); Maritime south-east Asia (pink; n=76); Oceania (brown; n=208); Unassigned samples (grey; n=41). This shows the genetic separation of samples into seven distinct geographic clusters. This also shows that samples that have not been assigned to a region look distinct from those from the seven regions. After removal of the 41 unassigned samples we have an analysis set of 1,031 samples. ( B) Genome-wide unrooted neighbour-joining tree showing population structure across all sites from the seven regions (1,031 analysis set samples), with sample branches coloured as in A. This shows that maritime Southeast Asia has large numbers of very highly related parasites and clear relatedness between samples is also present in some samples from Latin America and Africa.
Geographic patterns of tandem duplications.
Breakpoint IDs are shown in the first column (Duplication name) and can be used to match to the per sample breakpoints in the data release. Breakpoints are generally poly-A or poly-T repeats and First and Second breakpoints columns show the start positions and sequence of the breakpoint sequences in the reference genome (A 18 denotes a poly-A sequence of 18 bases, i.e. AAAAAAAAAAAAAAAAAA). Length column shows the length in bp between the inner ends of the breakpoints. Percentages in Frequency (red) show the proportion of samples which could be genotyped that have a duplication (copy number >= 1.5). LAM=Latin America, AF=Africa, WAS=West Asia, WSEA=West south-east Asia, ESEA=East south-east Asia, MSEA=Maritime south-east Asia, OCE=Oceania, n=range of numbers of samples that could be genotyped at the different duplications.
| Frequency | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Duplication name | Chrom | Length | First
| Second
| LAM
| AF
| WAS
| WSEA
| ESEA
| MSEA
| OCE
|
|
| PvP01_06_v1 | 7,333 | 980,472 A 18 | 987,823 A 15 | 0% | 73% | 0% | 29% | 35% | 7% | 5% |
|
| PvP01_06_v1 | 8,179 | 980,472 A 18 | 988,669 A 22 | 0% | 10% | 0% | 0% | 0% | 0% | 0% |
|
| PvP01_09_v1 | 44,831 | 392,555 GG | 437,388 GG | 0% | 0% | 0% | 0% | <1% | 0% | 0% |
|
| PvP01_10_v1 | 38,134 | 468,190 A 15 | 506,339 A 18 | 0% | 0% | 0% | 19% | 0% | 0% | 0% |
|
| PvP01_14_v1 | 26,452 | 2,894,706 GAAG | 2,921,162 GAAG | 0% | 0% | 0% | 0% | 0% | 0% | 3% |
|
| PvP01_14_v1 | 11,798 | 2,901,140 A 11 | 2,912,949 A 30 | 0% | 0% | 0% | 0% | 0% | 0% | 1% |
|
| PvP01_14_v1 | 3,517 | 2,903,559 T 17 | 2,907,093 T 16 | 0% | 0% | 0% | 0% | 0% | 0% | 26% |
Frequency of different sets of polymorphisms putatively associated with drug resistance in samples from different geographical regions.
All samples were classified into different types of drug resistance based on published genetic markers, and represent best attempt based on the available data. Each type of inferred resistance was considered to be either present, absent or unknown for a given sample. For each inferred resistance type, the table reports: the genetic markers considered; the drug they are associated with; the proportion of samples in each region classified as inferred resistant out of the samples where the type was not unknown. The number of samples classified as either resistant or not resistant varies for each type of inferred resistance considered (e.g. due to different levels of genomic accessibility); numbers in brackets in the header report the minimum and maximum number analysed while the exact numbers are reported in brackets below each percentage. SP: sulfadoxine-pyrimethamine; treatment: SP used for the clinical treatment of uncomplicated malaria. Details of the rules used to infer resistance status from genetic markers can be found on the resource page at www.malariagen.net/resource/30.
| Marker | Associated
| Latin
| Africa
| West
| Western
| Eastern
| Maritime
| Oceania
|
|---|---|---|---|---|---|---|---|---|
|
| Pyrimethamine | 1%
| 0%
| 0%
| 89%
| 0%
| 93%
| 77%
|
|
| Sulfadoxine | 55%
| 23%
| 13%
| 100%
| 89%
| 95%
| 88%
|
|
| Mefloquine | 0%
| 0%
| 0%
| 18%
| 0%
| 0%
| 0%
|
|
| SP (treatment) | 0%
| 0%
| 0%
| 88%
| 0%
| 93%
| 77%
|