| Literature DB >> 35521552 |
Thiago Peixoto Leal1,2, Vinicius C Furlan1, Mateus Henrique Gouveia1,3, Julia Maria Saraiva Duarte1, Pablo As Fonseca1,4, Rafael Tou1, Marilia de Oliveira Scliar5, Gilderlanio Santana de Araujo6, Lucas F Costa1, Camila Zolini1,7,8, Maria Gabriela Campolina Diniz Peixoto9, Maria Raquel Santos Carvalho1, Maria Fernanda Lima-Costa10, Robert H Gilman11,12, Eduardo Tarazona-Santos1,8,12, Maíra Ribeiro Rodrigues1,13.
Abstract
Genetic and omics analyses frequently require independent observations, which is not guaranteed in real datasets. When relatedness cannot be accounted for, solutions involve removing related individuals (or observations) and, consequently, a reduction of available data. We developed a network-based relatedness-pruning method that minimizes dataset reduction while removing unwanted relationships in a dataset. It uses node degree centrality metric to identify highly connected nodes (or individuals) and implements heuristics that approximate the minimal reduction of a dataset to allow its application to complex datasets. When compared with two other popular population genetics methodologies (PLINK and KING), NAToRA shows the best combination of removing all relatives while keeping the largest possible number of individuals in all datasets tested and also, with similar effects on the allele frequency spectrum and Principal Component Analysis than PLINK and KING. NAToRA is freely available, both as a standalone tool that can be easily incorporated as part of a pipeline, and as a graphical web tool that allows visualization of the relatedness networks. NAToRA also accepts a variety of relationship metrics as input, which facilitates its use. We also release a genealogies simulator software used for different tests performed in this study.Entities:
Keywords: ARP, All-Relatives Pruning; Complex network theory; DU, Dataset Unrelated; GRM, Genetic Relatedness Matrix; Genealogies simulator; Genetic kinship; KING, Kinship-based INference for Genome-wide association studies; MAF, Minor Allele Frequency; NAToRA, Network Algorithm to Relatedness Analysis; NDC, Node Degree Centrality; Nc, Network with cuts; PCA, Principal Component Analysis; Population genetics; REAP, Relatedness Estimation in Admixed Populations; SNV, Single Nucleotide Variation
Year: 2022 PMID: 35521552 PMCID: PMC9046962 DOI: 10.1016/j.csbj.2022.04.009
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Overview of NAToRA’s (Network Algorithm To Relatedness Analysis) algorithm. (a) input file with relatedness metrics for pairs of individuals. (b) relatedness network modeled by NAToRA with minimum kinship cutoff of 0.07; grey-scale colours represent families of genetically-related individuals. (c), (d) and (e) show the node elimination process for the dark grey family network, in which individuals with the highest node degree centrality (NDC, denoted in white boxes) are iteratively removed (in this case the individuals 1 and 2 with NDC = 3). (f) relatedness-pruned network. (g) output file with a list of individuals to be removed from the dataset.
Comparison of relatedness-pruning results to generate datasets with only kinship relationships below second-degree. Methods are PLINK (--rel-cutoff), KING (--degree 2 --unrelated), NAToRA and the all-relatives pruning strategy. Cutoff values are 0.1768 for the relationship coefficient (calculated by PLINK, PI_HAT) and degree 2 for the kinship coefficient (calculated by KING). For NAToRA we let the algorithm select the method (heuristic or optimal). We compare NAToRA with PLINK and KING methods and the all-relatives pruning strategy.
| BAMBUÍ | PLINK | 1442(1572) | plink -- | 947 | 234 |
| all-relatives pruning | 491 | 0 | |||
| NAToRA -c 0.1768 | 869 | 0 | |||
| KING | 1442(9 2 0) | king --degree 2 --unrelated | 880 | 1 | |
| all-relatives pruning | 602 | 0 | |||
| NAToRA --degree 2 | 950 | 0 | |||
| SHIMAA | PLINK | 45(95) | plink -- | 26 | 8 |
| all-relatives pruning | 10 | 0 | |||
| NAToRA -c 0.1768 | 23 | 0 | |||
| KING | 45(68) | king --degree 2 --unrelated | 45 | 68 | |
| all-relatives pruning | 12 | 0 | |||
| NAToRA --degree 2 | 25 | 0 | |||
| GUZERÁ | PLINK | 1036(17875) | plink -- | 212 | 368 |
| all-relatives pruning | 3 | 0 | |||
| NAToRA -c 0.1768 | 175 | 0 | |||
| KING | 1036(12861) | king --degree 2 --unrelated | 87 | 0 | |
| all-relatives pruning | 24 | 0 | |||
| NAToRA --degree 2 | 218 | 0 |
Fig. 2The impact of relatedness-pruning methods on Minor Allele Frequency (MAF) spectra. Bars represent the number of SNVs for each of the different relatedness-pruning methods (PLINK or KING, NAToRA and all-relatives pruning (ARP) minus the number of SNVs in the original dataset for each allele-frequency class. Positive values mean that there are more SNVs in that MAF interval on this specific dataset than in the original dataset. We divided the SNVs into four classes: Ultra rare (0 < MAF ≤ 0.01), rare (0.01 < MAF ≤ 5%), common (MAF > 5%) and monomorphic (MAF = 0). The monomorphic class includes the loss of SNVs due to the pruning procedure. For the SHIMAA, KING did not remove any individual, and therefore, there is no data for any frequency class.
Fig. 3Convex hull polygons of the Principal Component Analysis (PCA) before and after pruning with different methods. Methods used were PLINK or KING, NAToRA and All-relatives pruning strategy (ARP). We show the first two Principal Components. The PCA was performed on the original dataset and then the pruned individuals were identified and mapped.