| Literature DB >> 35132072 |
Lisette Espín-Noboa1,2,3, Claudia Wagner1,4,5, Markus Strohmaier1,4,6, Fariba Karimi7.
Abstract
Though algorithms promise many benefits including efficiency, objectivity and accuracy, they may also introduce or amplify biases. Here we study two well-known algorithms, namely PageRank and Who-to-Follow (WTF), and show to what extent their ranks produce inequality and inequity when applied to directed social networks. To this end, we propose a directed network model with preferential attachment and homophily (DPAH) and demonstrate the influence of network structure on the rank distributions of these algorithms. Our main findings suggest that (i) inequality is positively correlated with inequity, (ii) inequality is driven by the interplay between preferential attachment, homophily, node activity and edge density, and (iii) inequity is driven by the interplay between homophily and minority size. In particular, these two algorithms reduce, replicate and amplify the representation of minorities in top ranks when majorities are homophilic, neutral and heterophilic, respectively. Moreover, when this representation is reduced, minorities may improve their visibility in the rank by connecting strategically in the network. For instance, by increasing their out-degree or homophily when majorities are also homophilic. These findings shed light on the social and algorithmic mechanisms that hinder equality and equity in network-based ranking and recommendation algorithms.Entities:
Year: 2022 PMID: 35132072 PMCID: PMC8821643 DOI: 10.1038/s41598-022-05434-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Inequality and inequity. Every column represents a network with certain level of homophily. All networks contain 20 nodes: 20% belong to the minority group (orange), and 80% to the majority group (blue). Edges follow a preferential attachment with homophily mechanism. The top row shows the graph and the level of homophily within groups (MM: majorities and mm: minorities). The second row shows all nodes in descending order (from to −) based on their PageRank scores. The third row represents the rank inequality: Gini coefficients of the rank distribution for every top (black line). Gini refers to the Gini coefficient of the entire rank distribution (i.e., at top-). We see that the lower the k, the lower the Gini of the rank distribution. The bottom row represents the rank inequity: Percentage of minorities found in each top- of the rank distribution (orange line). ME is the mean error of these percentages compared to a fair baseline or diversity constraint (i.e., how much the orange line deviates from the dotted line across all top-k’s). Here we see three main patterns: (a, b) When the majority group is heterophilic, minorities are on average over-represented, . (d, e) When majorities are homophilic, minorities are on average under-represented, . (c) When both groups are neutral, the observed fraction of minorities is almost as expected, .
Figure 2Regions of disparity. We measure inequality (y-axis) as the skewness of the rank distribution, and inequity (x-axis) as the mean differences between the proportional representation of groups in top-k% ranks and the network. Highly skewed distributions lie in regions I to III (darker colors), and fair rankings, where minorities are well represented in top ranks, lie in regions II, V, VIII (green). We set which is arbitrary and allows for a flexible region of group fairness.
Figure 3DPAH model and ranking of nodes. (a) Illustration of the directed network model with preferential attachment and homophily (DPAH). First, nodes are created and randomly labeled according to the fraction of minorities . Then, the following algorithm repeats until a desired edge density is fulfilled. At time t, a source node p is drawn from a power-law (activity) distribution, and a target node j is drawn with a probability proportional to the product of its in-degree and the pair-wise homophily . At time , a new edge is added between nodes based on the same mechanism. (b) The PageRank score of each node is shown under PR. Nodes in each top-k% of the rank are grouped based on the unique PageRank scores. In this example, the top-60% of nodes concentrate most of the PageRank and their scores are somewhat similar (i.e., low Gini). Also, the ranking is fair from top-80% onwards, since they capture the same fraction of minorities as in the population, 25%. Local values are measured per top-k%, and global values are measured using the whole distribution for inequality (Gini), and the average across all top-k% ranks for inequity (mean error).
Figure 4The effects of homophily and fraction of minorities in the global disparity of PageRank. Columns represent the fraction of minorities in the network, x-axis indicates the homophily within minorities, and y-axis the homophily within majoritie s. Colors denote the region where the disparity lies in according to our interpretation in Fig. 2. First, we see that, on average, there is never low global inequality (i.e., regions IV to IX—lighter colors—do not appear). This makes sense because these are scale-free networks. Second, depending on the level o f homophily within groups, minorities on average can be under-represented (region I, red), or over-represented (region III, blue), or well-represented (region II, green). For example, when , minorities are on average under-represented when and .
Figure 5The effects of homophily and fraction of minorities in the local disparity of PageRank. Columns represent the fraction of minorities (10%, 30% and 50%) and rows show homophily within minorities (from top to bottom: heterophilic, neutral and homophilic). The x-axis denotes the top-k% rank and the y-axis shows homophily within majorities. Colors refer to the regions of disparity introduced in Fig. 2. One can see that the minority suffers most (red) when the majority is homophilic and the minority is either heterophilic or neutral. Moreover, inequality is lowest (very light colors) only for a few cases at top-5%. This means that the top best ranked nodes are very similar and their ranks are far from the majority of nodes (i.e., due to preferential attachment). Moreover, inequity remains mostly consistent regardless of top-k%. In other words, if the ranking algorithm favors one group in the top-5% (e.g., red or blue), it will continue to do so until entering the fair regime (green).
Ten-fold cross-validation for PageRank.
| Type | Outcome | Corr | Feature | Importance | |
|---|---|---|---|---|---|
| Global | 0.41 | 0.91 (0.009) | 0.43, 0.31, 0.21, 0.05 | ||
| 0.99 (0.001) | 0.61, 0.31, 0.08, 0.0 | ||||
| Local | 0.21 | 0.95 (0.002) | 0.73, 0.11, 0.07, 0.06, 0.03 | ||
| 0.99 (0.001) | 0.51, 0.27, 0.14, 0.08, 0.01 |
We use a Random Forest Regressor to assess feature importance and report the mean and standard deviation of the out-of-sample . Features are ranked in descending order based on their mean importance (from left to right) and highlighted if their importance represents at least 50% of the total importance. Corr shows the Spearman correlation between inequality and inequity scores (p-values ). represents random chance.
Figure 6The effects of homophily and preferential attachment in the global disparity of PageRank. We generated directed networks using four different models of edge formation. DPA: only preferential attachment. DH: only homophily. DPAH: our proposed model that combines DPA and DH. Random: a baseline where nodes are connected randomly. We see the following patterns: (i) Homophily (DH) produces a moderate-to-high level of inequality (), while preferential attachment (DPA) produces a consistent moderate inequality (). When both mechanisms are combined (DPAH), the rank inequality increases even further (). (ii) Random and Preferential attachment (DPA) are always fair ( or ), while in the cases where homophily is involved (DH and DPAH) inequity is often high (). Thus, in general preferential attachment is the main driver of inequality, while homophily influences both inequality and inequity. Vertical and horizontal error bars represent the standard deviation over 10 runs of the Gini and ME, respectively.
Empirical Networks.
| Dataset | APS | Hate | Blogs | Wikipedia |
|---|---|---|---|---|
| 1853 | 4971 | 1224 | 3159 | |
| pacs | hate | leaning | gender | |
| 05.30.-d | normal | right | male | |
| 05.20.-y | hateful | left | female | |
| 0.37561 | 0.10943 | 0.48039 | 0.15226 | |
| 0.00106 | 0.00061 | 0.01271 | 0.00149 | |
| 3.22246 | 2.23026 | 4.88733 | 4.22425 | |
| 8.93993 | 1.73445 | 3.22464 | 6.16567 | |
| 0.64981 | 0.56898 | 0.47070 | 0.78469 | |
| 0.02859 | 0.10244 | 0.04741 | 0.07824 | |
| 0.02721 | 0.07886 | 0.04105 | 0.10685 | |
| 0.29439 | 0.24972 | 0.44084 | 0.03022 | |
| 0.94000 | 0.58000 | 0.92000 | 0.59000 | |
| 0.96000 | 0.95000 | 0.90000 | 0.62000 |
APS, a scientific citation network. Hate, a retweet network. Blogs, a political blog hyper-link network. Wikipedia, a hyper-link network of politicians. Each row represents a property of the network. represents the fraction of edges within and across groups, and homophily values inferred by the DPAH model (see Supplementary Appendix A for derivations).
Figure 7Global disparity in PageRank on empirical networks. Each column represents an empirical network. Citation/retweet networks (APS and Hate) and Hyper-link networks (Blogs and Wikipedia). Inequality and inequity are shown in the y- and x-axis, respectively. The disparity in ranking that we see in empirical networks are best explained as follows: (i) citation/retweet networks by preferential attachment PA, and (ii) hyper-link networks by preferential attachment and homophily DPAH.
Model parameters.
| Random | DPA | DH | DPAH | |
|---|---|---|---|---|
| - | - | |||
| - | - | |||
| - | ||||
| - |
Check marks denote that a given model (column) requires a particular parameter (row): number of nodes n, fraction of minorities , edge density d, in-class homophily , and the power-law exponent of the activity distribution .
Sub-indices M and m refer to the majority and minority groups, respectively. The difference between DH and DPAH is the preferential attachment (in-degree) mechanism. All models produce directed networks.