| Literature DB >> 35193972 |
Diana Hicks1, Matteo Zullo2,3, Ameet Doshi2,4, Omar I Asensio2,5.
Abstract
In seeking to understand how to protect the public information sphere from corruption, researchers understandably focus on dysfunction. However, parts of the public information ecosystem function very well, and understanding this as well will help in protecting and developing existing strengths. Here, we address this gap, focusing on public engagement with high-quality science-based information, consensus reports of the National Academies of Science, Engineering, and Medicine (NASEM). Attending to public use is important to justify public investment in producing and making freely available high-quality, scientifically based reports. We deploy Bidirectional Encoder Representations from Transformers (BERT), a high-performing, supervised machine learning model, to classify 1.6 million comments left by US downloaders of National Academies reports responding to a prompt asking how they intended to use the report. The results provide detailed, nationwide evidence of how the public uses open access scientifically based information. We find half of reported use to be academic-research, teaching, or studying. The other half reveals adults across the country seeking the highest-quality information to improve how they do their job, to help family members, to satisfy their curiosity, and to learn. Our results establish the existence of demand for high-quality information by the public and that such knowledge is widely deployed to improve provision of services. Knowing the importance of such information, policy makers can be encouraged to protect it.Entities:
Keywords: BERT; machine learning; natural language processing; public understanding of science
Year: 2022 PMID: 35193972 PMCID: PMC8892306 DOI: 10.1073/pnas.2107760119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Description of NASEM dataset
| Number or year | Date or percentage | |
| Reports downloaded after 2002 in the United States | 10,275 | |
| First download | 2003 | June |
| First comment | 2011 | June |
| Last download and comment | 2020 | February 6 |
| Worldwide downloads | 16,000,616 | |
| US downloads—raw data | 8,303,511 | 52% |
| US downloads—processed | 6,648,781 | |
| Worldwide comments | 2,433,199 | |
| US comments | 1,554,157 | 64% |
| Most frequent US comment—“research” | 116,828 | 7.5% |
| Unique US comments | 862,258 |
*Downloads from Chinese domains that appeared under a US IP address were removed, as were duplicates, algorithmic downloads, and multiple copies of the same report downloaded by a user in 1 d.
†Excluded from the analysis were 2,051 comments classified as refusal to answer the prompt. Examples included ppoo00, nan, kjbkbknln, and similar.
US downloads of NASEM reports and comments by sector
| Sector | Number | Percentage | ||||||
| Downloads | Users | Domains | Comments | Downloads | Users | Domains | Comments | |
| Gmail & ISP | 2,300,947 | 926,227 | 22,796 | 491,003 | 35 | 38 | 11 | 32 |
| University | 1,903,312 | 699,854 | 16,543 | 451,816 | 29 | 28 | 8 | 29 |
| Companies (.com) | 719,160 | 274,546 | 97,542 | 171,163 | 11 | 11 | 45 | 11 |
| Federal Government | 450,798 | 121,830 | 2,333 | 91,587 | 7 | 5 | 1 | 6 |
| Nonprofit (.org) | 331,821 | 124,390 | 38,671 | 92,245 | 5 | 5 | 18 | 6 |
| State & Local Government | 223,044 | 72,217 | 8,497 | 67,824 | 3 | 3 | 4 | 4 |
| School | 205,089 | 102,148 | 14,069 | 65,627 | 3 | 4 | 7 | 4 |
| Health Care | 168,637 | 69,046 | 3,765 | 56,897 | 3 | 3 | 2 | 4 |
| Consulting | 96,985 | 23,240 | 1,326 | 23,007 | 1.5 | 0.9 | 0.6 | 1.5 |
| NASEM | 85,602 | 5,889 | 1,732 | 3,491 | 1.3 | 0.2 | 0.8 | 0.2 |
| Transportation | 78,703 | 14,481 | 381 | 19,649 | 1.2 | 0.6 | 0.2 | 1.3 |
| Miscellaneous (.net etc.) | 55,514 | 18,871 | 7,194 | 12,847 | 0.8 | 0.8 | 3.3 | 0.8 |
| Media | 9,604 | 3,768 | 473 | 2,303 | 0.1 | 0.2 | 0.2 | 0.1 |
| Museum | 9,142 | 2,904 | 412 | 2,745 | 0.1 | 0.1 | 0.2 | 0.2 |
| Community College | 9,121 | 4,856 | 354 | 1,955 | 0.1 | 0.2 | 0.2 | 0.1 |
| Total | 6,647,479 | 2,464,267 | 216,088 | 1,554,159 | 100 | 100 | 100 | 100 |
Most downloaded reports
| Downloads (thousands) | Pub year | Title |
| 206 | 2012 |
|
| 125 | 2011 |
|
| 74 | 2000 |
|
| 59 | 2009 |
|
| 45 | 2001 |
|
| 43 | 1996 |
|
| 38 | 2017 |
|
| 36 | 2000 |
|
| 35 | 2008 |
|
| 33 | 2008 |
|
| 28 | 2007 |
|
| 26 | 2013 |
|
| 26 | 2015 |
|
| 22 | 2010 |
|
| 21 | 2016 |
|
| 21 | 2001 |
|
| 20 | 2018 |
|
| 20 | 2015 |
|
| 20 | 2014 |
|
| 20 | 2013 |
|
| 20 | 2013 |
|
| 18 | 2014 |
|
| 18 | 2011 |
|
| 17 | 2010 |
|
| 17 | 2011 |
|
Top 25 most downloaded NASEM reports out of a total of 10,275 reports. Downloads by US based IP addresses only. Downloads counted June 2003-February 2020.
Fig. 1.How were NASEM reports used? Classification into 64 categories of 1.6 million comments left by US downloaders of NASEM reports between 2011 and 2020. Downloaders were asked how they will use the report. BERT machine learning algorithm was used to classify.
Six broad categories of NASEM report use
| Category | Comments | Share, % | Accuracy | F1 macro |
| Education and research | 752,985 | 48 | 0.91 (0.006) | 0.91 (0.006) |
| Governance | 279,799 | 18 | 0.93 (0.006) | 0.92 (0.004) |
| Information activity | 262,209 | 17 | 0.80 (0.012) | 0.82 (0.006) |
| Personal | 157,144 | 10 | 0.87 (0.009) | 0.89 (0.008) |
| Professional | 86,854 | 6 | 0.83 (0.012) | 0.84 (0.010) |
| Other | 15,168 | 1 | 0.87 (0.017) | 0.86 (0.017) |
| Overall | 1,554,159 | 100 | 0.89 (0.004) | 0.87 (0.004) |
Broad classification into six categories of 1.6 million comments left by US downloaders of NASEM reports between 2011 and 2020. Downloaders were asked how they will use the report. BERT machine learning algorithm was used to classify. The accuracy, F1 macro, and SEs (reported in parentheses) were generated by running 10-fold cross-validation of the optimized model. Accuracy and F1 macro are the average over the 10 runs.
Taxonomy of everyday information use and associated comment categories
|
|