| Literature DB >> 34677586 |
Joseph D Romano1,2, Trang T Le1, William La Cava1, John T Gregg1, Daniel J Goldberg3, Praneel Chakraborty4,5, Natasha L Ray6, Daniel Himmelstein7,8, Weixuan Fu1, Jason H Moore1.
Abstract
MOTIVATION: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.Entities:
Year: 2021 PMID: 34677586 PMCID: PMC8756190 DOI: 10.1093/bioinformatics/btab727
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Summary of PMLB datasets (with comparison to v0.2)
| PMLB v0.2 | PMLB v1.0 | |
|---|---|---|
| Num. classification datasets | 150 | 162 |
| Num. regression datasets | 0 | 255 |
| Mean num. instances | 20 865 | 42 860 |
| Median num. instances | 500 | 1066 |
| Language interfaces | Python | Python; R |
| Miscellaneous tools | — | Interactive website |
| Pandas Profiling reports | ||
| Git LFS support | ||
| API documentation | ||
| Contributing guide | ||
| Automatic dataset validation |
Fig. 1.Database search features on PMLB’s website. (a) Interactive scatterplot of databases in PMLB, showing number of features and number of observations in each dataset, as well as whether it is a regression or classification dataset. (b) Responsive table of PMLB databases. Users can sort on any columns’ values or filter based on ranges of values. Clicking on any dataset name will bring the user to the Pandas Profiling report for that dataset