Literature DB >> 36148216

Environmental due diligence data: A novel corpus for training environmental domain NLP models.

Afreen Aman1, Deepak John Reji2.   

Abstract

This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4.
© 2022 The Author(s). Published by Elsevier Inc.

Entities:  

Keywords:  DistilBERT; EnvBert; Environmental due diligence; Hugging face; Natural language processing; PyPI

Year:  2022        PMID: 36148216      PMCID: PMC9486029          DOI: 10.1016/j.dib.2022.108579

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


  2 in total

1.  Survey on environmental monitoring requirements of European ports.

Authors:  R M Darbra; N Pittam; K A Royston; J P Darbra; H Journee
Journal:  J Environ Manage       Date:  2008-10-16       Impact factor: 6.789

2.  ML-Net: multi-label classification of biomedical texts with deep neural networks.

Authors:  Jingcheng Du; Qingyu Chen; Yifan Peng; Yang Xiang; Cui Tao; Zhiyong Lu
Journal:  J Am Med Inform Assoc       Date:  2019-11-01       Impact factor: 4.497

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.