Literature DB >> 35028345

Dataset for corruption risk assessment in a public administration.

Marcelo Oliveira Vasconcelos1,2, Luís Cavique2,3.   

Abstract

This data article describes a dataset of corruption approach and possible variables related, and this dataset was created by integrating eight different systems of Brazilian federal government and Federal District. We present real data from civil servants and militaries to comply with GDPR legislation, the attributes that could identify a person were removed, making the data anonymized.
© 2021 Published by Elsevier Inc.

Entities:  

Keywords:  Corruption; Data enrichment; Imbalanced learning; Public administration; Risk

Year:  2021        PMID: 35028345      PMCID: PMC8741413          DOI: 10.1016/j.dib.2021.107768

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the Data This dataset contains data from eight different databases from the Brazilian federal government and Federal District. This dataset benefits researchers working in the field of corruption risk assessment and also applied machine learning. Researchers working in the field of corruption risk assessment may find this dataset benefited and could also apply machine learning. The analysis of this data could help identify corruption risk factors and assist in the definition of overseen planning on focus on the activities of the greatest risk for Public Administration, such as cases with a high probability of occurrence and a high financial or social impact.

Data Description

The dataset provided in this paper offers valuable information on public administration and allows research in the corruption area. A few datasets regarding corruption are available, Al-Jundi [1] presents a survey dataset on determinants of administrative corruption, Peerthum et al. [2] related to corruption in Mauritius, and Oguntunde et al. [3] deal with selected crime data in Nigeria, including corruption. Literature was consulted to determine attributes for administrative corruption. Other researchers can reuse the dataset and can be easily downloaded from the Mendeley Data repository.1 The data in this article are composed of all civil servants from Federal District Government (Brazilian Public Administration), involves the reported cases of dismission by corruption, and aggregate 26 attributes related to four domain areas extract from eight databases. These four domains are related by sources provided and are: Corruption Domain (C) aggregate data corresponding to illegal acts committed by civil servants or militaries or companies that they are owners; Employment domain (E) provide servant's registrations from Human Resources Management System like income and number of coordination roles; Political Domain (P) covers data related to political activities; and Business Domain (B) presents company features that civil servants and militaries are owners.

The descriptive statistics

The dataset is composed of 27 attributes, part of them are integer and numeric attributes (Table 2), other attributes are categorical (Table 3), and a few of them are Boolean (Table 4).
Table 2

Integer and numeric attributes.

DimAttributeMin25%Median75%maxmeanstd
ESalary (Real BR)0.014651.287866.741464.6340,140.219473.919752.79
ESalaryMinusTax0.003455.905388.118111.1230,162.966725.798178.93
EQtySIGRHOff0.000.000.001.0020.000.751.61
EQtySIAPEOff0.000.000.000.008.000.370.99
EQtySIGRHSIAPEOff0.000.000.002.0020.001.121.85
EQtySIGRHfunc0.001.001.002.0013.001.420.87
EQtySIAPEfunc0.000.000.000.004.000.030.23
EQtySIGRHSIAPEfunc0.001.001.002.0013.001.450.91
BOwnershipPerc0.000.000.000.00100.009.0322.54
BQtFirmActivities0.000.000.001.0011.000.791.81
BDaysOwnership0.000.000.00499.0043,795.00917.402054.12

Brazilian currency - Real.

Table 3

Categorical attributes.

DimAttributeNumber of CategoriesN° of examples
PElectivePosition81317
PCodParty321317
PCandElectiveStatus51317
PCandEducationLevel61317
PCandMaritalStatus51317
PCodRoundStatus71317
BTypeOfOwnership286,058
BCodFirmActivity97686,058
BCodFirmLegal3386,058
BCodFirmStatus586,058
BCodFirmSize386,058
BCodFirmTaxOption586,058
Table 4

Boolean attributes.

DimAttributeTrueFalse
CCorruptionTG428302,608
CCEIS132302,904
CTCDFrestriction274302,762
CCEPIM0303,036
All boolean attributes (Table 4) belong to Corruption Domain. Table 3 presents categorical attributes from Political and Business Domains, and Table 2 shows the main statistic description from Employment and Business domains with integer or numeric attributes.

Experimental Design, Materials and Methods

This section gives Data Sources aggregated information; Related Literature to compose the dataset features, Descriptive Statistics, and the Preprocessing (Data Enrichment and Data Cleansing).

Data sources

The dataset was composed of eight different sources from Brazilian public administration. After consolidation, the attributes were classified by four domain areas for better understand, described by: corruption(C), Employment (E), political (P), and Business (B) that are related by sources. The dataset was created after an ETL process collected from these different data sources: CGU-CEPIM - Private Non-Profit Entities Prevented from contracting with the Public Administration maintained by Office of the Comptroller General (Controladoria-Geral da União—CGU); CGU-CEIS - Registration of Unfaithful and Suspended Companies) maintained by Office of the Comptroller General (Controladoria-Geral da União—CGU); CGDF - Expulsion Registrations maintained by Comptroller General of the Federal District (Portal da Transparência DF); TCDF - Persons that by sanction are not allowed for the exercise commission position or a trust function within the scope of the Public Administration of the Federal District maintained by District Federal Court of Accounts– TCDF; SIAPE – Integrated Human Resources Administration System maintained by Federal Government; SIGRH - Integrated Resource Management System maintained by Federal District Government; TSE- Electoral Data maintained by Superior Electoral Court (TSE); and SRF/ME - Personal and Legal Data maintained by Secretariat of Brazil's Federal Revenue (SRF/ME). These data represent the information from civil servants, militaries, and pensioners of The Federal District, a Brazilian Public Administration, in total are 303,036. Federal District is a legal entity of internal public law, which is part of the political-administrative structure of Brazil, of a nature sui generis, because it is neither a state nor a municipality, but a special entity that accumulates the legislative powers reserved to the states and the municipalities, which gives it a hybrid nature of state and municipality.

Domains

These four domains (Fig. 1) are:
Fig 1

Illustrates the pipeline of ETL process (extract, transform and load) from different data sources integrated into a dataset aggregated by four domains and was submitted to a pre-processing (Data Enrichment and Data Cleansing).

Illustrates the pipeline of ETL process (extract, transform and load) from different data sources integrated into a dataset aggregated by four domains and was submitted to a pre-processing (Data Enrichment and Data Cleansing). Corruption Domain (C) aggregates data corresponding to illegal acts committed by civil servants or militaries, or companies that they are owners. Employment domain (E) is composed of base integration of two payment databases that have information from Public Security workers (policemen and firefighters) in Integrated Human Resource Management System (SIAPE) and from other civil servants (Education, Health, and other areas in Resource Management System (SIGRH). Base integration work took place for the SIGRH, and SIAPE, since the same civil servant or military from the Federal District could be included in both databases due to the possibility provided by the Brazilian Federal Constitution to allow the accumulation of certain public offices. Political Domain (P) has information from TSE, The Brazilian Superior Electoral Court, and provides information about candidates like level of education, party, marital status. Business Domain (B) is composed of information from The Secretariat of Brazil's Federal Revenue - SRF/ME, about companies whose owners are civil servants and militaries.

Related literature

The decision about which attributes to compose this dataset was defined considering studies carried out on corruption literature, and all of them were identified and classified by previous domains defined (Table 2). It is essential to bring the concept of corruption adopted for this dataset and represented by the variable "C.CorruptionTG". It was described in Brazilian Law No. 8429/92, which defines corruption as an act of improbity that, under the influence or not of the position, causes illicit enrichment, causes or not mandatory, will be used to the purse or violate Public Administration principles [20] and is described on Table 1.
Table 1

Attributes/features description.

#Attribute nameTypeBrief Description
Corruption Domain (C)

1C.CorruptionTGBooleanCases of dismission by corruption, this attribute could be a target for machine learning
2C.CEISBooleanCases of individuals or legal entities with restrictions on the right to participate in tenders or to contract with the Public Administration by sanctions
3C.TCDFrestrictionBooleanCases of person who are not qualified to exercise a position in a commission or a trust function within the Public Administration of the Federal District for a period of up to eight years due to serious irregularities found by the TCDF
4C.CEPIMBooleanCases of private non-profit entities that are prevented from entering into new agreements, on lending contracts or partnership terms with the Federal Public Administration, depending on irregularities not resolved in agreements, on lending contracts or partnership terms previously signed

Employment Domain (E)

5E.SalaryNumericSalary (Brazilian currency - Real) of the civil servant or military that included the salary received by any of the bases (SIGRH and SIAPE) or the sum of salaries in the case of civil servants who accumulate public positions as permitted by the Federal Constitution
6E.SalaryMinusTaxNumericSalary with several discounts and obtained in a similar way to the "Salary" (SIGRH and SIAPE bases)
7E.QtySIGRHOffIntQuantity of positions that the civil servant or military held until Nov/2020 into the SIGRH determined only with the SIGRH base.
8E.QtySIAPEOffIntQuantity of positions the civil servant or military held in Public Security until Nov/2020 at SIAPE (Public Security, SIAPE)
9E.QtySIGSIPOffIntQuantity of positions that the civil servant or military held until Nov/2020 in these two databases (SIGRH and SIAPE).
10E.QtySIGRHfuncIntQuantity of functions that the civil servant occupied until Nov/2020 in the SIGRH (Servers, except Public Security, SIGRH)
11E.QtySIAPEfuncIntQuantity of functions that the civil servant or military occupied until Nov/2020 in SIAPE (SIAPE Public Security)
12E.QtySIGSIPfuncIntQuantity of functions that the civil servant or military occupied until Nov/2020 in these two databases (SIGRH and SIAPE)

Political Domain (P)

13P.ElectivePositionCategoricalType of electoral position that the civil servant disputed (president or vice, governor or vice, mayor, senator, councilor, federal deputy, state deputy, or district deputy)
14P.CodPartyCategoricalCode of the party in which the server was registered for the election
15P.CandElectiveStCategoricalcandidate's registration status, which can assume the values 'Apt' (candidate able to go to the ballot box); 'Unfit' (candidate unfit to go to the ballot box); 'Registered' (registration of candidacy carried out, but not yet judged by the electoral body)
16P.CandEducationCategoricalCandidate's level of education can be defined as: non-disclosable, reads and writes, incomplete or complete elementary school, incomplete, or complete high school, and incomplete or complete higher education
17P.CandMaritalStCategoricalThe civil status situation of the candidate civil servant: single, married, non-disclosable, widowed, legally separated or divorced
18P.CodRoundStCategoricalThis attribute identifies the candidate's totalization situation in the turn that can be (elected, elected by average, elected by the electoral quotient, unelected, alternate, or null)
Business Domain (B)

19B.OwnershipPercNumericPercentage of share capital that the civil servant or military presents at Nov/2020
20B.TypeOfOwnerCategoricalType of partner of a civil servant or military is registered within the company to which it belongs
21B.QtFirmActIntNumber of secondary activities registered by the company in which the civil servant or military is a partner
22B.CodFirmActCategoricalThe main activity of the firm/company in which the civil servant or military is a partner
23B.CodFirmLegalCategoricalDefinition of legal nature of the company in which the civil servant or military is a partner, which may be in different denominations, such as: Mixed Economy Society, Public Limited or Closed Corporation, Limited Business Society, limited partnership, or by shares, among others
24B.CodFirmStCategoricalStatus of the company in which the civil servant or military is a partner, among the possible alternatives there are active, null, suspended, unsuitable, or closed
25B.CodFirmSizeCategoricalSize of the company that can be Microenterprise (ME), Small Business (EPP), medium or large depending on the gross annual turnover of the head office and its branches, or that is, the global gross revenue defined in the tax legislation
26B.DaysOwnershipNumericThis attribute informs the number of days that the server is a partner in the company until Nov/2020
27B.CodFirmTaxOptCategoricalThis attribute informs if the company opted for the simplified taxation system - Simples Nacional - which aims to help micro and small companies concerning the payment of taxes

Source: An extract of this table was published on Vasconcelos et al. [4], p. 512/513 https://link.springer.com/chapter/10.1007/978–3–030–86230–5_40.

The Mendeley dataset is available at.

.

Attributes/features description. Source: An extract of this table was published on Vasconcelos et al. [4], p. 512/513 https://link.springer.com/chapter/10.1007/978–3–030–86230–5_40. The Mendeley dataset is available at. . Integer and numeric attributes. Brazilian currency - Real. Categorical attributes. Boolean attributes. Literature related to corruption. The data obtained from these sources (Fig. 1) provided by different public organizations were aggregated in SAS Enterprise. They were outlined by their attributes classified by the four domains defined.

Pre-processing (Data Enrichment and Data Cleansing)

The data preparation is the stage in which the data must be processed and prepared in a way that can demonstrate the understanding of the business, in this case for corruption. Integrating different data sources could be a challenge because, in general, the data comes from sources of transactional systems or measurements or also from real-world situations, and the data set obtained must converge to understand the business. Data cleaning and construction of attributes were carried out to generate treated and adequate data to enable the development of predictive models. Data cleansing is the process of attempting to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data [21]. It aims to alleviate two critical problems of data acquisition processes: the existence of missing values and the existence of noisy values (noise values). The missing values occur when for the attributes of a dataset there is no determined value for some specimens or when a data set does not have values for an attribute of interest or even presents aggregated values concerning that attribute. As a solution to the missing values, it was possible to remove observations with this characteristic, manually fill in values, or auto-fill. The noisy values refer to changes from the original values and, therefore, consist of measurement errors or values considerably different from most of the other values in the data set, known as outliers. For example, we can mention cases that should be positive and negative values occur or a change in the behavior of the values of an attribute without explanation. Few observations were removed by specialist decision. For the solution of noisy values, there is the inspection with the manual correction or automatic identification and cleaning implemented by algorithms that soften or cancel noise. Data enrichment is the process of enhancing collected data with relevant context obtained from additional sources [22]. This dataset aggregates information from different databases that could benefit from a holistic approach. In addition, some features were elaborated in a specific way to provide information for business understanding. The feature construction allows the elaboration of features that can generate relevant information according to the understanding of the business from the original data. In this scenario, a feature construction was the transformation of categorical attributes into counting attributes. This procedure was performed because the attribute, when expressing quantity, has meaning in the context of business understanding, while the categorical value does not express benefit in the context of corruption. For example, a categorical attribute that means the positions that the civil servant or military man/woman occupied in Public Administration has no meaning for this investigation. However, many positions he/she had occupied could inform that this one does not have a stable condition and could represent an anomaly. It was applied for W.QtySIGRHOff, W.QtySIAPEOff, W.QtySIGSIPOff, W.QtySIGRHfunc, W.QtySIAPEfunc, and W.QtySIGSIPfunc from the Employment domain. For machine learning research, it is essential to address the data imbalance problem. The relevant feature for research that should be the independent variable of this investigation (C.CorruptionTG) presents in the class of interest 428 records and in the dominant class 302,608 records, a relation that keeps the proportion of 1: 707, in percentage terms 0.14% of the class of interest in the population. C.CorruptionTG is a dichotomous variable and is an important variable that has to be analyzed from other variables available for identifying risk factors that could be addressed to mitigate corruption in public administration. Possible ways of dealing with this scenario are explained by Zhu et al. [23] that suggests solving the problem of learning on imbalanced datasets with two possible solutions: data-level solutions and algorithm-level solutions. It is vital to inform that to comply with the GDPR legislation, the attributes that could identify a person were removed, making the data anonymized.

Ethics Statement

The authors declare that they have observed all ethical requirements for publication in Data in Brief.

CRediT authorship contribution statement

Marcelo Oliveira Vasconcelos: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization. Luís Cavique: Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.
SubjectInformation Systems and Management
Specific subject areaCorruption, civil servant, logistic regression
Type of dataTable (csv file)
How data were acquiredData were acquired from eight different databases from The Brazilian Government with SAS Enterprise Guide
Data formatMixed (raw and pre-processed)
Parameters for data collectionThis dataset corresponds to data available in November 2020 and refers to all civil servants and militaries in Federal District.
Description of data collectionA compilation of Brazilian government databases was used for this research and integrated eight databases from Federal Government and Federal District related by sanctions, Civil Service Payment Systems, Political and Firms/Companies.
Data source locationFederal District, Brazilhttps://dados.gov.br/dataset?_organization_limit=0Institutions: Controladoria-Geral da União – CGU,https://dadosabertos.tse.jus.br/http://www.transparencia.df.gov.br/#/https://www.gov.br/receitafederal/pt-br/acesso-a-informacao/dados-abertoshttps://www2.tc.df.gov.br/controle-externo/inabilitados-para-cargos-em-comissao/
Data accessibilityRepository name: Dataset for corruption risk assessment in a Public AdministrationData identification number:doi:10.17632/crpdknzswh.2Direct link to the dataset:https://data.mendeley.com/datasets/crpdknzswh/2
Related research articleVasconcelos, M. O., Chaim, R. M., & Cavique, L. (2021). Imbalanced Learning in Assessing the Risk of Corruption in Public Administration BT - Progress in Artificial Intelligence (G. Marreiros, F. S. Melo, N. Lau, H. Lopes Cardoso, & L. P. Reis (eds.); pp. 510–523). Springer International Publishing.
Table 5

Literature related to corruption.

DOMAINSLITERATURE
Corruption (C)Hanna and Wang [5], Carvalho and Carvalho [6], Carvalho [7]
Employment (E)Gans-Morse et al. [8], Liou et al. [9], Carvalho [7] Padula and Albuquerque [10], Poocharoen and Brillantes [11], Carvalho and Carvalho [6] López-valcárcel et al. [12]
Political (P)Pedersen and Johannsen [13], Bersch et al. [14], Meyer-Sahling and Mikkelsen [15], Moro [16], Carvalho et al. [17], Carvalho [7] Lassou and Hopper [18], Treisman [19], Gans-Morse et al. [8]
Business (B)Carvalho [7]
  3 in total

1.  Analysis of selected crime data in Nigeria.

Authors:  Pelumi E Oguntunde; Oluwadare O Ojo; Hilary I Okagbue; Omoleye A Oguntunde
Journal:  Data Brief       Date:  2018-06-02

2.  A survey dataset on determinants of administrative corruption.

Authors:  Salem A Al-Jundi
Journal:  Data Brief       Date:  2019-11-16
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.