Literature DB >> 34806059

Intelligent financial fraud detection practices in post-pandemic era.

Xiaoqian Zhu1,2, Xiang Ao3,4,5, Zidi Qin3,4, Yanpeng Chang2,6, Yang Liu3,4, Qing He3,4, Jianping Li1.   

Abstract

The great losses caused by financial fraud have attracted continuous attention from academia, industry, and regulatory agencies. More concerning, the ongoing coronavirus pandemic (COVID-19) unexpectedly shocks the global financial system and accelerates the use of digital financial services, which brings new challenges in effective financial fraud detection. This paper provides a comprehensive overview of intelligent financial fraud detection practices. We analyze the new features of fraud risk caused by the pandemic and review the development of data types used in fraud detection practices from quantitative tabular data to various unstructured data. The evolution of methods in financial fraud detection is summarized, and the emerging Graph Neural Network methods in the post-pandemic era are discussed in particular. Finally, some of the key challenges and potential directions are proposed to provide inspiring information on intelligent financial fraud detection in the future.
© 2021.

Entities:  

Keywords:  COVID-19 pandemic; artificial intelligence; financial fraud detection

Year:  2021        PMID: 34806059      PMCID: PMC8581570          DOI: 10.1016/j.xinn.2021.100176

Source DB:  PubMed          Journal:  Innovation (Camb)        ISSN: 2666-6758


Introduction

Over the past decades, financial fraud has brought shocking losses to the global economy, threatening the efficiency and stability of capital markets., Making things worse, the coronavirus pandemic (COVID-19) outbreak in early 2020 disrupted the international financial markets in unprecedented ways, heightening the risk of being vulnerable to financial fraud. For example, in April 2020, fraud rates across all financial products in the United Kingdom soared 33% from a year earlier. Meanwhile, Fidelity National Information Services, a payment services provider that assists about 3,200 U.S. banks with fraud monitoring, reported that the lost volume of fraudulent transactions leaped 35% in America compared with the previous period in 2019. Financial fraud in the post-pandemic era is becoming a growing severe problem. As defined by Black’s Law Dictionary, fraud refers to a knowing misrepresentation of the truth or concealment of a material fact to induce another to act to his or her detriment. The classification of financial fraud has not reached a consensus because the types of financial fraud are varied and mounting. Summarizing the previous literature,, this paper constructs a financial fraud classification framework according to the major financial institution involved. The classification framework is depicted in Figure 1. The frauds related to securities contain securities and commodities fraud, financial statement fraud, among others. Insurance frauds contain health care fraud, automobile insurance fraud, corporate insurance fraud, and so on., The frauds closely related to banks are mortgage fraud, loan default, credit card fraud, money laundering, among others. Some frauds that obviously cannot be linked to the above three institutions, such as e-commerce transaction fraud, mass marketing fraud, and illegal fund-raising, are classified as others. Another common perspective is to divide fraud activities into customer level and business level, so we also take them into consideration in the framework. Financial fraud detection at the customer level is mainly related to individual financial activities, including health care insurance, automobile insurance, credit card, loans, e-commerce transaction, and so on,, whereas business-level fraud crimes, such as financial statement misconduct and money laundering, are often committed by syndicates accompanied by other crimes such as bribery, tax evasion, and even support of terrorism.13, 14, 15
Figure 1

The classification of financial fraud types

The classification of financial fraud types The ongoing COVID-19 pandemic brings unexpected sudden shock to the global financial system and accelerates the use of digital financial services. These changes have escalated more insidious fraud schemes, providing a breeding ground for all types of financial fraud. On one hand, the economic downturns, caused by the global pandemic, bring proliferating economic pressure and stronger fraud motives to both companies and individuals. For example, in response to the expected cash flow disruption caused by the advent of the COVID-19 crisis, companies withdraw funds on a large scale from pre-existing credit lines. The rising operation costs stemming from the economic shutdown threaten the survival of many companies, inducing an increase in credit fraud. Furthermore, the pressure on corporate financial results intensifies the temptation to manipulate financial statements in order to meet stakeholder expectations. For policyholders, poor financial conditions spawn more speculative insurance claim fraud. On the other hand, the COVID-19 outbreak significantly accelerates digital transformation and increases digital processes, which sheds new light on fraud activities. The emerging situations can be summarized into two types. The first is that the switch of the business from offline to online exacerbates information asymmetry and leads to increased difficulty in fraud detection. Quarantine regulations create opportunities for online banking and remote transactions, but it is difficult for remote banking to obtain comprehensive information for customer identity verification, resulting in frequent credit fraud incidents. The rise in suspected and proven insurance frauds caused by the claim process adjustment also keeps insurers up at night. The remote work not only expands workload but also hinders access to information. For example, an insurance adjuster may not be able to inspect automobile repairs in detail, which provides opportunities for policyholders to exaggerate billing. Another situation engendered by the increasing digitalization is that the burgeoning of new financial products and services makes the existing detection methods difficult to adapt. To elude regulators, fraudulent behaviors and types escalate over time, which greatly lowers the effectiveness of the extant approaches. Google reports they are blocking more than 240 million COVID-themed spam emails and 18 million malicious emails related to COVID-19 each day. During the crisis, fraudsters tweak their fraud schemes and add COVID-19 twists to confuse the victims, which makes fraud detection a challenging task for both individuals and detection agencies. Moreover, although digital financial services, such as crowdfunding platforms and digital payments, are quickly applied, the incomplete regulatory policies are conducive to hide fraudsters’ identity information or financing history, which leads to credit fraud. Hence, financial fraud in the post-pandemic era is a critical problem with the characteristics of stronger motives, more insidious forms, and more intelligent schemes. These changes bring considerable challenges to financial fraud detection, including faster detection, better interpretability, and stronger robustness. In addition, the rapid digital transformation is not only an opportunity to obtain richer data for fraud detection but also brings more problems such as how to mine valuable information from massive but low-value-density data more effectively. Considering the above-mentioned changes, this paper provides a comprehensive review of the development of financial fraud detection practices and highlights the new characteristics of fraud caused by the COVID-19. We first give a brief introduction to the evolution of data types used in financial fraud detection. Through the review from traditional methods to recently proposed methods, the purpose of this paper was to summarize the possible improving directions in response to more insidious fraudsters and provide insights into future algorithm design. Finally, the current challenges and potential directions are outlined to provide some inspiring information on intelligent financial fraud detection in the post-pandemic era. The remainder of the paper is organized as follows. Section 2, financial fraud detection data evolution, presents the evolution of data used in financial fraud detection. Section 3, survey of methods, discusses the state-of-the-art fraud detection techniques according to the timeline and highlights the progress in recent years. Section 4, challenges and future directions, provides insights into problems and challenges that are still unsolved and points out the directions for future work. Last, the conclusions are summarized in Section 5.

Financial fraud detection data evolution

With the rapid growth of information technology, the types of data used for financial fraud detection continue to expand, which can be roughly divided into three categories, i.e., basic quantitative structured data (a.k.a. tabular data), diverse semi-structured data, and complex unstructured data. Data types and examples are shown in Table 1.
Table 1

Types and examples of data used for fraud detection

Data typeExamplesResearch
StructuredQuantitative numbersViaene et al. diagnose automobile insurance claims fraud by using indicators including claimant, insured driver, and lost wages.25Beneish detects corporation earnings manipulation by using financial indexes collected from commercial databases.26Dechow et al. describe the characteristics of corporation misrepresentation through sorting Accounting and Auditing Enforcement Releases information into a numerical database.27
Semi-structuredInterviewLaw analyzes the organizational factors of corporate fraud through interviewing chief financial officers.28
Business processJans et al. mine procurement processes to predict internal transaction fraud in companies.29
Database systemThe Securities and Exchange Commission requires corporations to submit reports in the eXtensible Business Reporting Language (XBRL) language, which provides public and formatted data for fraud detection.30
UnstructuredTextXiong et al. mine individual opinions on social media to detect corporate disclosure fraud.31
AudioHobson et al. analyze the vocal and linguistic cues elicited from speech to detect misreporting.32
VideoMuddy Waters Research analyzes multiple information including store traffic videos to expose Luckin Coffee of fabricating financial numbers.33
Telemetry dataThe China Securities Regulatory Commission detects Dalian Zhangzidao Fishery Group’s financial fraud by using the BeiDou Navigation Satellite System.34
Types and examples of data used for fraud detection From the very beginning, pioneered by pathfinders, such as Beaver and Altman who stated that a set of financial ratios would be investigated for bankruptcy prediction, numerous encouraging explorations on using quantitative data to predict fraud have been conducted., The sources of these structured data consist of corporations, regulators, research teams, commercial companies, and so on. Insurance companies and banks established unique systems to collect and store the basic information of policyholders or account holders. For insurers, the information used for fraud detection, such as insurance claims, the characteristics of incidents, and customer purchase behaviors, are obtained from the claim statement or the policy., Banks usually predict fraud with the help of transaction information, such as transactional history and payment observation., More comprehensively, regulators collect incidents in the entire industry and issue relevant reports., For example, the Securities and Exchange Commission (SEC) has been issuing the Accounting and Auditing Enforcement Releases (AAERs) to investigate companies for alleged accounting misconduct since 1982. To enhance the availability of financial misstatements data, Dechow et al. sorted AAERs information into a numerical database. Commercial companies also collect finance and market information from global institutions and build databases to meet the growing demand for data analysis. For instance, the major accounting and financial databases for researchers in the world include the Compustat North America database by Standard & Poor’s and the Worldscope database by Thomson Financial., Based on commercial databases, Beneish calculated financial indexes including the gross margin index, asset quality index, and sales growth index when detecting corporation earnings manipulation. Quantitative data are intuitive and easy to obtain, but the information contained in it is limited. As shown in Table 1, researchers seek other types of data to detect fraud with the continuous escalation of fraud patterns. For example, Law examined the organizational factors of corporate governance that are related to fraud through analyzing questionnaires and interviews from chief financial officers in Hong Kong. By mining event logs for knowledge, process mining that analyzes business processes also assists in fraud detection. Jans et al. developed a system that mined procurement processes to predict exposure opportunities of committing internal transaction fraud. Another typical type of information is the public formatted files from corporations and regulatory authorities. For example, the SEC has required corporations to file key performance reports in the extensible Business Reporting Language (XBRL) format since 2009, which provides high-level data and improves the transparency of corporations. Researchers pull out financial statements from the data repository and then predict financial misconduct through text analysis.,, Nowadays, the explosion of information has brought more types of available data, which are mainly unstructured, such as text, video, and telemetry data. Typical examples are shown in Table 1. In addition to financial reports, companies’ abundant email archives, public corporate announcements, legal proceedings published by courts, and other textual information have also gradually become raw materials for fraud detection. Furthermore, financial social media platforms have burgeoned in recent years. By mining emotions, social relations, and other information,, the wisdom of crowds within social media is also a crucial toolkit to capture business information. Besides, multitype unstructured data, such as audio,,, image, and video, are playing important roles. Recently, Muddy Waters Research analyzed multiple information, including customer receipts and store traffic videos, and accused Luckin Coffee of fabricating financial and operating numbers since the third quarter of 2019. The China Securities Regulatory Commission recorded the working location and duration of the Dalian Zhangzidao Fishery Group’s fishing vessels by use of the BeiDou Navigation Satellite System to expose the Chinese A-share listed fishery group that pretended that their scallops had escaped four times in 6 years to inflate profits. Notably, fraud detection, regardless of the fraud type, is faced with continuously growing data and informationthat need to be effectively mined and integrated. Reviewing the history of data types mentioned above, the data used in fraud detection practices have experienced the development from basic quantitative data to the current multi-source data. The combination of multi-source information can provide a more panoramic view of financial activities and brings opportunities for better fraud detection. It is also the general trend of scientific research in various fields. However, this evolution also brings great challenges in developing intelligent methods to effectively integrate and utilize panoramic data in future detection practices.

Survey of methods

Analogous to the evolution of data types, methods for fraud detection experienced a rapid proliferation in the past decades. Especially in the post-pandemic era, due to the intensified motives, insidious forms, and intelligent schemes of financial fraud, it is becoming more difficult to identify fraudulent behaviors accurately and efficiently. Thus, recently, researchers tend to incorporate and exploit information from as many aspects as possible for comprehensive monitoring. Following these trends, in this section, we survey existing financial fraud detection methods based on the technical development routes. We highlight the research proposed in the recent 2 years to demonstrate how researchers excavate related information from multiple perspectives in the post-pandemic era. For those antiquated techniques, we merely list representative cases to clarify the historical line. Table 2 depicts the representative financial fraud detection approaches we discuss in this section.
Table 2

Financial fraud detection practices discussed in the section “survey of methods”

Fraud typeData typeAlgorithmXAIResearch
Credit fraudCustomer levelStructuredExpert systemBrause et al.,63 HaratiNik et al.,64 Correia et al.65
SVMDheepa et al.66
RFNoghani et al.67
CNNFu et al.68
Semi-structuredNaive bayesPanigrahi et al.69
CNNZheng et al.70
UnstructuredFNN, Att.HACUD71
LSTM, Att.MAHINDER72
GNNPC-GNN73
GNN, Att.AMG-DP,74 SemiGNN75
GNN, LSTM, Att.TemGNN76
Money launderingBusiness levelUnstructuredGraph ADFlowScope77
Supervised network analysisSavage et al.78
GNNWeber et al.79
Loan fraudCustomer levelUnstructuredGNN, GRU, Att.DGANN80
GNN, LSTM, Att.ST-GNN81
Financial statement fraudBusiness levelStructuredNaive bayesDeng82
SVMRavisankar et al.83
RF, GBT, Rule ensemblesWhiting et al.84
FNNGreen and Choi,85 Fanning and Cogger86
Insurance fraudCustomer levelStructuredLRArtís et al.,87Viaene et al.88
GBTGuelman89
UnstructuredGNNLiang et al.90
E-commerce transaction fraudCustomer levelSemi-structuredLSTMJurgovsky et al.91
GRUBranco et al.92
UnstructuredRNNCLUE93
LSTM, Att.LIC Tree-LSTM94
FNN, Att., FMHEN95
FNN, Att., FMNHFM,96 DIFM97
OthersStructuredExpert systemQuinlan et al.,98 Cohen et al.99
UnstructuredGraph ADLi et al.100
GNNCARE-GNN101
GNN, Att.Player2Vec,102 GraphConsis,103 PIdentifier104

AD, anomaly detection; Att., attention; XAI, explainable artificial intelligence; ● represents non-deep method and is generally considered to be interpretable; 〇 represents the method claims to be interpretable; ∗ indicates that it is hard to evaluate.

Financial fraud detection practices discussed in the section “survey of methods” AD, anomaly detection; Att., attention; XAI, explainable artificial intelligence; ● represents non-deep method and is generally considered to be interpretable; 〇 represents the method claims to be interpretable; ∗ indicates that it is hard to evaluate.

Rule-based expert systems

In the early stages, data used for fraud detection are usually highly structured, e.g., transaction logs or well-designed financial metrics, and the means for detecting fraud are undecorated. A number of rules and static thresholds can be used to filter out misbehavior. A straightforward case is that a system will alert if important indexes like liquidity or profitability are unusually high or low. Then, expert systems were designed to facilitate the work of human auditors. They generally use symbolic rules to encode knowledge created by human experts, which was an important part of artificial intelligence during the 1970s and 1980s. This encoded knowledge base is then queried to yield a result through reasoning. For example, Quinlan et al. and Cohen et al. introduced a set of if-then statements to recognize fraud records in multiple fields., Moreover, association rules, fuzzy rules, and manual trial-and-error rules are applied to settle the problems of credit card fraud detection as well. Nevertheless, these manual and rule-based approaches have become particularly costly and ineffective at present. As fraudsters begin to employ trickier strategies to elude regulators, rich financial-related information is required to be analyzed, which undoubtedly exacerbates difficulties in extracting and summarizing effective rules. Small sets of human-summarized rules are no longer sufficient to meet the demand, motivating to build and maintain a large set of rules. However, managing a large ruleset requires more computing resources and is challenging to evaluate and understand.

Traditional machine learning algorithms

Considering the defects of rule-based approaches, growing numbers of machine learning-based methods have been developed. They usually start with extracting statistical features relevant to the given task, such as user profiles, credit history, and historical transactions. After performing feature engineering, a classifier can be trained with these features. Next, we introduce several typical algorithms and their corresponding applications in financial fraud detection. Naive Bayes, Logistic Regression (LR), and Support Vector Machine (SVM) are standard linear classifiers that have shown excellent performance in various applications.,108, 109, 110 Naive Bayes is a simple probabilistic classifier based on the “Bayes” theorem under the assumption of strong (naive) independence of the attributes. Panigrahi et al. proposed a well-designed model for credit card fraud detection, combining a Dempster-Shafer adder with a Bayesian learner. Deng designed a fraudulent financial statements detection model based on a Naive Bayes classifier to facilitate human auditors. LR classifies the existing data by establishing regression equations classification boundaries, mainly used for binary classification problems., Art'ıs et al. applied LR model to detect fraudulent insurance claims based on the Spanish market and estimated the error rate. Viaene et al. considered the damages and audit costs and applied LR model to decide suspicious claims. SVM is also a linear classifier that separates all data samples into correct classes by finding the maximum margin hyperplane. Kernel techniques and margin optimization are two critical properties of SVM., With these two tricks, SVM is capable of solving complex fraud detection problems. To name some, Ravisankar et al. tested SVM techniques on data from 202 Chinese companies to find out a fraudulent financial statement. Dheepa and Dhanapal employed behavior-based SVM to predict suspicious transactions. Tree-based classifiers attempt to separate data into exclusive categories. Each leaf node represents a specific class, and each tree branch represents a possible attribute value. Decision Tree is the most fundamental one; however, it is likely to be unstable and easily over-fitting. Therefore, more advanced tree-based classifiers such as Random Forest (RF), XGBoost, or LightGBM apply ensemble strategies such as bagging and boosting to improve performance. In the financial detection area, tree-based models have shown performance superior to other learning algorithms like SVM.118, 119, 120 For example, Guelman researched Gradient Boosting Tree (GBT) in modeling auto insurance loss cost based on data from a Canadian company. Whiting et al. reported the performance of methods including RF, GBT, and rule ensembles when applying to financial fraud detection. Taking feature selection and decision cost into account, Noghani and Moattar proposed an advanced RF-based model, which yielded certain performance improvements. Furthermore, some applications represent transaction data as graphs, using nodes to represent financial entities and edges to represent money transfer. After extracting features through feature engineering and graph-embedding techniques to preserve topological and structural properties,122, 123, 124 machine learning models are built afterward. For example, Savage et al. extracted meaningful communities from the network and performed classification to detect money-laundering activities. A few works consider graph anomaly detection skills, as fraud can be seen as unusual events different from normal behaviors.125, 126, 127 For instance, Li et al. spotted potential fraudulent cases in trading networks by finding the black hole and volcano patterns. Li et al. modeled the laundering as densest and multi-step money flow and proposed an algorithm FlowScope to search dense flow accurately and efficiently in large transaction graphs.

Deep-learning-based approaches

Deep Learning (DL) is becoming a particular type of machine learning, as it achieves great success in various domains. At its heart, the most essential advantages of DL models are that they can extract features directly from raw data without hard-coding task-specific knowledge or tedious feature engineering. With the increasingly complex fraud in the financial scenario, researchers try their best to use these massive and various data to uncover these concealed miscreants. Thus, DL techniques for fraud detection have gained popularity over recent years, especially in the post-pandemic era where digital transformation has become the new normal. In this section, we discuss the surveyed approaches according to the different types of input data.

Modeling tabular data

In the first few years, researchers merely used the basic feedforward neural networks (FNNs), also known as multi-layer perceptrons (MLPs), as classifiers based on static tabular data. for example, Green and Choi presented a neural network classifier employing variables related to the financial statement. Fanning and Cogger also used an artificial neural network for management of fraud prediction. Their input vectors mainly consist of financial ratios and qualitative variables derived from financial statements. Though many attempts using MLP in financial fraud detection have shown better performance than rule-based systems and other classification methods like LR,130, 131, 132, 133 these networks are acyclic and incapable of modeling sequential data that might be essential to discover anomaly users or transactions.

Modeling sequential data

Hence, for better excavating and utilizing sequential data, more complex and elaborate network structures are designed. Convolutional Neural Networks (CNNs), with the convolutional operations, are capable of capturing short-term contextual information and can be applied in financial fraud detection. For example, Fu et al. recombined transaction data to feature matrices and performed a CNN-based approach to identify latent fraud behaviors. Zheng et al. formulated a meta-learning-based classifier, including a feature extraction module, a K-Tuplet Network based on ResNet-34, which is a typical CNN structure. Besides CNN, cyclic DL models, e.g., Recurrent Neural Networks (RNNs), are further proposed and developed for sequence prediction.134, 135, 136 In RNNs, the output of the last hidden layer is also the input of the current hidden layer, which renders it suitable to encode variable sequences of inputs. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are typical architectures of RNNs. They introduce “Gates” to optionally let information through to avoid the problems of gradient vanishing and exploding.137, 138, 139 As temporal information is a crucial factor in financial data analysis, the RNN models significantly outperform basic MLPs due to their abilities to encode sequential data. To name some, Wang et al. presented a novel deep-learning-based system, namely CLUE, to detect transaction fraud at JD.com, one of China’s largest e-commerce platforms. Jurgovsky et al. considered the fraud detection problem in e-commerce as a sequence classification task and employed LSTM networks to incorporate the historical behavior of the users for detecting fraud on new incoming transactions. Branco et al. introduced a GRU-based framework to detect payment card fraud, in which the payments are treated as an interleaved sequence. Liu et al. devised a behavior tree and introduced a Local Intention Calibrated Tree-LSTM (LIC Tree-LSTM) for fraud transaction detection. The behavior tree is built by splitting and reorganizing the sequential behavioral data, and its branch represents a specific user intention. In addition to CNNs and RNNs, quite a few other techniques can be employed to model sequential data. For example, Zhu et al. proposed a Hierarchical Explainable Network (HEN) to model users’ behavior sequences. In HEN, a field-level extractor encodes both first- and second-order information through Factorization Machines (FM). Then, an event-level extractor captures higher-order feature interactions for better sequence representation. Similarly, Xi et al. designed a Neural Hierarchical Factorization Machine (NHFM) model, a two-level architecture capturing feature interactions and representations of users’ historical events. They further presented the Dual Importance-aware Factorization Machines, which exploits users’ historical behavior in dual perspectives.

Modeling relational data

Though sequential data demonstrate effectiveness in detecting fraud among users and transactions, the changes in the post-pandemic era propel modeling relational data in urgent demand. As we mentioned before, the intensified motives, insidious forms, and intelligent schemes impel comprehensive data analysis and considering interaction relations among users. It thus motivates graph-based DL and Graph Neural Networks (GNNs) widely applied in the financial fraud detection areas since the graph is a natural choice for presenting relational data.,,,,

Homogeneous relations

Initially, most graph-based methods only consider homogeneous graphs, in which the node and edge types are undifferentiated. Even so, they have yielded promising performance in financial crime and fraud detection, especially GNNs, which have the potential to improve structural representations and causal reasoning. They broadly follow a recursive message passing schema, in which each node computes its new representation through aggregating feature vectors of its neighbors. For instance, Weber et al. applied Graph Convolutional Network (GCN), a typical GNN model, in anti-money laundering. Liang et al. introduced a device-sharing network among claimants and developed a GNN-based solution to uncover groups of organized fraudsters for return-freight insurance on the e-commerce platform. Furthermore, as most real-world graphs are dynamic, a few models consider an additional time dimension based on previous studies. By combining GNN and RNN in different ways, dynamic GNNs are proposed to mine structural and temporal information simultaneously. For example, DGANN is a dynamic graph-based attention neural network for risk guarantee relationship prediction. Each node in the graph represents a company, and each edge represents a guarantee. In the model, a GCN layer with structural attention can process each snapshot, a Graph Recurrent Network with temporal attention is applied to exploit the temporal relationships between snapshots. Similarly, Yang et al. proposed a Spatial Temporal GNN (ST-GNN) to mine credible supply chain relationships, including risk analysis of small and medium-sized enterprises. Wang et al. proposed a Temporal-Aware GNN (TemGNN) to model the credit risk prediction on dynamic graphs. Considering the time interval irregularity between dynamic snapshots, TemGNN adopts an interval-decayed attention mechanism and can assemble short- and long-term temporal-structural information.

Heterogeneous relations

Although previous homogeneous works offer practical solutions for modeling relational data, they still have limited ability to capture information from realistic situations, especially the multi-relational data that emerged in the post-virus era. As remote businesses and transactions hinder access to comprehensive user information for identity verification, this switch from offline to online exacerbates information asymmetry and makes data completeness and data quality a major concern. Multi-source data are thus collected to alleviate this problem and better model user profiles. Researchers start focusing on heterogeneous graphs, as they contain multiple types of nodes and links to represent different entities and relations, which mimic the data flows more closely in the real-world network.,145, 146, 147, 148 For instance, under problem formulation in Zhong et al., a node can be a customer, a merchant, or a device. In the graph constructed in Hu et al., an edge implies social connection, money transaction, or device ownership, and so forth. The heterogeneous graph also can be termed as a Heterogeneous Information Network. Meanwhile, in the recent scenario of financial fraud detection, DL solutions for the heterogeneous graph are often proposed under the Attributed Heterogeneous Information Network (AHIN), where both nodes and edges may contain attributes (or named features). Thus, we discuss the heterogeneous graph-based methods under the concept of AHIN. While AHIN is a powerful information modeling method for characterizing data heterogeneity, it brings about extra challenges in designing algorithms because of its complex topology and higher feature dimensions. One intuitive solution for AHIN is decomposing the heterogeneous graph as a combination of series of homogeneous graphs and fusing the homogeneous representations. For example, Hu et al. devised AMG-DP that employs relation-specific receptive layers to distinguish neighbors by relation attributes. After aggregating the neighbor information following a typical GNN schema, representations incorporating rich semantics derived from multiplex relations are learned. Then, they implement a relation-specific attention mechanism to integrate multiple representations adaptively for loan default prediction. Zhang et al. proposed Player2Vec to identify key players in online underground forums. In the model, GCN is employed to learn embedding from each single-view attributed graph. Then, an attention mechanism fuses the learned embedding based on different single-view attributed graphs to get the final representations. Similarly, Wang et al. proposed a semi-supervised attentive GNN, named SemiGNN, which applies a hierarchical attention mechanism to correlate different neighbors and different views better. The aforementioned heterogeneous GNNs reveal illegal acts through aggregating nodes’ neighborhood information across different relations. However, under the fraud detection scenario, some inherent characteristics of the data will hamper the performance of GNN-based fraud detectors, so a few methods are proposed to alleviate these issues. For example, to escape regulation, fraudsters will camouflage themselves through adjusting their behavior to act like benign users or connecting themselves to benign users, which we call the feature and relation camouflage. Dou et al. propose CARE-GNN, consisting of three neural modules against the camouflage. A label-aware similarity measure and a similarity-aware neighbor selector are leveraged to find informative neighboring nodes. A relation-aware neighbor aggregator combines neighborhood information across different relations with trainable weights. Sharing a similar idea, Liu et al. introduced a GNN framework, namely GraphConsis, to alleviate the problems of context, feature, and relation inconsistency. Besides, class imbalance also has negative influence on models, which means the label distribution of samples is heavily skewed. Liu et al. proposed a Pick and Choose GNN (PC-GNN) to remedy this challenge. In PC-GNN, first, nodes and edges are picked with a devised label-balanced sampler to construct sub-graphs for mini-batch training. Next, for each node in the sub-graph, the neighbor candidates are chosen by a proposed neighborhood sampler. Finally, information from the selected neighbors and different relations is aggregated to obtain the final representation of a target node. Another route for modeling AHIN is encoding nodes’ or links’ attributes via meta-path sampling. Meta-path is a path sampled over graphs according to preset rules, refined from prior experience about specific fraud patterns. For example, “User Merchant User” represents all paths starting from a user node, passing a merchant node, ending in a user node via two “Transaction” edges. The interaction relations among users can be explored according to the guidance of predefined meta-paths. Hu et al. proposed HACUD, which picks meta-path-aware neighborhoods for each node, then aggregates features with a hierarchical attention mechanism to classify whether a user is cash-out or not. Zhong et al. proposed MAHINDER for financial defaulter detection, which implements meta-path sampling and considers multi-view decomposing. Unlike HACUD, MAHINDER models each meta-path by an LSTM-based encoder to capture local structural patterns and then adopts attention mechanisms on the node, link, and meta-path levels to learn fusion weights. The works of Hu et al. and Zhong et al. are typical meta-path-based algorithms in AHIN, although do not follow GNNs’ typical message passing schema., Fan et al. further proposed PIdentifier to detect illicit trade in the underground market, which upgrades kernel of meta-path to meta-graph, a graphlet composed of meta-paths. For each sampled meta-graph, a representation is learned based on a meta-graph-guided search. Multi-head attention is computed to construct embedding for buyer nodes and products nodes separately. The global coronavirus pandemic makes it harder to detect suspects for the following reasons: economic fallout brings stronger fraud motive, social distancing hinders information collection, and accelerated digital transformation affects existing detection methods. Reviewing and summarizing the representative cases mentioned above, we see that in response to the problem, the anti-fraud systems begin excavating deeper user-related information, like sequential and relational data, and gather information from multiple sources to better model real-world activities. Consequently, the data are getting more irregular, from numerical indicators to transaction networks, from Euclidean to non-Euclidean data. In this case, DL techniques are becoming increasingly popular, as they can identify and combine crucial features from unstructured data to achieve high performance without any domain knowledge. In addition, graph-based, especially heterogeneous graph-based fraud detection, has been focused on recently, as graphs can capture rich behavioral interactions.

Challenges and future directions

Although data-driven artificial intelligent techniques have achieved excellent performance in the financial fraud detection domain, there are still key issues remaining unsolved, as financial fraud schemes are rapidly evolving to adapt to this new digital environment. In this section, we provide the major challenges and suggest directions for future work from task-oriented, data-oriented, and model-oriented perspectives.

Financial fraud is harder to identify due to its increasing secretiveness and complexity

One of the severe difficulties for financial fraud detection is that the fraud is hidden in complex financial activities. The increased motives and the accelerated digital transformation caused by the pandemic even lead to more intelligent fraud schemes, which makes fraud more difficult to identify. These issues bring two essential challenges for detection.

The secretiveness of financial fraud leads to the natural error in samples

Fraud detection, in many cases, can be regarded as a classification task essentially, which requires fraud samples and non-fraud samples as training data. However, as the fraud activities are increasingly hidden, in most portions of practices, fraud usually cannot be fully identified by regulators and market participants. Consequently, the non-fraud samples used for training may contain some unrecognized fraud samples, leading to natural errors among training samples. When the natural error rate of samples is serious, the basic features of fraud and non-fraud samples captured by the detection model may have fundamental errors, but the users of the model are not aware of them, thus seriously threatening the accuracy of detection results.

The complexity of financial activities leads to massive information involved

The financial activities are related to a wider range of business. Therefore, the involved information is massive but heterogeneous, accompanied by lower-value density. The multi-source information will be difficult to play its role if it is not well integrated. Some researchers have explored models for storing and analyzing massive data, among which the knowledge graph is most suitable for solving this problem. The knowledge graph is a knowledge system connecting all data through the relationships between the data., This knowledge-based system, if possible, will contain information about every entity related to fraud in the real world,, which provides a panoramic perspective. Furthermore, the logic consistency analysis between different nodes of the knowledge graph can help verify the authenticity of information and correct inconsistent information. Powerful knowledge reasoning technologies based on knowledge graphs can help mine the secret relationship between entities connected with fraud and provide potential evidence to make up for missing information., Thus, the knowledge graph will be one of the most important and promising tools that mine valuable information for comprehensive detection in the future.

Financial data for fraud detection is massive but scattered

In the information explosion era, the multi-source data are massive but usually scattered across different institutions. At the same time, detecting fraud activities increasingly requires the support of panoramic data to gain a comprehensive understanding of miscreant activities. It thus remains challenging in integrating these scattered data and processing the massive data efficiently.

Data isolation is difficult to resolve

Although the amount of data used in fraud detection is much more tremendous than before, most of the data exist in the form of isolated islands, i.e., scattered in different institutions or even different countries. It will be hard to provide a comprehensive view of financial activities due to the difficulties in data aggregation, which will further greatly affect the effectiveness of detection methods. Google proposes the federated learning framework, which helps to construct a complete and powerful model through joint modeling of multiple institutions., However, some key issues remain to be studied, such as the data formats of different institutions are inconsistent, and the network connections between institutions are unstable.

Large-scale data processing brings great challenges to model training

The increase of digital services records more user footprints and information, but it also brings more challenges to massive data processing. Many detection methods require plenty of time to optimize parameters, and the time grows nonlinearly with the expansion of the sample size. Time-consuming modeling cannot obtain the detection model quickly, so it is difficult to update the detection model in time. For example, DL can be applied to process large amounts of data, but training parameters are extremely time-consuming., Further research is required to fully develop and apply more advanced technologies to solve these practical fraud detection problems.

Financial fraud detection models need to be more flexible and interpretable

Nowadays, though the emerging research and application of GNN and other models have helped improve financial fraud detection efficiency by utilizing multiple types of information, there are still many challenges with the practicality of the detection models such as model bias, robustness, and interpretability.

Model bias issue needs to be taken into account

Model bias is a significant issue in the machine learning field, which refers to the difference between the model prediction and the actual value we are trying to predict. In fraud detection practice, there are roughly two reasons for model bias: one is the problem of the data samples; the other is from the models themselves. Class imbalance is a crucial factor to high model bias and is overwhelmingly observed in fraud detection, as regularly fraudsters are far fewer than regular users. Models performing poorly on the minority may lead to undesirable results, as people are more concerned about the minority classes, i.e., the fraudsters. The class imbalance problem on feature-based neural methods has been studied in depth, such as re-sampling,162, 163, 164 re-weighting,165, 166, 167, 168 and transfer learning., Whereas in the GNNs works, the noisy information, few interactions among fraudsters, and desalination of the minority class’s features caused by the message aggregation of GNNs are three major challenges in designing class imbalanced GNNs for fraud detection., Future studies that follow-up on these directions would be beneficial. There are other model biases caused by samples that need to be addressed, such as the under-representation, ignoring sensitive attributes, and the social feedback loops., As for the detection models themselves, the initial design flaw of the model is also one of the essential factors leading to model bias, which is very hard to avoid. However, there is also recent research progress working on calibrating such a kind of bias.

Robustness needs to be strengthened

The fraudster and the anti-fraud party are always in a dynamic game. With the new technology, the game among fraudsters, financial institutions, and regulators is upgrading, presenting high confrontational features. The robustness and adversarial issues based on conventional DL models have attracted extensive attention from researchers174, 175, 176; however, the study for GNNs is still in its nascent stages., Moreover, due to the message propagation mechanism of GNNs, the effects caused by small perturbations can spread, resulting in even worse performance than non-GNNs. In financial scenarios, attackers always aim for interference with defense models to seek exorbitant profits. Hence, how to detect and defend against harmful perturbations and design robust models, especially for GNN, are becoming major implementation goals.

Interpretability needs to be improved

A key factor in the success of deep neural networks is the fact that networks can be seen as a very large number of nonlinear functions, rendering them possible to learn features at various levels of abstraction with the cost of interpretability and explainability.179, 180, 181 As a result, they cannot be fully trusted in critical applications such as financial fraud detection. Although several post hoc explanatory methods have been developed recently to understand DL models, research has shown that many interpretation methods may produce unfaithful results.182, 183, 184, 185 Especially for a graph-based neural network, its unique non-Euclidean structure brings more challenges, as gradient or backpropagation-related methods cannot be directly applied. Although researchers have made explorations on interpreting GNNs, most of them are still working on toy examples and cannot solve problems in real-world financial scenarios.186, 187, 188 Hence, further research is required to understand not only conventional GNNs but also more complex structures, such as models on AHIN.

Conclusions

In this survey, we provided a comprehensive overview of financial fraud detection practices from three aspects: the impact of the pandemic, the evolution of the data, and the advancement of methods. The unprecedented pandemic shocked the global financial system and accelerated digital transformation, which brings stronger motives, more insidious forms, and more intelligent schemes of financial fraud activities. As for the data, applying more panoramic data to comprehensively detect fraud activities is the prevailing trend. The data used in fraud detection practices have experienced the development from basic quantitative data to the current multi-source unstructured data. In the post-pandemic era, explosive data provide more information than before, and fraud detection is inclined to use multi-source data to obtain a comprehensive understanding of financial activities. As for the model, DL systems have been popular recently for their versatility and revolutionary success in financial fraud detection. The graph-based detection approach is an emerging direction to analyze multi-source data of fraud activities. With the rapid development of technology, financial scenarios and behaviors are becoming more intelligent and sophisticated. Graph-based detection, such as GNN, attracts more attention since the graph can gather information from multiple sources to better model real-world activities and detect hidden anomalies more effectively. Although the data-driven DL models have been proven to be helpful in fraud detection problems, there are still many challenges to be solved for future development. Complex and hidden fraud activities bring greater challenges to a comprehensive understanding and accurate identification. Achieving efficient integration and processing of massive but scattered financial data is one of the important foundations for panoramic fraud detection. Finally, the flexibility, robustness, and interpretability challenges of models need to be considered more seriously in the context of financial fraud.
  12 in total

1.  Exploratory undersampling for class-imbalance learning.

Authors:  Xu-Ying Liu; Jianxin Wu; Zhi-Hua Zhou
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  2008-12-16

Review 2.  Deep learning in neural networks: an overview.

Authors:  Jürgen Schmidhuber
Journal:  Neural Netw       Date:  2014-10-13

Review 3.  Deep learning.

Authors:  Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal:  Nature       Date:  2015-05-28       Impact factor: 49.962

4.  Global syndicated lending during the COVID-19 pandemic.

Authors:  Iftekhar Hasan; Panagiotis N Politsidis; Zenu Sharma
Journal:  J Bank Financ       Date:  2021-03-16

5.  Culture versus Policy: More Global Collaboration to Effectively Combat COVID-19.

Authors:  Jianping Li; Kun Guo; Enrique Herrera Viedma; Heesoek Lee; Jiming Liu; Ning Zhong; Luiz Flavio Autran Monteiro Gomes; Florin Gheorghe Filip; Shu-Cherng Fang; Mujgan Sagir Özdemir; Xiaohui Liu; Guoqing Lu; Yong Shi
Journal:  Innovation (N Y)       Date:  2020-08-01

Review 6.  A Comprehensive Survey on Graph Neural Networks.

Authors:  Zonghan Wu; Shirui Pan; Fengwen Chen; Guodong Long; Chengqi Zhang; Philip S Yu
Journal:  IEEE Trans Neural Netw Learn Syst       Date:  2021-01-04       Impact factor: 10.451

7.  GNNExplainer: Generating Explanations for Graph Neural Networks.

Authors:  Rex Ying; Dylan Bourgeois; Jiaxuan You; Marinka Zitnik; Jure Leskovec
Journal:  Adv Neural Inf Process Syst       Date:  2019-12

8.  Impact of digital surge during Covid-19 pandemic: A viewpoint on research and practice.

Authors:  Rahul De'; Neena Pandey; Abhipsa Pal
Journal:  Int J Inf Manage       Date:  2020-06-09

9.  Modeling the COVID-19 Outbreak in China through Multi-source Information Fusion.

Authors:  Lin Wu; Lizhe Wang; Nan Li; Tao Sun; Tangwen Qian; Yu Jiang; Fei Wang; Yongjun Xu
Journal:  Innovation (N Y)       Date:  2020-08-06
View more
  1 in total

1.  Financial Fraud Detection and Prediction in Listed Companies Using SMOTE and Machine Learning Algorithms.

Authors:  Zhihong Zhao; Tongyuan Bai
Journal:  Entropy (Basel)       Date:  2022-08-19       Impact factor: 2.738

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.