Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Frania Chettiar
DOI Link: https://doi.org/10.22214/ijraset.2024.64502
Certificate: View Certificate
It is highly challenging to detect fraudulent transactions with extant imbalances in available datasets where fraudulent cases make up a minor percentage of total transactions. This work presents a novel hybrid anomaly detection framework that integrates Autoencoders for efficient dimensionality reduction with LOF and Isolation Forest algorithms to detect anomalies for accurate fraud detection. We make use of the very standard dataset, namely Credit Card Fraud Detection Dataset [7], that has 284,807 transactions of which only 492 are classified as fraudulent. We apply Synthetic Minority Over-Sampling Technique to balance the dataset for optimizing the model’s performance. The results show that although LOF is challenging in terms of precision, it exhibits significant increases in recall with the proper adjustment of the contamination parameter and utilization of SMOTE. In comparison, Isolation Forest algorithm works excellently in terms of recall where it detects 81% frauds but degrades slightly in terms of precision after using SMOTE. The two techniques here have trade-offs between precision and recall, hence indicating a scope for further optimization. Both LOF and Isolation Forest significantly contribute in detecting anomalies in imbalanced datasets, and though Isolation Forest has a higher efficiency ratio compared to LOF in fraud transaction detection, our results confirm that indeed using Autoencoders for the extraction of features and advanced anomaly detection techniques have a synergistic effect in fraud detection applications, particularly in big class imbalance scenarios. Future research would include other oversampling techniques along with fine-tuning the parameter settings to have a better balance between precision and recall.
I. INTRODUCTION
With the rapid evolution of digital transactions, especially in finance, payment processing has absolutely changed globally. In this case, however, while these are made with progress, fraud in online transactions in the form of cyber threats emerges. Therefore, detection of fraud became a great challenge for financial institutions to contain financial losses and protect customer data. While detection is important for protecting financial systems, it is also important for maintaining customer trust and, at the ultimate end, for complying with regulatory requirements. Fraud detection models, especially when deployed in real-time, face an issue regarding a massive amount of data to look through in the hope of finding rare fraudulent transactions hidden among a great number of legitimate transactions. The intrinsic difficulty in this problem is the very unbalanced, since fraudulent activities are typically a very small proportion of all transactions. For instance, the dataset used in this work, fraudulent transactions only comprise 0.17% of the total transactions, thus posing critical challenges to traditional machine learning models. Supervised models tend to overfit the majority class, namely the no fraudulent transaction set, which further leads to poor generalization to the minority class of fraudulent transactions. Since fraudulent cases are very few in real world data, it leads to a large number of false negatives that financial institutions do not detect genuine fraud incidents that really lead to loss. On the other hand, unsupervised learning and hybrid models have emerged to be more robust in fraud detection, especially in heavily imbalanced scenarios. Anomaly detection within autoencoders is effective as they learn compact representations of data and learn to identify instances that don't fit into the patterns it has learned. These anomaly methods do much better together in capturing both normal behaviour of transactions as well as anomalies in them, that is, fraudulent transactions. In this paper, we propose a hybrid approach to detect fraudulent transactions that use autoencoders for feature extraction with LOF and IF for outlier detection. It compresses the dimensionality transaction data into a feature space so as to capture key transaction characteristics while discounting noise. LOF and Isolation Forest then find anomalies in this feature space. In our approach, we use SMOTE for oversampling and careful tuning of model parameters to improve significantly the detection of fraud transactions with minimal false positives.
II. LITERATURE REVIEW
Some recent studies on anomaly detection in financial fraud cover multiple models applied to fraud activity detection on very imbalanced datasets. This section briefly reviews prior work in this area and outlines how our work extends this work, moving beyond the limitations found in related research.
Autoencoders and Unsupervised Anomaly Detection: Wongvorachan et al. applied autoencoders for unsupervised anomaly detection in fraud detection applications. They pointed out that even though autoencoders well reconstruct real transactions, they fail to respond to anomalies, leading to a significantly high false-positive rate. Their work indicated that any technique used should incorporate SMOTE to handle imbalances within the dataset, but this issue was not explored in detail [1].
Misra et al. combined autoencoders with Isolation Forest for credit card fraud detection. They found out using Isolation Forest sequentially after features being extracted by autoencoders resulted in higher detection rates; however, they faced difficulties fine-tuning the contamination parameter. They even suggested further fine-tuning of both the model and dataset, which is even at the very high imbalanced state [2].
Hybridizing Grey Wolf Optimization with Isolation Forest: Shen et al. suggested a hybrid model that integrates Grey Wolf Optimization with Isolation Forest for fraud detection purposes. The authors demonstrate the ability to optimize the contamination parameter of Isolation Forest but fail to represent how autoencoders could be used to extract features or how oversampling techniques, such as SMOTE, would enhance the rate of detection [3].
Explainable Models for Fraud Detection: Shen and Maxion (2014) attempted SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) to describe decisions the fraud detection models, including autoencoders and Isolation Forest applied. However, they did not attack imbalance in the dataset. Instead, they thought enhancing the transparency of detection models [4].
Random Forest with SMOTE: Bauder and Khoshgoftaar (2018) analyzed SMOTE amalgamated with Random Forest for credit card fraud detection. Their synergy of SMOTE resulted in increased detection rates; they, however, culminated that Random Forest, though fairly efficient for the role of classifications, still does not suit anomaly detection work as good as Isolation Forest and LOF [5].
Randhawa et al. in 2018 have applied LOF in fraud detection of healthcare insurance. Their conclusions were that indeed, LOF identifies anomalies correctly in high-dimensional spaces but got hampered with accuracy due to noise input in the imbalanced datasets. The study has suggested that LOF could be improved by oversampling techniques and feature extraction using deep learning [6].
Although the above-mentioned research work on Autoencoders, LOF, and Isolation Forest have been applied for anomaly detection, very few applied them together in a hybrid architecture. In addition, it is quite evident that few of the above-mentioned works often discussed SMOTE but failed to focus on specific implementation. This work fills all these gaps by:
III. DATASET
The Credit Card Fraud Detection Dataset contains 284,807 transactions, out of which only 492 were fraudulent transactions-intuitively corresponding to extreme class imbalance of merely 0.17% fraudulent transactions. Moreover, the dataset has 30 numerical features obtained through a PCA transformation-anonymizing the data but keeping key transactional patterns. For instance, V1 to V28 are the main features, whereas Time is the time elapsed in terms of seconds since the first transaction and Amount is the value of that transaction. Furthermore, the target variable, Class, refers to the fraud or not: Class = 1 for fraudulent transactions and Class = 0 for the valid ones [7].
IV. METHODOLOGY
This paper proposes a hybrid fraud detection framework integrating Autoencoders, Local Outlier Factor, Isolation Forest, and SMOTE to handle the imbalanced datasets. The general methodology is thus organized into the following stages:
A. Data Preprocessing
B. Dimensionality Reduction using Autoencoder
An autoencoder is an unsupervised deep learning model. It learns a compression of the data dimensionality, then reconstructs it. Through this process, in training, the error in reconstruction of the autoencoder would be reduced, and thereby transactions which cannot be well reconstructed are identified for further analysis.
C. Anomaly Detection using LOF and Isolation Forest
D. Training and Validation
V. RESULTS AND DISCUSSION
To provide context for our findings, the following table summarizes key descriptive statistics of the dataset used for training:
TABLE I Descriptive Statistics Summary
Regular |
Value |
Total Transactions |
454,902 |
Fraudulent Transactions |
394 |
Normal Transactions |
454,508 |
Mean Transaction Amount |
90.31 |
Standard Deviation of Amount |
238.44 |
Class Imbalance Ratio |
0.00087 |
The performance of each model is summarized in the following table, which includes key metrics such as precision, recall, F1-score, and accuracy:
TABLE III
Performance Metrics Summary
Model |
Precision |
Recall |
F1-Score |
Accuracy |
Local Outlier Factor (LOF) |
0.00 |
0.02 |
0.01 |
99% |
LOF with Adjusted Contamination Rate |
0.00 |
0.05 |
0.01 |
98% |
Isolation Forest (IF) |
0.07 |
0.81 |
0.13 |
98% |
After SMOTE (LOF) |
0.40 |
0.02 |
0.03 |
50% |
After SMOTE + Isolation Forest |
0.99 |
0.04 |
0.08 |
52% |
A. Results Overview
Autoencoder: the autoencoder model learned to be able to reconstruct real transactions by minimizing reconstruction errors at training. It was used as one step of feature extraction, which it helped in doing to reduce dimensionality, making it possible to take further anomaly detection methods.
B. Local Outlier Factor (LOF)
When we applied LOF to the encoded features, the model achieved a precision of 0.00 and a recall of 0.02. The confusion matrix for LOF as shown in Fig. 1 demonstrates the model’s effectiveness in distinguishing between normal and fraudulent transactions:
Fig. 1 Confusion Matrix of LOF
C. Adjusted Contamination Rate with LOF
Adjusting the contamination rate led to improved results, yielding a precision of 0.00 and a recall of 0.05, as shown in Fig. 2.
Fig. 1 Confusion Matrix with Adjusted Contamination Rate
D. Isolation Forest (IF)
The Isolation Forest algorithm demonstrated enhanced detection capabilities, achieving a precision of 0.07 and a recall of 0.81, as shown in Fig. 3.
Fig. 2 Confusion Matrix with Isolation Forest
E. Results After SMOTE
The application of SMOTE for class balancing resulted in notable changes in performance, with a precision of 0.40 and a recall of 0.02 as shown in Fig. 4
Fig. 4 Confusion Matrix after SMOTE
F. Results with SMOTE and Isolation Forest
Combining SMOTE with Isolation Forest led to improved detection rates, yielding a precision of 0.99 and a recall of 0.04 as shown in Fig. 5
Fig. 5 Confusion Matrix after SMOTE and Isolation Forest
In this framework, a new hybrid fraud detection approach was used to improve fraudulent transaction detection capability over an enormous imbalanced dataset. The proposed approach directly deals with the two critical problems faced by the approaches for fraud detection, which are class imbalance and absence of effective anomaly detection solutions. Through applying Autoencoders on dimensionality reduction, we were able to catch meaningful features as noise suppression; therefore, the performance was maximized for models integrating LOF and Isolation Forest. We further have accomplished the balancing of the training dataset through the use of SMOTE, hence achieving improved recall for fraudulent transactions without losing precision greatly. Experimental results showed that, though LOF had a very low precision, however, the Isolation Forest model was robust in recall: it could recognize a higher percentage of fraudulent transactions. Thereby, its capability was enhanced using SMOTE, with significant enhancements in fraud detection for a more detailed evaluation. Further development would include more refined models and the application of more advanced hyperparameter tuning methods along with the integration of other additional anomaly detection algorithms. Our study outlines the potential benefit of diversified approaches for overcoming issues arising in imbalanced datasets for fraud detection and can serve as an anchor for further research in this critical field.
[1] Wongvorachan, S. He, and O. Bulut, ”A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” *Information*, vol. 14, no.1, p. 54, 2023 [2] S. Misra, S. Thakur, M. Ghosh, and S. K. Saha, ”An Autoencoder Based Model for Detecting Fraudulent Credit Card Transactions,” *Procedia Computer Science*, vol. 167, pp. 254-262, 2020. [3] C. Shen, Z. Cai, X. Guan, and R. Maxion, ”Performance Evaluation of Anomaly Detection Algorithms for Mouse Dynamics,” *Computer Security*, vol. 45, pp. 156-171, 2021. [4] J. Shen, ”Credit Card Fraud Detection Using Autoencoder-Based Deep Neural Networks,” in *2021 IEEE 2nd International Conference on Computer and Communication Engineering Technology (CCET)*, 2021, pp.263-270. [5] R. Bauder and T. M. Khoshgoftaar, ”Medicare fraud detection using machine learning methods,” *Journal of Big Data*, vol. 5, no. 1, 2018. [6] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, ”Credit Card Fraud Detection Using AdaBoost and Majority Voting,” *IEEE Access*, vol. 6, pp. 14277-14284, 2018. [7] Kaggle. \"Credit Card Fraud Detection.\" Available at: https://www.kaggle.com/datasets/dalpozz/creditcard-fraud
Copyright © 2024 Frania Chettiar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET64502
Publish Date : 2024-10-08
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here