'Fraud' in credit card transactions is unauthorized and unwanted usage of an account by someone other than the owner of that account. Necessary prevention measures can be taken to stop this abuse and the behavior of such fraudulent practices can be studied to minimize it and protect against similar occurrences in the future. In other words, Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for personal reasons while the owner and the card issuing authorities are unaware of the fact that the card is being used. This problem is particularly challenging from the perspective of learning, as it is characterized by various factors such as class imbalance. The number of valid transactions far outnumber fraudulent ones. Also, the transaction patterns often change their statistical properties over the course of time.
II. SCOPE
Fraud detection involves monitoring the activities of populations of users in order to estimate, perceive or avoid objectionable behavior, which consist of fraud, intrusion, and defaulting. This is a very relevant problem that demands the attention of communities such as machine learning and data science where the solution to this problem can be automated.
III. PLATFORM: GOOGLE COLAB
Google Colab was developed by Google to provide free access to GPU’s and TPU’s to anyone who needs them to build a machine learning or deep learning model. Google Colab can be defined as an improved version of Jupyter Notebook.
As a programmer, we can perform the following using Google Colab. Write and execute code in Python Document your code that supports mathematical equations Create/Upload/Share notebooks Import/Save notebooks from/to Google Drive Import/Publish notebooks from GitHub Import external datasets e.g. from Kaggle Integrate PyTorch, TensorFlow, Keras, OpenCV Free Cloud service with free GPU.
Colab, or Colaboratory is an interactive notebook provided by Google (primarily) for writing and running Python through a browser. We can perform data analysis, create models, evaluate these models in Colab. The processing is done on Google-owned servers in the cloud. We only need a browser and a fairly stable internet connection. Colab is a great alternative tool to facilitate our work, whether as a student, professional, or researcher. Although Colab is primarily used for coding in Python, apparently we can also use it for R (#Rstats). We can also run R in Google Colab and can mount Google Drive or access BigQuery in R notebook.
A. Software Specifications
Google Colaboratory
B. Hardware Specifications
Microsoft® Windows® 7/8/10 (32- or 64-bit)
3 GB RAM minimum, 8 GB RAM recommended;
2 GB of available disk space minimum
core processor of i3 minimum or above.
C. Dataset
Creditcard.csv which is available on Kaggle. (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
D. Packages Requried
ranger
caret
data.table
caTools
rpart.plot
neuralnet
gbm
pROC
IV. LITERATURE REVIEW
A Fraud act as the unlawful or criminal deception intended to result in financial or personal benefit. It is a deliberate act that is against the law, rule or policy with an aim to attain unauthorized financial benefit. Numerous literatures pertaining to anomaly or fraud detection in this domain have been published already and are available for public usage. A comprehensive survey conducted by Clifton Phua and his associates have revealed that techniques employed in this domain include data mining applications, automated fraud detection, adversarial detection.
In another paper, Suman, Research Scholar, GJUS&T at Hisar HCE presented techniques like Supervised and Unsupervised Learning for credit card fraud detection. Even though these methods and algorithms fetched an unexpected success in some areas, they failed to provide a permanent and consistent solution to fraud detection. A similar research domain was presented by Wen-Fang YU and Na Wang where they used Outlier mining, Outlier detection mining and Distance sum algorithms to accurately predict fraudulent transaction in an emulation experiment of credit card transaction data set of one certain commercial bank. Outlier mining is a field of data mining which is basically used in monetary and internet fields. It deals with detecting objects that are detached from the main system i.e. the transactions that aren’t genuine. They have taken attributes of customer’s behaviour and based on the value of those attributes they’ve calculated that distance between the observed value of that attribute and its predetermined value.
Unconventional techniques such as hybrid data mining/complex network classification algorithm is able to perceive illegal instances in an actual card transaction data set, based on network reconstruction algorithm that allows creating representations of the deviation of one instance from a reference group have proved efficient typically on medium sized online transaction. There have also been efforts to progress from a completely new aspect. Attempts have been made to improve the alert feedback interaction in case of fraudulent transaction. In case of fraudulent transaction, the authorized system would be alerted and a feedback would be sent to deny the ongoing transaction. Artificial Genetic Algorithm, one of the approaches that shed new light in this domain, countered fraud from a different direction.
In 2015, J. Esmaily and R. Moradinezhad in their paper proposed a hybrid of artificial neural network and decision tree. In their model they used a two-phase approach. In first phase the classification results of Decision tree and Multilayer perceptron were used to generate a new dataset which in second phase is feed into Multilayer perceptron to finally classify the data. This model promises reliability by giving very low false detection rate. Siddhartha Bhattacharyya and 4 others in their paper in 2011 did a detailed comparative study of Support vector machine and random forest along with logistic regression. They concluded through experiments that Random Forest technique shows most accuracy followed by Logistic Regression and Support Vector Machine.
V. IMPLEMENTATION
In the first step of this data science project, we will perform data exploration. We will import the essential packages required for this role and then read our data. Finally, we will go through the input data to gain nec- essary insights about it.
VI. READING EVENTS FROM CREDITCARD.CSV
Before going to ccfd analysis, the first step is to read the data for performing analysis on. The data is saved in dataset named as creditcard.csv. This dataset contains 0.28 million record with various features. The events saved in dataset are unstructured. To perform analysis, reading of data set is done using command “read.csv”.
First we imported the datasets that contain transactions made by credit cards. we then explored the data that is contained in the creditcard_data dataframe. After displaying the creditcard_data using the head() function as well as the tail() function, we proceeded to explore the other components of this dataframe.
B. Data Manipulation
In this section of the project, we scaled the data using the scale() function. We applied this to the amount component of our creditcard_data amount. With the help of scaling, the data is structured according to a specified range. Therefore, there are no extreme values in the dataset that might interfere with the functioning of the model.
C. Data Modelling
After standardizing the entire dataset, I split the dataset into training set as well as test set with a split ratio of 0.80. This means that 80% of the data will be attributed to the train_data whereas 20% will be attributed to the test_data. I then found the dimensions using the dim() function.
VII. FITTING LOGISTIC REGRESSION MODEL
In this section of the project, we fit the first model. we began with logistic regression. we used it for modeling the outcome probability of fraud/not fraud. we proceeded to implement this model on the test data. Once I summarized the model, we visualized it through plots. In order to assess the performance of the model, we portrayed the Receiver Optimistic Characteristics or ROC curve. For this, we first imported the ROC package and then plotted the ROC curve to analyze its performance.
# Visualizing summarized model through the following plots
plot(Logistic_Model)
# ROC Curve to assess the performance of the model
library(pROC)
lr.predict <- predict(Logistic_Model,test_data, probability = TRUE)
auc.gbm = roc(test_data$Class, lr.predict, plot = TRUE, col = "blue")
A. Fitting a Decision Tree Model
Next, we implemented a decision tree algorithm to plot the outcomes of a decision through which we could conclude as to what class the object belongs to. we then implemented the decision tree model and plotted it using the rpart.plot() function. we specifically used the recursive parting to plot the decision tree.
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model
B. Artificial Neural Network
Artificial Neural Networks are a type of machine learning algorithm that are modeled after the human nervous system. The ANN models are able to learn the patterns using the historical data and are able to perform classification on the input data. We imported the neuralnet package that allowed me to implement the ANNs. Then we proceeded to plot it using the plot() function. Now, in the case of Artificial Neural Networks, there is a range of values that is between 1 and 0. I set a threshold of 0.5, that is, values above 0.5 will correspond to 1 and the rest will be 0.
Gradient Boosting is a popular machine learning algorithm that is used to perform classification and regression tasks. This model comprises of several underlying ensemble models like weak decision trees. These decision trees combine together to form a strong model of gradient boosting. We implemented gradient descent algorithm in the model.
In the last section of the project, we calculated and plotted an ROC curve measuring the sensitivity and specificity of the model. The print command plots the curve and calculates the area under the curve. The area of a ROC curve can be a test of the sensivity and accuracy of a model.
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)
Conclusion
Concluding our R Data Science project, we learnt how to develop a credit card fraud detection model using machine learning. We used a variety of ML algorithms to implement this model and also plotted the respective performance curves for the models. We also learnt how data can be analyzed and visualized to discern fraudulent transactions from other types of data. Hope you enjoyed this credit card fraud detection project of machine learning using R.