Abstract

Statistical learning methods were applied to credit card transactions in order to detect fraud. A variety of learning techniques were explored and validated. Simple methods like logistic regression show great promise, especially given their computational efficiency at test time. Despite the size of the available data, due to the rare occurrence of fraud, a much larger dataset should be used to train models before being put into use.

Introduction

Credit and debit cards¹ are being used for an unceasing number of payments.² While likely a more secure form of payment than cash and checks, credit card fraud³ continues to threaten cardholders and card issuers. Skimming, phishing, and a variety of other technique can be utilized to grant access to an account and create unauthorized transactions. Given the high volume of credit card transactions, financial institutions must combat this potential for fraud using automated systems.

To construct a system to detect credit card fraud, statistical learning techniques have been applied to a dataset containing thousands of credit card transactions. The results show potential for such a system to be used to screen all credit card transactions for fraud, especially given that the dataset utilized is a fraction of what would be available to major financial institutions.

Methods

Data

The dataset used for this analysis contains credit card transactions made by European cardholders during September 2013. The data was accessed through Kaggle⁴ where it was donated by Worldline and the Machine Learning Group of the Université Libre de Bruxelles.⁵

These transactions occurred over a two day period, where there were 492 fraudulent charges from a total of 284,807 transactions. Thus, the data is highly imbalanced, with fraudulent charges accounting for only 0.172% of the transactions. For each transaction, three quantities are given:

Time of transaction (relative to first transaction in dataset)
Amount of transaction
Fraud status of transaction

For this analysis, time is not considered.

In addition to these three quantities, 28 principal components⁶ are available for each transaction. In order to maintain confidentiality of these transactions, the original features are not provided. Further, information about what original features were used to created these principal components is not available. We can reasonably assume that those features contain information that would be available to the card issuers at the time of transaction. We assume this information contains but is not limited to information such as:

Billing address of cardholder
Location of transaction
Date and time of transaction
Name of merchant
Time of transaction
Information about previous transactions of cardholder (amount, time, location, type, etc.)
Information about previous transactions of merchant (amounts, locations, cardholders, etc.)

In preparation for model training, a training dataset is created using 80% of the provided data. Within the training data, 0.173% of observations are fraudulent, which roughly matches the overall proportion in the original dataset.

Some exploratory data analysis can be found in the appendix.

Modeling

Five different classification models were trained, each using 5-fold cross-validation. The best tuning parameters were chosen using ROC.⁷

A random forest via the ranger package in R fit to training data that has been down-sampled.

mod_rf_down = train(Class ~ ., data = cc_trn, 
                    method = "ranger", 
                    metric = "ROC", 
                    trControl = cv_down)

A logistic regression with no subsampling of the training data.

mod_glm_cv = train(Class ~ ., data = cc_trn, 
                   method = "glm", 
                   metric = "ROC", 
                   trControl = cv)

A logistic regression fit to training data that has been up-sampled.

mod_glm_up = train(Class ~ ., data = cc_trn, 
                   method = "glm", 
                   metric = "ROC", 
                   trControl = cv_up)

A logistic regression that has been down-sampled.

mod_glm_down = train(Class ~ ., data = cc_trn, 
                     method = "glm", 
                     metric = "ROC", 
                     trControl = cv_down)

A logistic regression that was fit to training data that used SMOTE⁸ to correct for class imbalance.

mod_glm_smote = train(Class ~ ., data = cc_trn, 
                      method = "glm", 
                      metric = "ROC", 
                      trControl = cv_smote)

Models selection and evaluation is discussed in the results section.

Results

The table below shows the result of fraud predictions on the test data using a logistic regression model fit to the training data with an up-sampling procedure to combat the effect of the massive class imbalance. Additional intermediate tuning results can be found in the appendix. Due to computational limitations, only a random forest that utilizes down-sampling is presented. While the best result can be found within the random forest model, it does not appear to be significantly different than the logistic regressions considered. As a result, a logistic regression is chosen as its simplicity is especially helpful at test time as it can compute predictions much faster than the random forest.

Models were tuned for ROC, but sensitivity was also considering when choosing a final model. Aside from the logistic regression model without any subsampling, all models had similar performance.

Table: Test Results, **Up-sampled Logistic Regression**
	Truth
	Fraud	Genuine
Predicted: Fraud	91	1402
Predicted: Genuine	7	55461

Within this test data, 7.1% of fraud is being misclassified as genuine, while 2.5% of genuine transactions are being labeled as fraud. We note that these two values could be further manipulated by adjusting the threshold for labeling a transaction fraud.

Discussion

The results above are somewhat encouraging. While our model does produce errors, of both types, given the application, we believe this analysis demonstrates a proof-of-concept for an automated fraud detection system. Using more data, both samples and features, this model could likely be improved before being put into practice.

We consider false positives produced by this model (detecting fraud in a genuine transaction) to be the lesser for the two potential errors. By labeling these transactions as fraud, the card issuer could simply deny them. (Provided they are detected in real-time.) This comes at the cost of cardholder inconvenience. Also, in modern banking, this denial is often coupled with messaging (text, email, or phone call) to the cardholder that can be used to identify these false positives and bypass the denial. The cost of doing so should be considered when using a model such as this in practice.

Table: Test Data False Negatives
Predicted	Actual	Amount
genuine	fraud	0.00
genuine	fraud	0.92
genuine	fraud	1.79
genuine	fraud	2.47
genuine	fraud	45.03
genuine	fraud	99.99
genuine	fraud	319.20

False negatives (failure to detect a true fraud) should be investigated further. The table above lists the seven false positives found in the test data. Although this is an extremely limited sample, we make two comments. First, the total amount of these transactions comes to 469.4. Per transaction that is 0.0082407. While this is extremely small, we note one additional consideration that suggests that amount not fully explain the risk of false negatives.

Within this data, there are a large number of transaction for either 0.00 or 0.01. The rate of occurrence of transactions for 0.00 or 0.01 far exceeds those for 0.02 through 0.10. It is unclear what these transaction could be for, but one potential is for use in account verification when linking a credit card to third party services. So while they have little monetary value, they create a potential security risk. While, we cannot make any strong conclusion, it is worrying that a fraudulent transaction for 0.00 appears among the false negatives.

Because fraud detection should be performed in real-time, that is, the card issuer would like to detect fraudulent transactions as they occur, the speed of predictions from our trained models must be taken into consideration. While the random forest model had similar detection performance, when making predictions at test time, it was on average 8.8 times slower. While this probably actually amounts to a very small absolute amount of time, when considering the volume of transactions, every bit of time counts.

By using principal components, instead of original features, this model is essentially a black box. With access to the full feature set, and using interpretable models, some additional sanity checks could be made about the predictions of this model.

Appendix

Data Dictionary

V1 - V28 - 28 principal components based on an unknown set of input features that contain information about each transaction.
Amount - Amount of transaction.
Class - Transaction label: fraud or genuine.

For additional information, see documentation on Kaggle.⁹

EDA

Table: Statistics by Outcome, Training Data
		Transaction Amount
Class	Count	10th Percentile	Median	90th Percentile
fraud	394	0.76	12.31	324.344
genuine	227452	1.00	22.00	203.136

Additional Results

Table: Random Forest, Down-sampling
mtry	min.node.size	splitrule	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
2	1	gini	0.975	0.888	0.982	0.015	0.040	0.007
2	1	extratrees	0.978	0.868	0.991	0.012	0.048	0.003
15	1	gini	0.975	0.906	0.959	0.013	0.044	0.013
15	1	extratrees	0.977	0.888	0.980	0.012	0.040	0.004
29	1	gini	0.977	0.901	0.964	0.013	0.041	0.008
29	1	extratrees	0.979	0.896	0.972	0.012	0.041	0.003

Table: Logistic Regression, No Subsampling
parameter	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
none	0.973	0.609	1	0.011	0.08	0

Table: Logistic Regression, Up-sampling
parameter	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
none	0.978	0.901	0.976	0.018	0.024	0.001

Table: Logistic Regression, Down-sampling
parameter	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
none	0.961	0.911	0.938	0.034	0.032	0.041

Table: Logistic Regression, SMOTE
parameter	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
none	0.971	0.899	0.975	0.014	0.022	0.003

Instructor Notes

Some meta-notes about this analysis:

If using the full data, this analysis is extremely computationally intensive. My laptop’s fan was working overtime. In general, RStudio Cloud is not really up to this task due to the memory limit.
To alleviate some of the stress of repeating this computations while checking the knitted output, the results are cached. This is done in the first chunk of the analysis. Beware that caching results can sometimes lead to odd issues that are best solved by deleting the cache and running from scratch again.
Another method used to reduce the computational burden is the use of parallel operations. The doParallel and parallel packages are used to setup a parallel backend which caret automatically leverages if it exists.
SMOTE was used instead of ROSE as issue were uncovered when combining ROSE with glm. It seems to be related to how various functions are defining the positive class. (And also whether factor coercion is happening within those functions.)
While not tried here, subsampling within random forests for class imbalance might perform well in this scenario.
In practice, this application would likely involve streaming data.¹⁰
The createDataPartition function from the caret package was used to do the data splitting. This allows for stratified sampling with the classes of the response variable. Often, this might not really be necessary, but with such imbalanced data, it guarantees that the imbalance is the same in both datasets. As a result of the splitting, the test data was only 98 positive (fraud) examples.

Table: Class Proportions by Dataset
Dataset	Fraud	Genuine
Train	0.0017	0.9983
Test	0.0017	0.9983

Wikipedia: Credit Card ↩
Federal Reserve: Payment Systems Study ↩
Wikipedia: Credit Card Fraud ↩
Kaggle: Credit Card Fraud ↩
Machine Learning Group of ULB ↩
Wikipedia: Principal Component Analysis ↩
Although, for the logistic regression, no tuning was actually done. The specification of a metric is largely to suppress warnings in caret.↩
SMOTE: Synthetic Minority Over-sampling Technique ↩
Kaggle: Credit Card Fraud ↩
Wikipedia: Streaming Data ↩

Credit Card Fraud Detection

David Dalpiaz (dalpiaz2@illinois.edu)

19 November, 2019