Statistical learning methods were applied to credit card transactions in order to detect fraud. A variety of learning techniques were explored and validated. Simple methods like logistic regression show great promise, especially given their computational efficiency at test time. Despite the size of the available data, due to the rare occurrence of fraud, a much larger dataset should be used to train models before being put into use.
Credit and debit cards1 are being used for an unceasing number of payments.2 While likely a more secure form of payment than cash and checks, credit card fraud3 continues to threaten cardholders and card issuers. Skimming, phishing, and a variety of other technique can be utilized to grant access to an account and create unauthorized transactions. Given the high volume of credit card transactions, financial institutions must combat this potential for fraud using automated systems.
To construct a system to detect credit card fraud, statistical learning techniques have been applied to a dataset containing thousands of credit card transactions. The results show potential for such a system to be used to screen all credit card transactions for fraud, especially given that the dataset utilized is a fraction of what would be available to major financial institutions.
The dataset used for this analysis contains credit card transactions made by European cardholders during September 2013. The data was accessed through Kaggle4 where it was donated by Worldline and the Machine Learning Group of the Université Libre de Bruxelles.5
These transactions occurred over a two day period, where there were 492 fraudulent charges from a total of 284,807 transactions. Thus, the data is highly imbalanced, with fraudulent charges accounting for only 0.172% of the transactions. For each transaction, three quantities are given:
For this analysis, time is not considered.
In addition to these three quantities, 28 principal components6 are available for each transaction. In order to maintain confidentiality of these transactions, the original features are not provided. Further, information about what original features were used to created these principal components is not available. We can reasonably assume that those features contain information that would be available to the card issuers at the time of transaction. We assume this information contains but is not limited to information such as:
In preparation for model training, a training dataset is created using 80% of the provided data. Within the training data, 0.173% of observations are fraudulent, which roughly matches the overall proportion in the original dataset.
Some exploratory data analysis can be found in the appendix.
Five different classification models were trained, each using 5-fold cross-validation. The best tuning parameters were chosen using ROC.7
ranger
package in R
fit to training data that has been down-sampled.mod_rf_down = train(Class ~ ., data = cc_trn,
method = "ranger",
metric = "ROC",
trControl = cv_down)
mod_glm_cv = train(Class ~ ., data = cc_trn,
method = "glm",
metric = "ROC",
trControl = cv)
mod_glm_up = train(Class ~ ., data = cc_trn,
method = "glm",
metric = "ROC",
trControl = cv_up)
mod_glm_down = train(Class ~ ., data = cc_trn,
method = "glm",
metric = "ROC",
trControl = cv_down)
mod_glm_smote = train(Class ~ ., data = cc_trn,
method = "glm",
metric = "ROC",
trControl = cv_smote)
Models selection and evaluation is discussed in the results section.
The table below shows the result of fraud predictions on the test data using a logistic regression model fit to the training data with an up-sampling procedure to combat the effect of the massive class imbalance. Additional intermediate tuning results can be found in the appendix. Due to computational limitations, only a random forest that utilizes down-sampling is presented. While the best result can be found within the random forest model, it does not appear to be significantly different than the logistic regressions considered. As a result, a logistic regression is chosen as its simplicity is especially helpful at test time as it can compute predictions much faster than the random forest.
Models were tuned for ROC, but sensitivity was also considering when choosing a final model. Aside from the logistic regression model without any subsampling, all models had similar performance.
Fraud | Genuine | |
---|---|---|
Predicted: Fraud | 91 | 1402 |
Predicted: Genuine | 7 | 55461 |
Within this test data, 7.1% of fraud is being misclassified as genuine, while 2.5% of genuine transactions are being labeled as fraud. We note that these two values could be further manipulated by adjusting the threshold for labeling a transaction fraud.
The results above are somewhat encouraging. While our model does produce errors, of both types, given the application, we believe this analysis demonstrates a proof-of-concept for an automated fraud detection system. Using more data, both samples and features, this model could likely be improved before being put into practice.
We consider false positives produced by this model (detecting fraud in a genuine transaction) to be the lesser for the two potential errors. By labeling these transactions as fraud, the card issuer could simply deny them. (Provided they are detected in real-time.) This comes at the cost of cardholder inconvenience. Also, in modern banking, this denial is often coupled with messaging (text, email, or phone call) to the cardholder that can be used to identify these false positives and bypass the denial. The cost of doing so should be considered when using a model such as this in practice.
Predicted | Actual | Amount |
---|---|---|
genuine | fraud | 0.00 |
genuine | fraud | 0.92 |
genuine | fraud | 1.79 |
genuine | fraud | 2.47 |
genuine | fraud | 45.03 |
genuine | fraud | 99.99 |
genuine | fraud | 319.20 |
False negatives (failure to detect a true fraud) should be investigated further. The table above lists the seven false positives found in the test data. Although this is an extremely limited sample, we make two comments. First, the total amount of these transactions comes to 469.4. Per transaction that is 0.0082407. While this is extremely small, we note one additional consideration that suggests that amount not fully explain the risk of false negatives.
Within this data, there are a large number of transaction for either 0.00 or 0.01. The rate of occurrence of transactions for 0.00 or 0.01 far exceeds those for 0.02 through 0.10. It is unclear what these transaction could be for, but one potential is for use in account verification when linking a credit card to third party services. So while they have little monetary value, they create a potential security risk. While, we cannot make any strong conclusion, it is worrying that a fraudulent transaction for 0.00 appears among the false negatives.
Because fraud detection should be performed in real-time, that is, the card issuer would like to detect fraudulent transactions as they occur, the speed of predictions from our trained models must be taken into consideration. While the random forest model had similar detection performance, when making predictions at test time, it was on average 8.8 times slower. While this probably actually amounts to a very small absolute amount of time, when considering the volume of transactions, every bit of time counts.
By using principal components, instead of original features, this model is essentially a black box. With access to the full feature set, and using interpretable models, some additional sanity checks could be made about the predictions of this model.
V1
- V28
- 28 principal components based on an unknown set of input features that contain information about each transaction.Amount
- Amount of transaction.Class
- Transaction label: fraud
or genuine
.For additional information, see documentation on Kaggle.9
Class | Count | 10th Percentile | Median | 90th Percentile |
---|---|---|---|---|
fraud | 394 | 0.76 | 12.31 | 324.344 |
genuine | 227452 | 1.00 | 22.00 | 203.136 |
mtry | min.node.size | splitrule | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|---|---|
2 | 1 | gini | 0.975 | 0.888 | 0.982 | 0.015 | 0.040 | 0.007 |
2 | 1 | extratrees | 0.978 | 0.868 | 0.991 | 0.012 | 0.048 | 0.003 |
15 | 1 | gini | 0.975 | 0.906 | 0.959 | 0.013 | 0.044 | 0.013 |
15 | 1 | extratrees | 0.977 | 0.888 | 0.980 | 0.012 | 0.040 | 0.004 |
29 | 1 | gini | 0.977 | 0.901 | 0.964 | 0.013 | 0.041 | 0.008 |
29 | 1 | extratrees | 0.979 | 0.896 | 0.972 | 0.012 | 0.041 | 0.003 |
parameter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
none | 0.973 | 0.609 | 1 | 0.011 | 0.08 | 0 |
parameter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
none | 0.978 | 0.901 | 0.976 | 0.018 | 0.024 | 0.001 |
parameter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
none | 0.961 | 0.911 | 0.938 | 0.034 | 0.032 | 0.041 |
parameter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
none | 0.971 | 0.899 | 0.975 | 0.014 | 0.022 | 0.003 |
Some meta-notes about this analysis:
doParallel
and parallel
packages are used to setup a parallel backend which caret
automatically leverages if it exists.glm
. It seems to be related to how various functions are defining the positive class. (And also whether factor coercion is happening within those functions.)createDataPartition
function from the caret
package was used to do the data splitting. This allows for stratified sampling with the classes of the response variable. Often, this might not really be necessary, but with such imbalanced data, it guarantees that the imbalance is the same in both datasets. As a result of the splitting, the test data was only 98 positive (fraud) examples.Dataset | Fraud | Genuine |
---|---|---|
Train | 0.0017 | 0.9983 |
Test | 0.0017 | 0.9983 |
Although, for the logistic regression, no tuning was actually done. The specification of a metric is largely to suppress warnings in caret
.↩