Credit Card Fraud Prediction

Sifei Liu
Oct 17, 2022
8 min read

Updated: Mar 16, 2023

Sifei Liu, Clarise Wang, Lucy He, Yiqing Cui

1. Business Understanding

1.1 Problem Mining

Credit cards, which allow individuals or businesses to consume the products and

services in advance, are widely used around the world. However, the characteristics of

convenience and easier usage cause fraudulent behavior from time to time. According to Shift, $24.26 billion was lost due to fraud card transactions in 2018 all over the world,

causing great loss to issuers. Therefore, it is necessary for banks to detect the potential cyber crime arising from each credit card payment in advance to protect its customers. At the same time, credit card fraud detection is beneficial to help banks to save loss and increase bank solvency through preventing suspicious transactions from happening.

1.2 Business Value

● Scenario for Use

Banking industry is the main user industry of our final model. Suppose we are hired

as data analysts by a large bank who issues credit cards to create a model for a fraud alert system and maximize the profit for the bank.

● Final Goal

The final goal is to maximize the profit of the bank while trying to maintain accurate

predictions of the probability of fraudulent transactions.

● The importance of data mining solution

Once the model is built, we can use known variables to predict whether or not the

transaction is fraudulent before the transaction is being processed. Then appropriate actions can be taken to prevent further losses if the probability is above the threshold.

2. Data Understanding and Preparation

2.1 Description of dataset

● Source: A dataset available on Kaggle website, with 7 independent variables and 1

dependent variable. The dataset includes 1 million data points.

● Data instance/unit: Each credit card transaction is a separate data point.

● Target variable: Fraud, a binary dependent variable indicating whether or not a

transaction is fraudulent.

● Useful Features: distance_from_home, distance_from_last_transaction,

ratio_to_median_purchase_price, repeat_retailer,used_chip, used_pin_number,

online_order.

2.2 Data processing

We firstly checked if there are missing values in the original dataset and concluded

that there are no missing values. In order to get a better understanding of the characteristics of each feature, we looked at the summary of the whole dataset. The mean value of fraud is about 0.087, meaning that the majority of the available transactions are not fraudulent and the current dataset is skewed and highly imbalanced.

We then checked the correlation of each independent variable with the dependent

variable and concluded that except the fact that if the transaction is with a repeated retailer or not, all other variables are highly correlated to the independent variable.With the increase of distance for home, distance from last transactions, and ratio to median purchase price, and if the the transaction is an online order, then it will increase the probability of credit card fraud. And if the transaction happened from the same retailer and is through chips using a PIN number, the probability of credit card fraud will drop. And because the dependent variables are limited, we decided to keep all the variables when training models.

Originally, the dataset contained 1,000,000 data points. In order to run the analysis

effectively, we decided to randomly select 10,000 instances with an evenly combination of fraud and non-fraud instances (1:1). 9,000 of the newly selected data will be used to train the models. The rest of 1,000 will be used as a testing set for the deployment explanation.

In order to better understand the dataset, we employed K-means clusters with 8

centers. We found that when the pin number is used in the transaction, there is only 2% of the times that the transaction is a fraud. When the transaction happens far away from the card holder’s home address or the purchase price is highly over the median of the card holder’s previous purchase, there is about 90% of the times the transaction is a fraud.

2.3 Data Mining Problem

Supervised learning is used to resolve this classification problem. Classification

models are constructed to make binary predictions, while the models are evaluated by the total profit/loss of implementing fraud detection, based on the cost-benefit assumption below.

● Cost-benefit Matrix Assumption

Credit card companies make money mainly by collecting fees, including interest

charges, annual fees and late fees, among which the interest charges take the largest share of revenues. So we assume that the interest charges, which represent the percentage charged for the borrowed amount when the credit card users fail to pay off at the end of the month, are the only source for revenues of credit card companies.

On average, we assume that the average credit amount per transaction is $80. For not

permitting fraudulent transactions, the cost for requiring authorization for actual fraud is $30 per transaction (i.e. employee’ salaries, data maintenance fee, etc.). For permitting fraudulent transactions, we assume that reimbursement for fraud transaction is necessary and banks bear 70% of the total amount, so the cost for approving a fraud transaction is $56 (Average credit amount per transaction*70%= $80×70%=$56). For permitting non-fraudulent transactions, the interest revenue generated is $16 (Average credit amount per transaction × average interest rate = $80×20% = $16). For not permitting non-fraudulent transactions, banks cannot earn interest revenue and resources are needed in the process of authorization, so we assume the loss be $35 per transaction. The assumed amounts are denoted as A, B, C, D, which may be altered in practical use according to internal data about the cost and benefit. The four numbers should be in ascending sequence for A, B, C, D.

● Threshold calculation

As deployment, we do not permit transaction when expected value of not permit

exceeds expected value of permit,

− 30 × 𝑝 − 35 × (1 − 𝑝) ≥ − 56 × 𝑝 + 16 × (1 − 𝑝)

so we should not permit when 𝑝 ≥ 0. 662. The applied threshold of p is 0.662.

3. Modeling

3.1 Systematization

● Models picked

Logistic regression with interaction, LASSO, support vector machine (SVM), K

nearest neighbor (KNN), decision tree and random forest. Given our final goal, we used a 10-fold cross validation method and chose a model with the rule of maximizing the expected average profit.

3.2 Specific Models

● Logistic regression with interaction

From the logistic regression model, we can see that factors such as

distance_from_home, distance_from_last_transaction, ratio_to_mdian_puchase_price,

online_order and most interactional factors are significant to fraud, all of which can be used

to predict fraudulent behavior.

● LASSO

In order to investigate overfitting problems, we used Lasso to perform both variable

selection and regularization. Then we applied Lambda=1e-04, which concludes that the

number of variables chosen by Lasso is 7(same as the number of original variables), so there is no need to rule out any variables. Therefore, Post-Lasso can be abandoned.

● SVM

We also apply the support vector machine to separate the class. The SVM model

performs the best when using radial kernels and cost argument equals to 50. It means that a narrow margin is used where there are fewer support vectors on or violating the margin.

● KNN

K nearest neighbor is applied. All numerical variables are scaled before constructing

the model. The model performs the best when the K equals to 1.

● Decision tree

Classification tree is applied because it provides us with a better interpretation of the

decision making process. The model performs the best when the tree is not pruned, with size equals to 9. From the graph below, the first split is based on “ratio to median purchase price”, which implies that the model considers it as the most decisive factor.

● Random forest

Random forest is applied to the training dataset accompanied by 10-fold cross

validations. We chose to find best parameter amounts for random forest through the

comparison of average total expected revenue using the cost-benefit matrix assumed in the data understanding section.

4. Model Evaluation and Final Model

Given our final goal of the project, we apply a 10-fold CV on average profit/loss to

evaluate the model performance and choose the final model. After achieving 10 different possibilities based on 10 sample dataset in every model, we firstly calculate the total profit/loss based on the cost-benefit matrix considering every possibility, and then average the profit/loss to compare the model as follows:

Based on the comparison of expected average profit for each model, the final model

chosen is Random Forest. The model has 5 minimum observations in a terminal node, 400 trees to grow and 2 variables randomly sampled as candidates at each split as parameters.

5. Deployment and Model Application

5.1 Practical use

Banks can prevent fraud in advance with the system to evaluate the possibility of fraud

of each transaction. Given that factors included in our model are location of transactions (including distance between where transaction happens and home address of customer, distance between where current transaction happens and where last transaction happened), ratio of transaction price to the median of the customer’s past purchases, whether or not the retailer for the current transaction is same retailer as that of last transaction, whether or not use chip or pin number and whether or not this transaction is ordered online, the banks can take the location information, amount spent of the transaction, the credit card user’s home address and the usage of both chip and pin numbers into the trained model. Based on the information, our chosen model can be applied to predict the probability of the transaction being fraudulent.

On one hand, if the model classifies the transaction as fraudulent, when the

probability is greater than or equal to 0.662, the bank should send out some alerts to the credit card user and temporarily interrupt the transaction until the holder's permission is acquired. When the prediction is accurate, banks can avoid fraud; when the prediction is wrong, customer satisfaction may decrease.

On the other hand, if the system predicts that the transaction is normal, when the

probability is less than 0.662, the system will permit the translation. If the prediction is

correct, all can operate normally. However, if the transaction proves to be a fraud, both the cost of fraud and dissatisfaction by customer should be considered.

After model selection, we apply random forest to the test set of 1000 transactions we

selected, the matrix of actual results and deployment is given below. Under the cost-benefit assumption the expected profit is -$7,130. If the bank chooses not to conduct fraud prediction and approve all transactions, the expected profit is -$20,000. The classification model helps the bank to save 64.35% of its total loss.

5.2 Ethical considerations

● Concern for information privacy

The acquisition of some particular information should be confirmed by the credit card user through signing a privacy contract. After the information is acquired, the bank should protect the information as much as possible.

5.3 Risks associated and relevant solutions

● Moral hazard

Since some compensation will be provided once the interruption of transactions

proves to be incorrect, some customers might take advantage of that through pretending to experience credit card fraud and earning some additional interest. In order not to lose money for the moral hazard among the customers, the bank should constantly improve the model once false predictions accumulate to a certain number of cases and limit the amount of compensation to the corresponding amount of the fraud transaction.

● Losing customers and causing cost because of inaccurate prediction

When the prediction proves to be wrong, the customers tend to be dissatisfied and the

usage of credit cards will decrease. Therefore, the cost of the bank may be greater than what we assumed when customer relationships are affected. For this risk, firstly, the system should be updated whenever possible with more dimensions of information of the transactions and more comprehensive predictive considerations through collecting more variables.

Additionally, once the fraud is found, the bank should track the fraud, block the credit card and compensate for the loss of the customers. Usually, customers will reach out to the banks if a suspicious transaction appears on their history of transactions, so the contact representatives should be trained in the way that they show apology and patience to customers.

Appendices (division of work)

References

[1] R, D. N. (2022). Credit Card Fraud. Kaggle. Retrieved October 14, 2022, from

https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud?resource=download

[2] Credit Card Fraud Statistics. Shift Credit Card Processing. (2021). Retrieved October 14,

2022, from https://shiftprocessing.com/credit-card-fraud-statistics/

[3] Best, R. de. (2022). Average value of transaction per credit card worldwide by Brand.

Statista. Retrieved October 14, 2022, from https://www.statista.com/statistics/279249/purchase-transactions-on-general-purpose-cards-worldwide/

Credit Card Fraud Prediction

Recent Posts

Comments