Zurich University of Applied Sciences
Master of Science in Engineering
Project 1
Student: | Sebastian Häni, haeni.sebastian@gmail.com |
Advisor: | Andreas Weiler, wele@zhaw.ch |
Version: | 1.0.0, 11.05.2021 |
This document was written as part of a master's program project at the Zurich University of Applied Sciences ZHAW.
All project information such as this documentation and the source code are publicly available on the project website at https://github.com/sebastianhaeni/fraud-detection.
All sources are numbered [n] and listed in the list of sources in the Appendix. Basic knowledge in computer and data science is required to understand this work. The most important abbreviations and concepts are explained in the glossary, which can also be found in the Appendix.
This document was written in HTML and converted into a PDF file using Prince. The font used is Open Sans, designed by Steve Matteson on behalf of Google.
YesLogic Pty. Ltd.: Prince. http://www.princexml.com/ 22.07.2019
The goal of this project was to participate in the Data Mining Cup 2019 and produce a report about it, as well as code artifacts used to solve the challenge. The dataset has been analyzed to prepare the data for the training step. We identified contributions of features to the prediction, eradicated dependent features and created new ones. The resulting dataset had four remaining variables with a significant correlation with the prediction variable "fraud".
The classifiers explored include decision trees, logistic regression, random forests, K-neighbor and support vectors. The scores they reached with the final released test set from the contest were acceptable for SVC (37,690€), logistic regression (7,005€) and random forests (2,250€). The theoretical maximum score to achieve with a 100% accurate model would have been 118,635€. A random classifier would have scored negative 5,929,590€.
The challenge lied mainly in finding a suitable scoring function to tune the classifiers with. The dataset was highly imbalanced. Totally it contained 1775 non fraud samples and 104 fraud samples. The resulting model has to be regularized so it does not overfit in training but still receive an acceptable score.
The goal of this project was to participate in the Data Mining Cup 2019 and produce a report about it as well as code artifacts used to solve the challenge. The following chapter gives an overview of the Data Mining Cup and the Challenge 2019. The following content a copy from the Data Mining Cup website.
The DATA MINING CUP (DMC for short) has inspired students around the world to pursue intelligent data analysis since the year 2000. In the 20th DATA MINING CUP in 2019 about 150 teams from 114 universities in 28 countries took part in the competition. The best teams were invited to Berlin for the awards ceremony at the retail intelligence summit.
The number of self-checkout stations is on the rise. This includes stationary self-checkouts, where customers take their shopping cart to a scan station and pay for their products. Secondly, there are semi-stationary self-checkouts, where customers scan their products directly and only pay at a counter. The customers either use their own smartphone for scanning or the store provides mobile scanners. You will probably have encountered this already.
This automated process helps avoid long lines and speeds up the paying process for individual customers. But how can retailers prevent the trust they have placed in customers from being abused? How can they decide which purchases to check in an effort to expose fraudsters without annoying innocent customers?
An established food retailer has introduced a self-scanning system that allows customers to scan their items using a handheld mobile scanner while shopping
This type of payment leaves retailers open to the risk that a certain number of customers will take advantage of this freedom to commit fraud by not scanning all of the items in their cart.
Empirical research conducted by suppliers has shown that discrepancies are found in approximately 5 % of all self-scan transactions. The research does not differentiate between actual fraudulent intent of the customer, inadvertent errors or technical problems with scanners.
To minimize losses, the food retailer hopes to identify cases of fraud using targeted follow-up checks. The challenge here is to keep the number of checks as low as possible to avoid unnecessary added expense as well as to avoid putting off innocent customers due to false accusations. At the same time, however, the goal is to identify as many false scans as possible.
The objective of the participating teams is to create a model to classify the scans as fraudulent or non-fraudulent. The classification does not take into account whether the fraud was committed intentionally or inadvertently.
The solutions submitted will be assessed and compared based on their monetary value for the food retailer. This can be calculated using the following cost matrix based on empirical observation.
Actual value | |||
---|---|---|---|
Prediction | 0 (no fraud) | 1 (fraud) | |
0 (no fraud) | € 0.0 | € -5.0 | |
1 (fraud) | € -25.0 | € 5.0 |
Data Mining Cup https://www.data-mining-cup.com 30.04.2019
This chapter describes the data and the variable selection.
The given data is in form of a CSV file with a |
as field separator. We received two files:
train.csv
containing 1,879 rows with 9 feature columns and one prediction columntest.csv
containing 498,122 rows with 9 feature columns and no prediction columnThe train.csv file is used to train an estimator. The test.csv file is to use the estimator and predict for each row if it has class fraud or not. The result has to be written into a new CSV file that has only one column, the prediction of fraud (1) or not fraud (0). This then has to be submitted, so the contest jury can evaluate the score of the candidate.
The columns in the data set are the following:
Column name | Description | Value range |
---|---|---|
trustLevel | A customer's individual trust level. 6: Highest trustworthiness | {1,2,3,4,5,6} |
totalScanTimeInSeconds | Total time in seconds between the first and last product scanned | Positive whole number |
grandTotal | Grand total of products scanned | Positive decimal number with maximum two decimal places |
lineItemVoids | Number of voided scans | Positive whole number |
scansWithoutRegistration | Number of attempts to activate the scanner without actually scanning anything | Positive whole number or 0 |
quantityModification | Number of modified quantities for one of the scanned products | Positive whole number or 0 |
scannedLineItemsPerSecond | Average number of scanned products per second | Positive decimal number |
valuePerSecond | Average total value of scanned products per second | Positive decimal number |
lineItemVoidsPerPosition | Average number of item voids per total number of all scanned and not cancelled products | Positive decimal number |
fraud | Classification as fraud (1) or not fraud (0) | {0,1} |
Since the dataset is already well cleaned and doesn't contain noisy artifacts or null values, we don't have to clean the data or care for missing data.
Next we want to explore the data. Particularly interesting is the distribution of fraud and not fraud in the train dataset: Of the 1,879 rows, there are 1,775 cases of not fraud and only 104 cases of fraud. This means the data is highly imbalanced. Since we know from the task description, that true positives, true negatives, false positives and false negatives have different costs for the store operator, we know we probably have to come up with a custom loss function later, weigh the classes or oversample the data. Otherwise an estimator would just learn to predict no fraud and good is.
We have 9 features we can consider to build the model. But are all of them really relevant? We should find out first. We do this computing the correlation between the variables. Other possibilities would be using AIC stepwise procedure.
Before we start, we have to have a closer look at the variables. The variables trustLevel
could be
considered a categorical variable. Its range is from 1 to 6 but only integers. But since the level is increasing
and there is some meaning in the order of the values and not only in the ordinal position, we keep it as a numerical
value. Other variables might have high correlation with other variables as for example lineItemVoids and
lineItemVoidsPerPosition. The variables scannedLineItemsPerSecond and totalScanTimeInSeconds also must be
correlating.
The variables scannedLineItemsPerSecond and totalScanTimeInSeconds also seem like they could be combined to create a new variable totalItemsScanned that could potentially be more predictive. So, we added this to the dataset. Since we are now interested in distributions and correlations, we create a pairs plot. Or rather two since there are too many variables to show.
We can see that some variables actually do show some separation when correlated to the predictor variable, which is a good thing. We can also see that some variables are noisy.
For a better understanding and more intuition, we also create a correlation plot with Pearson's method.
This lets us see what we anticipated before. We score each variable with their correlation value and order them in a table:
Variable | Correlation |
---|---|
fraud | 1.000000 |
trustLevel | 0.319765 |
totalItemsScanned | 0.298423 |
totalScanTimeInSeconds | 0.110414 |
lineItemVoidsPerPosition | 0.090116 |
scansWithoutRegistration | 0.074123 |
lineItemVoids | 0.063496 |
valuePerSecond | 0.028873 |
scannedLineItemsPerSecond | 0.023085 |
grandTotal | 0.001421 |
quantityModifications | 0.000864 |
From here we scrap variables that have a very low correlation value or are built into another variable. Variables we remove because they are correlating with another variable are: scannedLineItemsPerSecond, lineItemVoidsPerPosition and valuePerSecond. Variables we remove because their correlation value is too low are: quantityModifications and grandTotal. We end up with these 4 variables: trustLevel, totalItemsScanned, totalScanTimeInSeconds and lineItemVoids.
Since we're dealing with only a few variables and not with high dimensional data, we can resort to simpler methods such as linear models, i.e. logistic regression, decision trees or support vector classification SVC.
The scikit-learn cheat sheet gives a good overview which algorithms we should be exploring. So we will actually follow this chart.
scikit-learn: Choosing the right estimator https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 04.05.2019
In this chapter we define our baselines and are exploring different methods to build a classifier.
To have something to compare this to, we need some baselines. So, what score would we get, if we would simply always predict 0 or 1 or at random. We also should figure out what the maximum score could theoretically be. So, if we hit that, we know we are probably overfitting. These are the results on the given the data:
Train | Test | |
---|---|---|
100% accuracy | 520€ | 118,635€ |
Always predicting fraud | -43,855€ | -11,741,215€ |
Always predicting not fraud | -520€ | -118,635€ |
Random (averaged) | -22,201€ | -5,929,590€ |
These scores show us, that we actually have a pretty hard problem to solve. The model should be very conservative classifying something as fraud as the penalty is very big.
To try out different pipelines, we decided to use scikit-learn as a framework as it comes with almost all the necessary tools to accomplish our task. The idea is to create a new Jupyter Notebook for each algorithm to try that heavily relies on helper functions from several Python scripts. Each notebook will be structured exactly the same and essentially execute the following steps:
The goal was that each of those steps, except the model training, is a single function call. These utility functions have been implemented in python scripts that reside in the same folder. This pipeline evolved as we were creating new classifiers and a common pattern eventually emerged.
The training code looks analogous to this code:
Notably, we use grid search to find the best hyper-parameters for the model and the training task. Inside the grid search we additionally use StratifiedKFold to preserve the percentage of samples of each class, so we have a fair cross validation score. The scoring is our custom scoring implementation.
The function load_fraud_data
loads the data from the training data and transforms the dataframe
according to the variables we decided upon in the previous chapter.
The function score_evaluation
creates a confusion matrix with the true labels and the predicted labels,
which it then uses to apply a score.
The function find_best_thresh
iteratively searches the threshold value that optimizes the score of
the classifier. In the iteration loop, an average score at that threshold is computed. All scores and thresholds
are then stored to be plotted. Ideally we want to see that the model forms a nice curve with scores and that there
is a global optimum that we find. It may also be that the model produces chaos and the results are arbitrary. A
random curve would strongly indicate that the model will perform poorly on test data.
The function get_test_score
takes the trained model and awards it a score from the released test data
of the contest.
The confusion matrix is generated with plot_fraud_confusion_matrix
which uses scikit-learn and is
normalized. The train-test split used to fit the classifier is random.
The precision recall plot is generated with plot_precision_recall_curve
from scikit-learn.
The cross-validated ROC curve is generated with plot_cv_roc_curve
and has been taken from a
tutorial page from scikit-learn. The visualization is more advanced than the normal ROC
plot, as it uses cross-validated models to generate multiple ROC curves in the same plot. This allows us to see an
interval in which the ROC curve would probably lie with test data.
The learning curves plot is generated with
plot_learning_curve
. The implementation is heavily inspired by another sckit-learn
tutorial. The first plot shows the training and the cross-validated score
during the training process by using more samples.
First we try to build a decision tree classifier. We use the dataset as prepared in the previous chapter. We use a decision tree as the first simple algorithm knowingly it will probably either overfit or not fit at all as decision trees are prone to.
The grid search for DecisionTreeClassifier had the task to find the best hyper-params out of the following:
The best score was achieved with criterion=gini and splitter=random. The score is 520€.
We actually get the maximum positive monetary value. This mean our model fitted the data perfectly. This is commonly the case with decision trees, that they overfit very likely. We would have to combat this with regularization techniques such as limiting the splits or the depth.
Now if we train a model with the found hyper-parameters, we get the following plots:
As the confusion matrix shows, the model is not good at predicting fraud at all.
In this plot we see that the model does not transition from precision to recall and rather has a hard point at which it suddenly changes. This indicates the model is probably too simplistic.
We see the same here as we do multiple cross-validations. We intent the curve to be more belly shaped or round. Since we are dealing with a severe imbalance of data, it actually does not make much sense to generate ROC curves as they will always very much lean towards one side and reading them will be difficult. So, that's why in the further explored algorithms we do no longer look at ROC curves. Source
Finally, the learning curves show us that the model is not generalizing and rather only learning the training data. No wonder as it's an unregularized decision tree.
We now want to figure out at what level we have to set the threshold for the model to predict fraud or no fraud. We do not simply want to assume that at 50% probability it's either fraud or no fraud as we want to optimize our score at the end. So, we built an algorithm that tries to find the best threshold to use by iterating through different thresholds and with each simulating a train-test split and predicting the class probabilistically. The best threshold found is to depict fraud when the probability is at 26% to maximize our score. We get a score of 133€ for it. But a closer look at the distribution of scores reveals that this was pure coincidence as the model is too simple in itself:
The model also does not perform well at all with the released test data from the contest. It scores -37,340€ in monetary value.
But not to dwell too much with this algorithm, we are going to move on to an algorithm that can generate a model that has more capacity and better methods to avoid overfitting now.
Machine Learning Mastery: How to Use ROC Curves and Precision-Recall Curves for Classification in Python https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ 08.12.2019
Next we try a logistic regression model. This is a linear model that uses a logistic function to model a binary variable. Since we are trying to solve a binary classification problem, this could work out.
The grid search for LogisticRegression had the task to find the best hyper-params out of the following:
The best score was achieved with C=1 and solver=lbfgs which are the default scikit-learn parameters. The score although is -405€.
When looking at the confusion matrix of a model trained with these parameters, we see little difference compared to the decision tree model:
The precision-recall plot looks more promising compared to the decision tree model. We now see that the model has generalized. The curve does not have to be smooth produce acceptable results, but it would be good.
In the learning curves we see that the cross-validated score actually reaches a positive value. But we also see that the training score somehow reached a negative value. while the cross-validated score is barely positive. This is due to the imbalanced dataset. In a slice of the data, the maximum might be a negative value. But we know in the cross-validated score we can reach up to 520€.
Once we try to find the best threshold to decide between fraud and no fraud, we find, that this model actually has a more predictive inner structure, as there clearly as a "best" threshold to be chosen to maximize the score. The threshold finding algorithm has figured, that with this model, the threshold should be pretty high to maximize the score to pay the evaluation matrix respect.
The best threshold is 91% and the score at that point is 78€. If we evaluate the model on the released test data, we get a score of 44,605€ which is a great score compared to the theoretical maximum.
The grid search for RandomForestClassifier had the task to find the best hyper-params out of the following:
The best score was achieved with n_estimators=250 and criterion=gini. The score is 520€ which is the optimum.
The precision-recall plot looks more promising compared to the decision tree model. We now see that the model has generalized. The curve does not have to be smooth produce acceptable results, but it would be good.
In the learning curve plot we can see that the model has a high variance problem. But nonetheless, both training and cross-validated scores are positive, which is something we didn't reach before.
The score-threshold plot is a nice smooth curve, which is what we theoretically want. But we also want a global optimum in the curve that we can pick to optimize the score. This curve is topping out and there are several global optima available. This is a problem.
The best threshold is 56% and the score at that point is 132€. If we evaluate the model on the released test data, we get a score of 10,455€ which is not too far off the maximum score possible but not the best algorithm evaluated.
The grid search for KNeighborsClassifier had the task to find the best hyper-params out of the following:
The best score was achieved with n_neighbors=10, weights=uniform and algorithm=auto. The score is -520€ which is what we would get if we would always predict 0 (not fraud). So, the model has probably learned to always predict not fraud.
This is exactly what we see in the confusion matrix. Fraud is just never predicted.
In the learning curve we see the same phenomena. The training score starts out good but gets worse and worse.
The maximum score in the score-threshold diagram also reveals that this models best threshold is just the threshold that ensures fraud is never predicted. Every other choice would be worse than the baseline.
The best threshold is 56% and the score at that point is -126€. If we evaluate the model on the released test data, we get a score of -118,635€.
The grid search for SVC had the task to find the best hyper-params out of the following:
The best score was achieved with C=0.8, kernel=linear and gamma=scale. The score is 295€.
The precision-recall plot looks good and nearly how we would want it to be.
The learning curves reveal again a slight high variance problem.
Now the score-threshold curve is a smooth curve with a global optimum. Finally the perfect curve we were looking for.
The best threshold is 56% and the score at that point is 84€. If we evaluate the model on the released test data, we get a score of 44,785€.
We also tried to fit a neural network to the data. The challenge herein lied in developing a loss function that does not prohibit gradient descent from finding good steps and does reflect our score evaluation.
The network architecture would have been as follows:
Using the standard loss function binary_crossentropy
this model converged. The problem was that it
did not optimize the score but rather the accuracy which is not what we want. So, we looked for a way to build a
loss function that has weights to our desire and still allows the algorithm to learn:
This function is inspired by a Github Issue comment. But when using this function, the learner could not converge anymore because the loss was not scaled properly. We couldn't get this to work as expected and abandoned the idea of neural networks.
scikit-learn: Receiver Operating Characteristic (ROC) with cross validation https://scikit-learn.org/stable/ auto_examples/model_selection/plot_roc_crossval.html 08.12.2019
scikit-learn: Plotting Learning Curves https://scikit-learn.org/stable/ auto_examples/model_selection/plot_learning_curve.html 08.12.2019
Github Keras: Is there a way in Keras to apply different weights to a cost function? https://github.com/keras-team/keras/issues/2115#issuecomment-204060456 15.12.2019
The different classifiers performed roughly in the order we would have expected them to. What we didn't know is if the actual scores would turn out to be any good and competitive in the competition.
What we also noticed when trying a more complex method than an SVC, we struggled to get the classifier to train with the data. With such little number of samples, it is not feasible to train a neural network.
Algorithm | Score in training | Score in test |
---|---|---|
SVC | 84€ | 44,785€ |
Logistic Regression | 78€ | 44,605€ |
Random Forest | 132€ | 10,455€ |
Decision Tree | 96€ | -37,340€ |
KNeighbor Classifier | -126€ | -118,635€ |
The best classifier thus is the SVC closely followed by Logistic Regression for this particular problem. We are not surprised by the outcome as an SVC is able to come to the best trade off. It can cope with linear data extremely well and if the data is non-linear, it can project the data into a linear space. The close follower Logistic Regression probably did that good because we boiled down the variables down to four and had the problem of binary classification to which logistic regression is very easy to apply.
It was very helpful to read into data pre-processing, variable selection, training algorithms, cross-validation, grid search, regularization, evaluation and so on. This was a small enough project to not be overloaded with the challenge, but still tricky enough that simple copy-paste from the Internet simply doesn't work. We had to think hard about the problem and make the right conclusions to get to an acceptable result.
Personally, I think the contest was a bit artificial as the training set was so small. Training a classifier on such little data will inevitably result in some variance in the submitted results from other contestants. Thus, luck was certainly needed to perform well in this competition besides building a robust classifier.
Overall, I'm happy to have completed the project and I think I learned a lot of hands-on techniques that I only heard of in lectures.
The following software was used to generate the results of this work: