Anna Karlhede

Tackling imbalanced data in Random Forest to predict free-to-fee transitions of a subscription SaaS-application

Abstract

In this thesis, we investigate different methods to tackle highly imbalanced data when using Random Forest for classification. The data comes from a subscription SaaS-application, namely Mentimeter, which allows users to create interactive presentations. The classification task at hand is to predict which users will become paying customers in a certain period. The ratio between the two classes, will and will not upgrade, is 400-to-1. The measures taken against the imbalanced nature of the data set are balanced class weights, random undersampling and SMOTE oversampling. The most important factor for a machine learning model to succeed is the features used as input, therefore, a big part of the work conducted is dedicated to this. As a part of this we use Pearson’s correlation coefficient and Permutation Importance (PI) to reduce the dimensionality of the problem, the number of features in the data set. We also use PI to calculate the importance of different features to create different sets of features - the top 5 and top 10 most important ones, as well as all features - in an attempt to further increase the performance. Without any measures taken, accuracy and specificity are close to 1, and the recall is close to 0.

We can conclude that with a very low computational cost - balanced class weights and random undersampling - we can vastly increase recall. This, however, comes at a cost of decreased specificity and accuracy. Balanced class weights and SMOTE oversampling have much less impact on the result. None of our efforts has been able to increase the precision, which is extremely low. Of all features collected, the most important ones as calculated by PI are the ones that represent the core feature of Mentimeter - the slides and question that make up the presentations. For the weighted and SMOTE oversampling RF classifiers, we saw that the set of features used had a significant effect on the performance, where the ones using the top 5 most important ones performed the best in terms of recall. However, for the RF classifiers that performed the best for each performance metric, we saw a limited effect of using the most important features.