Enhancing decision tree accuracy and compactness with improved categorical split and sampling techniques

By: Gaëtan Millerand

Abstract

Decision tree is one of the most popular algorithm in explainable AI domain. In fact, from its structure, it is really simple to induce a set of decision rules which are totally understandable for an ordinary user. That is why there is currently research on improving decision or mapping other models into a tree. Decision tree generated by C4.5 or ID3 tree suffers from two main issues. The first one is that they often have lower performances in term of accuracy for classification tasks or mean square error for regression tasks accuracy compared to state-of-the-art models like XGBoost or deep neural networks. Actually, on almost every task, there is an important gap between top models like XGboost and decision trees. This thesis addresses this problem by providing a new method based on data augmentation using state-of-the-art models which outperforms the old ones regarding evaluation metrics. The second problem is the compactness of the decision tree, as the depth increases the set of rules becomes exponentially big, especially when the splitted attribute is a categorical one. Standards solution to handle categorical values are to turn them into dummies variable or to split on each value producing complex models. A comparative study of current methods of splitting categorical values in classification problems is done in this thesis, a new method is also studied in the case of regression.

Keywords: Explainability, sampling, decision trees, white-box models