Anomaly Detection across multiple languages

Mastafa Foufa

Abstract

We present Multilingual Anomaly Detector (MAD), a toolkit to detect anomalies insensitive to the use of different languages. Unsupervised anomaly detection on high-dimensional textual data is of great relevance in both machine learning research and industrial applications. Although previous approaches focus on textual data to detect anomalies, they mainly suffer from their sensitivity to the languages the anomaly detector is trained on and hence the incapability to generalize to different languages. We find that the quality of the semantic space representing textual data is of great importance for downstream applications. We first compare different ways to represent textual data across multiple languages. Then, we focus on detecting anomalies by employing deep learning techniques based on autoencoders. In a real-world scenario, one can often have access to a few anomalous observations and unsupervised techniques show rather poor performance. Hence, we finally focus on a few-shot learning technique, which only requires a few anomalous observations, by introducing supervised MAD. The latter architecture, based on Siamese networks, outperforms unsupervised anomaly detection techniques consistently and shows more robustness in anomaly detection settings than strong multilingual models, such as multilingual BERT.