A compact language model for Swedish text anonymization

Victor Wiklund

Abstract:

The General Data Protection Regulation (GDPR) that came into effect in 2018 states that for personal information to be freely used for research and statistics it needs to be anonymized first. To properly anonymize a text one needs to identify the words that carry personally identifying information such as names, locations and organizations. Named Entity Recognition (NER) is the task of detecting these kinds of words and in the last decade a lot of progress has been made on it. This progress can be largely attributed to machine learning, in particular the development of language models that are trained on vast amounts of textual data in the target language. These models are very computationally demanding however and not everyone has the resources needed to run them. As a counterpoint to the trend of creating more and more complex models for higher performance, ALBERT is a recently developed language model designed to be substantially more compact at only a small loss of performance. In this thesis we explore the use of ALBERT as a component in Swedish anonymization by combining the model with a one-layer BiLSTM classifier and testing it on the Stockholm-Umeå corpus. The results show that the system can separate personally identifying words from ordinary words 79.4% of the time and that the model is the best when it comes to detecting names, with a F1-score of 87.7 percent. Looking at the average performance across eight categories we obtain a F1-score of 77.8% with five-fold cross-validation and 77.0 +- 0.2% on the test set with 95% confidence. We find that the system as-is could be used for the anonymization of some types of information but only with a risk of also obscuring non-sensitive information. We discuss ways to reduce this risk by enhancing the performance of the model and conclude that ALBERT can be a useful component in Swedish anonymization, provided that it is optimized further.