Tove Tengvall

A Method for Automatic Question Answering in Swedish based on BERT

Abstract

This report presents a method for doing automatic reading comprehension in Swedish. The method is based on BERT, a pre-trained Swedish neural-network language model, which was fine-tuned on a Swedish question-answer corpus. This corpus was built by having human annotators posing questions expressed in natural language over paragraphs of text present in a set of articles collected from Swedish Wikipedia and the Swedish Migration Agency. In the task defined, the model was supposed to return the short span of text within the given paragraph that constitutes the correct answer to a given question. The dataset was partitioned into 910 question-answer pairs for training and 105 pairs for validation. The quality of the method was evaluated on 257 questions. The returned answers were compared with the correct answers from the corpus, as well as with the results from a simpler grammatical method that was developed as a comparison (a baseline).

Using the fine-tuned Swedish BERT-base model, we can obtain an F- Score of 78.1% and an Exact Match Score of 63.0% when evaluating the model on the collection of questions generated in the study. The model outperforms the baseline and is assessed to be a successful method for the question answering task defined. However, although these results indicates that BERT has great potential as a method for automatic question answering in Swedish, the results are not as good as the results of the English BERT model fine-tuned on the English question-answer corpus called SQuAD. The reason to the poorer performance of the Swedish model may be explained by the size of the question-answer corpus used in the study being much smaller than the SQuAD corpus.