Automatic Proofreading of Swedish Text

Ola Knutsson

Department of Numerical Analysis and Computer Science

Royal Institute of Technology

SE-100 44 Stockholm, Sweden

knutsson@nada.kth.se

Abstract

This thesis describes the development and evaluation of a Swedish grammar-checking environment, Granska. The focus is on the construction of a rule language, the development of rules covering three major error categories, and evaluation of the system. The error categories are agreement errors and split compounds, both of which are frequent in Swedish texts.

Research on grammar checking does not only contribute to the development of better programs for automatic proofreading. Methods and algorithms from grammar checking are also very useful in any system for robust analysis of unrestricted text.

The rule language is object-oriented and has been designed for grammar checking and similar applications. One of the main ideas of the rule language was an integration of several levels of language analysis. A rule should benefit from every step in the general linguistic analysis; from tokenization to phrase structure analysis. Much effort has also been put in a clear and effective rule notation.

With the lack of large comprehensive error corpora a theoretical approach to error modelling, based on Swedish grammar descriptions, has been used. This has resulted in rather general error detection rules. The main advantages of the general rules are that they to some extent describe an error type, and that the rules could easily be extended and improved. The main disadvantages are overgeneration of error reports and the fact that a small mistake by the grammarian often has great effect on the result.

However, the development of the error detection rules has also taken advantage of the results of some limited studies of real errors in Swedish text. The collections of real errors give important information of how errors are constructed and in which contexts they appear.

An evaluation was conducted with Granska comprising about 200 000 words from five different text genres. The result indicates that there are differences of the outcome of the grammar checking between text genres. An error type very common in one text genre is sometimes not represented at all in another. In test runs on texts from popular science, nine of ten errors were found and five of ten error reports were correct. In student texts, the results were almost the opposite; four of ten errors were found and seven of ten error reports were correct.

A small user study was conducted with Granska and a commercial grammar checker. The results indicate that the users do not have any problem with choosing between different error diagnoses if one of the correction proposals is correct. Some users seem to need only the detection from a grammar checker and they can do the correction in the text by themselves. False alarms seem to be of variable difficulty; false alarms from the spell-checker are harmless, but false alarms from more complicated error types can cause harm to the text.