Each natural language text has certain properties. Some of these
proporties are the same for each text as for example stop words. These are frequent
words which are not significant for the text, as for example, prepositions, conjunctions,
interjections, numerals, etc. Other word as keywords are significant for the text.
Keyword can be nouns, adjectives, adverbs and certain verbs.Keyword belong also to the open class words which changes over time.
Automatic text filtering is for example to find texts containing a specific keywords or several keywords. Automatic text filtering can be used in automatic news agents. Text filtering is closely related to categorization and clustering of text.
Clustering and automatic categorization both sort sets of
texts into groups within which the texts are similar in substance. Categorization
sorts texts into predefined categories. Clustering builds previously unknown categories
depending on the contents of the set. Categorization is suitable when one wishes
to sort a set of texts according to predefined categories, clustering when one wishes
to explore the structure of the set.
Many sets of texts are categorized as a matter of routine. In libraries and papers, for instance, the same categories have been used, unchallenged, for along time, mostly because a re-categorization of the enormous sets would be to time-consuming.
But as a researcher you sometimes would wish to sort these sets into other than the usual categories. Here both clustering and categorization could be helpful as long as the texts are stored digitally. Of course the results of both methods probably could be out done by a human, but on the other hand a human would never try to sort really big sets.
Clustering could also be used as a tool of exploration - producing a survey of a set by splitting it into groups containing similar texts, thus revealing the main subjects in the set. A clustering in this way could give you new information concerning the set (even when it's previously categorized).
As an example we have clustered a set of newspaper articles with our own clustering system. The result was a cluster (group) with articles about murder and assault taken from both the foreign and the domestic parts of the newspaper categorization, a cluster with articles about sales taken from the economic, the sports and the culture parts, and a few other clusters with less distinct contents. The clustering algorithm thus has shown us (as if we did not know) that two major concerns of newspapers are assault and sales.
In our current research project Infomat we are treating query expansion and clustering.
M. Hassel 2001. Internet as Corpus - Automatic Construction of a Swedish News Corpus. In the Proceedings of NODALIDA í01 - 13th Nordic Conference on Computational Linguistics, May 21-22, 2001, Uppsala, Sweden. PDF.
M. Rosell 2002. Klustring av svenska tidningsartiklar, (Clustering of swedish newspaper articles. In Swedish) Master thesis NADA-KTH PDF
M. Rosell 2003. Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting, NoDaLiDa 2003, Reykjavik, 2003. PDF