Nada

Infomat - Swedish information retrieval with language technology and matrix computations

The project was funded 2003-2005 by the Swedish research council VR and KTH. A description of the project in Swedish can be found here.

Short description of the project

Information retrieval (IR) is a big and growing subarea of language technology. Search engines and information retrieval tools use various models to perform ranking in search, for example Boolean models, term weight models and vector based models. Tools from numerical matrix computation give interesting clues to the solution of several important subproblems in IR. The general problem is to track relations and give a measure of closeness.

Latent Semantic Indexing (LSI) corresponds to doing the singular value decomposition (SVD) of the term document matrix, this way emphasising common and global properties over particular and local ones. In our approach, we get also documents that are only indirectly related to the original query. In the project we will use similar methods to detect synonyms and related terms, and to perform clustering.

IR works best if language specific knowledge is used. Swedish is known poorly from the IR perspective and much work has still to be done. We will use stemming, lemmatization, noun phrase extraction, compound splitting, spelling correction and clustering to extract as much information as possible from the queries and the texts.

An important part of the project will be to evaluate rankings and clusterings. We will use two different Swedish corpora in this work: the KTH news corpus and a medical corpus from Medical Epidemiology at Karolinska Institutet.

Exjobb

We have a list of open exjobs (student exam projects).

Participants

Reports in English

Magnus Rosell, Clustering in Swedish - The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method., Lic. thesis in computer science, KTH Nada, TRITA-NA-0531, 2005.
Oscar Täckström, An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification, Masters thesis in computer science, KTH Nada, TRITA-NA-E05150, 2005.
Katarina Blom and Axel Ruhe, Information Retrieval using a Krylov subspace method, SIAM J. Matrix Analysis and Applications, vol 26, pages 566-582, 2005.
Hercules Dalianis, Improving search engine retrieval using a compound splitter for Swedish, to appear at NoDaLiDa 2005, Joensuu, 2005. Abstract
Viggo Kann and Magnus Rosell, Free construction of a free Swedish dictionary of synonyms, NoDaLiDa 2005, Joensuu, 2005. Abstract, Presentation (PowerPoint)
Magnus Rosell and Sumithra Velupillai, The impact of phrases in document clustering for Swedish, NoDaLiDa 2005, Joensuu, 2005.
Jonas Sjöbergh, Creating a free digital Japanese-Swedish dictionary, PACLING 2005.
Sumithra Velupillai, Phrases or Words? Clustering and categorizing Swedish scientific medical text, Masters thesis in computer science, KTH Nada, TRITA-NA-E05063, 2005.
Magnus Rosell, Viggo Kann, Jan-Eric Litton, Comparing comparisons: Document clustering evaluation using two manual classifications, ICON 2004, India.
Jonas Sjöbergh, Viggo Kann, Finding the correct interpretation of Swedish compounds, a statistical approach, Proc. LREC 2004 (4th Int. Conf. Language Resources and Evaluation), Lissabon, Portugal.
Katarina Blom, Information retrieval using Krylov subspace methods, Ph.D. thesis, Chalmers University of Technology, ISBN 91-7291-453-X, 2004.
Magnus Rosell, Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting, NoDaLiDa 2003, Reykjavik, 2003.
Katarina Blom and Axel Ruhe, Information Retrieval using very short Krylov sequences, Proc. Computational Information Retrieval Workshop 2000, SIAM Proceedings in Applied Mathematics 106, 2001.
Katarina Blom, Information retrieval using the singular value decomposition and Krylov subspaces, licentiate thesis, 1999.

Reports in Swedish

Jonas Sjöbergh, Viggo Kann, Vad kan statistik avslöja om svenska sammansättningar?, Språk och stil 16:199-214, 2006.
Rasmus Kjellman, Spellchecking in search engines, course report in "Advanced, Individual Course in Computer Science" at KTH Nada, 2005.

Links

Up to research in language technology.

Responsible for this page: Viggo Kann <viggo@nada.kth.se>
Latest change February 7, 2007
Technical support: <webmaster@nada.kth.se>