Nada

Infomat - Swedish information retrieval with language technology and matrix computations

The project was funded 2003-2005 by the Swedish research council VR and KTH. A description of the project in Swedish can be found here.

Short description of the project

Information retrieval (IR) is a big and growing subarea of language technology. Search engines and information retrieval tools use various models to perform ranking in search, for example Boolean models, term weight models and vector based models. Tools from numerical matrix computation give interesting clues to the solution of several important subproblems in IR. The general problem is to track relations and give a measure of closeness.

Latent Semantic Indexing (LSI) corresponds to doing the singular value decomposition (SVD) of the term document matrix, this way emphasising common and global properties over particular and local ones. In our approach, we get also documents that are only indirectly related to the original query. In the project we will use similar methods to detect synonyms and related terms, and to perform clustering.

IR works best if language specific knowledge is used. Swedish is known poorly from the IR perspective and much work has still to be done. We will use stemming, lemmatization, noun phrase extraction, compound splitting, spelling correction and clustering to extract as much information as possible from the queries and the texts.

An important part of the project will be to evaluate rankings and clusterings. We will use two different Swedish corpora in this work: the KTH news corpus and a medical corpus from Medical Epidemiology at Karolinska Institutet.

Exjobb

We have a list of open exjobs (student exam projects).

Participants

Reports in English

Reports in Swedish

Links

^ Up to research in language technology.


Responsible for this page: Viggo Kann <viggo@nada.kth.se>
Latest change February 7, 2007
Technical support: <webmaster@nada.kth.se>