Infomat - Swedish information retrieval with language technology and matrix computations
The project was funded 2003-2005 by the Swedish research council VR
A description of the project in Swedish can be found here
Short description of the project
Information retrieval (IR) is a big and growing subarea of language
technology. Search engines and information retrieval tools use various
models to perform ranking in search, for example Boolean models, term
weight models and vector based models. Tools from numerical matrix
computation give interesting clues to the solution of several
important subproblems in IR. The general problem is to track
relations and give a measure of closeness.
Latent Semantic Indexing (LSI) corresponds to doing the singular value
decomposition (SVD) of the term document matrix, this way emphasising
common and global properties over particular and local ones. In our
approach, we get also documents that are only indirectly related to
the original query. In the project we will use similar methods to
detect synonyms and related terms, and to perform clustering.
IR works best if language specific knowledge is used. Swedish is known
poorly from the IR perspective and much work has still to be done. We
will use stemming, lemmatization, noun phrase extraction, compound
splitting, spelling correction and clustering to extract as much
information as possible from the queries and the texts.
An important part of the project will be to evaluate rankings and
clusterings. We will use two different Swedish corpora in this
work: the KTH news corpus and a medical corpus from
Medical Epidemiology at Karolinska Institutet.
We have a list of open exjobs
(student exam projects).
Reports in English
- Magnus Rosell,
Clustering in Swedish - The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method.,
Lic. thesis in computer science, KTH Nada, TRITA-NA-0531, 2005.
- Oscar Täckström,
An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification,
Masters thesis in computer science, KTH Nada, TRITA-NA-E05150, 2005.
- Katarina Blom and Axel Ruhe,
Information Retrieval using a Krylov subspace method,
SIAM J. Matrix Analysis and Applications,
vol 26, pages 566-582, 2005.
- Hercules Dalianis,
Improving search engine retrieval using a compound splitter for Swedish,
to appear at NoDaLiDa 2005, Joensuu, 2005.
- Viggo Kann and Magnus Rosell,
Free construction of a free Swedish dictionary of synonyms,
NoDaLiDa 2005, Joensuu, 2005.
- Magnus Rosell and Sumithra Velupillai, The impact of phrases in document clustering for Swedish, NoDaLiDa 2005, Joensuu, 2005.
- Jonas Sjöbergh,
Creating a free digital Japanese-Swedish dictionary,
- Sumithra Velupillai,
Phrases or Words? Clustering and categorizing Swedish scientific medical text,
Masters thesis in computer science, KTH Nada, TRITA-NA-E05063, 2005.
- Magnus Rosell, Viggo Kann, Jan-Eric Litton,
Comparing comparisons: Document clustering evaluation using two manual classifications, ICON 2004, India.
- Jonas Sjöbergh, Viggo Kann,
Finding the correct interpretation of Swedish compounds, a statistical approach,
Proc. LREC 2004 (4th Int. Conf. Language Resources and
Evaluation), Lissabon, Portugal.
- Katarina Blom,
Information retrieval using Krylov subspace methods,
Ph.D. thesis, Chalmers University of Technology, ISBN 91-7291-453-X, 2004.
- Magnus Rosell, Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting, NoDaLiDa 2003, Reykjavik, 2003.
- Katarina Blom and Axel Ruhe, Information Retrieval using very short Krylov sequences, Proc. Computational Information Retrieval Workshop 2000, SIAM Proceedings in Applied Mathematics 106, 2001.
- Katarina Blom, Information retrieval using the singular value decomposition and Krylov subspaces, licentiate thesis, 1999.
Reports in Swedish
Up to research in language technology.