bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2476 / ir12 / Schedule

Schedule: Search Engines and Information Retrieval Systems, ir12

Period 3

Where and WhenActivityReadingExamination
January 17
13.15-15.00
B22
Lecture 1: Introduction, boolean retrieval, course practicalities
Hedvig Kjellström, Johan Boye
Manning Chapter 1, 2
January 24
13.15-15.00
L44
Lecture 2: Term vocabulary, dictionaries and tolerant retrieval
Johan Boye
Manning Chapter 2, 3
January 31
13.15-15.00
B22
Lecture 3: Scoring, weigthing, vector space model
Johan Boye
Manning Chapter 6, 7
January 31
15.00-17.00
Sporthallen
Computer hall session
Hedvig Kjellström, Johan Boye
Oral examination of Assignment 1 in front of computer
February 7
10.15-12.00
E34
Lecture 4: Retrieval of documents with hyperlinks
Johan Boye, Hedvig Kjellström
Manning Chapter 21
Avrachenkov Sections 1-2
February 14
10.15-12.00
D33
Lecture 5: Index construction, index compression
Johan Boye
Manning Chapter 4, 5
February 21
10.15-12.00
Q22
Lecture 6: Evaluation, relevance feedback, query expansion
Hedvig Kjellström
Manning Chapter 8, 9
February 24
13.00-15.00
Spelhallen
Computer hall session
Hedvig Kjellström, Johan Boye
Oral examination of Assignment 2 in front of computer
February 28
10.15-12.00
D41
Lecture 7: Probabilistic information retrieval, language models
Hedvig Kjellström
Manning Chapter 11, 12
March 15
14.00-19.00
V34
Written exam Written exam

Period 4

Where and WhenActivityReadingExamination
March 21
13.15-15.00
L31
Lecture 8: Classification
Hedvig Kjellström
Manning Chapter 13.1-4, 14.1-3, 15.3
March 28
13.15-15.00
L31
Lecture 9: Clustering, web crawling
Hedvig Kjellström
Manning Chapter 16.1-4, 17.1-3, 19, 20
April 11
15.15-17.00
L44
Lecture 10: Some useful additions to a search engine, Random Indexing
Viggo Kann
Magnus Sahlgren: An Introduction to Random Indexing (2005) http://www.sics.se/~mange/papers/RI_intro.pdf
April 18
13.15-15.00
B24
Lecture 11: Guest lectures
Simon Stenström, Findwise
Information Retrieval and Findability
The presentation will describe the process of building a search system rather than a search engine. There are many search engines out there, but the search engines are just one small part of creating a Findability solution. With a Findability solution we at Findwise mean both using the full potential of search technology and focusing on the four other critical dimensions of Findability; Business, Users, Information and Organization. This presentation will however focus on the technical parts such as the architecture of a search system. It will explain what we did to handle the real case search scenario at Uppsala University, starting at unstructured files ending up with a usable search system with an understandable user interface. You can see the result at http://search.uu.se/en/.

Oscar Täckström, SICS
Sentiment Analysis
In the last ten years, sentiment analysis has grown from a rather obscure sub-field of natural language processing to a highly productive and diverse field of research with a growing business impact. The basic assumption of all approaches to sentiment analysis is that we can learn something about people's attitudes towards other people, things and ideas, by looking at what they write or say about them. With the explosive growth of online media, such as blogs, micro-blogs and online fora, therefore comes a rich source of data from which we can learn more about people's attitudes and preferences. This knowledge, is fundamental to, for example, brand management and business intelligence, but may also find use in sociology and political science.
In this lecture, I will give an overview of some important tasks and current approaches to sentiment analysis, focusing on the role of linguistic representations and tools from machine learning. Different levels of analysis requires different tools and I will spend some time discussing how to decide on the appropriate level and which tools to use for different levels. Specifically, I will discuss how one can create and use polarity lexicons, the limits of lexicon based approaches and if time permits, I will describe some recent work on using graphical models to model fine-grained sentiment in product reviews.
April 25
13.15-15.00
B24
Lecture 12: Guest lectures
Hercules Dalianis, SU DSV
Some applications using clinical corpora to assist the clinician, her managers and clinical research
Today a large number of Electronic Patient Records (EPRs) are produced for legal reasons but they are very seldom reused, neither for clinical research nor for business (hospital) intelligence reasons. Moreover, the clinician’s daily work in documenting the patient status is not always supported in a proper way. Hospital management needs key and real time information of the health care processes. Simultaneously, patients have become more demanding customers that want to be involved in their own health care process. We are aiming to support these demands. Clinical documentation forms an abundant source to extract valuable information that can be used for this purpose, however clinical corpora contain protected health information and must be kept in a safe way. Today only in Sweden (with a population of 10 million) 4-10 million pages of patient records are produced each year. We have studied the Stockholm EPR Corpus, a huge clinical document collection written in Swedish, containing over one million patient records. The document collection is distributed over 900 clinics from the Stockholm area encompassing three years 2006-2008. We have used this clinical corpus as a knowledge base to develop a set of tools that can work as basic building blocks for the future tools for health engineering. We have been assisted by physicians that have interpreted the content in the clinical text to us, they have annotated the clinical text and they have also set requirements on these tools together with their colleagues. We have identified four groups of users in the health domain: physicians, clinical researchers, hospital management and patients. We will show examples on these tools and the benefits they will give to health care. 1) For physicians: Automatic ICD-10 assignment 2) For clinical researchers: Comorbidity networks 3) For hospital management: ICD-10 validation and adverse event detection and finally 4) For patients: automatic text summarization. Project homepage: http://dsv.su.se/forskning/health/.

Magnus Rosell, Recorded Future
Text Clustering Exploration
Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that "farmers smoke less than the average", which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values.
May 16
10.00-12.00
Fantum, Lindstedtsv 24, floor 5
Project presentations Written report hand-in
Oral presentation in front of poster

Copyright © Sidansvarig: Hedvig Kjellström <hedvig@nada.kth.se>
Uppdaterad 2012-04-25