bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2476 / ir12

Search Engines and Information Retrieval Systems, ir12

A course in Computer Science focusing on basic theory, models, and methods for information retrieval.

If you took DD2475 ir10, and have lab assignments and/or the project left, you are very welcome to finish these during the spring of 2012. Please contact Hedvig Kjellström (see People in the meny) to let us know that you are following the course.

News

February 14: You can now book a time slot for presenting your solutions to Computer Assignment 2, using this Doodle. The presentation takes place in front of a computer in Spelhallen on February 24. All members of the group have to be present, and be prepared to answer questions on all parts of the assignment.

January 30: We have rearranged the order of lectures 3-5, to give you the theory for Assignment 2 earlier - tomorrow, we will go through ranked retrieval. Sorry about the late notice! (The reason for the old order of lectures was that index compression was a part of assignment 1 last year, and had to be covered as early as possible in the lecture series.)

January 27: There was a bug in the MegaIndex class in computer assignment 1.4, that caused merged indexes not to be saved. This bug is now corrected. Please download the new version!

January 24: You can now book a time slot for presenting your solutions to Computer Assignment 1, using this Doodle. The presentation takes place in front of a computer in Sporthallen on January 31. All members of the group have to be present, and be prepared to answer questions on all parts of the assignment.

October 21: The homepages are now up and running. The course has changed names (and course code from DD2475 to DD2476), but cover essentially the same content, with a slight shift in focus towards web search, with e.g. a deeper coverage of linked document retrieval.

Learning Outcomes

After completing the course you will be able to:

  • explain the concepts of indexing, vocabulary, normalization and dictionary in Information Retrieval,
  • give an account of different text similarity measures, and select a similarity measure suitable for the problem at hand,
  • define a boolean model and a vector space model, and explain the differences between them,
  • implement a method for ranked retrieval of a very large number of documents with hyperlinks between them,
  • evaluate information retrieval algorithms, and give an account of the difficulties of evaluation,
  • give an account of the structure of a Web search engine.

Content

Basic and advanced techniques for information systems: information extraction; efficient text indexing; indexing of non-text data; Boolean and vector space retrieval models; evaluation and interface issues; structure of Web search engines.

Literature

Required Text Book

  • C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
The book can be ordered from your favorite internet bookstore, and found using ISBN 0521865719. Virtually all material from the book is also available online at nlp.stanford.edu/IR-book/information-retrieval-book.html.

Article

  • K. Avrachenkov, N. Litvak, D. Nemirovsky and N. Osipova, Monte Carlo Methods in PageRank Computation: When One Iteration is Sufficient, SIAM Journal on Numerical Analysis 45(2), 2007.
Only Sections 1-2 of this article are covered in the course.

Optional Books

  • I. Witten, A. Moffat and T. Bell, Managing Gigabytes, Morgan Kaufmann, 1999.
Useful as a reference for technical Information Retrieval in the first half of the course. Available online at books.google.com.
  • S. Marsland, Machine Learning: An Algorithmic Perspective, Taylor and Francis, 2009.

   

Useful as a reference for topics related to Machine Learning, classification, clustering and probability. Available online at books.google.com.
  • S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.
Covers many topics in the last part of the course. Available online at books.google.com.

Other Resources

To get an idea of state-of-the-art in Information Retrieval research and development, take a look at the program of the annual conference ACM SIGIR.

Examination

Assignments

The examination in the course is performed through:
  • Two computer assignments (3 credits). The computer assignments are performed in groups of two students, and presented orally by the computer. Grade (normally the same for all group members): P(pass) / F(fail).
  • A written exam (3 credits). The exam is 5 hours long and takes place after the first half of the course, in December. The written exam can not be taken before the computer assignments are graded with P(pass). Grade: A - F(fail).
  • A project assignment (3 credits). The projects are performed in groups of two students, and presented with a short written report, as well as an oral poster presentation. Grade (normally the same for all group members): A - F(fail).
Details about the assignments themselves can be found under Written Exam, Computer Assignments and Project in the menu.

Grading

Course grades are assigned according to the following (CA = computer assignment grade, WE = written exam grade, PA = project assignmnent grade):
If CA = F, WE = F or PA = F, that part of the course has to be re-examined, until CA = P, WE >= E and PA >= E. The course grade is the average of WE and PA, according to the following:

WE
A
B
C
D
E
PA
A
A
A
B
B
C
B
A
B
B
C
C
C
B
B
C
C
D
D
B
C
C
D
D
E
C
C
D
D
E

The grading criteria for the different learning outcomes are:

Outcome E D C B A
Explain the concepts of indexing, vocabulary, normalization and dictionary in Information Retrieval X X X X X
Examined through laboration 1 and exam
Give an account of different text similarity measures, and select a similarity measure suitable for the problem at hand X X X X X
Examined through exam and project
Define a boolean model and a vector space model, and explain the differences between them X X X X X
Examined through laboration 1, 2 and exam
Implement a method for ranked retrieval of a very large number of documents with hyperlinks between them X X X X X
Examined through laboration 2 and exam
Evaluate information retrieval algorithms, and give an account of the difficulties of evaluation X X X X X
Examined through exam and project
Give an account of the structure of a Web search engine X X X X X
Examined through exam and project

Copyright © Sidansvarig: Hedvig Kjellström <hedvig@nada.kth.se>
Uppdaterad 2012-02-14