bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2476 / ir12

Search Engines and Information Retrieval Systems, ir12

A course in Computer Science focusing on basic theory, models, and methods for information retrieval.

If you took DD2475 ir10, and have lab assignments and/or the project left, you are very welcome to finish these during the spring of 2012. Please contact Hedvig Kjellström (see People in the meny) to let us know that you are following the course.

News

June 21: The course analysis can now be found under Course Analysis in the menu. Have a great summer everyone!

June 06: Thank you for an inspiring course, especially the fabulous poster presentations! The project grades are now reported. You can retrieve the commented reports in paper form at the CSC student expedition, Osquars Backe 2, bottom floor, starting Monday.

May 18: Your thoughts and ideas are valuable in the further development of this course. Therefore, we would like you to fill out the following course evaluation form:

April 25: Announcements in connection to today's guest lectures: SU DSV announces 1-3 PhD positions, deadline April 30. For more info, contact Hercules Dalianis.
Recorded Future are looking for summer interns with a good knowledge of a language apart from English or Swedish, in particular Farsi. For more info, contact Magnus Rosell.

April 25: The computer hall session May 9 has been cancelled, due to schedule clashes for Hedvig and Johan. If you have questions about the projects, which can not be answered by the project proposers, contact Hedvig or Johan via email.

April 2: The poster session has been moved to May 16, 10:00-12:00, to enable Johan to be there. The schedule is updated with that information. Since the attendance is compulsory for everyone, I would like to know asap if you can not attend (due to any other compulsory activities). Send me an email in that case, describing why you can not attend.

April 2: The exams are now corrected and can be collected at the CSC student expedition beginning tomorrow morning. The results (6 A, 11 B, 3 C, 3 D, 1 E, 4 Fx, 0 F) are reported into Rapp. Passed exam results will appear in Ladok in a few days, and students that got Fx will be notified via email, with further instructions on how to complete to E.

March 20: The projects are now listed under Project in the menu. The first thing to do is to form project groups of 4-5 students. We will do match-making in the break on Wednesday, for those of you who have not formed groups yet.
When you have a group, you should select a project. Send an email to Hedvig with the first, second and third choice of your group - we will then optimize so that groups work on different projects but with a project that they fancy. There are 6 different projects, and you can also propose your own (talk to Hedvig or Johan).

February 27: No registration needed for the exam, just show up!

February 27: If you have an assigment left to present, please make an appointment with Hedvig or Johan as soon as possible, to be sure to be able to present it before the exam. (The assigments need to be examined before you are allowed to take the exam.)

February 26: In order to prepare for the exam, please have a look at Written Exam in the menu, where the grading, content and form of the exam are explained. You can have a look at last year's exam to get an idea of what it will look like. We also recommend that you do the exercises of Chapters 1-9, 11, 12, and 21.

February 24: The projects will be announced within the next week. They will probably be performed in groups of 4-5 students - which means that there will be 6-7 project groups. Groups will be formed in the beginning of period 4. In addition to the suggested topics it is also possible to suggest your own. If you have an idea of your own project, please send an email to Hedvig or Johan!

February 14: You can now book a time slot for presenting your solutions to Computer Assignment 2, using this Doodle. The presentation takes place in front of a computer in Spelhallen on February 24. All members of the group have to be present, and be prepared to answer questions on all parts of the assignment.

January 30: We have rearranged the order of lectures 3-5, to give you the theory for Assignment 2 earlier - tomorrow, we will go through ranked retrieval. Sorry about the late notice! (The reason for the old order of lectures was that index compression was a part of assignment 1 last year, and had to be covered as early as possible in the lecture series.)

January 27: There was a bug in the MegaIndex class in computer assignment 1.4, that caused merged indexes not to be saved. This bug is now corrected. Please download the new version!

January 24: You can now book a time slot for presenting your solutions to Computer Assignment 1, using this Doodle. The presentation takes place in front of a computer in Sporthallen on January 31. All members of the group have to be present, and be prepared to answer questions on all parts of the assignment.

October 21: The homepages are now up and running. The course has changed names (and course code from DD2475 to DD2476), but cover essentially the same content, with a slight shift in focus towards web search, with e.g. a deeper coverage of linked document retrieval.

Learning Outcomes

After completing the course you will be able to:

  • explain the concepts of indexing, vocabulary, normalization and dictionary in Information Retrieval,
  • give an account of different text similarity measures, and select a similarity measure suitable for the problem at hand,
  • define a boolean model and a vector space model, and explain the differences between them,
  • implement a method for ranked retrieval of a very large number of documents with hyperlinks between them,
  • evaluate information retrieval algorithms, and give an account of the difficulties of evaluation,
  • give an account of the structure of a Web search engine.

Content

Basic and advanced techniques for information systems: information extraction; efficient text indexing; indexing of non-text data; Boolean and vector space retrieval models; evaluation and interface issues; structure of Web search engines.

Literature

Required Text Book

  • C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
The book can be ordered from your favorite internet bookstore, and found using ISBN 0521865719. Virtually all material from the book is also available online at nlp.stanford.edu/IR-book/information-retrieval-book.html.

Article

  • K. Avrachenkov, N. Litvak, D. Nemirovsky and N. Osipova, Monte Carlo Methods in PageRank Computation: When One Iteration is Sufficient, SIAM Journal on Numerical Analysis 45(2), 2007.
Only Sections 1-2 of this article are covered in the course.

Optional Books

  • I. Witten, A. Moffat and T. Bell, Managing Gigabytes, Morgan Kaufmann, 1999.
Useful as a reference for technical Information Retrieval in the first half of the course. Available online at books.google.com.
  • S. Marsland, Machine Learning: An Algorithmic Perspective, Taylor and Francis, 2009.

   

Useful as a reference for topics related to Machine Learning, classification, clustering and probability. Available online at books.google.com.
  • S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.
Covers many topics in the last part of the course. Available online at books.google.com.

Other Resources

To get an idea of state-of-the-art in Information Retrieval research and development, take a look at the program of the annual conference ACM SIGIR.

Examination

Assignments

The examination in the course is performed through:
  • Two computer assignments (3 credits). The computer assignments are performed in groups of two students, and presented orally by the computer. Grade (normally the same for all group members): P(pass) / F(fail).
  • A written exam (3 credits). The exam is 5 hours long and takes place after the first half of the course, in December. The written exam can not be taken before the computer assignments are graded with P(pass). Grade: A - F(fail).
  • A project assignment (3 credits). The projects are performed in groups of two students, and presented with a short written report, as well as an oral poster presentation. Grade (normally the same for all group members): A - F(fail).
Details about the assignments themselves can be found under Written Exam, Computer Assignments and Project in the menu.

Grading

Course grades are assigned according to the following (CA = computer assignment grade, WE = written exam grade, PA = project assignmnent grade):
If CA = F, WE = F or PA = F, that part of the course has to be re-examined, until CA = P, WE >= E and PA >= E. The course grade is the average of WE and PA, according to the following:

WE
A
B
C
D
E
PA
A
A
A
B
B
C
B
A
B
B
C
C
C
B
B
C
C
D
D
B
C
C
D
D
E
C
C
D
D
E

Copyright © Sidansvarig: Hedvig Kjellström <hedvig@nada.kth.se>
Uppdaterad 2012-06-21