bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2476 / ir13

Search Engines and Information Retrieval Systems, ir13

A course in Computer Science focusing on basic theory, models, and methods for information retrieval.

If you took DD2476 ir12, and have lab assignments and/or the project left, you are very welcome to finish these during the spring of 2013. Please contact Hedvig Kjellström (see People in the meny) to let us know that you are following the course.

News

June 4: Your thoughts and ideas are valuable in the further development of this course. Therefore, we would like you to fill out the following course evaluation form:

May 30: Thanks again for your effort with the projects - it was loads of fun to see your poster presentations!
Johan and Hedvig have now graded all projects according to the criteria on the project page (see the menu to the left). All projects got a grade of C or higher. For each project group, an email with an explanation of the grading (i.e., what criteria were not fulfilled, and why) has been sent to the group contact person. All grades have been reported in Rapp.

May 19: Sorry for the delay: On May 23, there will be an opportunity to present Assignment 1, 2 or 3, at 13 in Spelhallen. Press this button to book a slot:
If you want to present several assignments, book several slots after each other. The grades on Assignment 1 have gone down four steps, the grades on Assignment 2, three steps, and the grades on Assignment 3, two steps. See Computer Assignments in the menu to the left for more details about this.
The whole examination will take only 15 minutes. Therefore, make sure that you are logged in, (to either your own computer or to one of the Linux machines) and prepared to start the presentation on time!
We will add slots when the the current ones are filled. If you realize that you will not present at your booked time, unbook it so that someone else can take it. If none of these times suit you, contact Hedvig asap to present earlier in the week.
If you are not ready by this week, do not worry, there will be more possibilities to present; in the end of June and the end of August.

May 2: Tomorrow, May 3 at 10.15 in E3, there is a guest lecture by Simon Stenström, Findwise. It is in the schedule on the homepage, but maybe not in TimeEdit? You are all very welcome!

May 2: The detailed schedule for the poster presentation is now on the homepage, please have a look at Project in the menu to the left. To find your group number, look in the file /info/DD2476/ir13/project/project_groups.txt on the CSC UNIX system.

April 25: It is not fun to have to be this explicit, but it is to be fair to the students that make an effort and do their part of the work. To ensure that all students participate actively in the projects, we have decided the following:
1) The written project report should contain a written account (just after the abstract, for example) of the contributions of each group member. A group member that did not provide a honest contribution to the project work will get the grade F, and will have to redo the project next spring.
2) The poster session on May 17 will start at 9.00, not 8.00 as it says in the schedule. All group members will have to be there ON TIME at 9.00 (take an earlier bus if you are often late), and all group members should be prepared to present the poster. A project member that fails to show up on time, or that cannot account for the group's results, will risk being graded F and redo the project next spring. If you cannot be there for the presentation due to illness etc., please contact Hedvig beforehand, and we can arrange for a replacement excercise. More information on the detailed poster session schedule will come in due time.
All the best of luck with the projects! If you have any questions, please contact the project proposer or Hedvig.

April 3: On April 18, there will be an opportunity to present Assignment 1, 2 or 3, at 13 in Spelhallen. Press this button to book a slot:
If you want to present several assignments, book several slots after each other. The grades on Assignment 1 have gone down three steps, the grades on Assignment 2, two steps, and the grades on Assignment 3, one step. See Computer Assignments in the menu to the left for more details about this.
The whole examination will take only 15 minutes. Therefore, make sure that you are logged in, (to either your own computer or to one of the Linux machines) and prepared to start the presentation on time!
We will add slots when the the current ones are filled. If you realize that you will not present at your booked time, unbook it so that someone else can take it.

March 22: One more project has been added to the list, #6 from Spotify. Please adjust your project ranking lists if you want to include this project, and send the new ranking list to me before midnight today Friday March 22. (They can only host one group, so we might have to turn down many requests. If you select this project, please provide an account of the Python experience within the group - this will be one evaluation criteria.)

March 19: We have now decided upon the deadlines for assignment grades in April and May. There will be examination sessions similar to the one this afternoon. You are welcome with your assignment presentation then - or this afternoon!

March 18: The projects are now posted on the homepage, see Project in the menu to the left. The contact person of each group should send Hedvig an email with a ranked list of three project choices, before Friday March 22, at 12 noon. This list will then be compiled, and the project of each group announced in /info/DD2476/ir13/project/project_groups.txt on the CSC UNIX system.

March 8: Next Tuesday, it is time for the oral presentation of Assignment 3, at 15 in Spelhallen. Press this button to book a slot:
You can also present Assignment 2 in some of the time slots. If you want to present both Assignment 2 and 3, book two slots. The whole examination will take only 15 minutes (80 students, four teachers). Therefore, make sure that you are logged in, (to either your own computer or to one of the Linux machines) and prepared to start the presentation on time!
We will add slots when the the current ones are filled. If you realize that you will not present at your booked time, unbook it so that someone else can take it.

March 8: The project groups have now been formed. (The grouping is done so that all group members have approximately the same grade on Assignment 1 and 2 - heightening the possibility that you are on the same ambition level in this course. We have also tried to create as good a mix of swedish/international and male/female students as possible.) You can find a list of all groups in the file /info/DD2476/ir13/project/project_groups.txt on the CSC UNIX system. Please contact the other members of your group asap, and decide upon one person who will be a contact. This person should then send Hedvig an email. The project proposals will come out next week, and then it is good that you have talked amongst yourselves.

March 6: For Assignment 3, it is even more important than for Assignment 2 to understand the vector representation of documents and queries. Key concepts for Task 3.1 are centroid and vector addition. Read the book! (See the schedule for chapters corresponding to Lectures 4 and 6.) The book is freely available online, so money is no excuse not to read it...

February 25: The description of length normalization in Task 3.1 of Assignment 3 has been updated. Now it should be a bit clearer. You do NOT have to get the exact same ranking as we did! It depends on the tf/tf-idf representation, and of the document/query length representation.

February 22: Assignment 3 is now published, see Computer Assignments in the menu to the left.

February 20: Assignment 3 is expected on Friday, after Hedvig's grant application deadline. It will comprise topics from Lectures 6 and 7, and build on the code that you developed in Assignments 1 and 2. Therefore, a good preparation for Assignment 3 is to tend to all comments about the Assignment 2 code that you got at the presentation yesterday!

February 17: Clarification concerning Assignment 2:
Many students are worried that their solution will not be accepted because it is too different from the example in Task 2.1, 2.2, 2.3. Do not worry! It clearly says in the text that your solutions should be SIMILAR to the examples. What is important is that you understand how your solution works, and are able to discuss the effect of changing the length measurement or the function of tf in the tf-idf weight. Look at the documents, and try to understand why they got the rank they got in relation to the query.
To give you a guideline if you are still worried, we will require the following: In Task 2.2, the document 726 should be highly ranked in response to the query "november eller december" since it is about dates. Otherwise, your list should contain at least 50% of the documents in the shown list. In Task 2.3, the list should be very similar, with permutations of not more than 3 ranks (i.e., the doc at rank 7 in our list should be in ranks 4-10 in your list).

February 12: On Tuesday, it is time for the oral presentation of Assignment 2, at 15 in Brun. Press this button to book a slot:
The whole examination will take only 10 minutes (80 students, four teachers). Therefore, make sure that you are logged in, (to either your own computer or to one of the Linux machines) and prepared to start the presentation on time!
We will add slots when the the current ones are filled. If you realize that you will not present at your booked time, unbook it so that someone else can take it.

February 10: Clarification concerning Assignment 2:
The cosine similarity measure take by definition values between 0 and 1. Depending on how the length of documents is represented, and if the length of the query is taken into account, the similarity score can be larger than 1. However, it can never be negative, as neither the individual tf-idf scores of terms, nor the lenghts of documents, can be negative.

February 7: Error in Assignment 2 now corrected. The power iteration page rank top 50 list was wrong before, due to a bug. This has been changed in the assignment description, Task 2.3.

February 6: If you have not yet passed the oral examination of Assignment 1, please send an email to Hedvig and Johan to book a time for presentation.

February 4: Error in Assignment 2 now corrected. The union query "november eller december" should return 474 documents, not 472. This has been changed in the assignment description, Tasks 2.2 and 2.5.

January 25: Three clarifications concerning Assignment 1:
SimpleTokenizer assumes that you use UTF-8 encoding.
In Task 1.2, intersection algorithms with quadratic complexity will not pass. In other words, the intersection has to use the fact that the two postings lists are sorted. A safe approach is to use the proposed algorithm.
In Task 1.5, your merging code should be able to handle non-disjunct corpora, i.e., that the same document might show up in two corpora. Only one copy of the document should be kept in the merged index.

January 23: On Tuesday, it is time for the oral presentation of Assignment 1, at 15 in Spelhallen. Press this button to book a slot:
The whole examination will take only 10 minutes (80 students, four teachers). Therefore, make sure that you are logged in, (to either your own computer or to one of the Linux machines) and prepared to start the presentation on time!
We will add slots when the the current ones are filled. If you realize that you will not present at your booked time, unbook it so that someone else can take it.

January 22: Assignment 1 is updated with changes in the grading criteria (a slight slack), as well as clarifications in Tasks 1.4 and 1.5. This has also been mentioned at today's lecture.

January 15: Please register yourself in the CSC result registration system Rapp. Do this as soon as possible so that we know how many are actively following the course.

January 3: TimeEdit is now updated with the correct rooms.

December 20: Due to the unexpectedly large number of participants, we have made some changes in the schedule. Lecture 1 has been moved two hours, to January 15, 15-17, and all lectures are held in bigger rooms. The changes can be seen in the Schedule in the menu to the left. Note that TimeEdit has not been updated yet - the schedule at these pages is the correct one!

December 20: Computer assignments 1 and 2 are now published, see Computer Assignments in the menu to the left. All information about deadlines, grade requirements, etc., can be found in the assignments, and on this homepage (bottom of main page, Schedule, and Computer Assignments). Computer assignment 3 is under development and will be published in January.

November 22: The homepages are now up and running. There are two changes in this years course: The written exam has been replaced by a more in-depth lab course, and the labs are examined individually instead of in pairs.

Learning Outcomes

After completing the course you will be able to:

  • explain the concepts of indexing, vocabulary, normalization and dictionary in Information Retrieval,
  • give an account of different text similarity measures, and select a similarity measure suitable for the problem at hand,
  • define a boolean model and a vector space model, and explain the differences between them,
  • implement a method for ranked retrieval of a very large number of documents with hyperlinks between them,
  • evaluate information retrieval algorithms, and give an account of the difficulties of evaluation,
  • give an account of the structure of a Web search engine.

Content

Basic and advanced techniques for information systems: information extraction; efficient text indexing; indexing of non-text data; Boolean and vector space retrieval models; evaluation and interface issues; structure of Web search engines.

Literature

Text Book

  • C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
The book can be ordered from your favorite internet bookstore, and found using ISBN 0521865719. Virtually all material from the book is also available online at nlp.stanford.edu/IR-book/information-retrieval-book.html.

Articles

  • K. Avrachenkov, N. Litvak, D. Nemirovsky and N. Osipova, Monte Carlo methods in PageRank computation: When one iteration is sufficient, SIAM Journal on Numerical Analysis 45(2), 2007.
  • Article on image retrieval, TO BE ANNOUNCED.
  • M. Sahlgren, An Introduction to Random Indexing, 2005, www.sics.se/~mange/papers/RI_intro.pdf.

Other Resources

To get an idea of state-of-the-art in Information Retrieval research and development, take a look at the program of the annual conference ACM SIGIR.

Examination

Assignments

The examination in the course is performed through:
  • Three computer assignments (6 credits). The computer assignments are performed individually, and presented orally by the computer. Grade: A - F(fail).
  • A project assignment (3 credits). The projects are performed in groups of 4-5 students, and presented with a short written report, as well as an oral poster presentation. Grade (normally the same for all group members): A - F(fail).
Details about the assignments themselves can be found under Computer Assignments and Project in the menu.

Grading

Course grades are assigned according to the following (CA = computer assignment grade, PA = project assignment grade):
The course grade is the weighted average of CA and PA, according to the following:

CA
A
B
C
D
E
PA
A
A
B
B
C
D
B
A
B
C
C
D
C
B
B
C
D
D
D
B
C
C
D
E
E
B
C
D
D
E

Copyright © Sidansvarig: Hedvig Kjellström <hedvig@nada.kth.se>
Uppdaterad 2013-06-04