Lab 1: Spelling

If you report labs by Februari 11th, you get a bonus point on the exam. On the lab session on februari 4th, you have the opportunity to present your solutions to the theory problems below the solution to theoretical information below. A correct solution of these gives a bonus point on the exam. A late report gives no bonus.
It is optional to solve the theory questions , but to be able to do the lab you should have made them.

In the mapp /info/algokomp13/labb1 is a Java program that solves the problems described below. Your task is to speed up the program so that it runs about 10,000 times faster. Accuracy and efficiency is tested by sending your solution to Kattis, see instruktioner. To complete the lab you should be approved by Kattis and a lab assistent. Start by logging into Kattis and enroll algokomp13 in the menu optionKurser in the top menu.

Problem

The Editing distance between two words is the minimum number of letter operations required to transform one word to the other. There are three permitted letter operations:
  1. remove one of the letters in the word
  2. add a letter anywhere in the word
  3. replace a letter in the word to another letter
For example, the word alroitm can be transformed to algoritm by changing the letter r to g (rule 3) and inserting the letter r after the letter o (rule 2). The chain

alroitm -> algoitm -> algoritm

shows that the editing distance between alroitm and algoritm is at most 2. Since it is impossible to transform alroitm to algoritm with a single operation, the editing distance is exactly 2.

A common way to develop spelling suggestions for a misspelled word is simply to return the words in the dictionary that has at least edit distance to the misspelled word. The program is given a glossary and a number of misspelled words as input and is supposed to compute spelling suggestions in this way.

Specification

Input data consists of two parts. The first part is the glossary, consisting of a number of words in utf-8 alphabet, one word per line. This part ends with a line that contains only one '#' character. The second part is a number of misspelled words to be corrected, one word per line. The misspelled words are not included in the glossary. Each word in the input consists only of lowercase letters in Swedish alphabet (a-z, å, ä, ö), no spaces, punctuation or numbers.

The program shall for each misspelled word print a line consisting of the misspelled word, followed by the minimal editing distance in parentheses followed by a list of all the words in the dictionary that has minimal editing distance to the misspelled word. The list should be in alphabetical order and each word in the list should be preceded by a space. The dictionary has more than half a million words and the number of misspelled words in the input is more than 100.

Example

A dictionary file is in/info/algokomp13/labb1/ordlista. You can test drive your program by entering a few misspelled words (e.g. labd and dabbbhud) on each line in a file (eg testord.txt) and then run

spel01$ cat /info/algokomp13/labb1/ordlista testord.txt | java Main
labd (1) labb lagd land
dabbbhud (4) anbud dabba nabbad

Assignment

The given Java program do solves the problem, but it takes hours to get the answer. You should optimize the program so it finds the answer within the time limit Kattis provides.

Good test cases to test your program with can be found in /info/algokomp13/labb1/testfall/

The theory problems give suggestions on how to improve the program. Your optimized programs should have the same form of input and output as the given program and it must still be in Java.

Kattis recognizes the problem as adkspelling

Theory problems

Try to understand how the given program works and answer the following questions.
  1. Formulate the recursion (partDist in the program) as compact as possible with mathematical notation.
  2. Calculate partDist("labd", "blad", x, y) for all x and y between 0 and 4 and enter the results in a matrix M. What is M?
  3. What does the method partDist(w1, w2, x, y) compute?
  4. Show that the time complexity of Distance(w1, w2) is Omega(2max(n,m)) In the worst case, where n is the number of letters in w1 and m is the number of letters in w2.
  5. See how you can track the editing operations made ??in the shortest edit sequence from "labd" to "blades" by looking at the matrix M.
  6. Show with pseudocode how the recursion can be calculated by using dynamic programming, ie how a matrix M can be created.
  7. Analyze the time complexity for determining the editing distance between an n-letter word and an m-letter word with dynamic programming.
  8. Calculate the dynprog matrix ethe distance between "labs" and "blad".
  9. What part of the matrices for "labd" - "blad" and "labs" - "blad" are different?
  10. Generally, what parts of the matrices for Y-X and Z-X are different when the words Y and Z have the same first p letters?