Homework 3

This homework is due May 31 (note: updated due date!) at 9.00.

The previously "empty" questions 8 and 9 are now removed, they will not appear.

Please note that this is not group work. You may talk about the problems and educate each other, but don't solve another student's problem. The Code of Honour applies.

Email your solutions (text etc in a PDF, code as extra attached files) to Hossein Farahani with Lasse cc:ed. Please use the subject line "algbio11: homework 3".

For this assignment, I want you to read a chapter on HMMs in Bioinformatics by Anders Krogh.
1. What is a silent state? 30
2. What is the purpose of pseudocounts? 30
3. Suggest a HMM (structure and parameters) for modelling DNA repeats of length five, looking like "GACGA", but allowing for a 10% error rate (i.e., the repeats are not perfect because they have been the subject of evolution). The errors are nucleotide variations, not insertions or deletions. 40
Consider the homologous sequences in map20.fa.
1. What domain do they contain according to Pfam? Perform a "sequence search" one a handful of the sequences in the file. 20
2. You will now create your own HMM for the found domain.
  - Make your own alignment of the sequences in map20.fa (using Muscle, for example) and then identify and "cut out" the most conserved part of the alignment (Jalview is suitable for this).
  - Download and install the HMMER3 software (see hmmer.janelia.org) and use a program in that distribution to build an HMM. I am not giving you more details here, because I think that you should read the manual for HMMER3, but I can tell you that the input needs to be on "Stockholm format" and you can use my little Python script for doing the conversion here.
  - Finally, generate an "HMM-Logo" from the HMM-file you built, and report it in your answers. What are the differences with the HMM logo for the domain found in the previous question, available at Pfam? (It is available as by clicking a button in the left when looking at the details of the domain.) 80
1. What is the time complexity of calculating the Jukes-Cantor distances for n sequences of length m? 25
2. What is the space complexity of the Neighbour Joining algorithm for a distance matrix over n sequences? 25
Explain the purpose of bootstrapping in the context of phylogeny inferrence. 50
Implement Felsenstein's algorithm and decide which tree is the most likely for the dataset in felsen.fa using Kimura's two-parameter model. You can assume that βt=0.5 and α=2β.
Your program should take two filenames as input, one for the sequences and one for a hypothetical tree stored in the Newick format. The input trees are rooted and with branchlengths. I recommend using a third-party library for reading the trees. The output is simply the log-likelihood of the sequences in that tree.
Example usage:
```
prompt$ felsenstein seq.fa hypothesis1.tree
-1234.5
		
```
Note: the numbers are made up. Example data:
A sequence file:
```
>s1
ACGTACGT
>s2
ACCTACCT
>s3
ACGTAAGT
>s4
ACCTCCCT	    
```
and a corresponding tree in Newick format:
```
((s1, s2), (s3, s4));	    
```
Would this tree be better than
```
((s1, s3), (s2, s4));	    
```
for example? Impl: 100
Some researchers have looked at the length of homopolymers (repetitions of single nucleotides) as an indicator of evolutionary distance. Such regions are not conserved and the lengths will vary over relatively short timescale, making them an interesting phylogenetic marker.
Suppose that we want to make use of such markers in a case when we have reason to believe that the length changes with on or two nucleotides over a single edge in a tree. We get an interesting tree reconstruction problem. Assume that we have k markers in a genome that we can study with good accuracy. The input is n vectors of k integers. Ideally, we would now compute the tree which minimizes the number of implied length changes on the edges. In this assignment, we look for a smaller problem and simple take the tree as fixed. How miuch length change does it imply?

Adapt the algorithm by Fitch (page 176 in the book) for computing the minimum number of length changes given by k markers in a tree T. Notice the instead of having vectors of characers as input, you will work with vectors of integers. Alg: 100
The Andrea Rizzi assignment
Andrea raised the question in class whether we actually need aligned sequences to estimate distances. The answer is no, we do not need to align sequences before estimating pairwise distances. Both the alignment and the distance can be seen as an unknown parameter, and we can adapt an alignment algorithm to find the distance.

Chapter 4 of the course book shows how to use HMM:s as models for sequences, and in 4.1 they give a Viterbi algorithm for computing the most probable alignment in a model where
- δ is probability of an insertion/deletion,
- ε is probability of gap elongation,
- τ is a termination probability (not very interesting),
- q_a is the probability of observing residue a, and
- p_a,b is the probability of an aligned pair (a,b) in the alignment.
The pair probability can be factored into the probability of observing one residue, a, times the probability of evolving into another residue in a certain time span t, which we therefore can write as p_a,b=q_a×Pr(a → b | t). Under Kimura's 2-parameter model,

Pr(a → b | t)=u_t for transitions,
Pr(a → b | t)=s_t for transversions, and
Pr(a → b | t)=r_t for conserved positions,
using the book's notation (see page 197).

The algorithm on page 84 estimates the probability Pr(x',y' | t) for the "best" alignment x',y' of two sequences x and y. Since we do not know the evolutionary distance t of two sequences, it makes sense to consider it an unknown and estimate it using numerical methods such as Newton-Raphson (which needs a derivative, how do you solve that?) or Golden Section Search (which is a bit slow, but does not need a derivative).
1. Assignment: Write a program nudist (for "New Unaligned DISTance estimation") which takes as input a file of Fasta-formatted sequences, and outputs a distance matrix in Phylip format estimated by the algorithm outlined above, using Kimura's 2-parameter model. Report the output of your program on t2.fa and explain how you find t.
  Parameters: You get to find values for δ, ε, τ, and q_a yourself. Calibrate a value for β using the test data t1.fa (don't worry much about precision), but let α/β=2.
  Example usage:
```
prompt$ nudist t1.fa
    4
a           1.886017  1.779531  0.952756  0.000000
b           1.617019  1.749469  0.000000  0.952756
c           0.516764  0.000000  1.749469  1.779531
d           0.000000  0.516764  1.617019  1.886017
		
```
  The distances above is what Kimura's 2-parameter estimation gives you on the "true" alignment. Impl: 100
2. How do you compute Pr(x,y | t), i.e., the probability that x and y are homologous at a distance t? That is, without choosing a best alignment using Viterbi, but summing over all possible alignments. Alg: 100