Homework 3
This homework is due May 31 (note: updated due date!) at 9.00.
The previously "empty" questions 8 and 9 are now removed, they will not appear.
Please note that this is not group work. You may talk about the
problems and educate each other, but don't solve another student's problem. The Code of Honour applies.
Email your solutions (text etc in a PDF, code as
extra attached files) to Hossein Farahani with Lasse cc:ed. Please use
the subject line "algbio11: homework 3".
- For this assignment, I want you to read a chapter on HMMs in Bioinformatics by Anders Krogh.
- What is a silent state?
30
-
What is the purpose of pseudocounts?
30
-
Suggest a HMM (structure and parameters) for modelling DNA repeats of length five, looking like "GACGA", but allowing for a 10% error rate (i.e., the repeats are not perfect because they have been the subject of evolution). The errors are nucleotide variations, not insertions or deletions.
40
-
Consider the homologous sequences in map20.fa.
- What domain do they contain according to Pfam? Perform a "sequence
search" one a handful of the sequences in the file.
20
- You will now create your own HMM for the found domain.
-
Make your own alignment of the sequences in map20.fa
(using Muscle, for example) and then identify and
"cut out" the most conserved part of the alignment (Jalview is suitable for this).
- Download and install the HMMER3 software (see hmmer.janelia.org) and
use a program in that distribution to build an HMM. I
am not giving you more details here, because I think
that you should read the manual for HMMER3, but I can
tell you that the input needs to be on "Stockholm
format" and you can use my little Python script for
doing the conversion here.
- Finally, generate an "HMM-Logo"
from the HMM-file you built, and report it in your
answers. What are the differences with the HMM logo
for the domain found in the previous question,
available at Pfam? (It is available as by clicking a button in the left when looking at the details of the domain.)
80
-
-
What is the time complexity of calculating the Jukes-Cantor
distances for n sequences of length m?
25
-
What is the space complexity of the Neighbour Joining
algorithm for a distance matrix over n sequences?
25
- Explain the purpose of bootstrapping in the context of phylogeny inferrence.
50
-
Implement Felsenstein's algorithm and decide which tree is
the most likely for the dataset in felsen.fa using Kimura's two-parameter
model. You can assume that βt=0.5 and α=2β.
Your program should take two filenames as input, one for the sequences and one for a hypothetical tree stored in the Newick format. The input trees are rooted and with branchlengths. I recommend using a third-party library for reading the trees. The output is simply the log-likelihood of the sequences in that tree.
Example usage:
prompt$ felsenstein seq.fa hypothesis1.tree
-1234.5
Note: the numbers are made up.
Example data:
A sequence file:
>s1
ACGTACGT
>s2
ACCTACCT
>s3
ACGTAAGT
>s4
ACCTCCCT
and a corresponding tree in Newick format:
((s1, s2), (s3, s4));
Would this tree be better than
((s1, s3), (s2, s4));
for example?
Impl: 100
-
Some researchers have looked at the length of homopolymers
(repetitions of single nucleotides) as an indicator of
evolutionary distance. Such regions are not conserved and
the lengths will vary over relatively short timescale,
making them an interesting phylogenetic marker.
Suppose that we want to make use of such markers in a case
when we have reason to believe that the length changes
with on or two nucleotides over a single edge in a
tree. We get an interesting tree reconstruction
problem. Assume that we have k markers in a
genome that we can study with good accuracy. The input is
n vectors of k integers. Ideally, we
would now compute the tree which minimizes the number of
implied length changes on the edges. In this assignment,
we look for a smaller problem and simple take the tree as fixed. How miuch length change does it imply?
Adapt the algorithm by Fitch (page 176 in the book) for computing the minimum number of length changes given by k markers in a tree T. Notice the instead of having vectors of characers as input, you will work with vectors of integers.
Alg: 100
-
The Andrea Rizzi assignment
Andrea raised the question in class whether we actually need aligned sequences to estimate distances. The answer is no,
we do not need to align sequences before estimating pairwise
distances. Both the alignment and the distance can be seen as an
unknown parameter, and we can adapt an alignment algorithm to find
the distance.
Chapter 4 of the course book shows how to use HMM:s as
models for sequences, and in 4.1 they give a Viterbi
algorithm for computing the most probable
alignment in a model where
- δ is probability of an insertion/deletion,
- ε is probability of gap elongation,
- τ is a termination probability (not very interesting),
- qa is the probability of observing residue a, and
- pa,b is the probability of an aligned pair (a,b) in the alignment.
The pair probability can be factored into the
probability of observing one residue, a,
times the probability of evolving into another residue
in a certain time span t, which we therefore
can write as
pa,b=qa×Pr(a → b | t).
Under Kimura's 2-parameter model,
Pr(a → b | t)=ut for transitions,
Pr(a → b | t)=st for transversions, and
Pr(a → b | t)=rt for conserved positions,
using the book's notation (see page 197).
The algorithm on page 84 estimates the probability
Pr(x',y' | t) for the "best" alignment
x',y' of two sequences x and
y. Since we do not know the evolutionary
distance t of two sequences, it makes sense
to consider it an unknown and estimate it using numerical methods such as Newton-Raphson (which needs a derivative, how do you solve that?) or Golden Section Search (which is a bit slow, but does not need a derivative).
-
Assignment: Write a program
nudist
(for "New Unaligned DISTance estimation") which takes
as input a file of Fasta-formatted sequences, and
outputs a distance matrix in Phylip format estimated
by the algorithm outlined above, using Kimura's
2-parameter model.
Report the output of your program on t2.fa and explain how you find t.
Parameters:
You get to find values for δ, ε, τ, and qa yourself. Calibrate a value for β using the test data t1.fa (don't worry much about precision), but let α/β=2.
Example usage:
prompt$ nudist t1.fa
4
a 1.886017 1.779531 0.952756 0.000000
b 1.617019 1.749469 0.000000 0.952756
c 0.516764 0.000000 1.749469 1.779531
d 0.000000 0.516764 1.617019 1.886017
The distances above is what Kimura's 2-parameter estimation gives you
on the "true" alignment.
Impl: 100
-
How do you compute Pr(x,y | t), i.e., the probability that x and y are homologous at a distance t? That is, without choosing a best alignment using Viterbi, but summing over all possible alignments.
Alg: 100
|