main page
Reconstruction of Ancestral Genomic Sequences
Isaac Elias and Tamir Tuller
Abstract
A challenging task in computational biology is the reconstruction
of genomic sequences of extinct ancestors, given the phylogenetic
tree and the sequences at the leafs. This task is best solved by
calculating the most likely estimate of the ancestral sequences,
along with the most likely edge lengths. We deal with this problem
and also the variant in which the phylogenetic tree in addition to
the ancestral sequences need to be estimated. The latter problem
is known to be NP-hard, while the computational complexity of the
former is unknown. Currently, all algorithms for solving these
problems are heuristics without performance guarantees. The
biological importance of these problems calls for developing
better algorithms with guarantees of finding either optimal or
approximate solutions.
We develop approximation, fixed-parameter tractable (FPT), and fast
heuristic algorithms for two variants of the problem: when the
phylogenetic tree is known and when it is unknown. The
approximation algorithm guarantees a solution with a
log-likelihood ratio of 2 relative to the optimal solution. The
FPT has a running time which is polynomial in the length of the
sequences and exponential in the number of taxa. This makes it
useful for calculating the optimal solution for small trees.
Moreover, we combine the approximation algorithm and the FPT into
an algorithm with arbitrary good approximation guarantee (PTAS).
We tested our algorithms on both synthetic and biological data. In
particular, we used the FPT for computing the most likely
ancestral mitochondrial genomes of hominidae (the great
apes), thereby answering an interesting biological question.
Moreover, we show how the approximation algorithms find good
solutions for reconstructing the ancestral genomes for a set of
lentiviruses (relatives of HIV).
Download Supplementary Material
-
Full version of the paper
(PDF)
Dowload data sets
-
Hominoid Mitochondria - complete sequences
Hominoid Mitochondria - incomplete sequences from 1982
Lentiviruses (HIV)
Simulated Data
-
-
Original data
ClustalW
Phylip (i.e. gaps removed)
Distance matrix(using JC on the phylip file)
Accepted phylogeny tree drawing
-
FPT AML - THE MOST LIKELY ANCESTRAL SEQUENCES
Note: The FPT was run on the two state characters which were all but
18 of the total 15851 characters. Thereafter the 18 multivariate
symbols were estimated by using the edge-lengths given by the
FPT.
The resulting sequences are
Original FPT ancestral sequences
and had a log-likelihood = -28367.4.
This result could then be improved (in the four state model) by local
improvement to log-likelihood = -28097.8. These sequences FPT sequences improved Tree for FPT sequences
improved tree drawing
-
Leaflifting AML
log-Likelihood = -29018.8 (Before improvement -31907.7)
Ancestral sequences
AML tree tree drawing
-
Parsimony AML
log-Likelihood = -28204.4 (Before improvement -28301.4)
Ancestral sequences
AML tree tree drawing
-
-
This data set contains the sequences that Barry and Hartigan
ran their algorithms on in 1982. The sequences are from the paper by
XXXX. The total log-likelihood of Barry and Hartigan's method on this
dataset was -2597.5 (which is less than all our methods). They used a
very strange model of sequence evolution in which neighboring sites
are dependent. However, they did also use an independent model but the
result in for this model was worse.
-
Original data
Phylip file containing alignment with gaps
Phylip without gaps
-
FPT AML - THE MOST LIKELY ANCESTRAL SEQUENCES
log-Likelihood = -1689.79 (Before improvement -1704.28)
Original FPT 2-state sequences (i.e. before improvement)
Ancestral sequences in 4-state Jukes-Cantor (i.e. after improvement)
AML tree tree drawing
-
Leaflifting AML
log-Likelihood = -1729.82 (Before improvement -1882.98)
Ancestral sequences
AML tree tree drawing
-
Parsimony AML
log-Likelihood = -1699.33 (Before improvement -1704.98)
Ancestral sequences
AML tree tree drawing
-
-
-
Original data
FastaVirus-Lentivirus
ClustalW
Phylip (i.e. gaps removed)
Distance matrix(using K2P on the phylip file)
JC Distance matrix(using JC on the phylip file)
NJ tree tree drawing
ACS tree tree drawing
-
Big AML
log-Likelihood = -64756.4
Ancestral sequences
AML tree tree drawing
-
Leaf lifting AML - NJ tree
log-Likelihood = -64850.2
Ancestral sequences
small AML tree tree drawing
-
Leaf lifting AML - Big AML tree
log-Likelihood = -64756.4
Ancestral sequences
AML tree tree drawing
-
Leaf lifting AML - ACS tree
log-Likelihood = -64899.8
Ancestral sequences
AML tree tree drawing
-
Parsimony AML - NJ tree
log-Likelihood = -66105.5 (before impr. -67582.6)
Ancestral sequences
AML tree tree drawing
-
Parsimony AML - Big AML tree
log-Likelihood = -66877.7 (before impr. -68838.8 )
Ancestral sequences
AML tree tree drawing
-
Parsimony AML - ACS tree
log-Likelihood = -67081.9 (befoer impr. -68006.5)
Ancestral sequences
AML tree tree drawing
-
-
This data was generated as described in the paper. The results are
available in the full version of the paper.
Gziped tar archive
|