main page

    Reconstruction of Ancestral Genomic Sequences

    Isaac Elias and Tamir Tuller

    Abstract

    A challenging task in computational biology is the reconstruction of genomic sequences of extinct ancestors, given the phylogenetic tree and the sequences at the leafs. This task is best solved by calculating the most likely estimate of the ancestral sequences, along with the most likely edge lengths. We deal with this problem and also the variant in which the phylogenetic tree in addition to the ancestral sequences need to be estimated. The latter problem is known to be NP-hard, while the computational complexity of the former is unknown. Currently, all algorithms for solving these problems are heuristics without performance guarantees. The biological importance of these problems calls for developing better algorithms with guarantees of finding either optimal or approximate solutions.

    We develop approximation, fixed-parameter tractable (FPT), and fast heuristic algorithms for two variants of the problem: when the phylogenetic tree is known and when it is unknown. The approximation algorithm guarantees a solution with a log-likelihood ratio of 2 relative to the optimal solution. The FPT has a running time which is polynomial in the length of the sequences and exponential in the number of taxa. This makes it useful for calculating the optimal solution for small trees. Moreover, we combine the approximation algorithm and the FPT into an algorithm with arbitrary good approximation guarantee (PTAS).

    We tested our algorithms on both synthetic and biological data. In particular, we used the FPT for computing the most likely ancestral mitochondrial genomes of hominidae (the great apes), thereby answering an interesting biological question. Moreover, we show how the approximation algorithms find good solutions for reconstructing the ancestral genomes for a set of lentiviruses (relatives of HIV).

    Download Supplementary Material

    Full version of the paper
    (PDF)

    Dowload data sets

    Hominoid Mitochondria - complete sequences
    Hominoid Mitochondria - incomplete sequences from 1982
    Lentiviruses (HIV)
    Simulated Data

    Dataset 1: Hominoid Mitochondira - Complete

    Original data
    ClustalW
    Phylip (i.e. gaps removed)
    Distance matrix(using JC on the phylip file)
    Accepted phylogeny tree drawing

    FPT AML - THE MOST LIKELY ANCESTRAL SEQUENCES
    Note: The FPT was run on the two state characters which were all but 18 of the total 15851 characters. Thereafter the 18 multivariate symbols were estimated by using the edge-lengths given by the FPT.
    The resulting sequences are
    Original FPT ancestral sequences
    and had a log-likelihood = -28367.4.
    This result could then be improved (in the four state model) by local improvement to log-likelihood = -28097.8. These sequences
    FPT sequences improved
    Tree for FPT sequences improved tree drawing

    Leaflifting AML
    log-Likelihood = -29018.8 (Before improvement -31907.7)
    Ancestral sequences
    AML tree tree drawing

    Parsimony AML
    log-Likelihood = -28204.4 (Before improvement -28301.4)
    Ancestral sequences
    AML tree tree drawing

    Dataset 2: Hominoid Mitochondira - Incomplete sequences from 1982

    This data set contains the sequences that Barry and Hartigan ran their algorithms on in 1982. The sequences are from the paper by XXXX. The total log-likelihood of Barry and Hartigan's method on this dataset was -2597.5 (which is less than all our methods). They used a very strange model of sequence evolution in which neighboring sites are dependent. However, they did also use an independent model but the result in for this model was worse.

    Original data
    Phylip file containing alignment with gaps
    Phylip without gaps

    FPT AML - THE MOST LIKELY ANCESTRAL SEQUENCES
    log-Likelihood = -1689.79 (Before improvement -1704.28)
    Original FPT 2-state sequences (i.e. before improvement)
    Ancestral sequences in 4-state Jukes-Cantor (i.e. after improvement)
    AML tree tree drawing

    Leaflifting AML
    log-Likelihood = -1729.82 (Before improvement -1882.98)
    Ancestral sequences
    AML tree tree drawing

    Parsimony AML
    log-Likelihood = -1699.33 (Before improvement -1704.98)
    Ancestral sequences
    AML tree tree drawing

    Dataset 2: HIV viruses

    Original data
    FastaVirus-Lentivirus
    ClustalW
    Phylip (i.e. gaps removed)
    Distance matrix(using K2P on the phylip file)
    JC Distance matrix(using JC on the phylip file)
    NJ tree tree drawing
    ACS tree tree drawing

    Big AML
    log-Likelihood = -64756.4
    Ancestral sequences
    AML tree tree drawing

    Leaf lifting AML - NJ tree
    log-Likelihood = -64850.2
    Ancestral sequences
    small AML tree tree drawing

    Leaf lifting AML - Big AML tree
    log-Likelihood = -64756.4
    Ancestral sequences
    AML tree tree drawing

    Leaf lifting AML - ACS tree
    log-Likelihood = -64899.8
    Ancestral sequences
    AML tree tree drawing

    Parsimony AML - NJ tree
    log-Likelihood = -66105.5 (before impr. -67582.6)
    Ancestral sequences
    AML tree tree drawing

    Parsimony AML - Big AML tree
    log-Likelihood = -66877.7 (before impr. -68838.8 )
    Ancestral sequences
    AML tree tree drawing

    Parsimony AML - ACS tree
    log-Likelihood = -67081.9 (befoer impr. -68006.5)
    Ancestral sequences
    AML tree tree drawing

    Simulated Data

    This data was generated as described in the paper. The results are available in the full version of the paper. Gziped tar archive