BootFam: Defining Gene/Protein Families

Description

This program takes as input a set of homologous protein sequences, a species tree, and a sequence-to-species mapping. The output is a set of gene families relative to the species tree.

The method is described in a forthcoming paper.

Availability

BootFam is distributed under the GNU General Public License, and is available for immediate download. The software is written in Perl, except for the helper programs.

Please acknowledge the use of BootFam in your research, preferably by citing this web site.

Dependencies

BootFam relies on the following applications/libraries:

seqboot from Phylip.
protdist from Phylip. This can however be changed to 'lapd' using option '-d'. In the future, this option will be fully adjustable you your own software preferences. I have just been lazy.
neighbor from Phylip.
notung from Dannie Durand with coworkers.
kalign, by Lassmann and Sonnhammer. Also consider your own alignments and the option '-a'!
My small helper programs reconcile and chainsaw. The links here point to binaries compiled for GNU/Linux 2.6.9. Please write me about binaries for other systems: These two small programs depend on a large codebase we are not ready to distribute.
Bio::SeqIO in BioPerl.
Bio::AlignIO in BioPerl
Graph::UnionFind from CPAN.

Usage

Usage: bootfam [<options>] <fasta file> <speciestree> <seq-to-species> The fasta file contains the protein sequences. The species tree is in rooted Newick format and the seq-to-species file consists of lines with two space-separated strings: a sequence accession from the Fasta file and a lead in the species tree.

Options:

    -h, -u
        This help text.

    -a  This option needs a makeover! Right now it means: "Don't look for
        kalign". In the future, we will use it to allow opening alignments
        in different formats. Or to say: "Do align!"

    -c <filename>
        Compute consensus tree and put it in <filename>. This consensus tree
        is rooted.

    -d <progname>
        progname decides how to estimate protein distance, it is either
        'lapd' or 'protdist'.

    -D <filename>
        This option has two meanings: either create or read from an
        intermediate result file. Since bootstrapped distances are
        computationally expensive, one should if possible avoid this step.
        If any experimentation with parameters for steps after bootstrapping
        is to be expected, this option should be used.

        If the file <filename> does not exist, then bootfam creates this
        file, proceeds with computing distances for all bootstrap
        replicates, and saves the distances in the named file. If the file
        exists, bootfam tries to read a set of distances from that file and
        continues to the next step, hence avoiding the costly distance step.

        Note that you cannot, yet, bypass bootfam's distance estimation and
        let bootfam read your own bootstrapped distances. This is because
        bootfam renames sequences internally to avoid issues with Phylip's
        requirements on sequence names.

    -m <strict|extended|merge>
        Decide how to handle families that fall under the min bootstrap
        threshold (see -b). Default is 'strict', which means group
        everything unsupported together and label it as "Unclear". The
        'extended' mode reports non-conflicting families in a greedy fashion
        (highest support first), while 'merge' simply merge all unsupported
        and overlapping family candidates. This differst from 'strict' in
        that we may have several groups that are non-overlapping and
        unsupported and 'strict' would put them all in the same family.

    -r <int>
        Specify the number of replicates in the bootstrap process. Default:
        100

    -b <float>
        Experimental: Minimum bootstrap value to report families for.
        Everything below this is called "Unclear".

    -H  Let the output be HTML formatted for easy viewing in a web browser.

    -q  Quiet operation, no verbose messages.