BootFam: Defining Gene/Protein Families
Description
This program takes as input a set of homologous protein sequences,
a species tree, and a sequence-to-species mapping. The output is a set of gene families
relative to the species tree.
The method is described in a forthcoming paper.
Availability
BootFam
is distributed under the
GNU General Public
License, and
is available for immediate
download. The software is written in Perl, except for the
helper programs.
Please acknowledge the use of BootFam in your research, preferably by citing this web site.
Dependencies
BootFam relies on the following applications/libraries:
- seqboot from Phylip.
- protdist from Phylip. This can however be changed to
'lapd' using option '-d'. In the future,
this option will be fully adjustable you your own software
preferences. I have just been lazy.
- neighbor from Phylip.
- notung from Dannie Durand with coworkers.
- kalign, by Lassmann and
Sonnhammer. Also consider your own alignments and the option '-a'!
- My small helper programs reconcile
and chainsaw. The links here point to
binaries compiled for GNU/Linux 2.6.9. Please write me about
binaries for other systems: These two small programs depend on a
large codebase we are not ready to distribute.
Bio::SeqIO
in BioPerl.
Bio::AlignIO
in BioPerl
Graph::UnionFind
from CPAN.
Usage
Usage: bootfam [<options>] <fasta file> <speciestree> <seq-to-species>
The fasta file contains the protein sequences. The species tree is in rooted
Newick
format and the seq-to-species file consists of lines with two
space-separated strings: a sequence accession from the Fasta file and
a lead in the species tree.
Options:
-h, -u
This help text.
-a This option needs a makeover! Right now it means: "Don't look for
kalign". In the future, we will use it to allow opening alignments
in different formats. Or to say: "Do align!"
-c <filename>
Compute consensus tree and put it in <filename>. This consensus tree
is rooted.
-d <progname>
progname decides how to estimate protein distance, it is either
'lapd' or 'protdist'.
-D <filename>
This option has two meanings: either create or read from an
intermediate result file. Since bootstrapped distances are
computationally expensive, one should if possible avoid this step.
If any experimentation with parameters for steps after bootstrapping
is to be expected, this option should be used.
If the file <filename> does not exist, then bootfam creates this
file, proceeds with computing distances for all bootstrap
replicates, and saves the distances in the named file. If the file
exists, bootfam tries to read a set of distances from that file and
continues to the next step, hence avoiding the costly distance step.
Note that you cannot, yet, bypass bootfam's distance estimation and
let bootfam read your own bootstrapped distances. This is because
bootfam renames sequences internally to avoid issues with Phylip's
requirements on sequence names.
-m <strict|extended|merge>
Decide how to handle families that fall under the min bootstrap
threshold (see -b). Default is 'strict', which means group
everything unsupported together and label it as "Unclear". The
'extended' mode reports non-conflicting families in a greedy fashion
(highest support first), while 'merge' simply merge all unsupported
and overlapping family candidates. This differst from 'strict' in
that we may have several groups that are non-overlapping and
unsupported and 'strict' would put them all in the same family.
-r <int>
Specify the number of replicates in the bootstrap process. Default:
100
-b <float>
Experimental: Minimum bootstrap value to report families for.
Everything below this is called "Unclear".
-H Let the output be HTML formatted for easy viewing in a web browser.
-q Quiet operation, no verbose messages.