bild
School of
Electrical Engineering
and Computer Science

Documentation

The program Excap2 estimates from a given multiple sequence alignment a set of markers optimizing the genetic variability of this set. Length and number of markers can be specified as well as their location.

Algorithms and implementations were designed at the Stockholm Bioinformatics Center.

This file contains the following sections:

  1. Installation
  2. Help
  3. Examples
  4. License

Installation

If you like to use the make-tool you need a running GNU g++ compiler (tested with g++ >= 2.16), and the BOOST library and the STL library installed.

To install the program in a UNIX environment just run 'make' in the directory containing the sourcecode. An executable file 'Excap2' will be created.

To change the used compiler or to use additional compiler-flags you have to change 'Makefile'.

Help

You can use the option -h to get further information about the available parameters. This will print the following information to the screen:

Only infiles of the following formats can be read yet:

         FASTA
The provided sequences must have all the same length

Without any parameter the best non-overlapping markers of default length are presented in a sorted list. (you can change default-length-range with options -minl and -maxl)

Program options

 -stdin          program reads from STDIN - all given files are discarded

 -w              calculates the ec for the whole sequence without looking for markers
 -p              lists only the polymorphic columns.


 -c <int>        combines up to <int> non-overlapping markers, to improve the exclusion capacity - pos int re
quired.

 -m <int>        number of <int> times a polymorphism has to occurr at least to be counted.
 -m <double>     percentage of <double> a polymorphism has to occurr at least to be counted.                        positive int or value between 0.0 and 1.0 required

 -r <int>        how many results should be returned - positive integer required

 -maxl <int>     specifies the maximum-length <int> which a marker is allowed to have - positive integer required
 -minl <int>     specifies the minimum-length <int> which a marker is allowed to have
                 Maximum length must be greater than Minimum length!

 -hvr [1|2]      type 1 for looking only at HVR1, 2 for HVR2 and nothing for both
 -nohvr [1|2]    type 1 for at ignoring HVR1, 2 for HVR2 and nothing to ignore both
                 it is only possible to choose either -hvr or -nohvr, not both!
 -heu <double>           performs a heuristic of <double> percent -
                         0.0 -> optimal solution - 1.0 -> first best result is taken (not optimal!)

 -ex <filename>          excludes all positions given in the file, one position per line, each > 0

 -keep <filename>        keeps only those positions given in the file, one position per line, each > 0

 -ref <int>              sets sequence no <int> in the dataset as reference sequence,
                         default is 1 (first seqence in dataset), type -1 for no reference sequence
                         if you use more than one file, the reference sequence has to occur in each file at the same position!

 -cut [<int> <int>]+     look only at special parts of input data [<int> <int>] must be pairs for start- andstop-positions
                         e.g.: -cut start1 stop1 start2 stop2 start3 stop3 ... with each start-i <= stop-i
                         leave the last stop out, for cutting until the end

 -ec <double>    ec-value to start the calculation with - double between 0.0 and 1.0 required

 -i              include indels as polymorphic-positions

 -ir             reference sequence is considered a part of the dataset, per default it is not

 -k              shows a minimal (has not to be the minimum) set of markers, which have to be combined
                    to achieve the best ec possible for the given dataset and the given length parameters

 -s              use SNPs as markers, minimum-length and maximum-length are set to 1

 -noc            perform no correlation check on input data

 -o              breadth-first-search approach (not recommended) - needs very much workspace, but tells, howmuch work is done yet

 -v              verbose mode - prints additional information during calculation.

 -q              quiet mode - only the result is returned.

 -h, -u          Print this help-text to the screen.

Examples

For testing purposes there are some test cases available in the /examples directory. For each testcase we provide a script, running the test with predefined parameters, a data file containing the used multiple sequence alignment, and a Readme-file with further explanation.

To get a first insight to the program we recommend to use the script files. To understand the output the README files can give you good support.

To become more familiar with the parameters and the program we recommend to play around with different parameter settings.

License

The program is published under the GNU General Public License version 2.
Published by: Lars Arvestad <arve@nada.su.se>
Updated 2014-09-24