bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2399 / omsys11 / Labs / Genomics: Basic

Genomics assignment: Basic part

Please note this is an individual assignment. See the CSC code of honour for guidance.

The answers to the basic part are due Tue, Feb 15.

  1. How much of the genome has been missed in a whole-genome shotgun project, according to the Lander-Waterman model, if the coverage is 7?
  2. What is meant by "gap closure"?
  3. Use this linked sequence, a cellulose synthase from Phytophtora infestans, to search NCBI's Trace Archive with Blast for reads containing the same sequence in Saprolegnia parasitica. There is a genome project started for S. parasitica, but it is far from ready to present an assembly.
    • "Choose search set": Pick Saprolegnia parasitica from the list of species!
    • "Program selection: Pick "somewhat similar sequences"!
    • Download the significant hits, i.e., those with E-value 1e-5 or lower.
    Now run a suitable assembly program (e.g. CAP3 is available as a webservice, and software such as CAP3 and Minimus should be able to install on your own computer) to assemble the significant reads.

    Describe your results! How many contigs do you get? Does S. parasitica have a similar copy of the CesA gene?

  4. Below is a set of nine sequences which we pretend come from a small shotgun sequencing project.
    1. Construct the overlap graph G for the sequences. Demand perfect matching over at least 10 positions to recognize an overlap. Illustrate G in a picture.
    2. Perform a transitive reduction (see Sommer et al) on G and illustrate the result. This is an important step. Don't guess what it is.
    3. Suggest a good assembly of the sequences given your reduced graph. Note: the assembly must be consistent with G.
    You should not need a computer program for this exercise.
    >s1
    AAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    >s2
    AAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGAAAAAA
    >s3
    AAAAAGCGCGCAAAAAAAAAAAAAAAAAAAAAATTTTTTAAAAAA
    >s4
    AAAAAAAAGGGGGGAAAAAAAAAAAAAAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAA
    >s5
    TTTTTTTTTTTTTTAAAAAAAAAAGCGCGCAAAAAA
    >s6
    AAAAATTTTTTAAAAAAAAAAAAAAAAAGGGGGGAAAAAAAAA
    >s7
    AAAATGTGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    >s8
    AAAAGGGGGGAAAAAAAAAAAAAAAAAAAAAAAACCCCCCAAAATGTGTG
    >s9
    AAAAAAAAAAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAA
    
    1. Describe, in your own words, how a de Bruijn graph is built from a sequence.
    2. Build a de Bruijn graph for the sequence AACCGGTTAACG with k=4 (using Pop's definition of k, not mine) and mark the path corresponding to the sequence.
  5. What is the idea behind the "uniqueome" (see Koehler et al) and how does it relate to read mapping?
  6. In MAQ, how likely is it that a read is mismapped if its quality score is 40?
Copyright © Sidansvarig: Jens Lagergren <jensl@nada.kth.se>
Uppdaterad 2011-02-11