 Genomics: Basic
|
Genomics assignment: Basic part
Please note this is an individual assignment. See the CSC code of honour for guidance.
The answers to the basic part are due Tue, Feb 15.
-
How much of the genome has been missed in a whole-genome shotgun project, according to the Lander-Waterman model, if the coverage is 7?
-
What is meant by "gap closure"?
-
Use this linked sequence, a cellulose
synthase from Phytophtora
infestans, to search NCBI's Trace Archive with Blast for
reads containing the same sequence in Saprolegnia
parasitica. There is a genome project started for S. parasitica,
but it is far from ready to present an assembly.
- "Choose search set": Pick
Saprolegnia parasitica from the list of species!
- "Program selection: Pick "somewhat similar sequences"!
- Download the significant hits, i.e., those with E-value 1e-5 or lower.
Now run a suitable assembly program (e.g. CAP3 is available as a webservice, and software such as CAP3 and Minimus
should be able to install on your own computer) to assemble the
significant reads.
Describe your results! How many contigs do you get? Does
S. parasitica have a similar copy of the CesA gene?
- Below is a set of nine sequences which we pretend come from a
small shotgun sequencing project.
- Construct the overlap graph G for the sequences. Demand
perfect matching over at least 10 positions to recognize an
overlap. Illustrate G in a picture.
- Perform a transitive reduction (see Sommer et
al) on G and illustrate the result. This is an important step. Don't guess what it is.
- Suggest a good assembly of the sequences given your
reduced graph. Note: the assembly must be consistent with G.
You should not need a computer program for this exercise.
>s1
AAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>s2
AAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGAAAAAA
>s3
AAAAAGCGCGCAAAAAAAAAAAAAAAAAAAAAATTTTTTAAAAAA
>s4
AAAAAAAAGGGGGGAAAAAAAAAAAAAAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAA
>s5
TTTTTTTTTTTTTTAAAAAAAAAAGCGCGCAAAAAA
>s6
AAAAATTTTTTAAAAAAAAAAAAAAAAAGGGGGGAAAAAAAAA
>s7
AAAATGTGTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>s8
AAAAGGGGGGAAAAAAAAAAAAAAAAAAAAAAAACCCCCCAAAATGTGTG
>s9
AAAAAAAAAAAAAAAAAAAACCCCCCAAAAAAAAAAAAAAAAAAAAAAAA
- Describe, in your own words, how a de Bruijn graph is built from a sequence.
- Build a de Bruijn graph for the sequence
AACCGGTTAACG with k=4 (using Pop's definition of k, not mine) and mark the path corresponding to the sequence.
- What is the idea behind the "uniqueome" (see Koehler et al) and how does it relate to read mapping?
- In MAQ, how likely is it that a read is mismapped if its quality score is 40?
|