 Genomics: Advanced
|
Genomics assignment: Advanced part
The advanced part consists of two assignments.
- SOLiD sequencing
Due to special properties of the SOLiD sequencing platform, it is
possible to recognize many of the reads with an error, even before
aligning the reads to a reference sequence. In this task you are given
1000 sequences of length 50 in
color-space. The reads obtained from a
species of non-coding RNA which is called miRNA. In this experiment,
miRNAs are of length between 19 to 23.
There are two adapter attached to all sequences of
interest. One adapter is attached to the 5' side of each sequence
and another one to the 3' side. The first 12 nucleotides in the
adapter in the 3' side are CGCCTTGGCCGT.
Your task is identifying the reads with an error, utilizing the above
mentioned information. Submit your method and a file
containing accessions of the sequences which most likely contain no
error prior to the adapter sequence (a text file with one
accession per line).
The clickable illustration (from Rumble et al) helps you decode the colorspace sequences.
- Read mapping: Write SMAQ, the Simple MAQ clone
The objective of this assignment is to solve a common, modern,
genome mapping problem:
- Given a reference genome G and large set of short
genome reads from a related genome G´, find the differences between G and G´.
The real-life problem is complicated by experimental errors due to
limitations in lab protocols, but we will work with ideal data.
You objective is to implement your own read mapper find the mutations
in the data files listed below. My recommendation is that you keep the
program MAQ-inspired and use hashtables/dictionaries.
- How many mutations do you think there is in the "simple" dataset?
- Where are the mutations located?
- How large data set can you handle? For this question, try the Challenge data!
Please present your findings as a file containing one position per
line. E.g.,
17
23
1023
for easy comparison with the Truth.
Genome positions are zero-indexed.
Simple data
Look for data in
/info/omsys11/data/lab1/enkel/ on the CSC computers. You
will find:
- A small reference genome
enkel, about 4800 bp long.
- There are 805 short reads (30 bp) in
reads.fa and they contain a secret number of mutations compared to the reference genome.
Challenge data
For a challenging test, please look in the
/info/omsys10/data/lab1/NC_009782/ directory on the CSC computers. This
directory contains:
- The reference genome Staphylococcus aureus
subsp. aureus Mu3 (NC_009782.1), which is found in
NC_009782.fa.
- Four different test cases based on NC_009782.1.
reads_2x.fa.gz: On average, every base has been
"sequenced" twice in this dataset, i.e., coverage is 2. The file
contains 192011 short reads.
reads_5x.fa.gz: Coverage is 5. The file
contains 480028 short reads. This is 19 MB unzipped.
reads_10x.fa.gz: Coverage is 10. The file
contains 960056 short reads. This is 37 MB unzipped.
reads_20x.fa.gz: Coverage is 20. The file
contains 1920112 short reads. This is 75 MB unzipped.
Use gunzip (turns reads_*.fa.gz into reads_*.fa) or zcat (writes
result to the terminal) to unzip the data files.
Advice
I recommend that you create your own tiny test cases
before going for the test data given! Don't bother with the target data until you can handle the test data.
The mutations are supposed to be placed uniformly at random.
The reads are also placed uniformly at random, hence you can expect
that some parts of the genome has not been sequenced at all for the
low-coverage cases.
You do not have to re-implement MAQ, but you can probably find very
good inspiration by reading the paper.
Consider downloading Bowtie to verify your results!
|