bild
Skolan för
datavetenskap
och kommunikation
KTH / CSC / Kurser / DD2399 / omsys11 / Labs / Genomics: Advanced

Genomics assignment: Advanced part

The advanced part consists of two assignments.
  1. SOLiD sequencing
    Due to special properties of the SOLiD sequencing platform, it is possible to recognize many of the reads with an error, even before aligning the reads to a reference sequence. In this task you are given 1000 sequences of length 50 in color-space. The reads obtained from a species of non-coding RNA which is called miRNA. In this experiment, miRNAs are of length between 19 to 23.

    There are two adapter attached to all sequences of interest. One adapter is attached to the 5' side of each sequence and another one to the 3' side. The first 12 nucleotides in the adapter in the 3' side are CGCCTTGGCCGT.

    Your task is identifying the reads with an error, utilizing the above mentioned information. Submit your method and a file containing accessions of the sequences which most likely contain no error prior to the adapter sequence (a text file with one accession per line).

    The clickable illustration (from Rumble et al) helps you decode the colorspace sequences.

  2. Read mapping: Write SMAQ, the Simple MAQ clone
    The objective of this assignment is to solve a common, modern, genome mapping problem:
    • Given a reference genome G and large set of short genome reads from a related genome , find the differences between G and .
    The real-life problem is complicated by experimental errors due to limitations in lab protocols, but we will work with ideal data.

    You objective is to implement your own read mapper find the mutations in the data files listed below. My recommendation is that you keep the program MAQ-inspired and use hashtables/dictionaries.

    • How many mutations do you think there is in the "simple" dataset?
    • Where are the mutations located?
    • How large data set can you handle? For this question, try the Challenge data!
    Please present your findings as a file containing one position per line. E.g.,
    17
    23
    1023
    
    for easy comparison with the Truth. Genome positions are zero-indexed.

    Simple data

    Look for data in /info/omsys11/data/lab1/enkel/ on the CSC computers. You will find:
    • A small reference genome enkel, about 4800 bp long.
    • There are 805 short reads (30 bp) in reads.fa and they contain a secret number of mutations compared to the reference genome.

    Challenge data

    For a challenging test, please look in the /info/omsys10/data/lab1/NC_009782/ directory on the CSC computers. This directory contains:
    • The reference genome Staphylococcus aureus subsp. aureus Mu3 (NC_009782.1), which is found in NC_009782.fa.
    • Four different test cases based on NC_009782.1.
      1. reads_2x.fa.gz: On average, every base has been "sequenced" twice in this dataset, i.e., coverage is 2. The file contains 192011 short reads.
      2. reads_5x.fa.gz: Coverage is 5. The file contains 480028 short reads. This is 19 MB unzipped.
      3. reads_10x.fa.gz: Coverage is 10. The file contains 960056 short reads. This is 37 MB unzipped.
      4. reads_20x.fa.gz: Coverage is 20. The file contains 1920112 short reads. This is 75 MB unzipped.
    Use gunzip (turns reads_*.fa.gz into reads_*.fa) or zcat (writes result to the terminal) to unzip the data files.

    Advice

    I recommend that you create your own tiny test cases before going for the test data given! Don't bother with the target data until you can handle the test data.

    The mutations are supposed to be placed uniformly at random.

    The reads are also placed uniformly at random, hence you can expect that some parts of the genome has not been sequenced at all for the low-coverage cases.

    You do not have to re-implement MAQ, but you can probably find very good inspiration by reading the paper.

    Consider downloading Bowtie to verify your results!

Copyright © Sidansvarig: Jens Lagergren <jensl@nada.kth.se>
Uppdaterad 2011-02-11