BINF 634 Programming Assignment 2 Due: Monday, October 12, 2009, 7pm Like many viruses, Hepatitus B Virus (HBV) has a circular genome, meaning that coding regions can "wrap around" the virus. In this assignment, you will write a Perl program to find all the possible protein coding regions in a circular genome. Specifications: Write a Perl program called cds.pl that takes one command line argument, the name of a FASTA file. You may assume that the FASTA file contains a single sequence. Run your program as follows: % cds.pl filename Input: A FASTA format file containing the genomic DNA. Output: First print out the original DNA sequence, including the header line. Print the DNA sequence using 60 characters per line. Then print to the screen a list of possible coding regions, or CDS. In GenBank a CDS is a region of nucleotides that corresponds with the sequence of amino acids in a protein. Output format is shown below, based on the format of a GenBank record. (For HBV, see http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21326584 ). The first line gives the location of the CDS in genomic coordinates (the first base is labeled 1). The second line begins with "/translation=", followed by the protein sequence. All lines contain a maximum of 58 characters. Print a blank line between each CDS. The location of the CDS includes start and stop codons, but the stop codon is not printed. CDS 155..835 /translation="MESTTSGFLGPLLVLQAGFFLLTRILTIPQSLDSWWTSLNFLGG APTCPGQNSQSPTSNHSPTSCPPTCPGYRWMCLRRFIIFLFILLLCLIFLLVLLDYQG MLPVCPLLPGTSTTSTGPCRTCTIPAQGTSMFPSCCCTKPSDGNCTCIPIPSSWAFAR FLWEWASVRFSWLSLLVPFVQWFVGLSPTVWLSAIWMMWYWGPSLYNILSPFLPLLPI FFCLWVYI" If a coding region wraps around the end of the sequence, the first line indicates the sections that make up the CDS, as follows: CDS join(2848..3215,1..835) /translation="MGGWSSKPRQGMGTNLSVPNPLGFFPDHQLDPAFGANSNNPDWD FNPNKDHWPEANQVGAGAFGPGFTPPHGGLLGWSPQAQGILTTLPAAPPPASTNRQSG RQPTPISPPLRDSHPQAMQWNSTTFHQALLDPRVRGLYFPAGGSSSGTVNPVPTTASP ISSIFSRTGDPAPNMESTTSGFLGPLLVLQAGFFLLTRILTIPQSLDSWWTSLNFLGG APTCPGQNSQSPTSNHSPTSCPPTCPGYRWMCLRRFIIFLFILLLCLIFLLVLLDYQG MLPVCPLLPGTSTTSTGPCRTCTIPAQGTSMFPSCCCTKPSDGNCTCIPIPSSWAFAR FLWEWASVRFSWLSLLVPFVQWFVGLSPTVWLSAIWMMWYWGPSLYNILSPFLPLLPI FFCLWVYI" The program should print out each CDS in all three forward reading frames. In order to match the output in the sample, make sure that you print out all CDS for reading frame 1 first, followed by the CDS from readings frames 2 and 3. Rules for finding possible coding regions: 1. Coding regions begin with a start codon ATG, which translates to M (Methionine). 2. Coding regions end with a stop codon (TAA, TAG or TGA), and no other stop codons occur between the start codon and the stop codon at the end of the coding region. 3. Coding regions may contain more than one Methionine. 4. Coding regions may overlap. For example, the DNA sequence ATG TTA ATG CCT TAG contains two coding regions that translate as MLMP_ and MP_ (where "_" indicates the STOP codon.) In this case, you would print them both. 5. Include the first Methionine ("M") in the CDS, but do not print the STOP codon ("_"). Your program will print out more CDS regions than appear in the GenBank record, because not all possible coding regions actually code for proteins. Sample input and output is available on the course web site, created as follows: % cds.pl hbv.fsa > cds.out I may test your program on other input files, so don't hard code the answers for this example. Due Date: Submit the assignment via email to me by 7pm on October 12, 2009. Late assignments will not be accepted. Use of other code: You may freely use any code discussed in class or in the text book, including the functions in the module BeginPerlBioinfo.pm, described in Ch. 8. Use of any code from any other source is strictly prohibited.