Python for Genomics Data Science Project

Shengyuan Wang
Mar 29, 2020

Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions.

  1. How many records are in the file?
  2. What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their identifiers?
  3. Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence.
  4. Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.
In [5]:
from Bio import SeqIO
from collections import Counter


# calculate the start position and the length of longest ORF in sequence
def orf(sequence, reading_frame):
    seq = sequence[reading_frame-1:]
    max_len_orf = 0
    max_orf_start = 0

    for i in range(0, len(seq) - 6, 3):
        if seq[i:i + 3] == 'ATG':
            for j in range(i + 3, len(seq) - 3, 3):
                if seq[j:j + 3] in ['TAA', 'TAG', 'TGA']:
                    len_orf = j+3-i
                    if len_orf > max_len_orf:
                        max_len_orf = len_orf
                        max_orf_start = i
                    break

    return max_orf_start, max_len_orf


# identify all repeats of length n in sequence
def repeat_substring(sequence, repeat_num):
    sub_list = []
    for i in range(len(sequence)-repeat_num):
        sub_list.append(sequence[i:(i+repeat_num)])
    return sub_list


def fasta_analysis(input_file, reading_frame, n):
    fasta_sequences = SeqIO.parse(open(input_file), 'fasta')

    record_num = 0
    record_len = {}
    orf_record = {}
    sub_sum = []

    for fasta in fasta_sequences:
        name, sequence, description = fasta.id, str(fasta.seq), str(fasta.description).split()
        record_num += 1
        seq_len = len(sequence)
        record_len[description[0]] = seq_len

        # calculate the start position and the length of longest ORF
        orf_start, orf_len = orf(sequence, reading_frame)
        orf_record[description[0]] = [orf_start, orf_len]

        # identify all repeats of length n
        substring_list = repeat_substring(sequence, n)
        for i in range(len(substring_list)):
            sub_sum.append(substring_list[i])

    record_len_sorted = {k: v for k, v in sorted(record_len.items(), key=lambda item: item[1], reverse=True)}
    orf_record_sorted = {k: v for k, v in sorted(orf_record.items(), key=lambda item: item[1][1], reverse=True)}

    print("Number of records: ", record_num, '\n')
    print("Sorted records length: ", record_len_sorted, '\n')
    print("ORF%s, sorted by length: " % str(reading_frame), orf_record_sorted, '\n')
    print("Counter of repeats of length n, most common 5: ", Counter(sub_sum).most_common(5))


if __name__ == '__main__':
    input_file = 'dna2.fasta'

    # Test: ORF: 1, length of repeat: 7
    fasta_analysis(input_file, 1, 7)
Number of records:  18 

Sorted records length:  {'gi|142022655|gb|EQ086233.1|255': 4894, 'gi|142022655|gb|EQ086233.1|16': 4804, 'gi|142022655|gb|EQ086233.1|91': 4635, 'gi|142022655|gb|EQ086233.1|454': 4564, 'gi|142022655|gb|EQ086233.1|293': 4338, 'gi|142022655|gb|EQ086233.1|396': 4076, 'gi|142022655|gb|EQ086233.1|45': 3511, 'gi|142022655|gb|EQ086233.1|250': 2867, 'gi|142022655|gb|EQ086233.1|527': 2646, 'gi|142022655|gb|EQ086233.1|4': 2095, 'gi|142022655|gb|EQ086233.1|277': 1432, 'gi|142022655|gb|EQ086233.1|75': 1352, 'gi|142022655|gb|EQ086233.1|304': 1151, 'gi|142022655|gb|EQ086233.1|594': 967, 'gi|142022655|gb|EQ086233.1|584': 964, 'gi|142022655|gb|EQ086233.1|88': 890, 'gi|142022655|gb|EQ086233.1|322': 442, 'gi|142022655|gb|EQ086233.1|346': 115} 

ORF1, sorted by length:  {'gi|142022655|gb|EQ086233.1|45': [384, 2394], 'gi|142022655|gb|EQ086233.1|250': [561, 1560], 'gi|142022655|gb|EQ086233.1|16': [1527, 1509], 'gi|142022655|gb|EQ086233.1|255': [291, 1443], 'gi|142022655|gb|EQ086233.1|91': [978, 1296], 'gi|142022655|gb|EQ086233.1|396': [528, 1059], 'gi|142022655|gb|EQ086233.1|454': [2337, 1044], 'gi|142022655|gb|EQ086233.1|293': [1389, 312], 'gi|142022655|gb|EQ086233.1|4': [444, 249], 'gi|142022655|gb|EQ086233.1|277': [597, 204], 'gi|142022655|gb|EQ086233.1|527': [1224, 195], 'gi|142022655|gb|EQ086233.1|75': [819, 180], 'gi|142022655|gb|EQ086233.1|88': [81, 120], 'gi|142022655|gb|EQ086233.1|304': [858, 105], 'gi|142022655|gb|EQ086233.1|584': [159, 90], 'gi|142022655|gb|EQ086233.1|594': [27, 42], 'gi|142022655|gb|EQ086233.1|322': [0, 0], 'gi|142022655|gb|EQ086233.1|346': [0, 0]} 

Counter of repeats of length n, most common 5:  [('CGCGCCG', 63), ('CGCCGCG', 62), ('GCCGCGC', 61), ('GCGCGCG', 59), ('GCGCGGC', 58)]