BCH2024 Introduction to Programming in Python for Biochemistry (Sept.2020): 7.2 Assignment 2: Messenger RNA Transcription

Due date: Thursday, September 24th, 2020 at 11:55 pm.

Note that there's been an correction to this assignment. In point 2 below, "left to right" should have been "right to left".

In this assignment, we will try to identify (potential) messenger RNA encoded by the DNA sequence stored in the FASTA file chromosome1.fa, that also we looked at in assignment 1.

As you probably know, sequences in DNA are recipes for proteins, with triplets of C, T, G, and A encoding for a specific amino-acids (which is why those triplets are called codons). The first step in communicating these recipes to the ribosomes where proteins are ultimately synthesized is the production of messenger RNA. or mRNA.

mRNA is produced by 'transcribing' DNA. From a bioinformatics point of view, finding possible mRNA sequences entails:

Making a choice of reading frame, i.e., where to start triplets and in which direction to read the dna sequence.
Transcribing the triplets from DNA to RNA, which is given by the mapping:
T → A, A → U, G → C, C → G
Finding a starting point of the mRNA in the sequence. This is given by a specific codon. Let's say that the RNA start codon for our sample is AUG.
Reading and translating the sequence until a stop codon is encountered. There are several possibilities, but for this assignment, let's say the only stop codon is UAA.

For the assignment, we want you to write a Python script called "mrna.py" that

Reads and stores the dna sequence in a string without the new lines ('\n') and without the FASTA header.
Take the simplest reading frame, i.e., consider triplets from left to right (correction: this should be "right to left"), with the first triplet starting at index 0 of the (inverted) sequence.
Transcribe the sequence to RNA.
Create a numpy array of (RNA) codons, i.e., a one-dimensional array of which each element is a string of three characters.
Find the positions of all start codons (look up the numpy.where function).
Read and print out the mRNA sequence starting at each start codon and ending at a stop codon.

Having done that, the script should repeat this the same for the other two forward reading frames, ie., starting at index 1 and index 2.

As in the previous assignment, we expect your script

contain and use at least two function.
be well-commented and have doc-strings.
have sensible names for variables and functions.
Hint:
While NumPy can handle arrays of strings, you need to know how to specify the correct dtype, and we did not mention this in class. For regular, 3-character strings, the dtype is "<U3" (although, should you want to work with bytes instead, then "S3" is the dtype).

Upload your file "mrna.py" to the 'Assignment Dropbox'. Assignments will be graded on a 100 point basis. Due date is September 24th, 2020 (midnight), with a 5 point penalty for each day late until the cut-off date of October 1, 2020.

Last Modified: Tuesday Sep 22, 2020 - 14:15. Revision: 9. Release Date: Thursday Sep 17, 2020 - 10:00.

Go to Top

Intro to Programming in Python for Biochemistry (Sept.2020) (Old site; new site is at https://scinet.courses)

7.2 Assignment 2: Messenger RNA Transcription

Content Navigation

Forum Posts

Course Calendar

Related