Go to content ALT+c

Intro to Computational BioStatistics with R  (Fall 2020) (Old site; new site is at https://scinet.courses)

Friday May 31, 2024 - 16:29

6.3 Assignment 3

Due date: Tuesday, October 6th at midnight (Tuesday night).


For this assignment we will be working with real Covid-19 data coming from the City of Toronto. Details about the data can be found on the City of Toronto Covid-19 Open Data site.

1) The data is stored in the tarball file located here: https://support.scinet.utoronto.ca/~ejspence/COVID19.2020.tar.gz.  This file contains 6 CSV files (comma separated values), one for each of the months March - August, 2020.

To download and uncompress the data set, use the following commands at the Linux command line:


[ejspence.mycomp]
[
ejspence.mycomppwd
/c/Users/ejspence/MSC1090/assignment3
[ejspence.mycomp]
[
ejspence.mycompcurl -O https://support.scinet.utoronto.ca/~ejspence/COVID19.2020.tar.gz
[ejspence.mycomp]
[
ejspence.mycompls
COVID19.2020
.tar.gz
[ejspence.mycomptar -zxf COVID19.2020.tar.gz
[ejspence.mycompls
COVID19.2020.03
.csv COVID19.2020.06.csv
COVID19.2020.04
.csv COVID19.2020.07.csv
COVID19.2020.05
.csv COVID19.2020.08.csv
COVID19.2020
.tar.gz
[ejspence.mycomp]

The files contain the Toronto Covid-19 data for the months March through August, 2020.

Note that it is a good idea to do some initial exploration of the data (read the data into R, use str() to examine the names of the columns), look at the first few entries, before you proceed to the next section.


2) Create a utilities file, called Covid.Utilities.R. This file will contain functions which perform the steps outlined in parts 3a) - 3d). As always, be sure to properly comment your functions, use sensible variable and function names, and use good coding best practices.

Note that only the function for part 3a), the function which reads the data, should return a value. All other functions are only charged with outputting information to the screen. As such, these functions do not need to return anything.

Finally, also note that defensive programming of your functions is not required for this assignment. We will, however, require defensive programming of the script in part 3.


3) Write an R script, called process.Covid.R, which takes an argument from the command line, indicating the name of the file which contains the data to be examined. The script should call functions which perform the following steps.

  1. Receives an argument indicating which file to read, puts the file's data into a data frame, prints the name of the file being processed, and returns the data.
  2. Calculates and prints the total number of patients per source of infection.  For this you will need to find a way to automatically identify the different sources of infection (Do not hard-code the sources!), and loop over them to compute the total number for each source. A useful function for this is unique(). Use help() and example() to learn how to use this function.
  3. Calculates and prints the number of confirmed cases in the 40-49 years age group.
  4. Calculates and prints the neighbourhood with the most fatalities, and the number of fatalities.  In the case of a tie print out the first entry you encounter. For this question, depending on your strategy, functions which might be helpful include table() (to perform a frequency analysis on data), sort() (to sort things), and names() (to get the names from your table). 

Your script should output something like this, when run from the bash terminal:


[ejspence.mycomp]
[
ejspence.mycompRscript process.Covid.R
Error
We require a filename to process as a command line argument.
Execution halted
[ejspence.mycomp
[
ejspence.mycompRscript process.Covid.R COVID19.2020.05.csv 
Processing data from file
:  COVID19.2020.05.csv 
Total number of patients per infection source

     
N/Outbreak associated  --  1828 
     Community  
--  739 
     Institutional  
--  92 
     Unknown
/Missing  --  141 
     Close contact  
--  1997 
     Healthcare  
--  268 
     Pending  
--  27 
     Travel  
--  39 
The number of confirmed cases in the 40
-49 years age group is 708 
The neighbourhood with the most fatalities is  Birchcliffe
-Cliffside with 26 fatalities.
---------------------------------------------------------------------------- 
[
ejspence.mycomp]

Note that the function associated with part 3b) is the only function that should have a loop. All other questions should be answered using slicing. Also note that defensive programming should be used to protect your script from being run with no arguments. A good function to consider for defensive programming is the file.exists() function, to confirm that the requested CSV file exists before attempting to read it.

Note the following code, which may inspire your answers for some of the above sections:

 
>
<- 1:10
>
7
[1FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
sum(7)
[
13
>


4) Finally, write a shell script named "processALLmonths.sh" that loops over all the Covid-19 CSV files in your directory and calls the previous R script so that all the months are processed sequentially.  The following is the skeleton of a 'for' loop in bash.  This code should inspire your shell script.


for filename in *csv
do
   echo 
$filename
done 

Start with this, remove and add the necessary commands so that this script executes your R script for all the COVID19.2020.XX.csv files.  Note that for this to work you must have all the CSV files, the R script and the shell script in the same directory!


Submit your Covid.Utilities.R, process.Covid.R, and processALLmonths.sh code to the Assignment Dropbox.


Assignments will be graded on a 10 point basis. Due date is October 6th 2020 (midnight), until the submission cutoff of October 13th at 12:00pm.


Last Modified: Friday Oct 2, 2020 - 13:06. Revision: 27. Release Date: Tuesday Sep 29, 2020 - 12:00.


Content Navigation


Course Calendar


Forum Posts


Related



Questions? Contact Support.
Web site engine's code is copyright © ATutor®.
Modifications and code of added modules are copyright of SciNet.