Due date: Tuesday, October 6th at midnight (Tuesday night).
For this assignment we will be working with real Covid-19 data coming from the City of Toronto. Details about the data can be found on the City of Toronto Covid-19 Open Data site.
1) The data is stored in the tarball file located here: https://support.scinet.utoronto.ca/~ejspence/COVID19.2020.tar.gz. This file contains 6 CSV files (comma separated values), one for each of the months March - August, 2020.
To download and uncompress the data set, use the following commands at the Linux command line:
[ejspence.mycomp]
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment3
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://support.scinet.utoronto.ca/~ejspence/COVID19.2020.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
COVID19.2020.tar.gz
[ejspence.mycomp] tar -zxf COVID19.2020.tar.gz
[ejspence.mycomp] ls
COVID19.2020.03.csv COVID19.2020.06.csv
COVID19.2020.04.csv COVID19.2020.07.csv
COVID19.2020.05.csv COVID19.2020.08.csv
COVID19.2020.tar.gz
[ejspence.mycomp]
The files contain the Toronto Covid-19 data for the months March through August, 2020.
Note that it is a good idea to do some initial exploration of the data (read the data into R, use str()
to examine the names of the columns), look at the first few entries, before you proceed to the next section.
2) Create a utilities file, called Covid.Utilities.R
. This file will contain functions which perform the steps outlined in parts 3a) - 3d). As always, be sure to properly comment your functions, use sensible variable and function names, and use good coding best practices.
Note that only the function for part 3a), the function which reads the data, should return a value. All other functions are only charged with outputting information to the screen. As such, these functions do not need to return anything.
Finally, also note that defensive programming of your functions is not required for this assignment. We will, however, require defensive programming of the script in part 3.
3) Write an R script, called process.Covid.R
, which takes an argument from the command line, indicating the name of the file which contains the data to be examined. The script should call functions which perform the following steps.
- Receives an argument indicating which file to read, puts the file's data into a data frame, prints the name of the file being processed, and returns the data.
- Calculates and prints the total number of patients per source of infection. For this you will need to find a way to automatically identify the different sources of infection (Do not hard-code the sources!), and loop over them to compute the total number for each source. A useful function for this is
unique()
. Use help() and example() to learn how to use this function.
- Calculates and prints the number of confirmed cases in the 40-49 years age group.
- Calculates and prints the neighbourhood with the most fatalities, and the number of fatalities. In the case of a tie print out the first entry you encounter. For this question, depending on your strategy, functions which might be helpful include
table()
(to perform a frequency analysis on data), sort()
(to sort things), and names()
(to get the names from your table).
Your script should output something like this, when run from the bash terminal:
[ejspence.mycomp]
[ejspence.mycomp] Rscript process.Covid.R
Error: We require a filename to process as a command line argument.
Execution halted
[ejspence.mycomp]
[ejspence.mycomp] Rscript process.Covid.R COVID19.2020.05.csv
Processing data from file: COVID19.2020.05.csv
Total number of patients per infection source:
N/A - Outbreak associated -- 1828
Community -- 739
Institutional -- 92
Unknown/Missing -- 141
Close contact -- 1997
Healthcare -- 268
Pending -- 27
Travel -- 39
The number of confirmed cases in the 40-49 years age group is 708
The neighbourhood with the most fatalities is Birchcliffe-Cliffside with 26 fatalities.
----------------------------------------------------------------------------
[ejspence.mycomp]
Note that the function associated with part 3b) is the only function that should have a loop. All other questions should be answered using slicing. Also note that defensive programming should be used to protect your script from being run with no arguments. A good function to consider for defensive programming is the file.exists()
function, to confirm that the requested CSV file exists before attempting to read it.
Note the following code, which may inspire your answers for some of the above sections:
>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>
4) Finally, write a shell script named "processALLmonths.sh
" that loops over all the Covid-19 CSV files in your directory and calls the previous R script so that all the months are processed sequentially. The following is the skeleton of a 'for' loop in bash. This code should inspire your shell script.
for filename in *csv
do
echo $filename
done
Start with this, remove and add the necessary commands so that this script executes your R script for all the COVID19.2020.XX.csv files. Note that for this to work you must have all the CSV files, the R script and the shell script in the same directory!
Submit your Covid.Utilities.R
, process.Covid.R
, and processALLmonths.sh
code to the Assignment Dropbox.
Assignments will be graded on a 10 point basis. Due date is October 6th 2020 (midnight), until the submission cutoff of October 13th at 12:00pm.
Last Modified: Friday Oct 2, 2020 - 13:06. Revision: 27. Release Date: Tuesday Sep 29, 2020 - 12:00.