MSC1090 Introduction to Computational BioStatistics with R (Fall 2018): 5.4 Assignment 4

Due date: Thursday, October 18th at midnight (Thursday night).

0) You must use version control ("git"), as you develop your scripts. We suggest you to start, from the Linux command line, by creating a new directory, e.g. assignment3, cd into that directory and initialize a git repository ("git init") within it, and perform "git add, git commit" repeatedly as you add to your scripts. You will hand in the output of "git log" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant and meaningful number of commits, you will lose marks.

For this assignment we will be working with real 311 Service Request data coming from the City of Toronto. Details about the data can be found on the City of Toronto 311 Service Requests page.

1) The data is stored in the tarball file located here: https://support.scinet.utoronto.ca/~ejspence/T311_2010-2015.tar.gz. This file contains 6 CSV files -- comma separated values.

To download and uncompress the data set, use the following commands at the Linux command line:

[ejspence.mycomp] pwd /c/Users/ejspence/MSC1090/assignment3 [ejspence.mycomp] [ejspence.mycomp] curl -O https://support.scinet.utoronto.ca/~ejspence/T311_2010-2015.tar.gz [ejspence.mycomp] [ejspence.mycomp] ls T311_2010-2015.tar.gz [ejspence.mycomp] tar -zxf T311_2010-2015.tar.gz [ejspence.mycomp] ls SR2010.csv SR2011.csv SR2012.csv SR2013.csv SR2014.csv SR2015.csv T311_2010-2015.tar.gz [ejspence.mycomp]

The files contain the Toronto 311 Services Request Data for the years 2010 to 2015. Each file contains the data corresponding to the year specified in its name, eg. SR2010.csv, ..., SR2015.csv.

Note that it is a good idea to do some initial exploration of the data (read the data in, use str() to examine the names of the columns) before you proceed to the next section.

2) Write an R script, called process311.R, which performs the following steps.

Receives an argument from the command line indicating which file to read, and puts the file's data into a data frame.
Prints which file is being processed.
Calculates and prints the total number of service calls per city division. For this you will need to find a way to automatically identify the different divisions (Do not hard-code the divisions!), and loop over them to compute the total number for each division. A useful function for this is unique(). Use help() and example() for getting more information about it.
Calculates and prints the total number of service calls about dead animals on expressways.
Calculates and prints the ward with the most service calls from the "311" division, in September. For this question, depending on your strategy, functions which might be helpful include as.character() (to convert inputs to strings), substr() (to cut substrings out of strings), table() (to perform a frequency analysis on data), sort() (to sort things), and names() (to get the names from your table).

Your script should output something like this, when run from the shell terminal:

$ $ Rscript process311.R SR2010.csv Processing data from file: SR2010.csv Total number of service calls per division: Transportation Services -- 31904 Toronto Water -- 48921 Solid Waste Management Services -- 136808 311 -- 1050 Urban Forestry -- 16016 Municipal Licensing & Standards -- 19507 City of Toronto -- 12 The number of reports of a dead animal on an expressway is 15 The ward with the most "311 calls" (among all the service calls) in September was Trinity-Spadina (20) ---------------------------------------------------------------------------- $

Note that part 2c) is the only part that should have a loop. All other questions should be answered using slicing.

Note the following code, which may inspire your answers for some of the above sections:

> > a <- 1:10 > > a > 7 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE > > sum(a > 7) [1] 3 >

3) Finally, write a shell script named "processALLyears.sh" that loops over all the CSV files in your directory and calls the previous R script so that all the years are processed sequentially. The following is the skeleton of a 'for' loop in bash. This code should inspire your shell script.

for filename in *csv do echo $filename done

Start with this, remove and add the necessary commands so that this script executes your R script for all the SR20XX.csv files. Note that for this to work you must have all the CSV files, the R script and the shell script in the same directory!

Submit your "process311.R" and "processALLyears.sh" scripts and the output of "git log" from your assignment repository to the 'Assignment Dropbox'. Both the R and shell scripts must be added and committed frequently to the repository.

To capture the output of 'git log' use redirection ( git log > git.log, and hand in the "git.log" file).

Assignments will be graded on a 10 point basis.
Due date is October 18th 2018 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of October 25th 2018, at 1:00pm.

Last Modified: Monday Oct 15, 2018 - 14:21. Revision: 44. Release Date: Thursday Oct 11, 2018 - 11:00.

Go to Top

Intro to Computational BioStatistics with R (Fall 2018) (Old site; new site is at https://scinet.courses)

5.4 Assignment 4

Content Navigation

Course Calendar

Forum Posts

Course Events