Go to content ALT+c

Intro to Computational BioStatistics with R  (Fall 2018) (Old site; new site is at https://scinet.courses)

Friday May 31, 2024 - 11:28

5.1 Assignment 1

Due date: Thursday, September 20th at 11:55 pm

Please note that all of the commands and techniques you need to solve this assignment were given in class.  No internet searches should be necessary to complete this assignment.  If you aren't sure where to start, review the class slides.


The purpose of this assignment is to practise your bash scripting skills on a real data set. Before you begin, be sure to create a new directory to hold your assignment, and move into that directory:

[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090 
[ejspence.mycomp]
[ejspence.mycomp] mkdir assignment1
[ejspence.mycomp] cd assignment1
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp]


Consider the following data set, which concerns the response of bipolar disorder patients to lithium treatments: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5393. To make this dataset available for examination, we now introduce two new bash commands:

  • curl: this command downloads files from a given internet address.
  • gunzip: this command uncompresses a gzipped file.

To download and uncompress the data set, use the following commands at the Linux command line:

[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1 
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://support.scinet.utoronto.ca/~mponce/courses/datasets/GDS5393.soft.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
GDS5393.soft.gz
[ejspence.mycomp] gunzip GDS5393.soft.gz
[ejspence.mycomp] ls
GDS5393.soft
[ejspence.mycomp]

The data is now ready to be analyzed.


If you look into the data file (try the 'less' command, and type 'q' to get out), you'll notice (once you get past the header information) that each subject of the study is identified with a character string ILMN_XXXXXXX, where XXXXXXX is a 7 digit number.

Using this information, write a shell script, called count.patients.sh, which

  1. takes a filename as an input argument,
  2. prints out the name of the input file,
  3. prints out the number of patients listed in the file (assuming the file has the patient-identification format of the aforementioned data file), and
  4. prints out the number of patients that do not have 'null' as one of the entries in their columns (meaning the patient has complete data).

The script will be sourced from the command line, and should output as follows:

[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp] ls
count.patients.sh GDS5393.soft
[ejspence.mycomp]
[ejspence.mycomp] source count.patients.sh GDS5393.soft
Working with data file GDS5393.soft.
The total number of patients is 48107.
The number of patients with complete data is 47323.
[ejspence.mycomp]

 Some points to consider:

  • Full points will be awarded for implementations which store the numbers of patients in local variables, before printing the output.
  • Do not "hard code" the answers. This means you should not have the numbers 48107 and 47323, nor the string "GDS5393.soft", anywhere in your script.
  • Mac users may find that there is extra white space around the numbers in their output sentences.  Do not worry about this white spaces, extra spaces within the sentences are not important.

Submit your count.patients.sh script to the 'Assignment Dropbox'.

Assignments will be graded on 10 points basis.
Due date is September 20, 2018 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of September 27, 2018 at 1:00pm.

Last Modified: Sunday Sep 16, 2018 - 14:28. Revision: 18. Release Date: Wednesday Sep 12, 2018 - 17:00.


Content Navigation


Course Calendar


Forum Posts


Course Events



Questions? Contact Support.
Web site engine's code is copyright © ATutor®.
Modifications and code of added modules are copyright of SciNet.