MSC1090 Introduction to Clinical BioStatistics (Quantitative Applications for Data Analysis) (Fall 2017): 5.5 Assignment 5

Due date: Thursday, October 26th at midnight.

0) Be sure to use version control ("git"), as you develop your script. Do "git add ...., git commit" repeatedly as you add to your script. You will hand in the output of "git log" for your assignment repository as part of the assignment.

Problem 1

The goal of this assignment (and the next one!) is not only to evaluate the knowledge you have acquired during the lectures but also guide you through a typical statistical analysis study.

We are often interested in studying the relationship among variables to determine whether there is any underlying association among them. When we think that changes in a variable X explain, or maybe even cause, changes in a second variable Y, we call X an explanatory (or independent) variable and Y a response (or dependent) variable. Moreover, if we plot these variables (X,Y), and the form of the plot resembles a straight line, this may indicate that there may be a linear relationship between the two variables. The relationship is strong if all the data points are close to the line or weak if the points are widely scattered about the line. The covariance and correlation are measures of the strength and direction of a linear relationship between two quantitative variables. A regression line can be defined as a mathematical model describing a linear relationship between an explanatory variable X, and a response variable Y.

The following are the steps you will initially follow when analyzing your data, and that you will also implement in this assignment:

Inspect the data graphically, to check for possibles insights underlying their relation.
Quantify this relationship by computing the appropriate statistical estimators (e.g. covariance and correlation between the variables). What can you conclude from these values?

A pediatrician wants to study the relationship between a child's height and their head circumferences (both measured in inches). The physician selects a random sample of 13 three year old children, obtaining the following data sets:

heights = 27.75, 24.5, 25.5, 26, 25, 27.75, 26.5, 27, 26.75, 26.75, 27.5, 27.85, 28.0 circ = 17.5, 17.1, 17.1, 17.3, 16.9, 17.6, 17.3, 17.5, 17.3, 17.5, 17.5, 16.9, 18.0

For answering the following questions, create an R script that will receive an argument from the command line and depending on its value perform one of the actions mentioned in point 1), 2) or 3). The script should also be modular, as much as you think is necessary. For instance, at least each part in this assignment could be a function, such as loading the data, computing correlations, executing the fits, etc. Put your functions in an auxilliary file called Utilities.R.
We want also that you implement defensive programming, so that if the arguments are not a 1, 2 or 3, the script sends a message to the screen letting the user know that only these options are possible, and then stops.
In addition to the commands in your script, include additional comments explaining your observations.

0) Create a function which loads the observations above, and puts them into an appropriate data structure, and then returns the data structure.

Your script should perform the following actions:

1) if the command line argument is a 1.

1.a) Print the correlation estimators for the dataset.
1.b) Implement a linear model to fir the data, and provide details of the fitted model.
1.c) Provide details of the model and a graphical representation in the presence of the original data.

2) The following actions should be performed if the command line argument is a 2:
2.a) Print the correlation estimators for the dataset.
2.b) Implement a quadratic model to fit the data, and provide details of the model.
2.c) Generate a plot of the quadratic model comparing with the original data.

3) The following actions should be performed if the command line argument is a 3:
3.a) Print the correlation estimators for the dataset.
3.b) Implement both the linear and quadratic models to fit the data, and provide details for both models.

3.c) Generate a plot of the quadratic model comparing with the linear model and the original data.

Some notes to follow when implementing your script:

OBSERVATION #1: Do not use global variables, i.e. pass arguments to the functions you created otherwise you will lose marks!

OBSERVATION #2: You will notice that when running the R script from the command line, the plots will not be shown, but instead saved on a file named Rplots.pdf in the same directory as the script is located.
This is the default way in which R deals with plots when running in batch mode, and totally acceptable for this assignment.

Examples:
$ Rscript generateModels.R Error: This scripts requires only one argument: 1, 2 or 3 $ Rscript generateModels.R 0 Error: This scripts requires only one argument: 1, 2 or 3 $ Rscript generateModels.R 1 2 Error: This scripts requires only one argument: 1, 2 or 3 $ Rscript generateModels.R 1 ------------- Computing correlation indicators... Covariance: 0.2147115 Correlation coefficient: 0.6175882 Correlation Test: Pearson's product-moment correlation data: x and y t = 2.6043, df = 11, p-value = 0.0245 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.01009567 0....871863 sample estimates: cor 0.6175882 --------------- Fitting a Linear ModelCall:
lm(formula = circ ~ heights, data = data)

Residuals:
     Min       1Q   Median       3Q      Max
-0.63868 -0.05173 0.01895 0.10128 0.43662

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.95292    1.68832   7.672 9.7e-06 ***
heights      0.16466    0.06323   2.604   0.0245 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2501 on 11 degrees of freedom
Multiple R-squared: 0.3814,   Adjusted R-squared: 0.3252
F-statistic: 6.783 on 1 and 11 DF, p-value: 0.0245
---------------

Submit your generateModels.R script file and Utiltites.R file, and the output of "git log" from your assignment repository, to the 'Assignment Dropbox'.

To capture the output of 'git log' use redirection, as described in lecture 2 (git log > git.log, and hand in the "git.log" file).

Assignments will be graded on a 10 point basis.
Due date is October 26th 2017 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of November 2nd, 2017, at 11:00am.

Last Modified: Monday Oct 23, 2017 - 19:31. Revision: 29. Release Date: Wednesday Oct 18, 2017 - 16:00.

Go to Top

Intro to Clinical BioStatistics (Quantitative Applications for Data Analysis) (Fall 201... (Old site; new site is at https://scinet.courses)

5.5 Assignment 5

Content Navigation

Course Calendar

Course Events