STAT4620 Assignment 3

2023.03.04 - 首页 >> Database作业

STAT4620/5620 WINTER 2023
Assignment 3: Due Thursday March 2 2023
1. Suppose that you are interested in studying intravenous drug use among high
school students in Canada. Drug use is characterized as a binary random variable,
where 1 indicates that an individual has injected drugs within the past year and
0 that he/she has not. Covariate information related to drug use includes: infor-
mation about drug use provided in school (y/n), age of student (years), employed
part-time (y/n), school connectedness (Likert scale), and gender (m/f).
(a) [3pts] Propose and defend a suitable model for the aforementioned data. Be
sure to write down the model equation.
(b) [2pts] Discuss any potential interactions that might be worthwhile including in
your model and provide justification as to why (or why not).
(c) [1pts] Which R package(s) would you use to fit the above model?
(d) [2pts] What tools would you use to assess model fit and proceed with variable
selection?
2. [10pts] Install the R Package faraway. Consider the esdcomp data that were recorded
on 44 doctors working in an emergency service at a hospital to study the factors
affecting the number of complaints received. Build a model for the number of
complaints received, justify your choices, and report your conclusions. (250 words).
3. [10pts] The bootstrap is a general tool for assessing uncertainty. Describe the boot-
strap in general and then use it to investigate a statistic of relevance to the dataset
you have selected for your project. Take advantage of the functions available in the
R Package bootstrap and be sure to include your references. (500 words).
4. [5pts] Cross validation is probably the simplest and most widely used method for
estimating prediction error. Ideally if we had enough data, we would set aside a
validation set and use it to assess the performance of our model. Since data are
sometimes scarce, this may not always be possible. We finesse this problem by
using K-fold cross-validation. Explain. (150 words).
5. For the analysis of count (or semicontinuous) data there are models available to
deal with the common situation where there is an excessive number of zeros.
(a) [5pts] Discuss the various potential sources of zeros. (150 words).
(b) [8pts] Describe mixture and two-part models and show how their formulations
handle different types of zeros. (250 words).
GUIDELINES FOR SUBMISSION:
Submit the R markdown file (.RMD), the .csv file containing your datasets, AND the result-
ing knitted .PDF file to BrightSpace Assignments under Assignment 3.