代写ETC3250 exam 2019代做Statistics统计
- 首页 >> Java编程QUESTION 1
Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method, for each of the following. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small. [2 marks]
(b) The number of predictors p is extremely large, and the number of observations n is small. [2 marks]
(c) The relationship between the predictors and response is highly non-linear. [2 marks]
(d) The variance of the error terms, i.e. σ2 = Var("), is extremely high. [2 marks]
[Total: 8 marks]
QUESTION 2
Answer the questions on the following data:
(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0. Write it into the table. [3 marks]
(b) What is our prediction with K = 1? Why? [2 marks]
(c) What is our prediction with K = 3? [2 marks]
(d) What is our prediction with K = 5? [2 marks]
[Total: 9 marks]
QUESTION 3
(a) In the projection from a tour of the chocolates data below, the main pattern in the data that can be seen is a different between dark (green) and milk (orange) chocolates. Which of the following variables can be seen to contribute most to this pattern? (Circle one) [3 marks]
Fiber TotalFat Cholesterol Protein
(b) From the parallel coordinate plot below, which variables are most important for distinguishing between dark (green) and milk (orange) chocolates? (Circle them) [3 marks]
Na Fiber TotalFat CalFat Chol Sugars Carbs SatFat Protein Calories
(c) Would it be appropriate to say that both milk and dark chocolates have a similar amount of calories, based on the parallel coordinate plot? Yes or No. [2 marks]
[Total: 8 marks]
QUESTION 4
This is a summary of the principal component analysis for the dark chocolates in the data. Standardised nutritional variables are used. There are 56 observations. The last row of the table is the cumulative proportion of variance.
(a) Fill in the values where there are question marks. [3 marks]
(b) Compute the total variance? [2 marks]
(c) Make a scree plot for the results. [4 marks]
(d) How many principal components would you suggest be used to reduce the dimensionality of this data? Justify your answer. [3 marks]
(e) Interpret the first principal component. (What variables is it mostly composed of?) [3 marks]
(f) Below are two biplots. One is summarising the PCA for the dark chocolates, and the second is computed on the PCA of the milk chocolates.
i. The two biplots different. What does this imply about the variance-covariance matrices for the two groups? [3 marks]
ii. What is an obvious problem with PCA on the milk chocolates? [3 marks]
(g) TRUE or FALSE. The variance-covariance matrix computed on all of the data is the typically the same as pooling (averaging) the variance-covariance matrices computed separately on each group. [2 marks]
[Total: 23 marks]
QUESTION 5
Both of the plots below show different views of the same data, with differences between the two groups (circles, triangles).
(a) If you were to choose two variables for splitting the two groups, which would you choose, Var 2 or Var 3, in association with Var 1? Explain. [2 marks]
(b) A decision tree is fit to the data, using the rpart library. This is the tree:
n= 249
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 249 98 3 (0.39 0.61)
2) Var1>=1.1e+03 98 0 2 (1.00 0.00) *
3) Var1< 1.1e+03 151 0 3 (0.00 1.00) *
i. How many observations in the data? [2 marks]
ii. What is the predicted value for the split at node 2? [2 marks]
iii. What is the error for the model? [2 marks]
iv. How many terminal nodes are there? [2 marks]
(c) A random forest model is fit to the same data, and variable importance is calculated as follows:
i. Which variables are the most important? [2 marks]
ii. Var 3 in conjunction with Var 1 produce a big gap between the two groups (as seen from the plot in part a). Why doesn’t Var 3 show up as being an important variable in the random forest model? [3 marks]
(d) Sketch what you think the boundary between the two groups in Var 1 and Var 3 might be if a radial kernel SVM classifier is used. [2 marks]
[Total: 17 marks]
QUESTION 6
A (feed forward back propagation) neural network can be written as a nested regression model:
(1)
Let and f is a logistic function, .
The model was fitted to a data set with 2 variables, and 4 nodes in the hidden layer were used, yielding these coefficients:
(a) What is the value of s in the fitted model? [2 marks]
(b) What is the value of p in the fitted model? [2 marks]
(c) Make a sketch of the network diagram for this data. [3 marks]
(d) Write out the equation for the logistic regression at the first node of the hidden layer. [3 marks]
(e) Generally, in relation to logistic regression, show that the logistic function
can be re-arranged into
[4 marks]
[Total: 14 marks]
QUESTION 7
In ridge regression we minimise this function:
where λ ≥ 0 is a tuning parameter, and
(a) TRUE or FALSE. If λ = 0 the model fit equal is least squares. [2 marks]
(b) If λ is very, very large what will βj equal? [2 marks]
(c) What would be the change in the formula that would change this to lasso? [2 marks]
(d) Explain in two sentences how ridge regression effectively operates to enable model fitting with a large number of variables and few observations. [3 marks]
[Total: 9 marks]
QUESTION 8
(a) Match the linkage type to the explanation. [4 marks]
(b) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram. [4 marks]
(c) Does the following metric satisfy the definition of a distance metric? Justify your answer. [4 marks]
[Total: 12 marks]