代做Big Data Statistics Assessment 1代做R语言

2025.08.08 - 首页 >> Java编程

Big Data Statistics

Assessment 1 (Due by 16:00 8th August 2025)

Question 1[50 marks]

We observe data {(yi , xi), i = 1, 2, . . . , n} from the linear regression model

yi = β0 + β1x1i + β2x2i + · · · + βpxpi + εi , i = 1, 2, . . . , n, (1)

where xi = (x1i, x2i, . . . , xpi) > involves the p covariates.

(a). [10 marks] Generate a set of sample observations (yi , xi), i = 1, 2, . . . , n in the statistical software R by following the data generating process (DGP) below.

1. Parameters. set p = 20, n = 26 and β0 = 1, β1 = β2 = · · · = β10 = 0.8, β11 = β12 = · · · = β20 = 1.3.

2. Covariates. All the p covariates (i.e. predictors) follow normal distribution with mean 0.4 and variance 1.1.

3. Error Component. The error component εi follows standard normal dis-tribution.

(b). [15 marks] With the generated data in (a), estimate the regression coefficients β0, βk with k = 1, 2, . . . , p with the ordinary least squares (OLS) estimation approach in R. Design an experiment (i.e. simulation) to evaluate the prediction accuracy of this OLS estimator for the response variable y on test data. Please write the procedure of the designed experiment and present the results in R.

(c). [25 marks] With the generated data in (a), propose another estimation approach for the linear regression model, which has more accurate prediction accuracy than the OLS. Please implement the proposed estimation approach in R and present the estimation of the linear coefficients. Further, please illustrate why the proposed method is better than the OLS in the sense of prediction accuracy.

Question 2[50 marks]

Consider two sets of sample observations {x1, x2, . . . , xn} and {y1 , y2 , . . . , ym} from normal distributions with population mean vectors being µ and ν, respectively. The population covariance matrices are both identity matrices. The dimensions of µ and ν are both equal to p. Statisticians are interested in the hypothesis test

(2)

A popular test statistic for this hypothesis testing problem is the Hotelling T square statistic

(3)

where Sx and Sy are sample covariance matrices constructed by {x1, x2, . . . , xn} and {y1 , y2 , . . . , ym}, respectively; x and y are sample mean vectors for µ and ν, respec-tively. The Hotelling T square statistic T 2 has the following asymptotic distribution

(4)

where χ2p is Chi square distribution with k degrees of freedom.

(a). [25 marks] Please generate the two sets of sample observations in R by setting p = 60, n = 80, m = 90, µ = ν = (1, 1, . . . , 1)> , and then calculate the value of Hotelling T2 statistic T2 . Repeat this experiment for N = 200 times and then plot the histogram of the statistic T2.

(b). [25 marks] Please apply the bootstrap method to estimate the variance of the Hotelling T2 statistic in R when p = 60, n = 80, m = 90, µ = ν = (1, 1, . . . , 1)T. Write down the details of the bootstrap procedure and present the bootstrap estimation. In addition, please comment on the accuracy of the bootstrap estimation and provide the reasons.

Note: This homework is to be submitted through Canvas in digital form. only as per ANU policy. The R codes for any computational question must be supplied.