- 首页 >> Database作业
Final Exam

The data we are working with is in longitudinal format. Each column represents a patient, and each row represents a gene expression reading for genes 1-5913. The patient’s disease status is marked in the column header. The first 20 patients are marked with ‘meta,’ meaning these patients have a form of metastatic cancer (disease=1). The last 20 patients do not have the disease (disease=0).

You will need to transform this data into a model-ready format in order to predict metastatic disease by patient’s expression of each gene.

Set your R’s seed to 1234.

Once your data is ready to model, separate it into training and test sets.

Apply the following algorithms- training on your training data and testing on your test data- to predict disease based on gene expression. From your test data, pull out your accuracy, sensitivity and specificity.

RF (RF on the full dataset may take a long time to run due to the number of genes being used as predictor variables)


KNN + PCA (Use iteration to find optimal value of K)
In an external document, write a discussion on which algorithm you would choose and why. Discuss what the variable importance plot showed for RF and RF + PCA, the number of principal components you chose and what you chose as your optimal value of K.

Upload your code and your external explanation document by Thursday, April 30th at 8pm.

Thank you for a wonderful class and have a great summer! Stay in touch!