讲解MATH2319-Assignment1辅导Python
- 首页 >> C/C++编程3/4/20, 1:03 pmMATH2319_2020_Assignment1 
Page 1 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html 
MATH2319 Machine 
Learning 
Semester 1, 2020 
Assignment 1 
3/4/20, 1:03 pmMATH2319_2020_Assignment1 
Page 2 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html 
Assignment Rules: Please read 
carefully! 
1. Assignments are to be treated as "limited open-computer" take-home exams. That is, 
you must work on the assignments on your own. You must not discuss your assignment 
solutions with anyone else (including your classmates, paid/unpaid tutors, friends, parents, 
relatives, etc.) and the submission you make must be your own work. In addition, no 
member of the teaching team will assist you with any issues that are directly related to your 
assignment solutions.2 All solutions must be provided in Python 3.6+ with results 
documented in Jupyter Notebook. 
2. You must clearly show all your work for full credit. In particular, you need to clearly label your 
solutions with appropriate headings  subheading, lists, etc. Also keep in mind that just 
providing Python code will not get you full credit even if it's correct. You need to explain all 
your reasoning and document all your steps in plain English. That is, you must submit a 
professional piece of work as your assignment solutions. 
3. For solutions that are ambiguous, or solutions that are all over the place, you may receive 
zero points (even if it's correct!) as we have no obligation to spend hours and hours of our 
time to decipher your notebook. 
4. Once you are done, it is your responsibility to run your notebook and then save it as an 
HTML file before submission. Your solutions shall be marked exactly as they appear in your 
HTML file. 
5. You must submit a single file (in HTML format) that contains all your solutions to all the 
questions. 
6. For other assignment rules, please refer to this web page: 
https://rmit.instructure.com/courses/67061/assignments/424265 
(https://rmit.instructure.com/courses/67061/assignments/424265) 
7. It is your responsibility to follow any and all assignment rules stated in the above web 
page. 
8. Do not forget to include the Honour Code or your assignment shall not be marked. 
9. If you need to make any assumptions at any point so that you can continue for any question, 
please state these assumptions and clearly explain your reasoning. 
10. Suspected cheating incidents shall be reported to RMIT Student Conduct Office for possible 
disciplinary action. 
Question 1 
(65 points) 
Data preprocessing is a critical component in machine learning and its importance cannot be 
3/4/20, 1:03 pmMATH2319_2020_Assignment1 
Page 3 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html 
Data preprocessing is a critical component in machine learning and its importance cannot be 
overstated. If you do not prepare your data correctly, you can use the fanciest machine learning 
algorithm in the world and your results will still be incorrect. 
For this question, you will perform any and all data preprocessing steps on a dataset on the UCI 
ML Datasets Repository so that the clean dataset you end up with can be directly fed into any 
classification algorithm within the Scikit-Learn Python module without any further changes. 
This dataset is the Credit Approval  data at the following address: 
https://archive.ics.uci.edu/ml/datasets/Credit+Approval 
(https://archive.ics.uci.edu/ml/datasets/Credit+Approval) 
The UCI Repository provides four datasets, but only two of them will be relevant: 
crx.names : Some basic info on the dataset together with the feature names  values 
crx.data : The actual data in comma-separated format 
Instructions: 
1. If you are having issues with reading in the dataset directly (which is most likely due to UCI's 
or your web browser's SSL settings), you can download the file on your computer manually 
and then upload it to your Azure project, which you can then read in as a local file. 
2. This is a very small dataset. So please do not perform any sampling. 
3. Make sure you follow the best practices outlined in the Data Prep lecture presentation (on 
Chapters 2 and 3) on Canvas and the Data Prep tutorial 
(https://www.featureranking.com/tutorials/machine-learning-tutorials/data-preparation-for- 
machine-learning/) on our website. 
4. As a general rule, all categorical features need to be assumed to be nominal unless you have 
evidence to the contrary. 
5. As for potential outliers in numerical descriptive features, this is an anonymised dataset, so 
please do not flag any numerical values as outliers regardless of their value for this question. 
6. For this question, you are to set all unusual values (and all outliers, if there are any) to missing 
values. Also, you are to impute any missing values with the mode for categorical features 
and with the median for numerical features. If there are multiple modes for a categorical 
feature, use the mode that comes first alphabetically. 
7. For the A2  numerical descriptive feature, you are to discretize it via equal-frequency binning 
with 3 bins named "low", "medium", and "high", and then use integer encoding for it. 
8. For normalization, you are to use standard scaling. You are allowed to use Scikit-Learn's  
preprocessing  submodule for this purpose. 
9. The target feature needs be the last column in the clean data and its name needs to be  
target . 
10. You must perform all your preprocessing steps using Python. For any cleaning steps that you 
perform via Excel or simple find-and-replace in a text editor or any other language or in any 
other way, you will receive zero points. 
11. It's critical that the final clean data does not need any further processing so that it will work 
without any issues with any classifier within Scikit-Learn. 
3/4/20, 1:03 pmMATH2319_2020_Assignment1 
Page 4 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html 
12. Once you are done, name your final clean dataset as df_clean  (if it's not already named as 
such). 
13. At the end, run each one of the following three lines in three separate code cells for a 
summary: 
df_clean.shape 
df_clean.describe(include='all').round(3)  
df_clean.head(5) 
14. Save your final clean dataset exactly as "df_clean.csv". Make sure your file has the correct 
column names (including the target column). Next, you will upload this CSV file on to Canvas 
as part of your assignment solutions. That is, in addition to an HTML file (that contains your 
solutions), you also need to upload your clean data in CSV format on Canvas with this 
name. 
Please do not ask teaching staff any questions about this Credit Approval  dataset as we do 
not know anything more than what UCI already provides on their website. 
If you still need any help, please remember that you are allowed to search the Internet for generic 
questions, such as "how to change column order in Pandas" etc. Keep in mind that 99% of the 
time, a Google search will provide you a much faster response for your questions when compared 
to posting it on a discussion forum. 
If you run into any errors, the best course of action would be just to Google your error message. 
Good luck! 
For Question 2, please follow the instructions below: 
1. Textbook info can be found on Canvas at this link: 
https://rmit.instructure.com/courses/67061/pages/course-resources 
(https://rmit.instructure.com/courses/67061/pages/course-resources) 
2. You must show all your calculations and you must perform all your calculations using Python. 
You must also document all your work in Jupyter notebook format. 
3. You may not use any one of the classifiers in the Scikit-Learn module. Likewise, you may not 
use any one of the preprocessing methods in the Scikit-Learn module. You will need to show 
and explain all your solution steps without using the Scikit-Learn module. You will not 
receive any points for any work that uses Scikit-Learn for Question 2. The reason for this 
restriction is so that you get to learn how some things work behind the scenes. But don't 
worry, you will be using Scikit-Learn quite a bit in subsequent assessments. 
3/4/20, 1:03 pmMATH2319_2020_Assignment1 
Page 5 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html 
Question 2 
(35 points, 7 points for each part) 
Solve Chapter 5, Exercise 3 (all five parts) in the textbook, but instead of the Euclidean distance, 
use the Manhattan distance. All exercise parts must be solved with the Manhattan distance 
metric. 
www.featureranking.com
