COMP529/336讲解、batch analytics辅导

2019.10.26 - 首页 >> Java编程

COMP529/336: COURSEWORK ASSIGNMENT #1 (BATCH ANALYTICS)

INTRODUCTION

The assignment aims to test your understanding of batch analytics, with a focus on your ability to use Hadoop to solve Big Data Analytic problems. More specifically, it aims to partially assess the following learning outcome for COMP529: “understanding of the middleware that can be used to enable algorithms to scale up to analysis of large datasets”.

AsSESSMENT

The report will be assessed according to the following criteria:

Criterion

Percentage

Clarity of presentation (including succinctness) of main report20%

Quality of Java code (including assessment of how easy it is to understand)40%

Quality of analysis performed40%

SUBMISSION

Please submit your coursework online using the COMP529/336 page on VITAL by 12 noon on Wednesday 6th November 2019. Standard lateness penalties will apply to any work handed in after this time. The report and the Java program must be written by yourself using your own words.

PROJECT BACKGROUND

Now more than ever, local governments have been engaged in emerging a smart city and creating the most sustainable urban environment to improve the quality of life. Part of their plan is also to introduce a new transportation program known as (bike share) program. The aim of this program is to help their city’s traffic congestion as well as to reduce their city’s air pollution. Today, the idea of sharing bike is very popular, since the bike users are easily allowed to rent any bike from any stations and return it back to their final destination. There are approximately 500,000 bicycles are available around the world for people to share over 500 different sharing programs. For this coursework, your task is to analyse one of the program’s dataset known as Capital Bikeshare; http://capitalbikeshare.com/system-data for the Washington DC. city, in the USA.

The aim of this assignment is to help you to analyse Capital Bikeshare rental program’s dataset and to understand the most popular rental season (e.g., springer, summer, fall, winter) across the year.

Dataset

A bikeshare dataset has 1000 records of rental bikes in between 2011-2013. The data has been stored in a file called (BikeShareData) and available on VITAL, COMP529/336 Assignment/data folder. The data field is also described in table 1.

Table 1: data record description

FieldDescription

dtedaydate

seasonsspringer, summer, fall, winter

yryear (2011)

mnthmonth ( 1 to 12)

hrhour (0 to 23)

weekdayday of the week

weathersit- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

casualcount of casual users

stationschinatown, capitollhill, lincoln, logan, southwest, oxford, abraham, alexandria, …etc.

Your tasks:

1)Set up a Hadoop framework and justify your reason for deploying such framework ( e.g., standalone)?

2)Use ONLY (seasons, stations) data fields, the rest of other data fields can be deleted or ignored.

3)Write a Java program for a MapReduce job that counts the number of seasons in the file (e.g., spring =3, summer=10, winter =30).

4)Use the MapReduce job to calculate the number of time that each bicycle station (e.g., chinatown) has been used in the file.

5)Use the MapReduce job to show your output result in alphabetical order (a- z).

6)Comment on how this analysis could be extended to consider larger datasets (e.g., 10 years of renting bicycle with 1 Terabyte of dataset).

7)Briefly Describe how to use your Hadoop MapReduce skills to solve other problem (Chose own case study)/MapReduce data flow diagram.

Your output report:

The output from this coursework is a brief report (to be less than or equal to two[ While the requirement is to produce no more than 2 pages, it is anticipated that the challenge will be to fit everything into those 2 pages: it is unlikely that a report of much less than 2 pages will result in a high mark.] A4 pages (excluding any appendices) in 12-point font with no less than 2 cm margins) that should have sections that describe:

1)Middleware configuration: How you configured the Hadoop middleware/screen print (including a description of your Hadoop cluster and your rationale for this choice).

2)Data Analytic Design: How you designed the MapReduce job (including your rationale for your design, briefly state/draw a map reduce data flow model for your work).

3)Results: The results obtained (excluding any discussion);

4)Discussion of Results;

5)Conclusions and Recommendations (including discussion of how you would perform the task if it were to be undertaken at larger scale).

6)List of the Java program for your MapReduce job(s) in the appendix.