- 首页 >> Database作业

School of Computer Science

COMP5338: Advanced Data Models 2.Sem./2019

Project: NoSQL Schema Design and Query Workload


Group Work: 20% 20.09.2019

1 Introduction

In this assignment, you will demonstrate that you are able to work with both MongoDB

and Neo4j in terms of designing suitable schema and writing practical queries. You will also

demonstrate that you understand the strength and weakness of each system with respect to

certain query workload features. You will be given a real world data set in Question and

Answer area and a set of target queries.

The summary of your tasks are as follows:

• Design a schema for each storage system based on data and query feature;

• Implement all queries in each system;

• For each system, select two queries to provide an alternative implementation; Compare

the execution performance of the two implementations;

• Document the schema design, query design and performance analysis in a report;

2 Data set

The data that you will use is the latest dump (publication date: 2018-06-05) of the Artificial

Intelligence Stack Exchange question and answer site (https://ai.stackexchange.

com/). The dump is released and maintained by stackexchange:

details/stackexchange. The original dump contains many files in XML format. The assignment

uses a subset of the data stored in five csv files. The data files and the description

(readme.txt) can be downloaded from Canvas.

The assignment data set contains the following files:

• Posts.csv stores information about post; each row represents a post, which could be

a question or an answer

• Users.csv stores user’s profile; each row represents a user, a user can be the author

of a post or an answer.


• Comments.csv stores comments meta data; each row represents a comment, which can

be made for a question or an answer, identified by the PostId

• Votes.csv stores detailed vote information about post, each row represents a vote,

including the vote type, the date this vote is made and a few other information

• Tags.csv contains summary of tag usage in this site.

Two concepts that will appear in many query descriptions are: Topic and User.

• Topic: Each question may belong to a few topics. The topic(s) of a question are

recorded as a list of keywords in the Tags column in Posts.csv. Both answers and

comments belong to this questions have the same topic(s) as the question.

• User: Questions, answers and comments are all made by registered users. Users are

identified by UserId field in various CSV files. Some users are removed for various

reasons. The removed users no longer have an Id and should be ignored in all queries

3 Target Queries

• [Q1] Find the question that attracts most discussions in a given topic; We measure

the intensity of discussion by the total number of answers and comments in a question.

• [Q2] Find the user with the highest UpVote number in a given topic, return the user’s

name and UpVote number. Any user who has posted a question,an answered or a

comment in this topic are candidate users.

• [Q3] For a given topic, discover the questions that are hardest to answer. Here we

measure the difficulty of question by the time it takes to receive an accepted answer.

Questions that do not have an accepted answer will be ignored.

• [Q4] Discover questions with arguable accepted answer. Users can give upVote to both

question and answer. Usually the accepted answer of a question receives the highest

number of upVote among all answers of this question. In rare case, another answer(s)

may receive higher upVote count than the upVote count of the accepted answer. In

this query, you are asked to discover such questions whose accepted answer has less

upVote than the upVote counts of its other answers. Note We are only interested in

questions with at least 5 answers.

• [Q5] Given a time period as indicated by starting and ending date, find the top 5

topics in that period. We rank a topic by the number of users participated in that

topic during the period. Posting question, answering or commenting are all considered

as participation.


• [Q6] Find the top 5 co-authors of a given user. Consider all users involved in a

question as co-authors. This include users posting the question, answering the question

or making comments on either question or answers. For a given user, we rank the

coauthors by the number of questions this user and the coauthor appear together.

4 Task Details

Your tasks include:

• Schema Design for MongoDB and Neo4j

For each storage option design a proper schema that would best support the query and

data set feature. For each schema version, make sure you utilize features of the storage

system such as indexing, aggregation, ordering, filtering and so on.

The original data set follows relational structure. It may contain data that are not

useful or not involved in the query. During schema design, you may discard data that

are not needed. You may duplicate original data following the schema design.

• Query Design and Implementation

Load the data set (after some necessary preporocessing) into both systems and set up

proper indexes that will be used by the target queries. Design and implement all queries

in each system. You may implement a query using the shell command (e.g. MongoDB

shell or Cypher query) alone, or a combination of JavaScript and shell commands in the

case of MongoDB or as Python/Java program. In case that a programming language is

used, make sure that you do majority of the processing on the database side. The client

side processing should be restricted to activities like collecting output from previous

database query and send the output as is to the subsequent one. In particular, you

should avoid sorting, filtering and grouping query output on the client side.

• Performance Analysis

For each storage option (MongoDB and Neo4j), pick two queries as the performance

analysis target queries. Design a different implementation for each query. Then collect

execution statistics of each implementation and make a side by side comparison.

Deliverable and Submission Guideline

This is a group project, each group can have up to 2 students. Each group needs to produce

the following:

• A Written Report.

The report should contain five sections. The first section is a brief introduction of

the project. Section two and three should cover a storage option each. Section four


should provide a summary and brief comparison of the two storage systems. Section

five should be an appendix for sample results.

There is no point allocated on section one. It is included to make the report complete.

So please keep it short.

Section two and three should each contain the following three sub sections

– Schema Design

In this section, describe the schema with respect to the particular system. Your

description should include information at “table” and “column” level as well as

possible primary keys/row keys and secondary indexes. You should show sample

data based on schema. For instance, you may show sample documents of

each MongoDB collection, a sample property graph involving all node types and

relationship types for Neo4j.

– Query Design

In this section, describe implementation of each query. You should include the

entire command and/or code. For each query, briefly explain the behaviour.

Example of MongoDB query description can be found in week 2 and week3 labs


– Performance Analysis

In this section, list the two queries you have chosen for performance analysis.

For each query, include the entire command/or code for each implementation.

Show the execution statistics in tabular format or as screenshots. Give a brief

comparison by highlighting the important execution differences.

In section four, briefly compare the two storage systems with respect to ease of use,

query design and schema differences. You can also describe problems encountered in

schema design or query design.

In section five, document the sample query results as well as the respective argument(s)

you use for queries that take argument. This would include: a sample ’topic’ in Q1,

Q2, Q3; a sample period in Q5, a simple userId in Q6.

• System Demo

Each group will demo in week 10 lab. You can run demo on your own machine, on

lab machine or on some cloud servers. Please make sure you prepare the data before

the demo. The marker does not need to see your data loading steps. The marker will

ask you to run a few randomly selected queries to get an overview of the data model

and query design. All members of the group are required to attend the demo. The

marker will ask each member a few questions to establish their respective contribution

to the project. Members in the same group may get different marks depending on their

individual contributions.

• Source Code/Script and soft copy of report submission

There will be different links for script (zip file) and report (PDF file) submission to

facilitate plagiarism detection. The script submission should be a zip file (no rar, 7z)

include the following:

– query script or program code for each option

– data loading script.

– a Readme document for how to run the data loading script and the target queries.

The instruction should be detailed enough for the markers to quickly prepare the

data and to run the queries. For instance, you should indicate where and how

run-time argument are supplied. If you use special features only available in a

particular version or environment, indicate that as well.

Remember, only script or source code and file should be included. There will

be penalty for including data file in the submission.