School of Computer Science
COMP5338: Advanced Data Models 2.Sem./2019
Project: NoSQL Schema Design and Query Workload
Group Work: 20% 20.09.2019
In this assignment, you will demonstrate that you are able to work with both MongoDB
and Neo4j in terms of designing suitable schema and writing practical queries. You will also
demonstrate that you understand the strength and weakness of each system with respect to
certain query workload features. You will be given a real world data set in Question and
Answer area and a set of target queries.
The summary of your tasks are as follows:
• Design a schema for each storage system based on data and query feature;
• Implement all queries in each system;
• For each system, select two queries to provide an alternative implementation; Compare
the execution performance of the two implementations;
• Document the schema design, query design and performance analysis in a report;
2 Data set
The data that you will use is the latest dump (publication date: 2018-06-05) of the Artificial
Intelligence Stack Exchange question and answer site (https://ai.stackexchange.
com/). The dump is released and maintained by stackexchange: https://archive.org/
details/stackexchange. The original dump contains many files in XML format. The assignment
uses a subset of the data stored in five csv files. The data files and the description
(readme.txt) can be downloaded from Canvas.
The assignment data set contains the following files:
• Posts.csv stores information about post; each row represents a post, which could be
a question or an answer
• Users.csv stores user’s profile; each row represents a user, a user can be the author
of a post or an answer.
• Comments.csv stores comments meta data; each row represents a comment, which can
be made for a question or an answer, identified by the PostId
• Votes.csv stores detailed vote information about post, each row represents a vote,
including the vote type, the date this vote is made and a few other information
• Tags.csv contains summary of tag usage in this site.
Two concepts that will appear in many query descriptions are: Topic and User.
• Topic: Each question may belong to a few topics. The topic(s) of a question are
recorded as a list of keywords in the Tags column in Posts.csv. Both answers and
comments belong to this questions have the same topic(s) as the question.
• User: Questions, answers and comments are all made by registered users. Users are
identified by UserId field in various CSV files. Some users are removed for various
reasons. The removed users no longer have an Id and should be ignored in all queries
3 Target Queries
• [Q1] Find the question that attracts most discussions in a given topic; We measure
the intensity of discussion by the total number of answers and comments in a question.
• [Q2] Find the user with the highest UpVote number in a given topic, return the user’s
name and UpVote number. Any user who has posted a question,an answered or a
comment in this topic are candidate users.
• [Q3] For a given topic, discover the questions that are hardest to answer. Here we
measure the difficulty of question by the time it takes to receive an accepted answer.
Questions that do not have an accepted answer will be ignored.
• [Q4] Discover questions with arguable accepted answer. Users can give upVote to both
question and answer. Usually the accepted answer of a question receives the highest
number of upVote among all answers of this question. In rare case, another answer(s)
may receive higher upVote count than the upVote count of the accepted answer. In
this query, you are asked to discover such questions whose accepted answer has less
upVote than the upVote counts of its other answers. Note We are only interested in
questions with at least 5 answers.
• [Q5] Given a time period as indicated by starting and ending date, find the top 5
topics in that period. We rank a topic by the number of users participated in that
topic during the period. Posting question, answering or commenting are all considered
• [Q6] Find the top 5 co-authors of a given user. Consider all users involved in a
question as co-authors. This include users posting the question, answering the question
or making comments on either question or answers. For a given user, we rank the
coauthors by the number of questions this user and the coauthor appear together.
4 Task Details
Your tasks include:
• Schema Design for MongoDB and Neo4j
For each storage option design a proper schema that would best support the query and
data set feature. For each schema version, make sure you utilize features of the storage
system such as indexing, aggregation, ordering, filtering and so on.
The original data set follows relational structure. It may contain data that are not
useful or not involved in the query. During schema design, you may discard data that
are not needed. You may duplicate original data following the schema design.
• Query Design and Implementation
Load the data set (after some necessary preporocessing) into both systems and set up
proper indexes that will be used by the target queries. Design and implement all queries
in each system. You may implement a query using the shell command (e.g. MongoDB
case of MongoDB or as Python/Java program. In case that a programming language is
used, make sure that you do majority of the processing on the database side. The client
side processing should be restricted to activities like collecting output from previous
database query and send the output as is to the subsequent one. In particular, you
should avoid sorting, filtering and grouping query output on the client side.
• Performance Analysis
For each storage option (MongoDB and Neo4j), pick two queries as the performance
analysis target queries. Design a different implementation for each query. Then collect
execution statistics of each implementation and make a side by side comparison.
Deliverable and Submission Guideline
This is a group project, each group can have up to 2 students. Each group needs to produce
• A Written Report.
The report should contain five sections. The first section is a brief introduction of
the project. Section two and three should cover a storage option each. Section four
should provide a summary and brief comparison of the two storage systems. Section
five should be an appendix for sample results.
There is no point allocated on section one. It is included to make the report complete.
So please keep it short.
Section two and three should each contain the following three sub sections
– Schema Design
In this section, describe the schema with respect to the particular system. Your
description should include information at “table” and “column” level as well as
possible primary keys/row keys and secondary indexes. You should show sample
data based on schema. For instance, you may show sample documents of
each MongoDB collection, a sample property graph involving all node types and
relationship types for Neo4j.
– Query Design
In this section, describe implementation of each query. You should include the
entire command and/or code. For each query, briefly explain the behaviour.
Example of MongoDB query description can be found in week 2 and week3 labs
– Performance Analysis
In this section, list the two queries you have chosen for performance analysis.
For each query, include the entire command/or code for each implementation.
Show the execution statistics in tabular format or as screenshots. Give a brief
comparison by highlighting the important execution differences.
In section four, briefly compare the two storage systems with respect to ease of use,
query design and schema differences. You can also describe problems encountered in
schema design or query design.
In section five, document the sample query results as well as the respective argument(s)
you use for queries that take argument. This would include: a sample ’topic’ in Q1,
Q2, Q3; a sample period in Q5, a simple userId in Q6.
• System Demo
Each group will demo in week 10 lab. You can run demo on your own machine, on
lab machine or on some cloud servers. Please make sure you prepare the data before
the demo. The marker does not need to see your data loading steps. The marker will
ask you to run a few randomly selected queries to get an overview of the data model
and query design. All members of the group are required to attend the demo. The
marker will ask each member a few questions to establish their respective contribution
to the project. Members in the same group may get different marks depending on their
• Source Code/Script and soft copy of report submission
There will be different links for script (zip file) and report (PDF file) submission to
facilitate plagiarism detection. The script submission should be a zip file (no rar, 7z)
include the following:
– query script or program code for each option
– data loading script.
– a Readme document for how to run the data loading script and the target queries.
The instruction should be detailed enough for the markers to quickly prepare the
data and to run the queries. For instance, you should indicate where and how
run-time argument are supplied. If you use special features only available in a
particular version or environment, indicate that as well.
Remember, only script or source code and read.me file should be included. There will
be penalty for including data file in the submission.