代做Analysis Data Pipeline for Danish Electricity and Gas supply 2025/2026帮做Python编程

- 首页 >> Java编程

SCHOOL OF INFORMATICS & IT

DIPLOMA IN BIG DATA & ANALYTICS

AY 2025/2026 April Semester

Project (60%)

Analysis Data Pipeline for Danish Electricity and Gas supply

DATA ENGINEERING IN THE CLOUD (CDA2C06)

SUBJECT LEVEL: 2

GENERAL INSTRUCTIONS

1. This document consists of 13 pages (including cover page and marking rubrics).

2. Please complete ALL tasks in this assignment.

1. Background

Energinet is the independent public enterprise responsible for Denmark’s transmission system for electricity and gas. Through its commitment to transparency and the green transition, Energinet provides publicly accessible energy datasets via its Energy Data Service platform. (https://en.energinet.dk/). These datasets support national and international energy research, system planning, market analysis, and innovation in energy technologies.

The Energinet Data Service – API provides programmatic access to a wide range of datasets including electricity spot prices, consumption patterns, production data by source (wind, solar, fossil), emissions, grid balance, gas consumption, and more. These datasets are particularly valuable for analytics projects involving energy forecasting, market behavior. analysis, and sustainability impact studies.

The platform. aligns with the open data policies of the Danish Government and the EU, promoting the use of open data in fostering data literacy, transparency, and innovation. Source:https://en.energinet.dk

2. Data

2.1 Data Source

The dataset for this project will be drawn from Energinet’s "Electricity Production and Exchange 5 min Realtime" dataset, available through the Energy Data Service.

https://www.energidataservice.dk/tso-electricity/ElectricityProdex5MinRealtime

This dataset provides high-frequency, near real-time electricity production data in Denmark at 5-minute intervals. It includes breakdowns by energy production type (e.g., wind, solar, thermal) and tracks international electricity exchanges with neighboring countries.

This dataset is ideal for students interested in:

Monitoring fluctuations in electricity generation by source

•    Studying the real-time balance between domestic production and cross-border electricity exchanges

•    Analyzing renewable energy contribution in near real-time

Building time-series models to forecast short-term grid behavior.

It serves as a rich source for data engineering workflows and real-time analytics and is updated frequently to reflect live operational data from Denmark’s national grid.

2.2 Data Format

The data is accessible via a RESTful API that supports GET requests and returns results in JSON format, making it suitable for ingestion by AWS services or analytics platforms using Python, R, or SQL-based engines like Athena.

API Explorer by Energinet Data Service:

Figure: 2.1 (screenshot of 2 days Electricity Production and Exchange in 5 mins interval)

API requests can be customized using parameters such as:

start and end (to filter by datetime range)

filter (to target specific production types or bidding zones)

•    sort and time-zone options

Sample API Request (by Postman):

Figure: 2.2 (screenshot from Postman, customized request with offset, start, end and sort parameters)

The returned JSON object contains fields such as:

•     Minutes5UTC timestamp in UTC

•     PriceArea DK1 or DK2 market zone

•     Production electricity production in MW by source (e.g., solar, wind onshore/offshore, thermal)

Exchange – electricity flow (import/export) to/from neighboring countries

This structured format enables students to automate data ingestion,  catalog metadata using AWS Glue, and query with Amazon Athena for exploratory and analytical tasks.

3. Tasks

You are part of Energinet data engineering team, and are tasked with evaluating, designing, and building an AWS data pipeline as a proof-of-concept.

You  are  required  to  use  Energinet’s  "Electricity  Production  and  Exchange  5  min Realtime" dataset. You are now to build a data engineering solution in your AWS Learner Lab using many AWS services that are familiar with. By working with this data source, you will be able to test whether the solution that you build can support a much larger dataset in actual implementation.

The   object   of  this   project   is   to   achieve   a   sustainable   and   seamless   data synchronization  and  better  front-end  data  service  for  data  consumption.  Prior  to building a  robust and  reliable solution, you shall start with an architecture design proposal (which is considered as a Project Report) and viable prototype (which is considered as a Project Solution) in this context. The aim of the prototype is to present the Proof of Concept (POC) to judge the feasibility for actual implementation.

This project will challenge you to do the following:

Basic Requirements

Using AWS Cloud9 integrated development environment (IDE) instance.

Collect and ingest the data from the web source.

Store the data in Amazon S3 and create a data catalogue.

Create an AWS Glue crawler to infer the structure of the data, transform the data to be in human readable format (such as CSV).

Use Amazon Athena to query the data and create the views for analysis purpose.

Create an analysis dashboard in relevant visualization platform.

Project completed within allocated budget (capped to $50 USD).

Advanced Requirements (In addition of Basic Requirements)

Data Wrangling (further data processing using boto, AWS Data Wrangler and other relevant Python libraries)

Orchestration and Deployment (using API and Step functions)

Monitoring and Notifications

Project  running costs optimally  managed with  no wastage,  based on services implemented vs cost  (Not  based on current  usage amount,  but  by  method of implementation,  for  example,  100  employees  reading the same file VS   100 employees create 100 files to read).

Refer to the Project Cost Estimate Report for the services recommended for this project here:

4. Deliverables

There are TWO deliverables for this project namely,

Project Report with Group Presentation (for due date, refer to Teaching Plan)

Project Solution and Individual Presentation (for due date, refer to Teaching Plan)

4.1 Project Report – (Group 10%, PDF, 20 marks)

Form a group with 4 – 6 members.

Prepare a  Project  Proposal  Report (in  PDF format) that details below and stating contributing member for each section:

4.1.1 Architecture Design (10 marks)

Identify business requirements: List the business requirements for the proof-of-concept, such as enabling the data science team to perform SQL analysis  and   managing  data  access   based  on  job   roles.  Start   by considering  the  four  main  parts  of  the   pipeline   (ingestion,  storage, processing, serving) and expand from there.

Select  relevant components: Identify the components that meet these requirements.

Justify component choices: Provide   explanations   for   why   each component  was  chosen  and  how  it  contributes  to  the  intended  data pipeline.

4.1.2 Configuration Checklist (10 marks)

Recommend configurations: Based   on   the   identified  components, recommend necessary non-default configurations for each service, such as enabling versioning for an S3 bucket.

Justify configurations: Explain  the   reasons  for  each   recommended configuration.

Address access control and security: Include configurations for access control management (such as IAM, users, roles, and policies) and ensure configurations  are  optimized  for  threat  prevention,  data  integrity,  and compliance.  Due  to  the  limitations  of  the   IAM  module  in  the  project prototyping environment, keep the recommended configurations for future reference.

4.2 Group Presentation (Group 10%, 10 marks)

All team members are required to present:

Identified Business Requirements

•    Configuration Checklist

Question & Answer

4.3 Project Solution – (Individual 30%, PDF, 60 marks)

Prepare a Project Solution Report detailing below:

4.3.1 Basic Requirements

1.   Data Ingestion (to acquire data from JSON endpoint using API)

Suggested service/platform: AWS Cloud9 IDE or AWS Lambda. You may use either one of them to access the data directly from the JSON endpoint.

Be mindful of Python version and its compatibility with the required libraries.

Consider creating a layer in AWS Lambda to keep all the library dependencies in a single repository.

Be  careful  of  the  nested  JSON  structure  and  read  the  data accordingly.

Test your Python code in Jupyter Notebook first (especially for the iteration logic and conditional statements)

2.   Data Storage (to save data in temporary storage for processing)

Suggested service/platform: Amazon S3 can be considered for temporary storage. During the precedent data ingestion, structuring the data into tabular format (rows and columns) is required.

Be reminded of a landing zone concept during data collection. [Refer to L02a.Modern Data Architecture Infrastructure lecture taught in Week 2]

You may consider saving the acquired data in the cross-platform readable format (such as CSV, TXT, XLSX).

3.   Data Process (to prepare table schema for usable data formats)

Suggested service/platform: AWS Glue can read the data from a data store and recognizes the format and allocate the schema with appropriate data types.

You may still need to adjust the data type for some columns manually (because STRING is the default type in AWS Glue).

4.   Data Serving (to provide the data in analytic-ready state)

Suggested service/platform: Amazon Athena could work with most of the analytics platforms ranging from visualization tools to machine learning models. Once the data discovery was conducted by AWS Glue, you should have the tabular data with proper data types.

Athena would be serving as the connector between your data and any analytics platform.

Be  reminded  of  setting  the  output  folder  in  S3  bucket  for  the processed data.

[Refer to P03.Querying Data in the Cloud taught in Week 3]

Depending on your creativity, data aggregations, grouping, filtering could be done in Athena to reduce the cost of scanning the whole dataset from each time you run a query.

5.   Data Analysis: (to query and analyse data assets in place)

Suggested service/platform: Use Appropriate visualization platform. (such as Power BI or Tableau) to integrate with the AWS data sources.

You may consider connecting to Amazon Athena directly from self- service BI.

You may need to install ODBC Data Sources locally with

necessary driver to connect to AWS services from your computer.

4.3.2 Advanced Requirements

6.   Data Wrangling (enhance the data quality, aggregation, and analysis)

Notes: Out of four components of data wrangling you may ignore below two.

Structuring (given that the data has been prepared and formatted in Amazon Athena and they are interpretable)

Normalizing and de-normalizing (since you are only focusing at the minimum level of segregation such as countries and species)

Suggested service/platform: Amazon S3, AWS Data Wrangler and Jupyter Notebook

You are encouraged to work on:

Cleaning: explore and validate raw data from their messy state

and complex forms into high-quality data with the intent of making them more consumable and useful for analytics.

This includes tasks like standardizing inputs, deleting duplicate values or empty cells, removing outliers, fixing inaccuracies, and addressing biases. Remove errors that might distort or damage   the accuracy of your analysis.

Enriching: transform. and aggregate the current data to produce valuable insights and guide business decisions. Once you've

transformed your data into a more usable form, consider whether you have all the data you need for your analysis. If you don't, you can enrich it by integrating with values from other datasets. You   also may want to add metadata to your database at this point.

7.   Deployment (to assess the log files periodically and alert the system admin). Orchestrating the multiple services and components is a good practice for DataOps. Package the services you used in the project using STEP functions and automate the flow with scheduling for seamless integration with minimal human intervention.

Suggested service/platform: AWS STEP function, Amazon EventBridge AWS Lambda

8.   Monitoring (to assess the log files periodically and alert the system admin).

Understanding the performance metrics and continuous improvement to data pipeline is part of Data Engineering works. Observe the performance ofcomponents and make necessary adjustments to the configurations to make our data workflow efficient and reliable.

Suggested service/platform: Amazon CloudWatch, CloudTrail, SNS

You may consider assessing log files often using the built-in

monitoring tool (Amazon CloudWatch) which is well integrated with most components and generate the logs. For some components,

the utilization and performance metrics are available with useful charts. Use it accordingly to report the performance issues and  possible fine-tuning.

You may also integrate SNS together with CloudWatch to trigger the notifications to the system administrator in case of the errors and unexpected events.

4.4 Presentation Individual (10%, 10 marks)

Your demonstration for the project implementation should meet the following criteria:

4.4.1    Detailed explanation of the data pipeline including, but not limited to

Ingestion Layer

Storage Layer

Processing Layer

4.4.2    Demonstrate the collection, ingestion, serving and analysis of your prototype to convince CIO to consider for actual implementation.

4.4.3    In addition, you  may also  highlight the possible vulnerabilities and security concerns for your data pipeline and how your configuration can mitigate them to lower the risk.

5. How the Project is assessed

Refer to the Appendix A (Page #10) for marking rubrics in detail.

First Submission Project Proposal Report (Group) 10%

Font – Times New Roman

Font Size – 11

Format: – PDF

Page Limit – 20 maximum including the references.

Template: TP-LMS – DAEC – Assessment DAEC Project Template.docx Submission – Refer to the Teaching Plan

Group Presentation (10%)

Duration – Each group is allowed 20 minutes to explain the architecture proposal and configuration checklist.

Submission – The presentation will be scheduled during the timetabled lesson. Your tutor will inform. the venue and schedule nearer to the date.

Second Submission Project Solution Report (Individual) – 30%

Development Environment – Use AWS Academy (Learner Lab Environment) for project  prototyping.  $50  credit  points  will  be  provided  to  use  the  AWS  Services appropriate for this project.

Font Size – 11

Content – Explanations with relevant screenshots for all works done in Section 4.3 Format: – PDF

Page Limit – 50 maximum including the references.

Template: TP-LMS – DAEC – Assessment DAEC Project Template.docx Submission – Refer to the Teaching Plan

Individual Presentation (10%)

Duration and scope – Allowed 10 minutes for each student to demonstrate how data pipeline and workflow is implemented for prototype and address any questions raised.

Late submissions

Penalty will be awarded to late submissions:

•    Late and < 1 day – 10% deduction from absolute marks given for the part of the work e.g. if the assignment was worth 100 marks, you were given 75 marks for the work, after the penalty you are left with 65 marks i.e. 75 – 100x10%

Late >= 1 and < 2 days 20% deduction from absolute marks

•    Late >= 2 days no marks will be awarded



站长地图