代做Analysis Data Pipeline for Danish Electricity and Gas supply 2025/2026帮做Python编程
- 首页 >> Java编程SCHOOL OF INFORMATICS & IT
DIPLOMA IN BIG DATA & ANALYTICS
AY 2025/2026 April Semester
Project (60%)
Analysis Data Pipeline for Danish Electricity and Gas supply
DATA ENGINEERING IN THE CLOUD (CDA2C06)
SUBJECT LEVEL: 2
GENERAL INSTRUCTIONS
1. This document consists of 13 pages (including cover page and marking rubrics).
2. Please complete ALL tasks in this assignment.
1. Background
Energinet is the independent public enterprise responsible for Denmark’s transmission system for electricity and gas. Through its commitment to transparency and the green transition, Energinet provides publicly accessible energy datasets via its Energy Data Service platform. (https://en.energinet.dk/). These datasets support national and international energy research, system planning, market analysis, and innovation in energy technologies.
The Energinet Data Service – API provides programmatic access to a wide range of datasets including electricity spot prices, consumption patterns, production data by source (wind, solar, fossil), emissions, grid balance, gas consumption, and more. These datasets are particularly valuable for analytics projects involving energy forecasting, market behavior. analysis, and sustainability impact studies.
The platform. aligns with the open data policies of the Danish Government and the EU, promoting the use of open data in fostering data literacy, transparency, and innovation. Source:https://en.energinet.dk
2. Data
2.1 Data Source
The dataset for this project will be drawn from Energinet’s "Electricity Production and Exchange 5 min Realtime" dataset, available through the Energy Data Service.
https://www.energidataservice.dk/tso-electricity/ElectricityProdex5MinRealtime
This dataset provides high-frequency, near real-time electricity production data in Denmark at 5-minute intervals. It includes breakdowns by energy production type (e.g., wind, solar, thermal) and tracks international electricity exchanges with neighboring countries.
This dataset is ideal for students interested in:
• Monitoring fluctuations in electricity generation by source
• Studying the real-time balance between domestic production and cross-border electricity exchanges
• Analyzing renewable energy contribution in near real-time
• Building time-series models to forecast short-term grid behavior.
It serves as a rich source for data engineering workflows and real-time analytics and is updated frequently to reflect live operational data from Denmark’s national grid.
2.2 Data Format
The data is accessible via a RESTful API that supports GET requests and returns results in JSON format, making it suitable for ingestion by AWS services or analytics platforms using Python, R, or SQL-based engines like Athena.
API Explorer by Energinet Data Service:
Figure: 2.1 (screenshot of 2 days Electricity Production and Exchange in 5 mins interval)
API requests can be customized using parameters such as:
• start and end (to filter by datetime range)
• filter (to target specific production types or bidding zones)
• sort and time-zone options
Sample API Request (by Postman):
Figure: 2.2 (screenshot from Postman, customized request with offset, start, end and sort parameters)
The returned JSON object contains fields such as:
• Minutes5UTC – timestamp in UTC
• PriceArea – DK1 or DK2 market zone
• Production – electricity production in MW by source (e.g., solar, wind onshore/offshore, thermal)
• Exchange – electricity flow (import/export) to/from neighboring countries
This structured format enables students to automate data ingestion, catalog metadata using AWS Glue, and query with Amazon Athena for exploratory and analytical tasks.
3. Tasks
You are part of Energinet data engineering team, and are tasked with evaluating, designing, and building an AWS data pipeline as a proof-of-concept.
You are required to use Energinet’s "Electricity Production and Exchange 5 min Realtime" dataset. You are now to build a data engineering solution in your AWS Learner Lab using many AWS services that are familiar with. By working with this data source, you will be able to test whether the solution that you build can support a much larger dataset in actual implementation.
The object of this project is to achieve a sustainable and seamless data synchronization and better front-end data service for data consumption. Prior to building a robust and reliable solution, you shall start with an architecture design proposal (which is considered as a Project Report) and viable prototype (which is considered as a Project Solution) in this context. The aim of the prototype is to present the Proof of Concept (POC) to judge the feasibility for actual implementation.
This project will challenge you to do the following:
Basic Requirements
Using AWS Cloud9 integrated development environment (IDE) instance.
Collect and ingest the data from the web source.
Store the data in Amazon S3 and create a data catalogue.
Create an AWS Glue crawler to infer the structure of the data, transform the data to be in human readable format (such as CSV).
Use Amazon Athena to query the data and create the views for analysis purpose.
Create an analysis dashboard in relevant visualization platform.
Project completed within allocated budget (capped to $50 USD).
Advanced Requirements (In addition of Basic Requirements)
Data Wrangling (further data processing using boto, AWS Data Wrangler and other relevant Python libraries)
Orchestration and Deployment (using API and Step functions)
Monitoring and Notifications
Project running costs optimally managed with no wastage, based on services implemented vs cost (Not based on current usage amount, but by method of implementation, for example, 100 employees reading the same file VS 100 employees create 100 files to read).
Refer to the Project Cost Estimate Report for the services recommended for this project here:
4. Deliverables
There are TWO deliverables for this project namely,
Project Report with Group Presentation (for due date, refer to Teaching Plan)
Project Solution and Individual Presentation (for due date, refer to Teaching Plan)
4.1 Project Report – (Group 10%, PDF, 20 marks)
Form a group with 4 – 6 members.
Prepare a Project Proposal Report (in PDF format) that details below and stating contributing member for each section:
4.1.1 Architecture Design (10 marks)
• Identify business requirements: List the business requirements for the proof-of-concept, such as enabling the data science team to perform SQL analysis and managing data access based on job roles. Start by considering the four main parts of the pipeline (ingestion, storage, processing, serving) and expand from there.
• Select relevant components: Identify the components that meet these requirements.
• Justify component choices: Provide explanations for why each component was chosen and how it contributes to the intended data pipeline.
4.1.2 Configuration Checklist (10 marks)
• Recommend configurations: Based on the identified components, recommend necessary non-default configurations for each service, such as enabling versioning for an S3 bucket.
• Justify configurations: Explain the reasons for each recommended configuration.
• Address access control and security: Include configurations for access control management (such as IAM, users, roles, and policies) and ensure configurations are optimized for threat prevention, data integrity, and compliance. Due to the limitations of the IAM module in the project prototyping environment, keep the recommended configurations for future reference.
4.2 Group Presentation (Group 10%, 10 marks)
All team members are required to present:
• Identified Business Requirements
• Configuration Checklist
• Question & Answer
4.3 Project Solution – (Individual 30%, PDF, 60 marks)
Prepare a Project Solution Report detailing below:
4.3.1 Basic Requirements
1. Data Ingestion (to acquire data from JSON endpoint using API)
Suggested service/platform: AWS Cloud9 IDE or AWS Lambda. You may use either one of them to access the data directly from the JSON endpoint.
Be mindful of Python version and its compatibility with the required libraries.
Consider creating a layer in AWS Lambda to keep all the library dependencies in a single repository.
Be careful of the nested JSON structure and read the data accordingly.
Test your Python code in Jupyter Notebook first (especially for the iteration logic and conditional statements)
2. Data Storage (to save data in temporary storage for processing)
Suggested service/platform: Amazon S3 can be considered for temporary storage. During the precedent data ingestion, structuring the data into tabular format (rows and columns) is required.
Be reminded of a landing zone concept during data collection. [Refer to L02a.Modern Data Architecture Infrastructure lecture taught in Week 2]
You may consider saving the acquired data in the cross-platform readable format (such as CSV, TXT, XLSX).
3. Data Process (to prepare table schema for usable data formats)
Suggested service/platform: AWS Glue can read the data from a data store and recognizes the format and allocate the schema with appropriate data types.
You may still need to adjust the data type for some columns manually (because STRING is the default type in AWS Glue).
4. Data Serving (to provide the data in analytic-ready state)
Suggested service/platform: Amazon Athena could work with most of the analytics platforms ranging from visualization tools to machine learning models. Once the data discovery was conducted by AWS Glue, you should have the tabular data with proper data types.
Athena would be serving as the connector between your data and any analytics platform.
Be reminded of setting the output folder in S3 bucket for the processed data.
[Refer to P03.Querying Data in the Cloud taught in Week 3]
Depending on your creativity, data aggregations, grouping, filtering could be done in Athena to reduce the cost of scanning the whole dataset from each time you run a query.
5. Data Analysis: (to query and analyse data assets in place)
Suggested service/platform: Use Appropriate visualization platform. (such as Power BI or Tableau) to integrate with the AWS data sources.
You may consider connecting to Amazon Athena directly from self- service BI.
You may need to install ODBC Data Sources locally with
necessary driver to connect to AWS services from your computer.
4.3.2 Advanced Requirements
6. Data Wrangling (enhance the data quality, aggregation, and analysis)
Notes: Out of four components of data wrangling you may ignore below two.
Structuring (given that the data has been prepared and formatted in Amazon Athena and they are interpretable)
Normalizing and de-normalizing (since you are only focusing at the minimum level of segregation such as countries and species)
Suggested service/platform: Amazon S3, AWS Data Wrangler and Jupyter Notebook
You are encouraged to work on:
Cleaning: explore and validate raw data from their messy state
and complex forms into high-quality data with the intent of making them more consumable and useful for analytics.
This includes tasks like standardizing inputs, deleting duplicate values or empty cells, removing outliers, fixing inaccuracies, and addressing biases. Remove errors that might distort or damage the accuracy of your analysis.
Enriching: transform. and aggregate the current data to produce valuable insights and guide business decisions. Once you've
transformed your data into a more usable form, consider whether you have all the data you need for your analysis. If you don't, you can enrich it by integrating with values from other datasets. You also may want to add metadata to your database at this point.
7. Deployment (to assess the log files periodically and alert the system admin). Orchestrating the multiple services and components is a good practice for DataOps. Package the services you used in the project using STEP functions and automate the flow with scheduling for seamless integration with minimal human intervention.
Suggested service/platform: AWS STEP function, Amazon EventBridge AWS Lambda
8. Monitoring (to assess the log files periodically and alert the system admin).
Understanding the performance metrics and continuous improvement to data pipeline is part of Data Engineering works. Observe the performance ofcomponents and make necessary adjustments to the configurations to make our data workflow efficient and reliable.
Suggested service/platform: Amazon CloudWatch, CloudTrail, SNS
You may consider assessing log files often using the built-in
monitoring tool (Amazon CloudWatch) which is well integrated with most components and generate the logs. For some components,
the utilization and performance metrics are available with useful charts. Use it accordingly to report the performance issues and possible fine-tuning.
You may also integrate SNS together with CloudWatch to trigger the notifications to the system administrator in case of the errors and unexpected events.
4.4 Presentation – Individual (10%, 10 marks)
Your demonstration for the project implementation should meet the following criteria:
4.4.1 Detailed explanation of the data pipeline including, but not limited to
Ingestion Layer
Storage Layer
Processing Layer
4.4.2 Demonstrate the collection, ingestion, serving and analysis of your prototype to convince CIO to consider for actual implementation.
4.4.3 In addition, you may also highlight the possible vulnerabilities and security concerns for your data pipeline and how your configuration can mitigate them to lower the risk.
5. How the Project is assessed
Refer to the Appendix A (Page #10) for marking rubrics in detail.
First Submission – Project Proposal Report (Group) 10%
Font – Times New Roman
Font Size – 11
Format: – PDF
Page Limit – 20 maximum including the references.
Template: TP-LMS – DAEC – Assessment – DAEC Project Template.docx Submission – Refer to the Teaching Plan
Group Presentation (10%)
Duration – Each group is allowed 20 minutes to explain the architecture proposal and configuration checklist.
Submission – The presentation will be scheduled during the timetabled lesson. Your tutor will inform. the venue and schedule nearer to the date.
Second Submission – Project Solution Report (Individual) – 30%
Development Environment – Use AWS Academy (Learner Lab Environment) for project prototyping. $50 credit points will be provided to use the AWS Services appropriate for this project.
Font Size – 11
Content – Explanations with relevant screenshots for all works done in Section 4.3 Format: – PDF
Page Limit – 50 maximum including the references.
Template: TP-LMS – DAEC – Assessment – DAEC Project Template.docx Submission – Refer to the Teaching Plan
Individual Presentation (10%)
Duration and scope – Allowed 10 minutes for each student to demonstrate how data pipeline and workflow is implemented for prototype and address any questions raised.
Late submissions
Penalty will be awarded to late submissions:
• Late and < 1 day – 10% deduction from absolute marks given for the part of the work e.g. if the assignment was worth 100 marks, you were given 75 marks for the work, after the penalty you are left with 65 marks i.e. 75 – 100x10%
• Late >= 1 and < 2 days – 20% deduction from absolute marks
• Late >= 2 days – no marks will be awarded