Climate data on Azure
One of the biggest public datasets you’ll find is ERA5 climate data., It offers a great basis to practice new data analyses techniques on and you can capture many interesting insights about weather and our changing climate (just take a quick look on Kaggle to find some inspiration).
In this article we’ll offer guidance on handling Copernicus’ ERA5 data efficiently on Azure. This article serves as a practical reference. Depending on your own use case some components might differ but the linked sources provide in-depth information on each topic.
Whether you’re taking your first steps into data science on Azure or you’re a seasoned pro seeking recommendations, this article is for you. So let’s get started!
Problem statement
Our aim is to download Copernicus’ ERA5 data within the Azure cloud to enable data scientists to enhance and analyse its contents.
ERA5 is the latest climate reanalysis, providing weather data on a global scale from 1940 onwards. This data is often used for monitoring climate change, research and education. The data is public and free (more information, including how to get access, can be found here).
For the solution we have divided it into three stages:
Downloading the data (from the CDS API)
Pre-processing (cleaning the data)
Data Modeling (providing a workspace for data scientists to do analysis and build models)
First we’ll take a look at the architecture diagram. Then we’ll dig deeper on the components, discussing alternatives and highlighting the reasons for choosing these services. Depending on your own use case the choices in services could differ. Plethora of sources will be linked to provide background information for your own leisure.
Architecture
In the next section we’ll take a deeper look at all of the components, but at its core this architecture is designed to be flexible and adaptable - serving individuals at various levels of expertise.
This architecture contains the minimal security measure of using a key vault. We will not discuss further security best practices, networking or observability. To find an overview of best practices on these topics the ISE playbook is a good place to get started.
Now we’ll take a look at all of the services separately and explain each component’s function.
Components
Copernicus' ERA5 data
The primary data source is Copernicus' ERA5 climate reanalysis dataset (that’s a mouthful!), which contains comprehensive climate data. The data can be downloaded through the Climate Data Story (CDS) web interface or using the CDS API service. There is also a cdsapi python package available if you prefer Python development. More details can be found here.
We're triggering the CDS API through Azure Data Factory on a schedule so that it automatically downloads the newest available data and stores it in Azure Blob storage. There are several compute options available worth taking a look at like Azure Batch (for HPC workloads), Azure Functions or Managed Airflow.
Orchestration: Azure Data Factory
An orchestrator is the central control and coordination mechanism for the data processing workflow. Its primary goal is to coordinate scheduled triggers to download the ERA5 data and store it (without overwriting) in the correct blog storage.
Alternatively we looked at Apache Airflow on Azure (also called Managed Airflow). However at the time of writing, managed airflow was not generally available and cannot be accessed behind firewall/in vnet.
Azure Logic Apps was also considered but at time of writing presented multiple limitations:
Each trigger requires an additional workflow to be triggered on a schedule.
Failed runs cannot be re-ran from failed action. Instead whole workflow needs to be re-run.
No overview available to monitor al workflow runs.
No git integration
Automated downloading of ERA5
We will not go through the steps of how to set-up ADF and the activities within. The documentation can guide you through that and there are many tutorials available if this is your first time.
We should take a look at the different types of triggers. These triggers can automate the downloading of ERA5 data and you can design it so that it will also automatically copy it to the correct blob storage.
Azure Data Factory has tumbling window triggers, which are activated in a specified time interval while retaining state. Its main advantage over scheduled triggers, which are for executing activities at specific times or dates, is the ability to process and aggregate time-series data. This makes it a good fit for our large climate dataset.
Pre-processing & data modelling: Azure Machine Learning
Once you’ve downloaded the data (in an automated fashion) and it’s stored in a blob container we’ll need a tool that’s able to take care of the remaining stages: pre-processing (including data exploration) and data modelling.
You now have the raw ERA5 data. But perhaps you want to optimize the format by converting it to parquet, or you want to start filtering out a specific region for your future analyses. Whatever your objective is: pre-processing (or sometimes called data preparation) is a necessary evil.
There are many options available, and your choice depends also on preference. We recommend doing a spike for your specific use case to see which fits best.
We are going to take a look at Azure Machine Learning (AML) for the following reasons:
Provides a workspace for all levels of data scientists
Azure ML accommodates for different levels data scientists.
If you are familiar with Jupyter notebooks and prefer to do your data preparation there, you can simply spin one up in your AML workspace and get to work. It is not very different from the notebook in Azure Databricks: the difference is that your AML workspace can also function for the data modeling stage, code collaboration and MLops practices. But more on that later.
Azure Machine Learning Designer is a ‘machine-learning-as-a-service’ tool and has an UI-friendly environment in which you can drag and drop components to help you pre-process the data. If you don’t have a lot of experience with data science work yet this is a really nice feature that can get you started quickly.
Model deployment and MLops
Managing the lifecycle of your own models can help improve the overall quality and maintenance. Machine Learning Operations (MLops) practices can help with that and is built into AML. We highly recommend taking a look at the Azure MLops documentation if you’re not familiar with this practice.
Once you have your model ready you can deploy it as endpoint using the AML Python SDK. More information and a tutorial can be found here.
Azure-native
Because it’s Azure-native, it’s easy to connect it to the other components in your Azure environment.
Alternative: Azure Databricks
We want to highlight an alternative on Azure which is also spark-based: Azure Databricks. That means it can process data in parallel using distributed computing. You can spin up notebooks for your data preparation steps and execute them them within ADB. The notebooks support Python, SQL, Scala and R.
An advantage of ADB is that it allows for partitioning strategies to optimize for query performance. This comes in handy when dealing with large datasets that are expected to keep growing. Some teams are also more experienced with Databricks and ADB is the Azure-native version of it.
TL/DR
As experiment, this part has been generated by ChatGPT. Let me know if this works, I’d love some feedback!
In this article, we delve into an adaptable and flexible architecture designed to cater to individuals at varying levels of data science expertise. It revolves around efficiently handling Copernicus' ERA5 climate data on Azure. Key components include:
ERA5 Climate Data: This primary data source contains comprehensive climate data and can be accessed through the Climate Data Store (CDS) web interface or the CDS API service.
Azure Data Factory: Serving as the orchestrator, Azure Data Factory automates the retrieval of ERA5 data and its storage in Azure Blob storage. It offers different trigger options to suit your needs.
Azure Machine Learning: After automated data retrieval, Azure Machine Learning steps in for pre-processing (including data exploration) and data modeling. It provides a flexible workspace for all levels of data scientists and accommodates various preferences.
While this article provides valuable insights, please note that it primarily focuses on architecture components and does not delve into security, networking, or observability best practices. Refer to the ISE playbook for comprehensive insights into those areas.
Your specific component choices may vary based on your use case, and this article includes linked resources for further information. Consider conducting a feasibility test, often referred to as a "spike," to determine the best fit for your project.