No more cron jobs, orchestrate
data pipelines with Azure Data Factory
Why you may need it
How do you like maintaining lots of bash scripts defined in `crontab`. If you worked in a team responsible for a data flow you know what I mean. If it starts with one script it is OK. But if your crontab grows and grows it is no more fun.
Or maybe you are running your BI processes which are doing well in your current warehouse but you encounter problems with the optimization and there are not enough resources to scale your workers up?
One solution could be cloud native scheduling solutions – Cloudwatch from AWS or Azure Events. It gives you more monitoring functionalities but still, it is kind of manual work to do. You may be tempted to use those solutions as your team may be already familiar with them, but if all you have is a hammer, everything looks like a nail. The main point here is that we should define our problem well and then find a solution to it. The problem here is called data orchestration. We need to find a tool which will solve it.
There are many tools that one can use to orchestrate data flow. There are open source solutions like Airflow or Luigi that are gaining popularity recently. However, with these tools, you are the one responsible for reserving enough resources (computation/storage) for tasks to be performed. Such a tool requires to be deployed as a web application – redundancy and scaling are to be taken into account.
With Azure cloud you can use the solution that works as a service, without a need to think about resources explicitly – this is Azure Data Factory. We will describe it better soon, but before that what other Azure Cloud services will be helpful when designing a data pipeline?
Transforming IT systems
into agile data hubs
Read the guide on how to manage complex IT premises and still be open for innovation, use of new technologies, and don’t suffer from the maintenance burden.
Azure comes into the picture
Azure as a cloud solution managed by Microsoft has many features that are very handy when dealing with data-analysis processes.
Where do you store your files? When maintaining legacy systems maybe you had the pain to maintain and fix connectivity issues when dealing with data processing.
Have you ever heard about a lake full of data? With that being used mostly as a marketing trick the truth is that in Azure it describes cloud storage which fits well to the big data world. In Azure, we call it Azure Storage generation 2nd and we may treat it as an infinite hard drive being replicated. You don’t worry about capacity, because it grows as your needs grow. In its second generation, this storage is fully compatible with Hadoop related systems (f.e. Spark) so you can analyze your data in parallel.
Scalable SQL server
SQL is the language of data. Most probably your solution currently is using some kind of relational database. Where is it running? If you’re lucky it is maintained on the cloud. Otherwise, you have to face the risk of idle resources, for which you have to pay and maintain.
Most of SQL cloud resources are in pay-as-you-go flavour. You don’t need to pay for resources which are not used. SQL served on Azure follows this pattern. Moreover, you always have the latest version of the engine and you don’t worry about any upgrades.
For many years Azure was presenting its Data Warehouse as its solution for big data workloads. Recently there was Azure Synapse Analytics service released by Microsoft team. It is generally available since mid-2020. You may design a Data Warehouse, create data flows and use PowerBI reports. Everything is It means that a full big data analytics process can be developed there. And this is scalability you can afford – because you pay only for this resource that you really need. Let me quote one of the success stories from Azure portal
“We generally need about 400 data warehouse units, but at month’s end, we need to scale up to 1,000 DWUs to accommodate requests from various business users who need to generate reports quickly. With Azure Synapse Analytics, we can scale up to achieve additional compute power in seconds, and then scale down to manage costs.”
– Brain Muenks (Maritz)
As for Azure Synapse some ideas were present already for many years in the Azure platform, and this is true for orchestration mechanism. How can you use Azure Data Factory in your project?
Azure Data Factory
Azure Data Factory is a solution built on the top of Azure to provide data-centred solutions. You can move your data by creating pipelines moving your data between on-premise to cloud or cloud-cloud scenarios. These pipelines are triggered by events like storage state changes or in a timely fashion.
Imagine you have a company in which you feed your data (f.e. application logs) to a log aggregation system and you draw some insights. But what if you wanted to enrich it with other data sources. These sources should be stored somewhere. With Azure data solutions you may put everything in the data lake. The process of moving and enriching this data has to be orchestrated and managed. This is where Data Factory can be used.
With Data Factory you may design the ETL process. ETL stands for Extract, Transform and Load. These are crucial steps in the data processing. You have full control over all of them with Data Factory.
Transformation is possible using UI tools like Data Flow, where you can drag and drop components to Merge / Filter / Map your data elements. You are free to use regular code
Then you may load it to the destination of your choice, typically Data Warehouse (f.e. Azure Synapse) or some storage that will be read by Power BI. Typically with Data Factory you will move data around, but the transformations will be done with some Spark jobs run f.e. from Databricks platform.
When building a flow in Azure Data Factory, you will use the following components:
- Pipeline is a group of activities that will occur one after another. For example, it can be reading data from Storage, running some query on part of it. You can chain multiple activities and design flow on the basis of results of previous steps, f.e. What should happen when copying fails.
- Mapping data flow is data transformation logic you can design using a simple Drag and Drop interface. If you have used Databricks or Spark jobs you may treat this feature as a designer to create such jobs.
- Activity is a step of processing which moves/transforms data. For example, you can copy data from Azure Storage to SQL Server.
- Linked Services and Datasets represent data that can be read or written by Data Factory. You may treat it as a representation of connection strings.
- Variables and Parameters are used to store some configuration or temporary values between steps of a pipeline.
Design your Data Factory
To design your Data Factory you need to create an instance of Data Factory in your Azure portal. If you don’t have an Azure account you can get one for free for the period of 12 months with some decent money sum to spend.
After data factory service has been provisioned, you can start working on your pipeline. You need to find the Author & Monitor option in your Overview tab.
If you want to make something simple you can try to fetch some data from an existing open API and store it on your cloud storage. Let us try to list what steps are required to accomplish this task.
First, you need to create an Azure Storage (most likely Gen2 to make use of data lake). After it is ready for you you can easily manage it from Azure Storage Explorer available on different OS – working well even on Linux environments. Create a blob/filesystem that will act as a storage folder for processed data.
Then you will need to author a data pipeline and two datasets. You can see how it looks in our Data Factory resources tab.
The first dataset is needed to read data from the source. As we consume some API we will need the HTTP connector. The second dataset will be of type Azure Data Lake Storage Gen2, we will store data there. In the case of both datasets, you will be asked to create linked services that will define connection parameters.
For such a pair of resources, you need to define a data pipeline. In our case, it consists of a simple Copy data activity. For such an activity one needs to define source, sink and mapping. Then if the activity is defined you can test if it’s working properly triggering Debug. You are free to add other steps to your pipeline, f.e. Further processing of data or calling some webhooks. You may start with the pipeline as follows.
After your pipeline has been created you need to publish it, and you can set a trigger to run it based on some events, f.e. periodically every n minutes.
The example in this article is oversimplification just to show what you need to start playing with it. In real scenarios most likely you will define several steps and add processing of your data inside of your SQL engine or Data lake using Databricks jobs or Hive queries. However, moving data around requires a standard approach and this is what Data Factory was designed for – data orchestration at scale.
So, dear engineer, no more cron jobs. It’s time to encourage your team leaders, colleagues to go into the cloud and try real data orchestration systems like Azure Data Factory mentioned in this article.