- How does it Operate?
- Combine and Assemble in Azure
- Remodel and Enhance
- CI/CD and Distribute
- Top-level Theories
- Mapping Data Flows
- Linked Services
- Pipeline Runs
- Control Flow
Cloud computing offers computing assistance—including servers, storage, databases, networking, software, analytics, and intelligence; over the cloud (Internet). Microsoft Azure provides us with cloud computing facilities.
In the universe of big data, raw, unorganized data is usually saved in relational, non-relational, and other data warehouse systems. However, on its own, unprepared data does not have the precise context or purpose of providing significant insights to data analysts, data scientists, or business decision-makers.
Big data demands a service that can orchestrate and operationalize methods to refine these gigantic stores of raw data into actionable business insights. Azure Data Factory is a managed cloud assistance developed for these intricate hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data combination designs.
For example, imagine a gaming company that manages petabytes of game records generated by games in the cloud. The company needs to analyze these logs to gain acumens into customer choices, demographics, and routine behaviour. It also requires identifying up-sell and cross-sells possibilities, generating compelling new features, driving business extension, and providing a better practice to its customers.
To analyze these logs, the company needs to use source data such as customer data, game data, and marketing campaign data in an on-premises data store. The company wants to employ this data from the on-premises data store, combining it with supplementary log data in a cloud data store.
To derive insights, it aspires to treat the joined data using a Spark cluster in the cloud (Azure HDInsight) and distribute the transformed data into a cloud data warehouse such as Azure Synapse Analytics to efficiently produce a report on top of it. They want to automate this workflow and monitor and maintain it on a daily schedule. They additionally want to produce it when files land in a blob storage container.
Azure Data Factory is a program that explains such data scenarios. The cloud-based ETL and data synthesis aid empower you to create data-driven workflows for organizing data movement and remodelling data at scale. Using Azure Data Factory, you can design and program data-driven workflows (called pipelines) to ingest data from diverse data stores. You can develop complex ETL methods that visually reconstruct data with data flows or manage to compute assistance such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.
Additionally, you can distribute your reconstructed data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications. Eventually, through Azure Data Factory, raw data can be assembled into significant data stores and data lakes for better business choices.
How does it Operate?
Data Factory comprises a range of interconnected methods that provide complete end-to-end principles for data engineers.
Combine and Assemble in Azure
Companies have data of different types found in diverse sources on-premises, in the cloud, structured, unregulated, and semi-structured, all appearing at different intervals and rates.
The first step in developing an information generation system is to connect to all the common sources of data and processing, such as software-as-a-service (SaaS) assistance, databases, file shares, and FTP web assistance. The next step is to migrate the data as needed to a centralized position for consequent processing.
Without Data Factory, companies must develop custom data movement elements or draft custom services to combine these data sources and processing. It is costly and troublesome to integrate and sustain such systems. In summation, they often lack the enterprise-grade monitoring, warning, and checks that a thoroughly managed service can offer.
With Data Factory, you can practice the Copy Action in a data pipeline to migrate data from both on-premises and cloud source data stores to a centralization data repository in the cloud for further analysis. For instance, you can assemble data in Azure Data Lake Storage and reconstruct the data later using Azure Data Lake Analytics compute assistance. You can also assemble data in the Azure Blob warehouse and reconstruct it later by employing an Azure HDInsight Hadoop cluster.
Remodel and Enhance
After data is present in a centralized data repository in the cloud, process or remodel the accumulated data by utilizing ADF mapping data flows. Data flows empower data engineers to develop and sustain data conversion graphs that execute on Spark without demanding to learn Spark clusters or Spark programming.
If you prefer to code alterations by hand, ADF sustains external activities to remodel your transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
CI/CD and Distribute
Data Factory allows full care for CI/CD of your data pipelines employing Azure DevOps and GitHub. This enables you to incrementally develop and achieve your ETL processes before distributing the finished product. After the raw data has been polished into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your company users can point to from their business intelligence devices.
After you have successfully developed and deployed your data integration pipeline, presenting business benefits from processed data, observe the scheduled activities, and pipelines for success and failure movements. Azure Data Factory owns built-in assistance for pipeline monitoring via Azure Monitor.
An Azure transaction might have one or more extra Azure Data Factory occurrences (or data factories). Azure Data Factory is formed of the below key elements.
- Linked services
- Data Flows
- Integration Runtimes
These elements work collectively to implement the principles on which you can compose data-driven workflows with actions to move and reconstruct data.
A data factory might have one or more extra pipelines. A pipeline is a logical grouping of exercises that implements a section of work. Collectively, the projects in a pipeline execute a task. For instance, a pipeline can accommodate a group of projects that ingests data from an Azure blob and then operates a Hive query on an HDInsight cluster to partition the data.
The advantage of this is that – the pipeline allows you to maintain the activities as a set instead of running each one individually. The projects in a pipeline can be coupled together to perform sequentially. They can also function independently in parallel.
Mapping Data Flows
Build and maintain charts of data transformation reasoning that you can use to remodel any-sized data. You can build-up a reusable archive of data alteration routines and perform those methods in a scaled-out fashion from your ADF pipelines. Data Factory will perform your logic on a Spark cluster that spins-up and spins-down when you demand it. You will not ever have to maintain or sustain clusters.
Activities describe a processing level in a pipeline. For instance, you might practice a copy activity to replicate data from one data repository to a different data repository. Furthermore, you might practice a Hive activity, which operates a Hive query on an Azure HDInsight cluster, to reconstruct or investigate your data. Data Factory supports three types of exercises: data migration activities, data alteration activities, and administration activities.
Datasets describe data formations within the data stores, which simply lead to or reference the data you want to practice in your projects as inputs or outputs.
Linked services are much similar to connection strings. They describe the connection information that is required for Data Factory to connect to outer support. Think of it this way: a linked service explains the connection to the data source, and a dataset describes the construction of the data—for instance, an Azure Storage-linked assistance blueprint a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset names the blob container and the folder that holds the data.
Linked services are practised for two purposes in Data Factory:
- To describe a data store that incorporates, but isn’t limited to, a SQL Server database, Oracle database, file share, or Azure blob storage account.
- To describe a compute resource that can receive the accomplishment of an activity. For instance, the HDInsightHive project operates on an HDInsight Hadoop cluster.
Triggers describe the unit of processing that determines when a pipeline accomplishment demands to be kicked off. There are diverse types of triggers for various types of events.
A pipeline route is an instance of pipeline performance. Pipeline runs are typically instantiated by assigning the arguments to the parameters that are determined in pipelines. The arguments can be transferred manually or inside the trigger description.
Parameters are key-value pairs of read-only arrangement. Parameters are established in the pipeline. The arguments for the described parameters are passed during accomplishment from the run connection created by a trigger or a pipeline that was performed manually. Projects within the pipeline utilize the parameter values.
A dataset is a richly typed parameter and a reusable/referenceable object. An action can reference datasets and can utilize the properties that are defined in the dataset description.
A linked service is also a completely typed parameter that carries the connection knowledge to either a data store or a computing ecosystem. It is also a reusable/referenceable object.
Control flow is an orchestration of pipeline projects that involves chaining projects in a series, branching, establishing parameters at the pipeline level, and transferring arguments while requesting the pipeline on-demand or from a trigger. It also involves custom-state passing and connecting receptacles, that is, For-each iterators.
Variables can be applied inside pipelines to collect volatile values and be applied in association with parameters to facilitate passing values among pipelines, data flows, and other activities.
If you wish to learn more such concepts and build a career in this field, join Great Learning’s PGP Cloud Computing Course and upskill today.1