Why do we need Azure Data Factory?

Question:- What is the integration runtime?

Answer:- • The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments. • 3 Types of integration runtimes: • Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place • Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because the Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall. It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the Azure firewall in a specific way if we do that we don’t need to use a self-hosted IR. • Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to the data factory, we use Azure SSIS Integration Run Time.

Question:- What is the limit on the number of integration runtimes?

Answer:- There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.

Question:- What is the difference between Azure Data Lake and Azure Data Warehouse?

Answer:- Data Warehouse is a traditional way of storing data that is still used widely. Data Lake is complementary to Data Warehouse i.e if you have your data at a data lake that can be stored in the data warehouse as well but there are certain rules that need to be followed. • DATA LAKE • Complementary to data warehouse • Data is Detailed data or Raw data. It can be in any particular form. you just need to take the data and dump it into your data lake • Schema on read (not structured, you can define your schema in n number of ways) • One language to process data of any format(USQL) • DATA WAREHOUSE • Maybe sourced to the data lake • Data is filtered, summarised, refined • Schema on write(data is written in Structured form or in a particular schema) • It uses SQL

Question:- What is blob storage in Azure?

Answer:- Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately. Common uses of Blob Storage include: • Serving images or documents directly to a browser • Storing files for distributed access • Streaming video and audio • Storing data for backup and restore disaster recovery, and archiving • Storing data for analysis by an on-premises or Azure-hosted service

Question:- What are the steps for creating ETL process in Azure Data Factory?

Answer:- While we are trying to extract some data from Azure SQL server database, if something has to be processed, then it will be processed and is stored in the Data Lake Store. Steps for Creating ETL • Create a Linked Service for source data store which is SQL Server Database • Assume that we have a cars dataset • Create a Linked Service for destination data store which is Azure Data Lake Store • Create a dataset for Data Saving • Create the pipeline and add copy activity • Schedule the pipeline by adding a trigger

Question:- What is the difference between HDinsight & Azure Data Lake Analytics?

Answer:- HDInsight • HDInsight is Platform as a service • If we want to process a data set, first of all, we have to configure the cluster with predefined nodes and then we use a language like pig or hive for processing data • Since we configure the cluster with HD insight, we can create as we want and we can control it as we want. All Hadoop subprojects such as spark, Kafka can be used without any limitation. Azure Data Lake Analytics • Azure Data Lake Analytics is Software as a service. • It is all about passing queries, written for processing data and Azure Data Lake Analytics will create necessary compute nodes as per our instruction on-demand and process the data set • With azure data lake analytics, it does not give much flexibility in terms of the provision in the cluster, but Microsoft Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of dotnet for processing data.

Question:- What are the top-level concepts of Azure Data Factory?

Answer:- • Pipeline: It acts as a carrier in which we have various processes taking place. An individual process is an activity. • Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another. • Datasets: Sources of data. In simple words, it is a data structure that holds our data. • Linked services: These store information that is very important when it comes to connecting an external source. For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.

Question:- How can I schedule a pipeline?

Answer:- • You can use the scheduler trigger or time window trigger to schedule a pipeline. • The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

Question:- Can I pass parameters to a pipeline run?

Answer:- • Yes, parameters are a first-class, top-level concept in Data Factory. • You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on-demand or by using a trigger.

Question:- Can I define default values for the pipeline parameters?

Answer:- You can define default values for the parameters in the pipelines.

Question:- Can an activity in a pipeline consume arguments that are passed to a pipeline run?

Answer:- Each activity within the pipeline can consume the parameter value that’s passed to the pipeline and run with the @parameter construct.

Question:- Can an activity output property be consumed in another activity?

Answer:- An activity output can be consumed in a subsequent activity with the @activity construct.

Question:- How do I gracefully handle null values in an activity output?

Answer:- ou can use the @coalesce construct in the expressions to handle the null values gracefully.

Question:- Which Data Factory version do I use to create data flows?

Answer:- Use the Data Factory V2 version to create data flows.