What are the steps for creating ETL process in Azure Data Factory?

Question:- What is the difference between HDinsight & Azure Data Lake Analytics?

Answer:- HDInsight • HDInsight is Platform as a service • If we want to process a data set, first of all, we have to configure the cluster with predefined nodes and then we use a language like pig or hive for processing data • Since we configure the cluster with HD insight, we can create as we want and we can control it as we want. All Hadoop subprojects such as spark, Kafka can be used without any limitation. Azure Data Lake Analytics • Azure Data Lake Analytics is Software as a service. • It is all about passing queries, written for processing data and Azure Data Lake Analytics will create necessary compute nodes as per our instruction on-demand and process the data set • With azure data lake analytics, it does not give much flexibility in terms of the provision in the cluster, but Microsoft Azure takes care of it. We don’t need to worry about cluster creation. The assignment of nodes will be done based on the instruction we pass. In addition to that, we can make use of USQL taking advantage of dotnet for processing data.

Question:- What are the top-level concepts of Azure Data Factory?

Answer:- • Pipeline: It acts as a carrier in which we have various processes taking place. An individual process is an activity. • Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another. • Datasets: Sources of data. In simple words, it is a data structure that holds our data. • Linked services: These store information that is very important when it comes to connecting an external source. For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.

Question:- How can I schedule a pipeline?

Answer:- • You can use the scheduler trigger or time window trigger to schedule a pipeline. • The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

Question:- Can I pass parameters to a pipeline run?

Answer:- • Yes, parameters are a first-class, top-level concept in Data Factory. • You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on-demand or by using a trigger.

Question:- Can I define default values for the pipeline parameters?

Answer:- You can define default values for the parameters in the pipelines.

Question:- Can an activity in a pipeline consume arguments that are passed to a pipeline run?

Answer:- Each activity within the pipeline can consume the parameter value that’s passed to the pipeline and run with the @parameter construct.

Question:- Can an activity output property be consumed in another activity?

Answer:- An activity output can be consumed in a subsequent activity with the @activity construct.

Question:- How do I gracefully handle null values in an activity output?

Answer:- ou can use the @coalesce construct in the expressions to handle the null values gracefully.

Question:- Which Data Factory version do I use to create data flows?

Answer:- Use the Data Factory V2 version to create data flows.

Question:- What has changed from private preview to limited public preview in regard to data flows?

Answer:- • You will no longer have to bring your own Azure Databricks clusters. • Data Factory will manage cluster creation and tear-down. • Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache Parquet datasets. • You can still use Data Lake Storage Gen2 and Blob storage to store those files. Use the appropriate linked service for those storage engines.

Question:- How do I access data by using the other 80 dataset types in Data Factory?

Answer:- • The Mapping Data Flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink. • Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it’s been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in the source to transform that data.

Question:- Explain the two levels of security in ADLS Gen2?

Answer:- The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a very fundamental piece to getting started with the data lake and it is confusing for many people just getting started. Role-Based Access Control (RBAC). RBAC includes built-in Azure roles such as reader, contributor, owner, or custom roles. Typically, RBAC is assigned for two reasons. One is to specify who can manage the service itself (i.e., update settings and properties for the storage account). Another reason is to permit the use of built-in data explorer tools, which require reader permissions. Access Control Lists (ACLs). Access control lists specify exactly which data objects a user may read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, thus familiar to those with a Unix or Linux background. POSIX does not operate on a security inheritance model, which means that access ACLs are specified for every object. The concept of default ACLs is critical for new files within a directory to obtain the correct security settings, but it should not be thought of as an inheritance. Because of the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure Active Directory groups.

Question:- What is Terraform?

Answer:- Terraform is an infrastructure as code tool that allows you to specify cloud and on-premises resources in human-readable configuration files that can be versioned, reused, and shared. After that, you can utilize a standardized procedure to provide and manage all of your infrastructures throughout their lifespan. Terraform can manage both low-level components like compute, storage, and networking resources as well as high-level components like DNS records and SaaS functionality.

Question:- What do you mean Terraform init?

Answer:- Terraform initializes the code with the terraform init command. This command is used to create the Terraform configuration files’ working directory. It is safe to execute this command several times. The init command can be used for: 1. Plugin Installation 2. Child Module Installation 3. The backend is being set up.