0% found this document useful (0 votes)
8 views7 pages

Azure Data Engineer Technical Interview Questions

The document provides a comprehensive overview of Azure Data Engineering concepts, including Azure Data Factory, Synapse SQL, and data security measures. It covers key components such as pipelines, linked services, datasets, and various data transformation techniques. Additionally, it discusses the integration of Azure services with machine learning, data lakes, and security protocols, along with practical examples and applications.

Uploaded by

Vipul Khobragade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Azure Data Engineer Technical Interview Questions

The document provides a comprehensive overview of Azure Data Engineering concepts, including Azure Data Factory, Synapse SQL, and data security measures. It covers key components such as pipelines, linked services, datasets, and various data transformation techniques. Additionally, it discusses the integration of Azure services with machine learning, data lakes, and security protocols, along with practical examples and applications.

Uploaded by

Vipul Khobragade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Azure Data Engineer Technical Interview Questions

What is the Azure Data Factory service?


Azure Data factory is the service that helps you to do the migrations and orchestrate the work. For
example, if you want to move the data from on-premise to cloud, maybe it is incremental or it will be lift
and shift work there you can leverage the Azure Data factory to do the work. It support around hundred of
the different data sources from which we can pull the data

What is a pipeline in the Azure data factory?


A pipeline in an Azure data factory is a single workflow or set of activities connected in a specific order to
serve the business purpose. Every pipeline has one unique name within the data factory account and
based on the requirement you can schedule the pipeline to execute. We can also track the history of the
pipeline execution and check for any error that occurred while execution.

What is the linked service in the Azure data factory?


Linked service is one of the components in the Azure data factory which is used to make a connection
hence to connect to any of the data sources you have to first create the linked service based upon the
type of data source. The linked service could have different parameters for example in the case of the
SQL Server linked service you probably have to give the server name, username, and password but for
connecting to the Azure blob storage you have to give the storage location details.

What is the dataset in the Azure Data factory?


Dataset needs to read-write data to any data source using the ADF. Dataset is the representation of the
type of data holds by the data source.

What are the parameters in the ADF?


There are parameters at the multiple levels in the ADF. Parameters are basically used to make the ADF
components dynamic rather than static. For example, using a parameter same dataset can be used to
read multiple tables. Different parameters are as follows:
> Linked Service Parameters
>Dataset Parameters
> Pipeline Parameters
> Global Parameters

Why you will use the data flow from the Azure data factory?

Data flow is used for a no-code transformation. For example when you are doing any ETL operation that
you wanted to do a couple of transformations and put some logic on your input data. You may have not
found it comfortable to type the query or when you are using the files as input then in that case you
cannot write the query at all. Hence data flow will come as a Savior in this situation. Using the data flow
you can just do drag and drop and write almost all your business logic without writing any code. Behind
the scene, data flow get converted into the spark code and it will run on the cluster.

What is Synapse SQL?


Synapse SQL is the ability to do T-SQL based analytics in the Synapse workspace. Synapse SQL has
two consumption models: dedicated and serverless. For the dedicated model, use dedicated SQL pools.
A workspace can have any number of these pools. To use the serverless model, use the serverless SQL
pools. Every workspace has one of these [Link] Synapse Studio, you can work with SQL pools by
running SQL scripts

How you can use Apache Spark through Azure Synapse Analytics?

In Azure analytics, you can run the spark code either using the notebook or you can create a job that will
run the spark code. For running the Spark code you need a Spark pool which is nothing just a cluster of
the nodes having a spark installed on it.

How we can copy the data in the synapse SQL table from the Azure blob storage file?

COPY INTO [dbo].[Trip] FROM '[Link]

WITH (

FIELDTERMINATOR='|',

ROWTERMINATOR='0x0A'

) OPTION (LABEL = 'COPY: [Link]');

What are the two different types of the cluster provided by the Databricks?

Two different types of the cluster provided by the Databricks are the interactive cluster and the job cluster.
To run the interactive notebook you will be going to use an interactive cluster and to run the job, we will
use the job cluster.

What is Azure Synapse Runtime?

Apache Spark pools in Azure Synapse use runtimes to tie together essential component versions, Azure
Synapse optimizations, packages, and connectors with a specific Apache Spark version. These runtimes
will be upgraded periodically to include new improvements, features, and [Link] runtimes have
the following advantages:
● Faster session startup times
● Tested compatibility with specific Apache Spark versions
● Access to popular, compatible connectors and open-source packages

Can you run machine learning algorithm using Azure Synapse Analytics?

Yes, it is possible to run the machine learning algorithm using Azure synapse Analytics. In the Azure
synapse analytics, we have an Apache Spark and there we can write the machine learning code and
which can be executed on the Spark cluster.

What are the two different types of execution modes provided by the Databricks?

You can run the Spark code in two modes that are interactive mode or job scheduled mode. In the
interactive mode, you can run the code line by line and see the output. In the job mode, it will run all code
together, and then you will see the output.

How you can connect the Azure data factory with the Azure Databricks?
To connect to Azure databricks we have to create a linked service that will point to the Azure databricks
account. Next in the pipeline, you will be going to use the notebook activity there you will provide the
linked service created for Databricks. You will also be going to provide the notebook path available in the
Azure Databricks workspace. That’s how you can use the Databricks from the Data factory.

Is it possible to connect MongoDB DB from the Azure data factory?

Yes, it is possible to connect MongoDB from the Azure data factory. You have to provide the proper
connection information about the MongoDB server. In case if this MongoDB server is residing outside the
Azure workspace then probably you have to create a self-hosted integration runtime, and through which
you can connect to the Mongo DB server.

How you can move your changes from one environment to another environment for the
Azure data factory?

We can migrate the code from one environment to another environment for the Azure data factory using
the ARM template. ARM template is the JSON representation of the pipeline that we have created.
Explain the data masking feature of Azure?

Data masking helps in preventing unauthorized access to delicate data by enabling customers to assign
how much of the delicate data to reveal with minimal impact on the application layer. Dynamic data
masking limits acute data exposure by masking it to non-privileged users. It is a policy-based security
feature that hides the delicate data in the result set of a query over designated database fields. In
contrast, the data in the database will not be changed.

A few data masking policies are:

● SQL users excluded from masking - A set of SQL users or Azure Active Directory
identities that get unmasked data in the SQL query results. Users with administrator
privileges are permanently banned from masking and seeing the original data without any
mask.
● Masking rules - A set of rules defining the designated fields to be masked and the masking
function used. The selected fields can be determined using a database schema, table, and
column names.
● Masking functions - A set of methods that control data exposure for different scenarios.

Describe various windowing functions of Azure Stream Analytics?

A window in Azure Stream Analytics is a block of instant events that enables users to perform various
operations on the event data. To analyze and partition a window in Azure Stream Analytics, There exist
four windowing functions:

● Hopping Window: In these windows, the data segments can overlap. So, to define a
hopping window, we need to specify two parameters:
○ Hop (duration of the overlap)
○ Window size (length of data segment)
● Tumbling Window: In this, the data stream is segmented into distinct time segments of
fixed length in the tumbling window function.
● Session Window: This function groups events based on arrival time, so there is no fixed
window size. Its purpose is to eliminate quiet periods in the data stream.
● Sliding Window: This windowing function does not necessarily produce aggregation after a
fixed time interval, unlike the tumbling and hopping window functions. Aggregation occurs
every time an existing event falls out of the time window, or a new event occurs.

What are the different security options available in the Azure SQL database?

Security plays a vital role in databases. Some of the security options available in the Azure SQL database
are:

● Azure SQL Firewall Rules: Azure provides two-level security. There are server-level
firewall rules which are stored in the SQL Master database. Server-level firewall rules
determine the access to the Azure database server. Users can also create database-level
firewall rules that govern the individual databases’ keys.
● Azure SQL TDE (Transparent Data Encryption): TDE is the technology used to encrypt
stored data. TDE is also available for Azure Synapse Analytics and Azure SQL Managed
Instances. With TDE, the encryption and decryption of databases, backups, and transaction
log files, happens in real-time.
● Always Encrypted: It is a feature designed to protect sensitive data stored in the Azure
SQL database, such as credit card numbers. This feature encrypts data within the client
applications using Always Encrypted-enabled driver. Encryption keys are not shared with
SQL Database, which means database admins do not have access to sensitive data.
● Database Auditing: Azure provides comprehensive auditing capabilities along with the
SQL Database. It is also possible to declare the audit policy at the individual database level,
allowing users to choose based on the requirements.

How data security is implemented in Azure Data Lake Storage(ADLS) Gen2?

Data security is one of the primary concerns for most organizations for moving data to cloud storage.
Azure data lake storage gen2 provides a multi-layered and robust security model. This model has 6 data
security layers:

● Authentication: The first layer includes user account security. ADLS Gen2 provides three
authentication modes, Azure Active Directory (AAD), Shared Access Token (SAS), and
Shared Key.
● Access Control: The next layer for restricting access to individual containers or files. This
can be managed using Roles and Access Control Lists (ACLs)
● Network Isolation: This layer enables administrators to manage access by disabling or
allowing access to only particular Virtual Private Networks (VPNs) or IP Addresses.
● Data Protection: This is achieved by encrypting in-transit data using HTTPS(Hypertext
Transfer Protocol Secure). Options to encrypt stored data are also available.
● Advanced Threat Protection: If enabled, ADLS Gen2 will monitor any unauthorized
attempts to access or exploit the storage account.
● Auditing: This is the sixth and final layer of security. ADLS Gen2 provides comprehensive
auditing features in which all account management activities are logged. These logs can be
later reviewed to ensure the highest level of security.

What do you mean by data modeling?

Data Modeling is creating a visual representation of an entire information system or parts to express
linkages between data points and structures. The purpose is to show the many types of data used and
stored in the system, the relationships between them, how the data can be classified and arranged, and
its formats and features. Data can be modeled according to the needs and requirements at various
degrees of abstraction. The process begins with stakeholders and end-users providing information about
business requirements. These business rules are then converted into data structures to create a concrete
database design.
There are two design schemas available in data modeling:

● Star Schema
● Snowflake Schema
What are the 2 levels of security in Azure data lake storage Gen2?
The two levels of security available in Azure data lake storage Gen2 are also adequate for Azure data
lake Gen1. Although this is not new, it is worth calling it two levels of security because it’s a fundamental
piece for getting started with the Azure data lake. The two levels of security are defined as:

● Role-Based Access Control (RBAC): RBAC includes built-in Azure roles such as reader,
owner, contributor, or custom. Typically, RBAC is assigned due to two reasons. One is to
permit the use of built-in data explorer tools that require reader permissions. Another is to
specify who can manage the service (i.e., update properties and settings for the storage
account).
● Control Lists (ACLs): ACLs specify exactly which data objects a user may write, read, and
execute (execution is required for browsing the directory structure). ACLs are POSIX
(Portable Operating System Interface) - compliant, thus familiar to those with a Linux or Unix
background.

Explain a few important concepts of the Azure data factory?

● Pipeline: It acts as a carrier in various processes occurring. An individual process is


considered an activity.
● Activities: It represents the processing steps of a pipeline. A pipeline can have one or many
activities. It can be a process like moving the dataset from one source to another or
querying a data set.
● Datasets: It is the source of data or, we can say it is a data structure that holds our data.
● Linked services: It stores information that is very important when connecting to an external
source.

Explain the process of creating ETL(Extract, Transform, Load)?

Apache Spark pools in Azure Synapse use runtimes to tie together essential component versions, Azure
Synapse optimizations, packages, and connectors with a specific Apache Spark version. These runtimes
will be upgraded periodically to include new improvements, features, and patches.
These runtimes have the following advantages:

● Faster session startup times.


● Tested compatibility with specific Apache Spark versions.
● Access to popular, compatible connectors and open-source packages.

What is SerDe in the hive?

Serializer/Deserializer is popularly known as SerDe. For IO(Input/Output), Hive employs the SerDe
protocol. Serialization and deserialization are handled by the interface, which also interprets serialization
results as separate fields for processing.
The Deserializer turns a record into a Hive-compatible Java object. The Serializer now turns this Java
object into an HDFS (Hadoop Distributed File System) -compatible format. The storage role is then taken
over by HDFS. Anyone can create their own SerDe for their own data format.

Mention some common applications of Blob storage?

● Laboring images or documents straight to a browser.


● Saving files for shared access.
● Streaming audio and video.
● Collecting data for backup and recovery disaster restoration, and archiving.
● Saving data for analysis by an on-premises or Azure-hosted.

What are the main characteristics of Hadoop?

● It is an open-source structure that is ready for freeware.


● Hadoop is cooperative with the various types of hardware and simple to access distinct
hardware within a particular node.
● It encourages faster-distributed data processing.
● It saves the data in the group, which is unconventional of the rest of the operations.
● Hadoop supports building replicas for every block with separate nodes.

You might also like