0% found this document useful (0 votes)

4 views53 pages

Azure Data Engineer Course Notes

The document is a comprehensive guide on Azure Data Engineering, covering essential topics such as cloud fundamentals, Medallion Architecture, Azure Data Lake Storage, identity management, and Azure Data Factory. It includes theoretical concepts, practical scenarios, and implementation workshops designed to help practitioners master data management and transformation using Azure services. Each day focuses on specific aspects of data engineering, providing hands-on tasks to validate the learned concepts.

Uploaded by

kvinee2344

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views53 pages

Azure Data Engineer Course Notes

Uploaded by

kvinee2344

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Azure Data Engineering

Comprehensive Guide

Mastering ADF, Databricks, and SQL for Modern Data Architectures

Azure Data Engineering - Day 1

Day 1: Cloud Fundamentals &
Medallion Architecture

Theory: Introduction to Azure Data Engineering

Azure Data Engineering involves designing and implementing the

management, monitoring, security, and privacy of data using the full stack of
Azure data services. [cite: 1] The primary goal is to build data pipelines that
transform raw data into a structured format suitable for business intelligence.

On-Premises vs. Cloud: In an on-premises environment, the company is

responsible for purchasing hardware, maintaining data centers, and
managing limited resources. In contrast, cloud providers like Azure, AWS, and
GCP provide resources on a "pay-as-you-use" basis. [cite: 2] This allows for
massive scalability and cost efficiency.

Cloud Classification: Cloud services are shared among multiple users with
robust security. [cite: 3] Models include:

• Public Cloud: Shared hardware among different organizations. [cite: 3]

• Private Cloud: Dedicated hardware for one organization. [cite: 3]

• Hybrid Cloud: A mix of both public and private environments. [cite: 3]

Service Models:

• IaaS (Infrastructure as a Service): Provides networking, security, and

operating systems. [cite: 4]

• PaaS (Platform as a Service): Provides a platform for development

without managing underlying infrastructure. [cite: 4]

Azure Data Engineering - Day 2

• SaaS (Software as a Service): Ready-to-use software applications. [cite:
4]

Medallion Architecture

The Medallion Architecture is a layered data design pattern used to organize

data in a lakehouse. [cite: 5]

• Bronze (Raw Layer): This layer stores data in its raw, original format
(JSON, CSV, Parquet). [cite: 5] It acts as the "source of truth."

• Silver (Refinement Layer): Data is cleaned and refined. This involves

removing duplicates, handling null values, and correcting corrupted
data. [cite: 5]

• Gold (Aggregation Layer): The final layer where data is aggregated and
optimized for reporting and analytics. [cite: 5]

Related Data: Data Formats

Common formats include Parquet (columnar storage, efficient for queries),

CSV (simple but lacks schema), and JSON (flexible for semi-structured data).
[cite: 5]

Implementation Workshop (Day 1):

To master the topics covered in Day 1, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Cloud Fundamentals & Medallion Architecture.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

Azure Data Engineering - Day 3

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Cloud Fundamentals & Medallion
Architecture apply to optimizing their daily inventory reports? In Day 1, we focus
on the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 4

Day 2: Azure Data Lake Storage (ADLS
Gen2)

Theory: Storage Infrastructure

Azure Data Lake Storage (ADLS) Gen2 is the foundational storage for Big Data
on Azure. It is organized into a hierarchy: Tenant -> Subscription -> Resource
Group -> Services (ADF, Databricks, ADLS). Services within the same Resource
Group often share the same region and VNet. [cite: 6]

Access Tiers: Based on how frequently data is accessed, you can choose:

• Hot: For frequent access; lowest access cost but higher storage cost.
[cite: 6]

• Cool: For data accessed less frequently (min 30 days). [cite: 6]

• Cold: For data accessed very rarely (min 90 days). [cite: 6]

• Archive: For long-term storage of rare data. [cite: 6]

Blob Types:

• Block Blobs: Best for large files like movies and documents. [cite: 6]

• Append Blobs: Optimized for append operations (e.g., logging). [cite: 6]

• Page Blobs: Optimized for random read/write (e.g., VM disks). [cite: 6]

Azure Data Engineering - Day 5

Security and Availability

Permissions: Access is often granted using the Blob Storage Contributor role.
[cite: 7] Soft-delete retention periods protect against accidental deletion. [cite:
7]

Redundancy (Backup): Options like LRS (Locally Redundant Storage)

replicate data three times within a single data center. [cite: 7]

Related Data: ADLS Gen2 vs Blob Storage

The key differentiator is the Hierarchical Namespace. While Blob storage

uses a flat namespace, Gen2 allows for true folder structures, which
significantly improves the performance of directory-level operations in big
data processing. [cite: 7]

Implementation Workshop (Day 2):

To master the topics covered in Day 2, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Azure Data Lake Storage (ADLS Gen2).

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Azure Data Lake Storage (ADLS Gen2)
apply to optimizing their daily inventory reports? In Day 2, we focus on the

Azure Data Engineering - Day 6

foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 7

Day 3: Identity Management &
Databricks Billing

Theory: Entra ID (Azure AD)

Identity management involves three stages:

• Identification: Providing a username or email. [cite: 8]

• Authentication: Verifying identity via password or certificate. [cite: 8]

• Authorization: Controlling access (e.g., allowing ADF to read from

ADLS). [cite: 8]

Access Methods:

• Account Key/SAS Key: Temporary or permanent secrets. [cite: 8]

• Managed Identity: Azure-managed identities (System-assigned or User-

assigned). [cite: 8]

• Service Principal (SPN): An application identity used for third-party tools

like Databricks. [cite: 8]

Databricks Clusters and Billing

Databricks costs are based on resource usage (DBUs). [cite: 9]

• All-Purpose Clusters: Used for interactive analysis and development.

[cite: 9]

• Job Clusters: Cost-effective, automatically created and terminated for

jobs. [cite: 9]

Azure Data Engineering - Day 8

• Serverless: Compute managed by Databricks, providing instant
scalability. [cite: 10]

Related Data: PIM (Privileged Identity Management)

PN provides time-based access (e.g., 8 hours) to sensitive resources to reduce

the attack surface. [cite: 8]

Implementation Workshop (Day 3):

To master the topics covered in Day 3, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Identity Management & Databricks Billing.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Identity Management & Databricks
Billing apply to optimizing their daily inventory reports? In Day 3, we focus on
the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 9

Azure Data Engineering - Day 10
Day 4: Azure Data Factory (ADF) - ETL/
ELT Basics

Theory: Orchestration with ADF

Azure Data Factory (ADF) is a serverless data integration service used to

orchestrate data movement and transformation. [cite: 12] It supports both ETL
(Extract-Transform-Load) and ELT (Extract-Load-Transform).

Medallion ETL: Data is fetched from a source (ADLS/SQL) and moved into the
Raw/Bronze layer. [cite: 12]

Pipeline Settings:

• Creation: Pipelines contain activities that perform actions. [cite: 12]

• Activity Logic: Activities can be chained with dependencies like On

Success, On Fail, On Skip, or On Completion. [cite: 12]

Basic Activities

• Wait Activity: Pauses the pipeline for a specified duration. [cite: 12]

• Execute Pipeline: Invokes another pipeline for modular design. [cite: 12]

Parameterization

Parameters allow pipelines to be dynamic. You can use Global Parameters or

Variables to pass values into activities. [cite: 12]

Azure Data Engineering - Day 11

Implementation Workshop (Day 4):
To master the topics covered in Day 4, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Azure Data Factory (ADF) - ETL/ELT Basics.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Azure Data Factory (ADF) - ETL/ELT
Basics apply to optimizing their daily inventory reports? In Day 4, we focus on
the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 12

Azure Data Engineering - Day 13
Day 5: ADF Control Flow & Iteration

Theory: Advanced Control Flow

To handle complex logic, ADF provides several control flow activities:

• If Condition: Evaluates an expression and branches execution based on

True or False. [cite: 12]

• Switch Activity: Similar to 'case' statements in programming, it

branches based on specific values. [cite: 12]

• For-Each: Iterates over a collection (array) of items (e.g., processing a list

of files). [cite: 12]

• Filter Activity: Filters an input array based on a condition. [cite: 12]

Related Data: JSON Expressions

ADF uses JSON-based expressions. For example, checking if a parameter

'Country' equals 'Iran' in a Switch activity. [cite: 12]

# Example Expression for If Condition

@equals(pipeline().[Link], 'SQL')

Azure Data Engineering - Day 14

Implementation Workshop (Day 5):
To master the topics covered in Day 5, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of ADF Control Flow & Iteration.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of ADF Control Flow & Iteration apply to
optimizing their daily inventory reports? In Day 5, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 15

Azure Data Engineering - Day 16
Day 6: SQL Activities & ADLS Metadata

Theory: Connecting ADF to SQL and ADLS

Script Activity: Used to run SQL queries against Azure SQL Database (e.g.,
querying [Link]). [cite: 13]

Metadata and Validation:

• Get Metadata: Retrieves information about files (size, modified date,

child items). [cite: 14]

• Validation Activity: Ensures that a file or folder exists at a specific

location before proceeding. [cite: 14]

• Lookup Activity: Used to retrieve a dataset (JSON or SQL result) to

visualize or use in downstream activities. [cite: 15]

Implementation Details

Linked Services: Store connection credentials (access details). [cite: 14]

Datasets: Define the specific location of the data (file path or table name).
[cite: 14]

Related Data: Timeouts

Validation activities often have a timeout setting (e.g., 12 hours) represented

as '[Link]'. [cite: 14]

Azure Data Engineering - Day 17

Implementation Workshop (Day 6):
To master the topics covered in Day 6, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of SQL Activities & ADLS Metadata.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of SQL Activities & ADLS Metadata apply
to optimizing their daily inventory reports? In Day 6, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 18

Azure Data Engineering - Day 19
Day 7: ADF Triggers & Integration
Runtimes

Theory: Automating Pipelines

Triggers are used to automate pipeline execution. [cite: 17]

• Schedule Trigger: Runs at a specific wall-clock time. [cite: 17]

• Tumbling Window Trigger: Runs on a fixed time interval (e.g., every

hour) and handles backfilling. [cite: 17]

• Event-Based Trigger: Runs when an event occurs, such as a file arriving

in storage. [cite: 17]

Integration Runtimes (IR)

The IR is the compute infrastructure used by ADF for data movement and
activity execution. [cite: 18]

• Auto-Resolved (Cloud) IR: Used for cloud-to-cloud data movement.

[cite: 18]

• Self-Hosted IR: Required for on-premises or private network

connectivity. [cite: 18]

• SSIS IR: For running legacy SSIS packages in the cloud. [cite: 18]

Azure Data Engineering - Day 20

Related Data: SQL Scripting

Script activities can handle DDL and DML operations like creating tables or
inserting records for logging. [cite: 18]

Implementation Workshop (Day 7):

To master the topics covered in Day 7, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of ADF Triggers & Integration Runtimes.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of ADF Triggers & Integration Runtimes
apply to optimizing their daily inventory reports? In Day 7, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 21

Azure Data Engineering - Day 22
Day 8: Copy Activity Scenarios

Theory: Data Movement Patterns

The Copy Activity is the most frequently used activity in ADF. Common
patterns include:

• ADLS to ADLS: Migrating files between containers or storage accounts.

• SQL to SQL: Moving data between databases.

• ADLS to SQL: Loading flat files (CSV/Parquet) into relational tables.

• SQL to ADLS: Offloading database data into a data lake for analytics.

Settings

The Preserve Hierarchy setting is used when copying folders to ensure the
destination matches the source structure exactly.

Introduction to PySpark

For more complex transformations that the Copy Activity cannot handle, we
introduce PySpark using Azure Databricks. PySpark can interact with Key
Vault for secure secrets and process data at scale.

Implementation Workshop (Day 8):

To master the topics covered in Day 8, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

Azure Data Engineering - Day 23

• Implement a test scenario using dummy data to validate the theoretical
concepts of Copy Activity Scenarios.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Copy Activity Scenarios apply to
optimizing their daily inventory reports? In Day 8, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 24

Day 9: SQL DML & Incremental Loads

Theory: SQL Operations

Standard SQL commands are often integrated into ADF pipelines:

• Insert: Adding new records. [cite: 19]

• Update: Modifying existing data based on a key (e.g., cust_id). [cite: 19]

• Delete: Removing data from a table. [cite: 19]

Copy Scenarios and Mappings

• Auto-Create Table: If the sink table doesn't exist, ADF can create it
automatically. [cite: 20]

• Column Mismatch: Requires explicit mapping if source and sink column

names differ. [cite: 21]

Incremental Load (Watermark)

To avoid reloading all data (Full Load), we use a Watermark. [cite: 23]

• A Watermark Table stores the last processed timestamp. [cite: 23]

• ADF queries the source for records newer than the watermark, copies
them, and updates the watermark table. [cite: 23]

Azure Data Engineering - Day 25

Implementation Workshop (Day 9):
To master the topics covered in Day 9, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of SQL DML & Incremental Loads.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of SQL DML & Incremental Loads apply to
optimizing their daily inventory reports? In Day 9, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 26

Azure Data Engineering - Day 27
Day 10: Delta Copy & Watermark
Implementation

Theory: Step-by-Step Incremental Load

The standard pattern for an incremental load involves:

1. Lookup for the Current Watermark. [cite: 24]

2. Lookup for the New (Max) Watermark in the source. [cite: 24]

3. Copy Data where source date is between Old and New Watermarks. [cite:
24]

4. Stored Procedure to update the Control (Watermark) Table. [cite: 24]

-- Example SQL Update for Watermark

UPDATE WatermarkTable
SET WatermarkValue = @NewWatermark
WHERE TableName = 'Person'; [cite: 24]

Related Data: Person Table Scenario

Imagine a table 'Person' with a column 'LastModifyTime'. We use this column

to identify which records were updated or added since the last run. [cite: 24]

Azure Data Engineering - Day 28

Implementation Workshop (Day 10):
To master the topics covered in Day 10, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Delta Copy & Watermark Implementation.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Delta Copy & Watermark
Implementation apply to optimizing their daily inventory reports? In Day 10, we
focus on the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 29

Azure Data Engineering - Day 30
Day 11: Databricks & Spark
Architecture

Theory: Why Databricks?

While ADF is great for orchestration, Databricks provides high-performance

compute via Apache Spark. [cite: 25] Spark uses In-Memory processing,
which is significantly faster than Hadoop's disk-based MapReduce. [cite: 25]

Spark Architecture

Feature Hadoop Spark

Processing Slow (Disk-based) [cite: Very Fast (RAM-based) [cite:

Speed 30] 31]

Processing Type Batch only [cite: 32] Batch + Streaming [cite: 33]

Execution MapReduce [cite: 36] DAG engine [cite: 36]

Memory Management

Executor Memory: Divided into On-heap and Off-heap. [cite: 46]

• Unified Memory: Shared by Execution and Storage (0.6% default). [cite:

47]

• User Memory: For user-defined data structures. [cite: 47]

Azure Data Engineering - Day 31

• Reserved Memory: System overhead (300MB default). [cite: 47]

Spark Submit

The `spark-submit` command is used to launch applications on a cluster. [cite:

44] It specifies the master (local, yarn, k8s), deploy mode (client/cluster), and
resource allocation (memory/cores). [cite: 44, 45]

Implementation Workshop (Day 11):

To master the topics covered in Day 11, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Databricks & Spark Architecture.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Databricks & Spark Architecture apply
to optimizing their daily inventory reports? In Day 11, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 32

Azure Data Engineering - Day 33
Day 12: Databricks Utilities & Widgets

Theory: Notebook Productivity

Databricks provides DBUtils (Databricks Utilities) for file system operations

and notebook chaining.

• [Link]: Mount storage, list files, and move data.

• [Link]: Run one notebook from another.

Interactive Widgets

Widgets allow you to pass parameters to notebooks interactively or via ADF.

• Combobox / Dropdown: Selection from a list.

• Text: Free-text input.

• Multiselect: Multiple selections.

Implementation Workshop (Day 12):

To master the topics covered in Day 12, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Databricks Utilities & Widgets.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Azure Data Engineering - Day 34

Practical Scenario: Consider a retail company processing millions of
transactions. How would the concepts of Databricks Utilities & Widgets apply to
optimizing their daily inventory reports? In Day 12, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 35

Day 13: Storage Mounting & PySpark
Structures

Theory: Managed vs. External Tables

In Databricks (Spark SQL):

• Managed Tables: Spark manages both metadata and data. If you drop
the table, data is deleted. [cite: 50]

• External Tables: Spark only manages metadata. If you drop the table,
the data remains in ADLS. [cite: 50]

Connecting to Storage

Storage is often connected via Mounting or using Unity Catalog. Unity

Catalog provides central governance across the metastore, catalogs, and
schemas. [cite: 51]

PySpark Data Structures

• RDD (Resilient Distributed Dataset): The low-level API; no schema,

slower.

• DataFrame: Structured API (like a table); fast due to Catalyst Optimizer.

• Dataset: A mix of RDD and DataFrame (mostly used in Scala/Java).

Azure Data Engineering - Day 36

Implementation Workshop (Day 13):
To master the topics covered in Day 13, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Storage Mounting & PySpark Structures.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Storage Mounting & PySpark Structures
apply to optimizing their daily inventory reports? In Day 13, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 37

Azure Data Engineering - Day 38
Day 14: Spark Reading & Writing
Modes

Theory: Handling Data at Scale

Reading Modes:

• Permissive: Default; fills corrupt data with nulls. [cite: 52]

• DropMalformed: Removes records that don't match the schema. [cite:

52]

• FailFast: Throws an error immediately upon encountering corrupt data.

[cite: 52]

Writing Modes

• Overwrite: Deletes existing data and writes new. [cite: 53]

• Append: Adds data to the existing folder. [cite: 53]

• ErrorIfExists: Throws an error if data already exists. [cite: 53]

# Example PySpark Read

df = [Link]('csv') \
.option('mode', 'failfast') \
.load('path/to/data') [cite: 54]

Azure Data Engineering - Day 39

Implementation Workshop (Day 14):
To master the topics covered in Day 14, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Spark Reading & Writing Modes.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Spark Reading & Writing Modes apply
to optimizing their daily inventory reports? In Day 14, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 40

Azure Data Engineering - Day 41
Day 15: DataFrame Transformations

Theory: Column Operations

DataFrames are transformed using various functions:

• withColumn: Add or update a column.

• select: Choose specific columns.

• col: Reference a column for functions.

• alias: Rename a column for display.

Conditional Logic (When/Otherwise)

The `when` and `otherwise` functions serve as the SQL 'CASE' equivalent in
PySpark.

from [Link] import col, when

df = [Link]("salary_group", when(col("salary") > 5000,
"High").otherwise("Low"))

Handling Nulls

Functions like `fillna` and `dropna` are used to clean missing data in the Silver
layer.

Azure Data Engineering - Day 42

Implementation Workshop (Day 15):
To master the topics covered in Day 15, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of DataFrame Transformations.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of DataFrame Transformations apply to
optimizing their daily inventory reports? In Day 15, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 43

Azure Data Engineering - Day 44
Day 16: Window Functions &
Aggregations

Theory: Advanced Analytical Queries

Window Functions: Used for calculations across a set of rows related to the
current row (e.g., Ranking, Running Totals). [cite: 55]

Aggregations: Operations like `sum`, `avg`, `min`, and `max` performed via
`groupBy`.

# PySpark GroupBy Example

[Link]("department").agg({"salary": "avg"}) [cite: 55]

Related Data: Dropna

The `dropna(subset=['column_name'])` function allows for targeted cleaning of

rows where critical columns are null. [cite: 55]

Implementation Workshop (Day 16):

To master the topics covered in Day 16, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Window Functions & Aggregations.

Azure Data Engineering - Day 45

• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Window Functions & Aggregations
apply to optimizing their daily inventory reports? In Day 16, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 46

Day 17: Complex Types & Joins

Theory: Handling Arrays and Joins

Collective Functions:

• collect_list / collect_set: Aggregates values into an array (set removes

duplicates). [cite: 56]

• explode: Flattens an array into multiple rows. [cite: 56]

Joins in PySpark

Standard joins like Inner, Left, and Right are supported. [cite: 57]

• Left Anti Join: Returns records in the left table that have no matching
record in the right. [cite: 57]

Related Data: Key Vault

Always use Azure Key Vault to store connection strings and secrets, retrieving
them via `[Link]()`. [cite: 57]

Implementation Workshop (Day 17):

To master the topics covered in Day 17, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Complex Types & Joins.

Azure Data Engineering - Day 47

• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Complex Types & Joins apply to
optimizing their daily inventory reports? In Day 17, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 48

Day 18: Join Optimization & Lazy
Evaluation

Theory: How Spark Executes

Broadcast Hash Join: An optimization for joining a large table with a small
table (<10MB). The small table is "broadcasted" to all workers to avoid
network shuffling. [cite: 58, 59]

Lazy Evaluation: Spark doesn't execute transformations (filter, join)

immediately. It waits for an Action (count, show, display). [cite: 60]

Spark Logical Hierarchy

• Job: Triggered by an action. [cite: 60]

• Stage: Created when data needs to be shuffled (Wide Transformation).

[cite: 60]

• Task: The smallest unit of work (processes 128MB of data). [cite: 60]

Lineage and Fault Tolerance

Spark tracks the history of transformations (Lineage). If a node fails, it can

recompute the data. [cite: 62] **Checkpoints** can be used to break the
lineage and save intermediate data to disk for performance. [cite: 62]

Azure Data Engineering - Day 49

Implementation Workshop (Day 18):
To master the topics covered in Day 18, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Join Optimization & Lazy Evaluation.

• Monitor the performance of the services and identify any bottlenecks in

the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Join Optimization & Lazy Evaluation
apply to optimizing their daily inventory reports? In Day 18, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 50

Azure Data Engineering - Day 51
Day 19: Optimization Techniques &
Summary

Theory: Fine-Tuning Spark

Optimization is key to managing cloud costs. Techniques include:

• Partitioning: Dividing data into smaller chunks for parallel processing

(standard 128MB per task). [cite: 48]

• Caching / Persisting: Storing intermediate results in memory for reuse.

[cite: 47]

• Explain Plan: Using `[Link]('extended')` to view how the Catalyst

Optimizer structured the physical plan. [cite: 61]

Course Wrap-up

An Azure Data Engineer must master the flow from ingestion (ADF) to storage
(ADLS) to high-performance processing (Databricks), all while maintaining
security and optimizing for cost.

Implementation Workshop (Day 19):

To master the topics covered in Day 19, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical

concepts of Optimization Techniques & Summary.

Azure Data Engineering - Day 52

• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of

transactions. How would the concepts of Optimization Techniques & Summary
apply to optimizing their daily inventory reports? In Day 19, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 53

AZURE - EOD - Summary
No ratings yet
AZURE - EOD - Summary
15 pages
Data Engineering with Spark & Delta Lake
No ratings yet
Data Engineering with Spark & Delta Lake
26 pages
Azure Data Lake & Power BI Overview
No ratings yet
Azure Data Lake & Power BI Overview
9 pages
Azure Data Factory Course Overview
No ratings yet
Azure Data Factory Course Overview
4 pages
Data Platform Best Practices
No ratings yet
Data Platform Best Practices
27 pages
Azure EOD Summary
No ratings yet
Azure EOD Summary
22 pages
Azure Databricks Integration Guide
No ratings yet
Azure Databricks Integration Guide
32 pages
Overview of Azure Data Factory Features
No ratings yet
Overview of Azure Data Factory Features
10 pages
Azure Data Factory Components Overview
No ratings yet
Azure Data Factory Components Overview
17 pages
Data Literacy: Azure Data Factory Essentials
No ratings yet
Data Literacy: Azure Data Factory Essentials
4 pages
Azure Data Factory Overview and Setup Guide
No ratings yet
Azure Data Factory Overview and Setup Guide
7 pages
Modern Data Architecture Overview
No ratings yet
Modern Data Architecture Overview
19 pages
ADF Basics Workshop Overview
No ratings yet
ADF Basics Workshop Overview
26 pages
Data Engineering with Airflow & Power BI
No ratings yet
Data Engineering with Airflow & Power BI
20 pages
Azure Data Engineer Course Outline
No ratings yet
Azure Data Engineer Course Outline
4 pages
Data Engineering with Azure: 2019 Guide
No ratings yet
Data Engineering with Azure: 2019 Guide
4 pages
Data Lake Bootcamp Overview
100% (1)
Data Lake Bootcamp Overview
46 pages
Azure Data Lake Implementation Guide
No ratings yet
Azure Data Lake Implementation Guide
8 pages
Azure Data Factory Overview and Guide
No ratings yet
Azure Data Factory Overview and Guide
11 pages
Azure Databricks Administration Guide
No ratings yet
Azure Databricks Administration Guide
32 pages
Key Data Management Concepts Explained
No ratings yet
Key Data Management Concepts Explained
17 pages
Azure Data Engineering Course ContentMar2026
No ratings yet
Azure Data Engineering Course ContentMar2026
7 pages
Key Components of Azure Data Factory
No ratings yet
Key Components of Azure Data Factory
19 pages
Azure Data Factory Overview and Concepts
No ratings yet
Azure Data Factory Overview and Concepts
7 pages
M2
No ratings yet
M2
4 pages
Azure Data Lake Storage Gen2 Overview
No ratings yet
Azure Data Lake Storage Gen2 Overview
8 pages
Azure Data Pipeline with SQL Server
No ratings yet
Azure Data Pipeline with SQL Server
22 pages
Best Practices for Azure Data Factory
No ratings yet
Best Practices for Azure Data Factory
11 pages
Azure Data Factory Overview and Key Features
No ratings yet
Azure Data Factory Overview and Key Features
15 pages
Azure Data Architecture Overview
No ratings yet
Azure Data Architecture Overview
97 pages
Medallion Architecture in Data Lakehouses
No ratings yet
Medallion Architecture in Data Lakehouses
6 pages
DP-203 Azure Data Engineering Course
No ratings yet
DP-203 Azure Data Engineering Course
98 pages
Azure Data Engineering Essentials
No ratings yet
Azure Data Engineering Essentials
130 pages
Azure Data Factory Overview and Features
No ratings yet
Azure Data Factory Overview and Features
26 pages
Azure Data Lake Storage Overview
100% (1)
Azure Data Lake Storage Overview
21 pages
Azure Data Lake Storage Overview
No ratings yet
Azure Data Lake Storage Overview
21 pages
Databricks Data Engineer Exam Overview
100% (2)
Databricks Data Engineer Exam Overview
103 pages
Data Engineering Course 2024-2025 Guide
No ratings yet
Data Engineering Course 2024-2025 Guide
4 pages
Azure Data Factory Activity Types Explained
100% (1)
Azure Data Factory Activity Types Explained
37 pages
SSIS Integration in Azure Data Factory
No ratings yet
SSIS Integration in Azure Data Factory
118 pages
Azure Data Factory Extended Notes
No ratings yet
Azure Data Factory Extended Notes
3 pages
Azure Data Factory Overview Guide
100% (2)
Azure Data Factory Overview Guide
26 pages
Data Engineering with Azure: A Comprehensive Guide
No ratings yet
Data Engineering with Azure: A Comprehensive Guide
3 pages
Integrating On-Prem AD with Azure AD
No ratings yet
Integrating On-Prem AD with Azure AD
1 page
Azure Data Factory Overview and Triggers
No ratings yet
Azure Data Factory Overview and Triggers
57 pages
Azure Data Factory Interview Basics
No ratings yet
Azure Data Factory Interview Basics
18 pages
Azure Data Factory: Data Integration Guide
No ratings yet
Azure Data Factory: Data Integration Guide
5 pages
Azure Data Factory Training Overview
No ratings yet
Azure Data Factory Training Overview
5 pages
Limitless Analytics with Azure Synapse
No ratings yet
Limitless Analytics with Azure Synapse
23 pages
ADF Interview Questions & Answers Guide
No ratings yet
ADF Interview Questions & Answers Guide
29 pages
Azure Storage Redundancy & Data Processing
No ratings yet
Azure Storage Redundancy & Data Processing
27 pages
Cost Management in Databricks Data Lakes
No ratings yet
Cost Management in Databricks Data Lakes
94 pages
Full Databricks Project 1723160534
No ratings yet
Full Databricks Project 1723160534
112 pages
Understanding Azure Data Factory Components
No ratings yet
Understanding Azure Data Factory Components
16 pages
Azure Databricks Distributed Systems Guide
No ratings yet
Azure Databricks Distributed Systems Guide
11 pages
Azure Data Factory Overview and Use Cases
100% (1)
Azure Data Factory Overview and Use Cases
6 pages
Azure Data Engineering Capstone Project
No ratings yet
Azure Data Engineering Capstone Project
17 pages
Data Warehousing and Data Integration
No ratings yet
Data Warehousing and Data Integration
71 pages
Medallion Architecture for Data Pipelines
No ratings yet
Medallion Architecture for Data Pipelines
6 pages
IoT Data Analytics Overview and Challenges
No ratings yet
IoT Data Analytics Overview and Challenges
17 pages
Big Data Analytics Mid-II Exam 2024
No ratings yet
Big Data Analytics Mid-II Exam 2024
2 pages
Big Data Engineer Job Research
No ratings yet
Big Data Engineer Job Research
15 pages
Key Features and Benefits of Hadoop
No ratings yet
Key Features and Benefits of Hadoop
15 pages
CCS334 Big Data Analytics Syllabus
No ratings yet
CCS334 Big Data Analytics Syllabus
2 pages
Senior Data Engineer Profile Overview
No ratings yet
Senior Data Engineer Profile Overview
5 pages
Instagram Information System Analysis
No ratings yet
Instagram Information System Analysis
10 pages
Cloud Computing Installation Guide
No ratings yet
Cloud Computing Installation Guide
58 pages
Big Data Analytics Syllabus SPPU
No ratings yet
Big Data Analytics Syllabus SPPU
3 pages
CSE 7th Semester Syllabus Overview
No ratings yet
CSE 7th Semester Syllabus Overview
3 pages
BDA Unit 3: MapReduce Overview
No ratings yet
BDA Unit 3: MapReduce Overview
12 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
61 pages
Top Machine Learning Libraries for Big Data
No ratings yet
Top Machine Learning Libraries for Big Data
19 pages
Hadoop Interview Questions & Answers Guide
100% (2)
Hadoop Interview Questions & Answers Guide
25 pages
Enterprise Data Storage Solutions Explained
100% (1)
Enterprise Data Storage Solutions Explained
34 pages
Vmware Containers and Container Networking Whitepaper
No ratings yet
Vmware Containers and Container Networking Whitepaper
57 pages
HBase 1.1.2 Installation Guide on Hadoop
No ratings yet
HBase 1.1.2 Installation Guide on Hadoop
7 pages
Big Data Technologies Overview Survey
No ratings yet
Big Data Technologies Overview Survey
10 pages
Big Data QA: Essays and Key Concepts
No ratings yet
Big Data QA: Essays and Key Concepts
5 pages
Understanding Hadoop and HDFS Basics
No ratings yet
Understanding Hadoop and HDFS Basics
20 pages
BDA Sample Question Paper & Answers
No ratings yet
BDA Sample Question Paper & Answers
29 pages
BCS714D Big Data and Analytics Syllabus
No ratings yet
BCS714D Big Data and Analytics Syllabus
1 page
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
No ratings yet
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
5 pages
Big Data Applications in Various Industries
No ratings yet
Big Data Applications in Various Industries
7 pages
Big Data Masters - Curriculum - 01
No ratings yet
Big Data Masters - Curriculum - 01
46 pages
Big Data Processing Techniques Overview
No ratings yet
Big Data Processing Techniques Overview
22 pages
Pyspark-Focused Data Engineer Resume
No ratings yet
Pyspark-Focused Data Engineer Resume
2 pages
HDFS Command Basics for Big Data
No ratings yet
HDFS Command Basics for Big Data
21 pages
Test Bank For Essentials of Business Analytics 3rd Edition by Camm
100% (2)
Test Bank For Essentials of Business Analytics 3rd Edition by Camm
24 pages
Enhancing Indian Railways e-Ticketing with Big Data
No ratings yet
Enhancing Indian Railways e-Ticketing with Big Data
3 pages