0% found this document useful (0 votes)
4 views53 pages

Azure Data Engineer Course Notes

The document is a comprehensive guide on Azure Data Engineering, covering essential topics such as cloud fundamentals, Medallion Architecture, Azure Data Lake Storage, identity management, and Azure Data Factory. It includes theoretical concepts, practical scenarios, and implementation workshops designed to help practitioners master data management and transformation using Azure services. Each day focuses on specific aspects of data engineering, providing hands-on tasks to validate the learned concepts.

Uploaded by

kvinee2344
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views53 pages

Azure Data Engineer Course Notes

The document is a comprehensive guide on Azure Data Engineering, covering essential topics such as cloud fundamentals, Medallion Architecture, Azure Data Lake Storage, identity management, and Azure Data Factory. It includes theoretical concepts, practical scenarios, and implementation workshops designed to help practitioners master data management and transformation using Azure services. Each day focuses on specific aspects of data engineering, providing hands-on tasks to validate the learned concepts.

Uploaded by

kvinee2344
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Azure Data Engineering

Comprehensive Guide

Mastering ADF, Databricks, and SQL for Modern Data Architectures

Azure Data Engineering - Day 1


Day 1: Cloud Fundamentals &
Medallion Architecture

Theory: Introduction to Azure Data Engineering

Azure Data Engineering involves designing and implementing the


management, monitoring, security, and privacy of data using the full stack of
Azure data services. [cite: 1] The primary goal is to build data pipelines that
transform raw data into a structured format suitable for business intelligence.

On-Premises vs. Cloud: In an on-premises environment, the company is


responsible for purchasing hardware, maintaining data centers, and
managing limited resources. In contrast, cloud providers like Azure, AWS, and
GCP provide resources on a "pay-as-you-use" basis. [cite: 2] This allows for
massive scalability and cost efficiency.

Cloud Classification: Cloud services are shared among multiple users with
robust security. [cite: 3] Models include:

• Public Cloud: Shared hardware among different organizations. [cite: 3]

• Private Cloud: Dedicated hardware for one organization. [cite: 3]

• Hybrid Cloud: A mix of both public and private environments. [cite: 3]

Service Models:

• IaaS (Infrastructure as a Service): Provides networking, security, and


operating systems. [cite: 4]

• PaaS (Platform as a Service): Provides a platform for development


without managing underlying infrastructure. [cite: 4]

Azure Data Engineering - Day 2


• SaaS (Software as a Service): Ready-to-use software applications. [cite:
4]

Medallion Architecture

The Medallion Architecture is a layered data design pattern used to organize


data in a lakehouse. [cite: 5]

• Bronze (Raw Layer): This layer stores data in its raw, original format
(JSON, CSV, Parquet). [cite: 5] It acts as the "source of truth."

• Silver (Refinement Layer): Data is cleaned and refined. This involves


removing duplicates, handling null values, and correcting corrupted
data. [cite: 5]

• Gold (Aggregation Layer): The final layer where data is aggregated and
optimized for reporting and analytics. [cite: 5]

Related Data: Data Formats

Common formats include Parquet (columnar storage, efficient for queries),


CSV (simple but lacks schema), and JSON (flexible for semi-structured data).
[cite: 5]

Implementation Workshop (Day 1):


To master the topics covered in Day 1, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Cloud Fundamentals & Medallion Architecture.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

Azure Data Engineering - Day 3


• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Cloud Fundamentals & Medallion
Architecture apply to optimizing their daily inventory reports? In Day 1, we focus
on the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 4


Day 2: Azure Data Lake Storage (ADLS
Gen2)

Theory: Storage Infrastructure

Azure Data Lake Storage (ADLS) Gen2 is the foundational storage for Big Data
on Azure. It is organized into a hierarchy: Tenant -> Subscription -> Resource
Group -> Services (ADF, Databricks, ADLS). Services within the same Resource
Group often share the same region and VNet. [cite: 6]

Access Tiers: Based on how frequently data is accessed, you can choose:

• Hot: For frequent access; lowest access cost but higher storage cost.
[cite: 6]

• Cool: For data accessed less frequently (min 30 days). [cite: 6]

• Cold: For data accessed very rarely (min 90 days). [cite: 6]

• Archive: For long-term storage of rare data. [cite: 6]

Blob Types:

• Block Blobs: Best for large files like movies and documents. [cite: 6]

• Append Blobs: Optimized for append operations (e.g., logging). [cite: 6]

• Page Blobs: Optimized for random read/write (e.g., VM disks). [cite: 6]

Azure Data Engineering - Day 5


Security and Availability

Permissions: Access is often granted using the Blob Storage Contributor role.
[cite: 7] Soft-delete retention periods protect against accidental deletion. [cite:
7]

Redundancy (Backup): Options like LRS (Locally Redundant Storage)


replicate data three times within a single data center. [cite: 7]

Related Data: ADLS Gen2 vs Blob Storage

The key differentiator is the Hierarchical Namespace. While Blob storage


uses a flat namespace, Gen2 allows for true folder structures, which
significantly improves the performance of directory-level operations in big
data processing. [cite: 7]

Implementation Workshop (Day 2):


To master the topics covered in Day 2, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Azure Data Lake Storage (ADLS Gen2).

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Azure Data Lake Storage (ADLS Gen2)
apply to optimizing their daily inventory reports? In Day 2, we focus on the

Azure Data Engineering - Day 6


foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 7


Day 3: Identity Management &
Databricks Billing

Theory: Entra ID (Azure AD)

Identity management involves three stages:

• Identification: Providing a username or email. [cite: 8]

• Authentication: Verifying identity via password or certificate. [cite: 8]

• Authorization: Controlling access (e.g., allowing ADF to read from


ADLS). [cite: 8]

Access Methods:

• Account Key/SAS Key: Temporary or permanent secrets. [cite: 8]

• Managed Identity: Azure-managed identities (System-assigned or User-


assigned). [cite: 8]

• Service Principal (SPN): An application identity used for third-party tools


like Databricks. [cite: 8]

Databricks Clusters and Billing

Databricks costs are based on resource usage (DBUs). [cite: 9]

• All-Purpose Clusters: Used for interactive analysis and development.


[cite: 9]

• Job Clusters: Cost-effective, automatically created and terminated for


jobs. [cite: 9]

Azure Data Engineering - Day 8


• Serverless: Compute managed by Databricks, providing instant
scalability. [cite: 10]

Related Data: PIM (Privileged Identity Management)

PN provides time-based access (e.g., 8 hours) to sensitive resources to reduce


the attack surface. [cite: 8]

Implementation Workshop (Day 3):


To master the topics covered in Day 3, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Identity Management & Databricks Billing.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Identity Management & Databricks
Billing apply to optimizing their daily inventory reports? In Day 3, we focus on
the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 9


Azure Data Engineering - Day 10
Day 4: Azure Data Factory (ADF) - ETL/
ELT Basics

Theory: Orchestration with ADF

Azure Data Factory (ADF) is a serverless data integration service used to


orchestrate data movement and transformation. [cite: 12] It supports both ETL
(Extract-Transform-Load) and ELT (Extract-Load-Transform).

Medallion ETL: Data is fetched from a source (ADLS/SQL) and moved into the
Raw/Bronze layer. [cite: 12]

Pipeline Settings:

• Creation: Pipelines contain activities that perform actions. [cite: 12]

• Activity Logic: Activities can be chained with dependencies like On


Success, On Fail, On Skip, or On Completion. [cite: 12]

Basic Activities

• Wait Activity: Pauses the pipeline for a specified duration. [cite: 12]

• Execute Pipeline: Invokes another pipeline for modular design. [cite: 12]

Parameterization

Parameters allow pipelines to be dynamic. You can use Global Parameters or


Variables to pass values into activities. [cite: 12]

Azure Data Engineering - Day 11


Implementation Workshop (Day 4):
To master the topics covered in Day 4, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Azure Data Factory (ADF) - ETL/ELT Basics.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Azure Data Factory (ADF) - ETL/ELT
Basics apply to optimizing their daily inventory reports? In Day 4, we focus on
the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 12


Azure Data Engineering - Day 13
Day 5: ADF Control Flow & Iteration

Theory: Advanced Control Flow

To handle complex logic, ADF provides several control flow activities:

• If Condition: Evaluates an expression and branches execution based on


True or False. [cite: 12]

• Switch Activity: Similar to 'case' statements in programming, it


branches based on specific values. [cite: 12]

• For-Each: Iterates over a collection (array) of items (e.g., processing a list


of files). [cite: 12]

• Filter Activity: Filters an input array based on a condition. [cite: 12]

Related Data: JSON Expressions

ADF uses JSON-based expressions. For example, checking if a parameter


'Country' equals 'Iran' in a Switch activity. [cite: 12]

# Example Expression for If Condition


@equals(pipeline().[Link], 'SQL')

Azure Data Engineering - Day 14


Implementation Workshop (Day 5):
To master the topics covered in Day 5, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of ADF Control Flow & Iteration.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of ADF Control Flow & Iteration apply to
optimizing their daily inventory reports? In Day 5, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 15


Azure Data Engineering - Day 16
Day 6: SQL Activities & ADLS Metadata

Theory: Connecting ADF to SQL and ADLS

Script Activity: Used to run SQL queries against Azure SQL Database (e.g.,
querying [Link]). [cite: 13]

Metadata and Validation:

• Get Metadata: Retrieves information about files (size, modified date,


child items). [cite: 14]

• Validation Activity: Ensures that a file or folder exists at a specific


location before proceeding. [cite: 14]

• Lookup Activity: Used to retrieve a dataset (JSON or SQL result) to


visualize or use in downstream activities. [cite: 15]

Implementation Details

Linked Services: Store connection credentials (access details). [cite: 14]

Datasets: Define the specific location of the data (file path or table name).
[cite: 14]

Related Data: Timeouts

Validation activities often have a timeout setting (e.g., 12 hours) represented


as '[Link]'. [cite: 14]

Azure Data Engineering - Day 17


Implementation Workshop (Day 6):
To master the topics covered in Day 6, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of SQL Activities & ADLS Metadata.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of SQL Activities & ADLS Metadata apply
to optimizing their daily inventory reports? In Day 6, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 18


Azure Data Engineering - Day 19
Day 7: ADF Triggers & Integration
Runtimes

Theory: Automating Pipelines

Triggers are used to automate pipeline execution. [cite: 17]

• Schedule Trigger: Runs at a specific wall-clock time. [cite: 17]

• Tumbling Window Trigger: Runs on a fixed time interval (e.g., every


hour) and handles backfilling. [cite: 17]

• Event-Based Trigger: Runs when an event occurs, such as a file arriving


in storage. [cite: 17]

Integration Runtimes (IR)

The IR is the compute infrastructure used by ADF for data movement and
activity execution. [cite: 18]

• Auto-Resolved (Cloud) IR: Used for cloud-to-cloud data movement.


[cite: 18]

• Self-Hosted IR: Required for on-premises or private network


connectivity. [cite: 18]

• SSIS IR: For running legacy SSIS packages in the cloud. [cite: 18]

Azure Data Engineering - Day 20


Related Data: SQL Scripting

Script activities can handle DDL and DML operations like creating tables or
inserting records for logging. [cite: 18]

Implementation Workshop (Day 7):


To master the topics covered in Day 7, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of ADF Triggers & Integration Runtimes.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of ADF Triggers & Integration Runtimes
apply to optimizing their daily inventory reports? In Day 7, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 21


Azure Data Engineering - Day 22
Day 8: Copy Activity Scenarios

Theory: Data Movement Patterns

The Copy Activity is the most frequently used activity in ADF. Common
patterns include:

• ADLS to ADLS: Migrating files between containers or storage accounts.

• SQL to SQL: Moving data between databases.

• ADLS to SQL: Loading flat files (CSV/Parquet) into relational tables.

• SQL to ADLS: Offloading database data into a data lake for analytics.

Settings

The Preserve Hierarchy setting is used when copying folders to ensure the
destination matches the source structure exactly.

Introduction to PySpark

For more complex transformations that the Copy Activity cannot handle, we
introduce PySpark using Azure Databricks. PySpark can interact with Key
Vault for secure secrets and process data at scale.

Implementation Workshop (Day 8):


To master the topics covered in Day 8, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

Azure Data Engineering - Day 23


• Implement a test scenario using dummy data to validate the theoretical
concepts of Copy Activity Scenarios.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Copy Activity Scenarios apply to
optimizing their daily inventory reports? In Day 8, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 24


Day 9: SQL DML & Incremental Loads

Theory: SQL Operations

Standard SQL commands are often integrated into ADF pipelines:

• Insert: Adding new records. [cite: 19]

• Update: Modifying existing data based on a key (e.g., cust_id). [cite: 19]

• Delete: Removing data from a table. [cite: 19]

Copy Scenarios and Mappings

• Auto-Create Table: If the sink table doesn't exist, ADF can create it
automatically. [cite: 20]

• Column Mismatch: Requires explicit mapping if source and sink column


names differ. [cite: 21]

Incremental Load (Watermark)

To avoid reloading all data (Full Load), we use a Watermark. [cite: 23]

• A Watermark Table stores the last processed timestamp. [cite: 23]

• ADF queries the source for records newer than the watermark, copies
them, and updates the watermark table. [cite: 23]

Azure Data Engineering - Day 25


Implementation Workshop (Day 9):
To master the topics covered in Day 9, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of SQL DML & Incremental Loads.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of SQL DML & Incremental Loads apply to
optimizing their daily inventory reports? In Day 9, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 26


Azure Data Engineering - Day 27
Day 10: Delta Copy & Watermark
Implementation

Theory: Step-by-Step Incremental Load

The standard pattern for an incremental load involves:

1. Lookup for the Current Watermark. [cite: 24]

2. Lookup for the New (Max) Watermark in the source. [cite: 24]

3. Copy Data where source date is between Old and New Watermarks. [cite:
24]

4. Stored Procedure to update the Control (Watermark) Table. [cite: 24]

-- Example SQL Update for Watermark


UPDATE WatermarkTable
SET WatermarkValue = @NewWatermark
WHERE TableName = 'Person'; [cite: 24]

Related Data: Person Table Scenario

Imagine a table 'Person' with a column 'LastModifyTime'. We use this column


to identify which records were updated or added since the last run. [cite: 24]

Azure Data Engineering - Day 28


Implementation Workshop (Day 10):
To master the topics covered in Day 10, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Delta Copy & Watermark Implementation.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Delta Copy & Watermark
Implementation apply to optimizing their daily inventory reports? In Day 10, we
focus on the foundational principles that allow this retail data to move from raw
transaction logs into actionable insights.

Azure Data Engineering - Day 29


Azure Data Engineering - Day 30
Day 11: Databricks & Spark
Architecture

Theory: Why Databricks?

While ADF is great for orchestration, Databricks provides high-performance


compute via Apache Spark. [cite: 25] Spark uses In-Memory processing,
which is significantly faster than Hadoop's disk-based MapReduce. [cite: 25]

Spark Architecture

Feature Hadoop Spark

Processing Slow (Disk-based) [cite: Very Fast (RAM-based) [cite:


Speed 30] 31]

Processing Type Batch only [cite: 32] Batch + Streaming [cite: 33]

Execution MapReduce [cite: 36] DAG engine [cite: 36]

Memory Management

Executor Memory: Divided into On-heap and Off-heap. [cite: 46]

• Unified Memory: Shared by Execution and Storage (0.6% default). [cite:


47]

• User Memory: For user-defined data structures. [cite: 47]

Azure Data Engineering - Day 31


• Reserved Memory: System overhead (300MB default). [cite: 47]

Spark Submit

The `spark-submit` command is used to launch applications on a cluster. [cite:


44] It specifies the master (local, yarn, k8s), deploy mode (client/cluster), and
resource allocation (memory/cores). [cite: 44, 45]

Implementation Workshop (Day 11):


To master the topics covered in Day 11, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Databricks & Spark Architecture.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Databricks & Spark Architecture apply
to optimizing their daily inventory reports? In Day 11, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 32


Azure Data Engineering - Day 33
Day 12: Databricks Utilities & Widgets

Theory: Notebook Productivity

Databricks provides DBUtils (Databricks Utilities) for file system operations


and notebook chaining.

• [Link]: Mount storage, list files, and move data.

• [Link]: Run one notebook from another.

Interactive Widgets

Widgets allow you to pass parameters to notebooks interactively or via ADF.

• Combobox / Dropdown: Selection from a list.

• Text: Free-text input.

• Multiselect: Multiple selections.

Implementation Workshop (Day 12):


To master the topics covered in Day 12, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Databricks Utilities & Widgets.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Azure Data Engineering - Day 34


Practical Scenario: Consider a retail company processing millions of
transactions. How would the concepts of Databricks Utilities & Widgets apply to
optimizing their daily inventory reports? In Day 12, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 35


Day 13: Storage Mounting & PySpark
Structures

Theory: Managed vs. External Tables

In Databricks (Spark SQL):

• Managed Tables: Spark manages both metadata and data. If you drop
the table, data is deleted. [cite: 50]

• External Tables: Spark only manages metadata. If you drop the table,
the data remains in ADLS. [cite: 50]

Connecting to Storage

Storage is often connected via Mounting or using Unity Catalog. Unity


Catalog provides central governance across the metastore, catalogs, and
schemas. [cite: 51]

PySpark Data Structures

• RDD (Resilient Distributed Dataset): The low-level API; no schema,


slower.

• DataFrame: Structured API (like a table); fast due to Catalyst Optimizer.

• Dataset: A mix of RDD and DataFrame (mostly used in Scala/Java).

Azure Data Engineering - Day 36


Implementation Workshop (Day 13):
To master the topics covered in Day 13, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Storage Mounting & PySpark Structures.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Storage Mounting & PySpark Structures
apply to optimizing their daily inventory reports? In Day 13, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 37


Azure Data Engineering - Day 38
Day 14: Spark Reading & Writing
Modes

Theory: Handling Data at Scale

Reading Modes:

• Permissive: Default; fills corrupt data with nulls. [cite: 52]

• DropMalformed: Removes records that don't match the schema. [cite:


52]

• FailFast: Throws an error immediately upon encountering corrupt data.


[cite: 52]

Writing Modes

• Overwrite: Deletes existing data and writes new. [cite: 53]

• Append: Adds data to the existing folder. [cite: 53]

• ErrorIfExists: Throws an error if data already exists. [cite: 53]

# Example PySpark Read


df = [Link]('csv') \
.option('mode', 'failfast') \
.load('path/to/data') [cite: 54]

Azure Data Engineering - Day 39


Implementation Workshop (Day 14):
To master the topics covered in Day 14, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Spark Reading & Writing Modes.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Spark Reading & Writing Modes apply
to optimizing their daily inventory reports? In Day 14, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 40


Azure Data Engineering - Day 41
Day 15: DataFrame Transformations

Theory: Column Operations

DataFrames are transformed using various functions:

• withColumn: Add or update a column.

• select: Choose specific columns.

• col: Reference a column for functions.

• alias: Rename a column for display.

Conditional Logic (When/Otherwise)

The `when` and `otherwise` functions serve as the SQL 'CASE' equivalent in
PySpark.

from [Link] import col, when


df = [Link]("salary_group", when(col("salary") > 5000,
"High").otherwise("Low"))

Handling Nulls

Functions like `fillna` and `dropna` are used to clean missing data in the Silver
layer.

Azure Data Engineering - Day 42


Implementation Workshop (Day 15):
To master the topics covered in Day 15, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of DataFrame Transformations.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of DataFrame Transformations apply to
optimizing their daily inventory reports? In Day 15, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 43


Azure Data Engineering - Day 44
Day 16: Window Functions &
Aggregations

Theory: Advanced Analytical Queries

Window Functions: Used for calculations across a set of rows related to the
current row (e.g., Ranking, Running Totals). [cite: 55]

Aggregations: Operations like `sum`, `avg`, `min`, and `max` performed via
`groupBy`.

# PySpark GroupBy Example


[Link]("department").agg({"salary": "avg"}) [cite: 55]

Related Data: Dropna

The `dropna(subset=['column_name'])` function allows for targeted cleaning of


rows where critical columns are null. [cite: 55]

Implementation Workshop (Day 16):


To master the topics covered in Day 16, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Window Functions & Aggregations.

Azure Data Engineering - Day 45


• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Window Functions & Aggregations
apply to optimizing their daily inventory reports? In Day 16, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 46


Day 17: Complex Types & Joins

Theory: Handling Arrays and Joins

Collective Functions:

• collect_list / collect_set: Aggregates values into an array (set removes


duplicates). [cite: 56]

• explode: Flattens an array into multiple rows. [cite: 56]

Joins in PySpark

Standard joins like Inner, Left, and Right are supported. [cite: 57]

• Left Anti Join: Returns records in the left table that have no matching
record in the right. [cite: 57]

Related Data: Key Vault

Always use Azure Key Vault to store connection strings and secrets, retrieving
them via `[Link]()`. [cite: 57]

Implementation Workshop (Day 17):


To master the topics covered in Day 17, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Complex Types & Joins.

Azure Data Engineering - Day 47


• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Complex Types & Joins apply to
optimizing their daily inventory reports? In Day 17, we focus on the foundational
principles that allow this retail data to move from raw transaction logs into
actionable insights.

Azure Data Engineering - Day 48


Day 18: Join Optimization & Lazy
Evaluation

Theory: How Spark Executes

Broadcast Hash Join: An optimization for joining a large table with a small
table (<10MB). The small table is "broadcasted" to all workers to avoid
network shuffling. [cite: 58, 59]

Lazy Evaluation: Spark doesn't execute transformations (filter, join)


immediately. It waits for an Action (count, show, display). [cite: 60]

Spark Logical Hierarchy

• Job: Triggered by an action. [cite: 60]

• Stage: Created when data needs to be shuffled (Wide Transformation).


[cite: 60]

• Task: The smallest unit of work (processes 128MB of data). [cite: 60]

Lineage and Fault Tolerance

Spark tracks the history of transformations (Lineage). If a node fails, it can


recompute the data. [cite: 62] **Checkpoints** can be used to break the
lineage and save intermediate data to disk for performance. [cite: 62]

Azure Data Engineering - Day 49


Implementation Workshop (Day 18):
To master the topics covered in Day 18, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Join Optimization & Lazy Evaluation.

• Monitor the performance of the services and identify any bottlenecks in


the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Join Optimization & Lazy Evaluation
apply to optimizing their daily inventory reports? In Day 18, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 50


Azure Data Engineering - Day 51
Day 19: Optimization Techniques &
Summary

Theory: Fine-Tuning Spark

Optimization is key to managing cloud costs. Techniques include:

• Partitioning: Dividing data into smaller chunks for parallel processing


(standard 128MB per task). [cite: 48]

• Caching / Persisting: Storing intermediate results in memory for reuse.


[cite: 47]

• Explain Plan: Using `[Link]('extended')` to view how the Catalyst


Optimizer structured the physical plan. [cite: 61]

Course Wrap-up

An Azure Data Engineer must master the flow from ingestion (ADF) to storage
(ADLS) to high-performance processing (Databricks), all while maintaining
security and optimizing for cost.

Implementation Workshop (Day 19):


To master the topics covered in Day 19, practitioners should perform the
following hands-on tasks:

• Set up the environment and verify connectivity using the Azure Portal.

• Implement a test scenario using dummy data to validate the theoretical


concepts of Optimization Techniques & Summary.

Azure Data Engineering - Day 52


• Monitor the performance of the services and identify any bottlenecks in
the ingestion or transformation phase.

• Document the findings and compare the results with the expected
outcomes in the Medallion Architecture.

Practical Scenario: Consider a retail company processing millions of


transactions. How would the concepts of Optimization Techniques & Summary
apply to optimizing their daily inventory reports? In Day 19, we focus on the
foundational principles that allow this retail data to move from raw transaction
logs into actionable insights.

Azure Data Engineering - Day 53

You might also like