Data Engineering
Interview Questions — Categorized Reference
SQL · Azure Data Factory · Databricks · PySpark · ADLS · Synapse · Python · AWS
Source: LinkedIn posts from data engineering community. Compiled and organized by category for interview preparation.
Table of Contents
1. SQL — Basic Concepts ·······················································
2. SQL — Intermediate ····························································
3. SQL — Advanced ······························································
4. SQL — Performance Tips ·······················································
5. Azure Data Factory — Basic to Intermediate ···································
6. Azure Data Factory — Advanced & Scenario ······································
7. Azure Databricks ··························································
8. PySpark / Spark ······························································
9. ADLS Gen2 & Azure Storage ····················································
10. Azure Synapse Analytics ····················································
11. Python for Data Engineering ················································
12. Azure DE Key Topics — Study Roadmap ······································
13. AWS Cloud (Bonus) ····························································
1. SQL — Basic
Concepts for 0–3 Years Experience
1. What is SQL, and why is it important in data analytics?
2. Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
3. What is the difference between WHERE and HAVING clauses?
4. How do you use GROUP BY and HAVING in a query?
5. Write a query to find duplicate records in a table.
6. How do you retrieve unique values from a table using SQL?
7. Explain the use of aggregate functions like COUNT(), SUM(), AVG(), MIN(), and MAX().
8. What is the purpose of the DISTINCT keyword in SQL?
2. SQL — Intermediate
Window Functions, CTEs, Indexes
1. Write a query to find the second-highest salary from an employee table.
2. What are subqueries and how do you use them?
3. What is a Common Table Expression (CTE)? Give an example of when to use it.
4. Explain window functions like ROW_NUMBER(), RANK(), and DENSE_RANK().
5. How do you combine results of two queries using UNION and UNION ALL?
6. What are indexes in SQL, and how do they improve query performance?
7. Write a query to calculate the total sales for each month using GROUP BY.
8. What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?
9. How do you use PARTITION BY and ORDER BY in window functions?
10. Find the nth Order/Salary from a table.
11. Find the number of output records in each join from given Table 1 and Table 2.
12. YOY (Year-over-Year) and MOM (Month-over-Month) growth related questions.
13. Find Employee–Manager Hierarchy using a self join, or employees earning more than their managers.
14. Row-level scanning using CTE or recursive CTE (e.g., finding missing numbers/items).
15. Number of matches played by every team or source-to-destination flights using CROSS JOIN.
16. Use window functions to calculate moving averages or detect outliers.
17. Implement logic to handle hierarchical data (finding all descendants in a tree structure).
18. Identify and remove duplicate records from a table.
19. SQL query to find department-wise top 3 salaries.
20. Running sum, year-on-year growth, and top N people by salary using window functions.
9. Write a SQL query to find the third highest salary from the table emp. (Columns: id, salary)
10. Write a SQL query to find the department-wise third highest salary from the table emp. (Columns: dept_id,
salary)
11. Write a SQL query to find numbers which consecutively occur 3 times from the table emp. (Columns: id,
numbers)
12. Write a SQL query to find the days when temperature was higher than its previous date’s temperature.
(Columns: Days, Temp)
13. Write a query to find gaps in a dataset and fill that gap from table emp. (Columns: date, id; year is 2022)
14. Find the Most Recent Purchase Date for Each Customer. (Columns: customer_id, purchase_date)
15. Get the Number of Days Between the First and Last Purchase for Each Customer. (Columns: customer_id,
purchase_date)
16. What is a correlated subquery, and how does it differ from a non-correlated subquery?
17. Find the highest and lowest salary department-wise using GROUP BY with MAX() and MIN().
3. SQL — Advanced
Optimization, Views, Stored Procedures
1. How do you optimize a slow-running SQL query?
2. What are views in SQL, and when would you use them?
3. What is the difference between a stored procedure and a function in SQL?
4. Explain the difference between TRUNCATE, DELETE, and DROP commands.
5. What are windowing functions, and how are they used in analytics?
6. How do you handle NULL values in SQL, and what functions help (e.g., COALESCE, ISNULL)?
7. How would you optimize a complex SQL query with multiple joins and subqueries?
8. How do you handle NULL values in SQL while performing aggregations?
9. What is the purpose of the GROUP BY and HAVING clauses?
10. Explain how indexes work and when to use them.
11. Explain the difference between INNER, LEFT, ANTI, and SEMI JOINS with examples.
18. Write a SQL query to find all employees who earn more than their manager. (Columns: EmployeeID,
ManagerID, Salary)
19. What is the difference between a correlated and non-correlated subquery? Which is more performant?
20. SQL questions on window functions — rolling sum and lag/lead based. How do window functions differ from
aggregate functions?
21. Explain Slowly Changing Dimension (SCD) Type 1 in a data warehouse. How does it differ from SCD Type
2?
22. Given two datasets, explain how the number of records will vary for each type of join (Inner, Left, Right, Full
Outer Join).
23. Discuss key concepts in data modeling, including normalization and denormalization.
24. How can you mask sensitive data in Azure SQL Database? What are the different masking techniques
available?
4. SQL — Performance Tips
7 Best Practices for Query Optimization
1. Avoid using SELECT *; always specify only the columns you need.
2. Minimize the use of wildcard characters in LIKE clauses.
3. Opt for appropriate JOINs rather than relying on subqueries.
4. Reduce multiple OR conditions; consider alternatives like IN or UNION where applicable.
5. Use TOP or LIMIT to retrieve only the required rows instead of fetching the entire dataset.
6. Prefer COUNT(1) over COUNT(*) for large tables to optimize performance.
7. Ensure your database indexes are properly optimized for frequently queried columns.
5. Azure Data Factory — Basic to Intermediate
Components, Linked Services, Activities
1. What are the main components of Azure Data Factory?
2. What are Linked Services in ADF, and why are they important?
3. What types of triggers are available in ADF for pipeline execution?
4. What is the Copy Activity in ADF, and how can you optimize its performance?
5. How does ADF handle running multiple pipeline tasks simultaneously?
6. What is the Lookup activity in ADF, and how is it used?
7. How do you monitor and troubleshoot ADF pipelines?
8. What are activities in ADF? Name some important ones you have used.
9. What is the tumbling window trigger in ADF, and when should you use it?
10. Can you explain the different ways to execute pipelines in ADF?
11. What are the main data sources supported by ADF?
12. What is the Get Metadata activity in ADF?
13. What is the difference between Mapping Data Flow and Wrangling Data Flow in ADF?
14. How do you work with dynamic content and parameters in ADF pipelines?
15. How do you manage schema changes in ADF?
16. What is the purpose of a Self-hosted Integration Runtime, and when would you use it?
17. How do you integrate ADF with Databricks or Synapse Analytics?
18. What are ARM templates, and how are they used in ADF?
19. How do you handle incremental data loading in ADF?
20. How would you implement SCD Type 2 in ADF?
21. What is the limit on the number of pipelines and entities in ADF?
22. Describe the steps for deploying ADF pipelines using CI/CD.
23. What new features does ADF v2 have compared to ADF v1?
24. How can you improve data movement speed in ADF?
25. How does nested looping work in ADF?
26. Azure Data Factory — end-to-end flow of data ingestion.
27. What are Linked Services, Integration Runtime, and Datasets in ADF?
28. To copy 10 tables from the same source DB, how many datasets need to be created?
6. Azure Data Factory — Advanced & Scenario-Based
Orchestration, Error Handling, SCD, Real-time
1. How do you schedule when pipelines run in ADF?
2. How can you set up a stored procedure to run automatically in an ADF pipeline?
3. What are the different types of Integration Runtimes in ADF, and when do you use each?
4. What are Data Flows in Azure Data Factory?
5. How do you connect a Databricks notebook to ADF?
6. How would you handle errors and retries in ADF pipelines?
7. How would you take data from a CSV file in Azure Blob Storage and load it into Azure SQL Database?
8. How can you use parameters and variables in ADF pipelines to make them flexible and reusable?
9. How would you make a pipeline run only after another pipeline finishes successfully?
10. How would you set up a solution that chooses source and destination based on input parameters?
11. How do you transform data in ADF using data flow transformations?
12. How would you implement slowly changing dimensions (SCD) in ADF?
13. How do you manage access and security in Azure Data Lake Storage (ADLS)?
14. How do you connect ADF with CI/CD for deployment?
15. What is data lineage, and how would you show it in an Azure data pipeline?
16. What would you do if your data pipeline sometimes fails because of network problems?
17. How would you improve the performance of data partitioning in ADLS?
18. How would you set up incremental data loading from an on-premises SQL Server to Azure SQL?
19. How would you create a data pipeline that processes and analyzes streaming data in real-time?
20. How would you copy data from several CSV files in ADLS Gen2 to Azure SQL Database?
21. How would you manage incremental data loading in ADF from a SQL database?
22. How do you design an effective ADF pipeline — what metrics and considerations matter?
23. We have ADF and Databricks both — why use Databricks if ADF can also do transformations?
24. Share an example of when data pipeline monitoring missed an issue. How did you find and fix it?
25. What are the error handling mechanisms in ADF pipelines?
26. What steps would you take to set up logging and monitoring in ADF?
27. How do you set up dependencies and run pipelines in parallel in ADF?
28. What are best practices to make ADF pipelines perform better?
29. How can we load multiple (50+) tables at a time using ADF?
25. Explain the differences between a Scheduled Trigger and a Tumbling Window Trigger in ADF. When would
you use each?
26. What are the Control Flow activities in ADF? Explain how they differ from Data Flow activities and their
typical use cases.
27. What are the different types of Integration Runtimes (IR) in ADF? Discuss use cases and limitations of
each.
28. Describe the architecture and key components of Azure Data Factory. How do these components interact
to orchestrate data workflows?
29. What is Azure Data Factory (ADF), and how does it enable ETL and ELT processes in a cloud
environment?
30. 20 must-know ADF topics: Integration Runtime, Self-Hosted IR, On-Prem to Cloud Migration, Schedule
Triggers, Event-Based Triggers, Incremental Pipeline, Batch Pipeline, Calling One Pipeline from Another,
Dataset Parametrization, Copy Activity, Performance Optimization of Pipeline, Data Flows, CI/CD of ADF,
Pipeline Monitoring, Rerun Failed Pipeline, Get Metadata Activity, Handling If/For Loop, Handle Nested
Loop, Dynamic Content Manipulation, Pipeline Parameters.
7. Azure Databricks
Clusters, Notebooks, Delta Lake, PySpark
1. What are the types of tables in Databricks?
2. What are the types of clusters in Databricks — interactive and job clusters?
3. How to connect Azure Data Lake to Databricks?
4. What are RBAC, Azure Key Vault, and secret scopes in Databricks?
5. What is Medallion Architecture (Bronze, Silver, Gold)?
6. What are the different ways to run a notebook from another notebook in Databricks?
7. If you encounter an Out Of Memory (OOM) exception when joining two large datasets, how do you resolve
it?
8. Given these transformations — Read CSV, Repartition, Filter, Select, GroupBy, Collect() — how many jobs,
stages, and tasks are created?
9. Have you worked on Spark optimization? Explain an issue you faced and how you resolved it.
10. How do you use Kafka in your project? Describe the architecture and daily pipeline runs.
11. What are the benefits of using Delta tables?
12. What file formats have you worked with, and why is Parquet preferable to others?
13. If you have 100 GB of data, how would you choose the cluster configuration?
14. What is the Catalyst optimizer, and explain its phases?
15. Explain memory management in Spark.
16. How to flatten a JSON file using PySpark in Databricks?
17. How to read CSV files using PySpark?
18. What is a Watermark column and Watermark table?
31. What is the Databricks File System (DBFS), and how is it used?
32. How do you schedule jobs in Azure Databricks?
33. Explain the role of Apache Spark in Azure Databricks.
34. How do you configure a Spark cluster in Azure Databricks?
35. What are the advantages of using PySpark in Azure Databricks for data processing?
36. Describe the concept of notebooks in Azure Databricks.
37. How do Azure Databricks workspaces enhance collaboration?
38. Explain the significance of Delta Lake in Azure Databricks.
39. Explain the architecture of Delta Lake in Databricks and its benefits.
40. How would you implement multi-hop architecture (Bronze → Silver → Gold) in Databricks?
41. Describe the process of implementing auto-scaling in Databricks clusters.
42. How do you optimize Databricks Delta table performance?
43. Explain the concept of Databricks Unity Catalog and how it enhances data governance.
44. How do you implement SCD Type 1, 2, and 3 in PySpark on Databricks?
8. PySpark / Spark
Optimization, Delta Lake, Schema Evolution
1. Steps to tune your long-running Spark jobs.
2. What is the purpose of broadcast variables in Spark, and when would you use them?
3. What is data skewness, and how do you handle data skew in Spark jobs?
4. How do you achieve schema evolution in Spark?
5. What is Spark's lazy evaluation, and how does it optimize execution?
6. How do you handle large text files with no tabular structure in Spark?
7. What is Z-ordering in Databricks, and when would you use it?
8. What is the difference between Parquet and Delta files?
9. What are the benefits of using Delta tables over regular Parquet?
10. What is the Catalyst optimizer and explain its phases?
11. Explain memory management in Spark (executors, driver, shuffle).
12. How do you build ETL jobs using PySpark?
13. How do you handle partitioning, caching, and optimization in Spark?
14. What are Data Lake optimizations using Delta Lake?
15. How do you work with window functions, joins, and aggregations in Spark and SQL?
45. Describe the PySpark architecture.
46. What are RDDs in PySpark?
47. How does PySpark differ from Apache Hadoop?
48. What are DataFrames in PySpark?
49. How do you initialize a SparkSession?
50. What is the significance of the SparkContext?
51. Describe the types of transformations in PySpark (narrow vs wide).
52. What are actions in PySpark, and how do they differ from transformations?
53. How can you filter rows in a DataFrame?
54. How do you aggregate data in PySpark?
55. What are UDFs (User Defined Functions), and how are they used?
56. How do you repartition a DataFrame, and why would you do it?
57. Describe how to cache a DataFrame. Why is it useful?
58. How do you save a DataFrame to a file?
59. What are accumulators, and how are they used?
60. What is the significance of the Tungsten execution engine?
61. Explain how checkpointing works in PySpark.
62. What is data lakehouse architecture?
63. How do you find the second highest salary in a DataFrame using PySpark?
64. How do you use window functions in PySpark to calculate running totals or moving averages?
65. How do you remove duplicate rows from a DataFrame in PySpark?
66. How do you join two large datasets in PySpark without running out of memory?
67. How do you aggregate data by multiple columns in PySpark?
68. How do you perform ETL operations to clean and transform raw data into a structured format in PySpark?
69. How do you partition a large DataFrame by a specific column and write the partitions to different locations?
70. How do you handle and process JSON data in PySpark?
71. Which do you prefer — DataFrame or RDD? Justify your answer.
72. How do you define the structure (schema) of a DataFrame in PySpark?
73. How do you read data from a Parquet file in PySpark?
74. How do you add a new column to a DataFrame in PySpark?
75. Describe the process of implementing a custom aggregation function in PySpark.
76. How do you handle and optimize joins in PySpark when dealing with large datasets?
77. Explain the Catalyst optimizer in Spark SQL and how it improves query performance.
78. What are the best practices for managing memory in PySpark?
79. How can you monitor the performance of a PySpark application?
9. ADLS Gen2 & Azure Storage
Data Lake, Blob Storage, Medallion Architecture
1. What is the difference between Azure Blob Storage and Azure Data Lake Gen2?
2. What is ADLS Gen2 — storage for structured, semi-structured, and unstructured data?
3. What is Medallion Architecture — Bronze, Silver, Gold layers?
4. How do you design folder structures in ADLS for an incremental load pipeline?
5. How do you manage access and security in ADLS (RBAC, ACLs, managed identities)?
6. How would you improve the performance of data partitioning in ADLS?
7. What are Azure Storage Tiers, and how do you manage costs?
8. What file formats are used in ADLS (Parquet, Delta, CSV, JSON, Avro)?
9. What is the difference between Gen1 and Gen2 of ADLS?
10. Azure Synapse Analytics
SQL Pools, Star Schema, T-SQL
1. What is Synapse Analytics?
2. What is a SQL Pool, and what are its types (Dedicated vs Serverless)?
3. What is the difference between a Dedicated SQL Pool and a Serverless SQL Pool?
4. How do you connect Power BI to Azure Synapse Analytics?
5. How do you write efficient T-SQL queries in Synapse Analytics?
6. What are recursive queries, window functions, and CTEs in Synapse SQL?
7. How do you use partitioning and indexing in Synapse SQL pools?
8. What is a Star schema vs Snowflake schema for data warehousing?
9. How do you handle slowly changing dimensions (SCD) in Synapse?
10. How do you load data into Synapse using COPY INTO?
11. What is the MERGE statement in Synapse, and when do you use it?
11. Python for Data Engineering
Functions, Data Quality, Error Handling
1. What are decorators in Python and how can they be used in data engineering tasks?
2. How do you write data quality checks and explain the use-case?
3. Given an array of integers, return the indices of two numbers that add up to a specific target.
4. Write a function to rotate an array of n elements to the right by k steps.
5. How do you handle exceptions in Python using try and except?
6. Write a function to count the frequency of each character in a string.
12. Azure Data Engineering — Key Topics
Study Roadmap for Azure DE Certification / Interviews
1. Azure Data Services Overview: ADF (ETL), Databricks (analytics/ML), Synapse (data warehouse), ADLS
(data lake), Event Hub + Stream Analytics (real-time), SQL DB / Cosmos DB / Synapse SQL Pools.
2. Data Ingestion & ETL Pipelines: ADF pipeline design, Source-to-Sink connections, Incremental loading
(watermark/delta), Integration with on-prem using Self-hosted IR.
3. Data Transformation & Processing: Databricks PySpark ETL jobs, Partitioning/caching/optimization in
Spark, Delta Lake optimizations, Window functions/joins/aggregations.
4. Orchestration & Automation: ADF Triggers (tumbling window, event-based, scheduled), Parameterization
and reusable pipelines, Azure Logic Apps and Functions for orchestration.
5. Performance Optimization & Monitoring: Tuning ADF pipelines (parallel copy, partitioning), Optimizing
Spark jobs (reduce shuffle, broadcast joins), Azure Monitor alerts and logs, Managing costs with Azure
Storage Tiers.
6. Security & Governance: RBAC and managed identities, Data encryption at rest and in transit, Azure Key
Vault for credentials, Data governance with Azure Purview.
7. SQL Skills & Querying Data: Efficient T-SQL queries in Azure SQL and Synapse Analytics, Recursive
queries, CTEs, window functions, Partitioning and indexing.
8. Data Modeling & Architecture: Star and Snowflake schema, Designing data lakes (Bronze/Silver/Gold),
Handling slowly changing dimensions (SCD).
9. Hands-on Exercises: Azure Data Factory pipeline, Azure Databricks PySpark notebook, Azure SQL /
Synapse SQL pool.
13. AWS Cloud (Bonus)
Glue, DataSync, Incremental Loading
1. How do you implement incremental data loading in AWS Glue or AWS Data Pipeline?
2. Explain the difference between AWS Glue ETL jobs and AWS DataSync.
3. How do you handle error logging and notifications in AWS Glue workflows or pipelines?
4. What are the best practices for parameterizing AWS Glue scripts and workflows?
Preparation Tips
How to use this document effectively
1. Practice SQL on platforms like LeetCode, HackerRank, and StrataScratch — focus on window functions
and CTEs.
2. For ADF, create a free Azure account and build at least one end-to-end incremental pipeline using Lookup
+ Copy + Stored Procedure activities.
3. For Databricks, practice PySpark notebooks with Bronze → Silver → Gold medallion architecture using
Delta tables.
4. For Synapse, practice T-SQL including MERGE, COPY INTO, and Views for Power BI.
5. Be ready to explain the WHY: Why watermark? Why Databricks over ADF for heavy transforms? Why Delta
over Parquet?
6. Prepare scenario-based answers for questions like 'How would you handle 100 GB of data?' or 'How do
you handle OOM in Spark?'
7. For SQL interviews: always think about edge cases — NULLs, duplicates, missing rows, and data skew.
Good luck with your interviews! Build the project, practice the code, and you will be ready. 🚀
14. Dimensional Modeling
SCD, Star Schema, Junk Dimensions, Hierarchy
80. Describe the process of designing a Slowly Changing Dimension (SCD) Type 2.
81. How would you handle a many-to-many relationship in dimensional modeling?
82. Explain the concept of a junk dimension and when you would use it.
83. Describe the pros and cons of using a snowflake schema versus a star schema.
84. How would you design a dimension to handle hierarchical data?
85. Explain SCD Type 1 vs SCD Type 2 vs SCD Type 3 — when would you choose each?
86. How do you implement SCD Type 2 in PySpark using union and is_current flag?
87. What is a bridge table, and when is it used in dimensional modeling?
88. What is a degenerate dimension? Give a real-world example.
89. How does dimensional modeling differ between OLAP and OLTP systems?
15. Performance Optimization — Distributed Systems
Skew, Joins, Predicate Pushdown, Incremental Loading
90. Describe techniques for optimizing distributed joins in a big data environment.
91. How would you diagnose and resolve data skew issues in a distributed processing system?
92. Explain the concept of predicate pushdown and how it improves query performance.
93. Describe the process of optimizing a Slowly Changing Dimension (SCD) Type 2 for both write and read
performance.
94. How would you implement and optimize incremental data processing in a large-scale data pipeline?
95. What strategies help with optimizing PySpark jobs for performance (e.g., caching, partitioning, broadcast
joins)?
96. When should you use coalesce() vs repartition() in PySpark?
97. What are the common causes of a PySpark job running slowly, and how do you fix them?
98. Explain the concept of Z-ordering in Delta Lake and how it improves query performance.
16. Project & Behavioral Questions
Infosys L2, Scenario-Based, End-to-End Azure Project
General Project Questions (Infosys L2 Style)
99. Tell me about your project. What is the business problem it solves?
100. What is your source system and how many types of source systems do you work with?
101. How many jobs do you have in your project?
102. How many ETL pipelines do you have in your project?
103. How do you schedule your pipeline? What scheduler do you use?
104. Have you faced any difficulty in your project? How did you overcome it?
105. If you have one query with CTE and one with subquery, which do you prefer and what are the
differences?
106. Which do you prefer — DataFrame or RDD? Why?
107. Find the second highest salary of each department with the employee name. (Columns: id, name, dept,
salary)
Scenario-Based: End-to-End Azure Data Engineering Project
108. Can you briefly explain the steps involved in an Azure Data Engineer project from data ingestion to
visualization?
109. How do you handle data ingestion in Azure? (ADF for batch, Event Hubs for streaming)
110. What are the key storage options available in Azure for a data engineering project? (ADLS, Blob,
Synapse, SQL DB)
111. Can you describe the data processing step in your project? (Azure Databricks, Synapse Analytics, ADF
Data Flows)
112. How do you handle data warehousing in Azure? (Synapse Analytics, fact and dimension tables)
113. How do you ensure data governance and security in your project? (AAD, Azure Key Vault, Azure Monitor)
114. Can you give an example of a challenge you faced in one of your Azure projects and how you overcame
it?
115. Which IR should be used for copying data from an on-premises database to Azure?
116. Describe an Azure pipeline design you consider best practice — what metrics and design considerations
matter?
17. System Design & Load Balancing (Bonus)
Common algorithms and patterns for distributed systems
117. What is load balancing, and why is it important in distributed systems?
118. Round Robin: Assigns requests to servers in sequential order. What are its advantages and limitations?
119. Least Connections: Directs traffic to the server with the fewest active connections. When is it better than
Round Robin?
120. Weighted Round Robin: Assigns weights based on server capacity and distributes requests
proportionally. How does it work?
121. Weighted Least Connections: Combines Least Connections and Weighted Round Robin — directs to the
server with the lowest active-connections-to-weight ratio.
122. IP Hash: Determines the server based on source/destination IP, ensuring session persistence for a user.
When is this critical?
123. Least Response Time: Routes requests to the server with the lowest response time and fewest active
connections.
124. Random: Routes incoming requests to a randomly selected server from the available pool.
125. How do load balancers differ between Layer 4 (transport) and Layer 7 (application)?
126. What is sticky session (session affinity), and when is it needed?