0% found this document useful (0 votes)
50 views43 pages

Efficient Data Processing in Python and SQL

The document outlines various techniques and best practices for handling large files, logging errors, optimizing Spark jobs, and ensuring data quality in Python and data pipelines. It discusses SQL queries for customer payment statuses, differences between data schemas, and the importance of data governance and security in banking. Additionally, it highlights experiences with ETL tools, automation of regulatory reporting, and strategies for maintaining scalable data workflows.

Uploaded by

saparparth26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views43 pages

Efficient Data Processing in Python and SQL

The document outlines various techniques and best practices for handling large files, logging errors, optimizing Spark jobs, and ensuring data quality in Python and data pipelines. It discusses SQL queries for customer payment statuses, differences between data schemas, and the importance of data governance and security in banking. Additionally, it highlights experiences with ETL tools, automation of regulatory reporting, and strategies for maintaining scalable data workflows.

Uploaded by

saparparth26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

How do you handle large files in Python?

• I usually avoid loading the full file into memory. Instead, I use iterators or Pandas with a
chunk size parameter to process data in smaller batches. For streaming, I rely on
generators to yield data efficiently. This ensures the pipeline stays memory-efficient and
doesn’t crash.

• In one of my projects we had to process daily CSV dumps from a banking system, each
around 4–5 GB. Instead of loading the entire file in memory, I used Pandas with chunk
size to read in batches of 50k rows. This reduced memory pressure and allowed me to
validate data quality batch by batch.

How do you log pipeline errors in Python?

• In production, print statements aren’t reliable. In one project, I implemented Python’s


logging module with INFO for normal progress, WARNING for schema mismatches, and
ERROR for critical failures. Logs were shipped to Datadog, which helped us track failures
quickly. For example, once a schema change in Redshift broke the transformation step,
and the ERROR log immediately pinpointed the exact column mismatch.

Write a query to get customers with 3 missed EMI payments

• SELECT customer_id
FROM loan_repayments
WHERE status = 'MISSED'
GROUP BY customer_id
HAVING COUNT(*) >= 3;
• “We used a query like this in an NPA classification task. Customers with 3 or more missed
payments in a row had to be reported under RBI’s regulatory categories. This SQL was the
basis for downstream compliance dashboards that went into automated RBI
submissions.”

Difference between GROUP BY and PARTITION BY

• GROUP BY collapses rows — for example, total outstanding per branch. But PARTITION BY
keeps all rows and calculates aggregates per partition. In one case, I needed a running
balance per customer. With PARTITION BY, I could calculate it row-by-row without losing
individual transactions
Query for loans overdue > 90 days (NPA detection)

• SELECT loan_id,
customer_id FROM loans WHERE DATEDIFF(CURDATE(),
last_payment_date) > 90;
• “We applied this query in Basel reporting to categorize loans as NPAs if overdue > 90 days.
This flagged risky accounts early and fed into liquidity and capital adequacy calculations
for RBI compliance.”

How do you optimize Spark jobs?

• I optimize by partitioning large datasets, using broadcast joins for smaller lookup
tables, and caching DataFrames when reused. I also reduce shuffle overhead by
filtering early and tuning [Link]. These optimizations cut
processing time and resource costs.
• “In one migration project from Redshift to Delta Lake, a join operation on large datasets
was taking hours. I optimized it by broadcasting the smaller dimension table,
partitioning fact data by txn_date, and caching intermediate results. This reduced
shuffle size and cut the runtime by about 40%. I also tuned
[Link] from 200 to 50, which matched our dataset size
better.

How do you handle skewed data?

• Skew happens when a few keys dominate data distribution. I handle it with techniques
like key salting, repartitioning, or broadcasting smaller tables. This balances work
across executors and avoids stragglers.
• When processing campaign data, one region had 70% of the transactions, which
caused Spark shuffle skew. One executor was overloaded while others stayed idle. To
fix it, I used key salting by appending a random number to the region ID. This distributed
records evenly across partitions and prevented executor timeouts.”

What is a DAG in Airflow?

• A DAG (Directed Acyclic Graph) is the workflow definition in Airflow. It consists of tasks
with dependencies, where execution flows in one direction without cycles. Each DAG
represents an end-to-end pipeline, like extracting data, transforming it, and loading it
into a warehouse.

How do you enforce data quality in Spark pipelines?

• In our customer analytics pipeline, we had strict checks — like no null customer_id
and valid email formats. I implemented schema enforcement in Delta Lake and added
a PySpark filter to quarantine invalid rows into a separate table. Later, analysts could
review those rejected records. This ensured only clean, regulatory-compliant data
reached reporting tables

How do you manage failures in Airflow?

• I configure retries, SLA monitoring, and alerting via email or Slack. Failed tasks can be
retried individually without rerunning the whole DAG. Logs are available per task,
making it easier to debug issues.

How do ETL and ELT differ?

• ETL means extract → transform → load, where transformations happen before loading
into the warehouse. ELT means extract → load → transform, where raw data is loaded
first and transformations are done in the warehouse (e.g., with dbt). ELT is more
common in modern cloud setups.

Root cause analysis when a pipeline fails?

• I usually start with task logs. For example, one failure was due to a missing partition in
S3. Instead of rerunning the whole DAG, I identified the failed partition and reprocessed
just that slice. This minimized re-runs and met SLAs without reloading millions of rows
unnecessarily

Difference between Star and Snowflake schema?

• Star schema is denormalized, with a central fact table connected to dimension tables
— it’s faster for reporting. Snowflake schema normalizes dimensions further, reducing
redundancy but making queries more complex. For reporting teams, star schema is
generally preferred.
• for regulatory reporting, I prefer star schema because it’s easier for analysts and faster
for queries. For example, transactions as fact table, linked to dimensions like
customer, branch, product. Snowflake schema is normalized and reduces redundancy,
but compliance teams usually need quick aggregations, so star is more practical.”

What is metadata and lineage?

• Metadata is data about data — e.g., source, owner, last update time. Lineage is the
complete journey of data from source to destination, including transformations. Both
are critical in banking, where regulators require proof of how reported figures were
derived.
• Metadata describes data — like last updated, source system, owner. Lineage shows
the flow of data: from CBS tables → ETL transformations → RBI report. In practice, I used
lineage tools in Databricks Unity Catalog. When auditors asked how a capital adequacy
figure was derived, we could trace it back step by step to original transactions.”

What data quality checks do you implement?

• I apply checks for completeness (no missing values), accuracy (valid formats like PAN),
consistency (matching across systems), uniqueness (no duplicates), and timeliness
(fresh data). Failed records are quarantined for review.

What is data governance in banking?

• It’s the set of policies and controls ensuring data accuracy, consistency, ownership, and
compliance. In practice, it involves RBAC (role-based access), lineage, metadata
management, and version control for transformations.

How do you enforce security in pipelines?

• I use encryption at rest and in transit, apply masking for sensitive data like Aadhaar, and
restrict access using role-based policies. Tools like Unity Catalog in Databricks help
enforce fine-grained permissions.

Can you explain your approach to designing and maintaining scalable data pipelines for regulatory
reporting?

• Answer: My approach is to first understand reporting requirements, then design modular


ETL/ELT pipelines using PySpark for distributed transformations and dbt for business logic.
I orchestrate workflows with Airflow, embedding retries and SLAs. To ensure scalability, I
implement a Bronze–Silver–Gold Delta Lake architecture, partition data, and optimize
cluster resources. Data accuracy is enforced through dbt tests and reconciliation checks,
while monitoring is handled with Datadog and reporting in Power BI.
• Scenario: In the Chick-fil-A CMDP project, I ingested customer data from PostgreSQL,
marketing facts from Redshift, and real-time streams from AWS Kinesis into Databricks
Delta Lake. Using Airflow, dbt, and PySpark together, I built pipelines that scaled from daily
batch reporting to near-real-time analytics.

How do you ensure data accuracy and compliance when integrating data from multiple banking
systems?

• Answer: I apply source-to-target validation (row counts, checksums, schema consistency),


create reconciliation reports, and apply governance controls like PII masking, encryption
(TLS/KMS), and role-based access. Lineage and documentation are maintained in Unity
Catalog and dbt. I also align to frameworks like RBI/GDPR for compliance.
• Scenario: In CMDP, I merged Redshift and Kinesis data by applying dbt tests for schema
and accuracy checks. Sensitive customer fields were masked in Unity Catalog, ensuring
only anonymized datasets were available to analysts.

What steps would you take to automate a regulatory reporting process for RBI?

• Answer: I would capture RBI metrics via PySpark/dbt pipelines, orchestrated with Airflow
DAGs that have retries, SLAs, and alerts. I’d add automated reconciliation jobs, validation
reports, and publish final outputs securely in Power BI or similar BI tools with RBAC.
Documentation and lineage would be maintained in dbt + Unity Catalog.
• Scenario: In CMDP, I automated ingestion of PostgreSQL, Redshift, and Kinesis data
through Airflow, transformed with dbt, and created dashboards in Power BI. The same
framework can be adapted for RBI reporting with compliance gates.

Describe your experience with ETL tools or frameworks. Which one do you prefer and why?

• Answer: I’ve worked extensively with Apache Airflow for scheduling and orchestration. I
prefer Airflow because DAG-based workflows, retries, and cloud integration make it
scalable, reliable, and transparent.
• Scenario: In CMDP, I used Airflow to schedule PySpark transformations and dbt models
together, managing both batch and streaming pipelines seamlessly.

How do you ensure data lineage and traceability in your pipelines?

• Answer: I use Databricks Unity Catalog for dataset/column lineage and dbt documentation
for model traceability. Every transformation is version-controlled, and all flows are
auditable.
• Scenario: In CMDP, dbt automatically generated lineage of models, while Unity Catalog
documented transformations—allowing auditors and stakeholders to see the full data
journey.

What strategies do you use for optimizing performance and scalability of data workflows?

• Answer: I optimize through partitioning, Delta Lake file compaction, caching, broadcast
joins, and tuning Spark clusters. I also implement incremental models in dbt and monitor
jobs with Datadog.
• Scenario: In CMDP, I improved refresh efficiency by compacting small files and partitioning
large Redshift datasets in Delta, which cut runtime and stabilized pipelines.

Have you worked with ADF, CIMS, or Basel? If not, how would you adapt?

• Answer: I haven’t directly implemented them, but I’d adapt by studying RBI guidelines,
mapping regulatory fields to source systems, and applying my existing ETL/ELT expertise
with tools like PySpark, dbt, and Airflow.
• Scenario: In CMDP, I quickly learned and implemented Unity Catalog and dbt within
weeks, proving my ability to adopt new frameworks and deliver results fast.

How do you handle root cause analysis when a regulatory data issue arises?

• Answer: I trace lineage using Unity Catalog/dbt, check Airflow + Spark logs, and compare
source vs target. Once the issue is identified, I fix it, add regression tests, and update
documentation.
• Scenario: In CMDP, I diagnosed Airflow DAG failures caused by schema drift, updated dbt
models to standardize schemas, and documented the fix to avoid recurrence.
Explain your understanding of data governance and security best practices in financial services.

• Answer: Governance practices include role-based access, PII masking, encryption (at
rest/in transit), audit logging, and adherence to GDPR/RBI guidelines. I enforce lineage
documentation and retention policies.
• Scenario: In CMDP, I implemented fine-grained access controls in Unity Catalog so only
anonymized marketing datasets were accessible to analysts, while admins retained full
lineage visibility.

How do you plan to adapt to collaborating with teams onsite in Mumbai?

• Answer: I have worked in Agile environments using Jira and Confluence. Onsite
collaboration helps reduce communication gaps, speeds up decision-making, and builds
stronger team alignment.
• Scenario: In CMDP, close collaboration with product and marketing teams helped refine
dbt models quickly. Onsite, this would be even more effective for faster delivery.

Walk me through how you integrated PostgreSQL, Redshift, and Kinesis into Databricks.

• Answer: I designed ETL pipelines using PySpark for batch ingestion (PostgreSQL, Redshift)
and streaming ingestion via Kinesis. All data was ingested into Databricks Delta Lake with
Bronze–Silver–Gold layers. dbt handled business transformations, and Airflow scheduled
workflows.
• Scenario: In CMDP, this integration created a centralized Delta Lake that unified batch and
streaming data for customer engagement and campaign performance analytics.

What challenges did you face migrating datasets to Delta Lake, and how did you overcome them?

• Answer: Challenges included schema mismatches and small file issues. I resolved them
with standardized dbt models, Delta optimizations (Z-ordering, compaction), and cluster
tuning.
• Scenario: In CMDP, when migrating Redshift datasets, I standardized schemas using dbt,
applied Delta optimizations, and tuned Spark clusters—improving refresh speed and
stability.

How did you build and maintain ETL/ELT pipelines with PySpark, dbt Cloud, and Airflow?
• Answer: PySpark was used for distributed transformations, dbt for modular SQL models +
tests, and Airflow for orchestration with retries and monitoring. This ensured automation
and reliability.
• Scenario: In CMDP, this trio powered daily batch + real-time streaming pipelines, validated
outputs with dbt, and published trusted datasets to Power BI

Tell us about handling real-time streaming data with AWS Kinesis. What bottlenecks did you face?

• Answer: Bottlenecks included message lag and shard throughput limits. I optimized by
scaling shard capacity, enabling checkpointing, and monitoring with CloudWatch.
• Scenario: In CMDP, I tuned consumer parallelism and shard scaling, which reduced
latency and kept clickstream events flowing into Delta Lake reliably.

How did you manage governance and security with Unity Catalog?

• Answer: I applied fine-grained permissions, RBAC, and masking policies, while


documenting lineage for compliance.
• Scenario: In CMDP, analysts only accessed anonymized customer segments, while full
datasets were masked and governed via Unity Catalog.

How did you use Docker, and what value did it add?

• Answer: I containerized PySpark and dbt jobs to ensure environment consistency across
dev/test/prod, which improved CI/CD and reduced deployment issues.
• Scenario: In CMDP, Docker eliminated “works on my machine” issues and allowed faster,
more reliable deployments across environments.

How did you improve ETL pipeline stability and reduce failures?

• Answer: I added Airflow retries/SLAs, PySpark exception handling, dbt schema tests, and
Datadog monitoring dashboards.
• Scenario: In CMDP, these improvements reduced reruns, optimized refresh efficiency, and
improved pipeline stability—earning appreciation from stakeholders.

Describe the situation where you resolved a critical tool issue in 11 hours instead of 48.
• Answer: I prioritized debugging, collaborated with cross-functional teams, quickly
identified the root cause, and deployed a patch while documenting the fix.
• Scenario: In CMDP, an Airflow DAG issue was estimated to take 48 hours. By analyzing task
logs and fixing dependencies, I restored operations in just 11 hours.

Which cloud platform is most effective for banking analytics, and why?

• Answer: AWS is the most effective due to mature services like Redshift, Kinesis, and Glue.
Azure also stands out for compliance features, while GCP is strong in AI/ML. Choice
depends on org priorities.
• Scenario: In CMDP, we used AWS for streaming (Kinesis) and warehousing (Redshift) with
Databricks, providing cost-effective, scalable analytics—similar to banking requirements.

How have your certifications helped, and how will you leverage them in this role?

• Answer: My AWS, Azure, GCP, SQL, and Python certifications gave me a strong grasp of
cloud-native tools, pipeline design, and governance best practices. They help me quickly
adapt to different environments.
• Scenario: In CMDP, AWS fundamentals supported my use of Kinesis and Redshift, while
Azure knowledge helped with governance. I’ll leverage these certifications to ramp up
faster in regulatory data engineering.

Can you walk me through the Scrum sprint lifecycle and your role in it?

• Practical Answer: The sprint lifecycle starts with sprint planning, where we commit to
backlog items. During the sprint, we have daily stand-ups to track progress and address
blockers. At the end, we do a sprint review/demo for stakeholders, followed by a
retrospective to reflect and improve. My role as a Data Engineer is to break down stories
into technical tasks (like building a PySpark job, creating dbt models, or setting up Airflow
DAGs), deliver them within the sprint, and ensure proper testing and documentation.
• Scenario: In CMDP, during one sprint, I took the backlog story of migrating Redshift tables
to Delta Lake. I broke it into tasks: schema standardization, dbt transformation, and
Airflow scheduling. By sprint review, I demoed faster queries on Power BI dashboards, and
in the retrospective, we noted schema drift as a recurring blocker and added pre-sprint
schema alignment steps as a process improvement.
What are Parquet, Avro, and ORC formats, and how do they differ?

• Parquet & ORC → Columnar storage formats, optimized for analytics workloads. They
store data by columns, which makes aggregation and filtering much faster and more
efficient on big data.
• Avro → Row-based format, lightweight and schema-focused, best for streaming and
serialization use cases.
• Key differences:
o Parquet: widely used in Spark/Databricks, supports complex nested data, and is
good for read-heavy workloads.
o ORC: similar to Parquet but more common in Hive ecosystems, very efficient for
compression and predicate pushdown.
o Avro: better for write-heavy, streaming data and schema evolution (like Kafka).
o Scenario (CMDP):
In the Chick-fil-A CMDP project, we stored curated datasets in Delta Lake backed
by Parquet, because Parquet integrates seamlessly with Spark/Databricks and
provided strong compression + query performance for campaign analytics. For real-
time ingestion via AWS Kinesis, Avro could be used if the source needed schema
evolution, but since Delta/Parquet was the unified storage layer, it simplified
downstream BI reporting.

1. Can you explain the difference between INNER JOIN, LEFT JOIN, and FULL OUTER
JOIN?

Answer:

• INNER JOIN: Returns only matching rows from both tables.


• LEFT JOIN: Returns all rows from the left table, with NULLs if no match on the right.
• FULL OUTER JOIN: Returns all rows from both tables, filling NULLs where matches don’t exist.

Scenario (CMDP): While merging customer data (PostgreSQL) with campaign data (Redshift), I used a
LEFT JOIN so all customers appeared in the analytics, even if they didn’t engage in campaigns. For
reporting unmatched customers, I used FULL OUTER JOIN to identify gaps across systems.

2. What are Window Functions in SQL, and where have you used them?

Answer:
Window functions perform calculations across a set of rows related to the current row, without
collapsing the dataset. Common ones: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER().
Scenario (CMDP): I used ROW_NUMBER() to deduplicate customer engagement records ingested from
Kinesis streams. I also used LAG() to calculate the difference in campaign impressions day-over-day,
giving marketing teams a trend analysis.

3. Can you explain the difference between WHERE and HAVING in SQL?

Answer:

• WHERE filters rows before aggregation.


• HAVING filters rows after aggregation (on grouped data).
• Scenario (CMDP): In a query calculating campaign ROI per region, I used WHERE region =
'North America' to filter input rows, and HAVING SUM(spend) > 10000 to return only high-
spend regions.

4. How do you optimize SQL queries for performance?

Answer:

• Use appropriate indexes (clustered, composite).


• Partition large tables.
• Avoid SELECT *, only fetch needed columns.
• Push filters early (WHERE before joins).
• Use CTEs/materialized views for complex queries.

Scenario (CMDP): When querying Redshift fact tables with billions of rows, I optimized queries by
selecting only necessary fields and partitioning data on campaign_date. This reduced report refresh
times by ~40%.

5. What is a CTE (Common Table Expression), and why do you use it?

Answer:
CTEs are temporary result sets defined using WITH, useful for breaking down complex queries, improving
readability, and reusing intermediate results.

Scenario (CMDP): While generating multi-level campaign performance metrics, I used a CTE to first
calculate daily engagement, then another CTE to roll it up into weekly and monthly aggregates, before
joining it to marketing spend tables.
6. How do you detect duplicates in a SQL table?

Answer:
By grouping on key fields and using COUNT(*) > 1.

SELECT customer_id, COUNT(*)


FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;

Scenario (CMDP): When ingesting customer events from Kinesis, I ran deduplication queries in
Databricks SQL using this pattern, then applied ROW_NUMBER() OVER(PARTITION BY … ORDER BY
event_time DESC) to retain the latest record only.

7. Explain the difference between UNION and UNION ALL.

Answer:

• UNION: Combines results and removes duplicates.


• UNION ALL: Combines results but keeps duplicates (faster since no deduplication).

Scenario (CMDP): While merging daily batch data from Redshift with real-time Kinesis events, I used
UNION ALL to combine both streams before applying deduplication logic with window functions in
Databricks.

8. What is the difference between OLTP and OLAP databases?

Answer:

• OLTP (Transactional): Optimized for inserts/updates (e.g., PostgreSQL, MySQL).


• OLAP (Analytical): Optimized for aggregations and large scans (e.g., Redshift, BigQuery).

Scenario (CMDP): PostgreSQL served as the transactional source system, while Redshift and Delta
Lake in Databricks acted as analytical systems for BI dashboards.
9. How do you handle slowly changing dimensions (SCD) in SQL?

Answer:

• SCD Type 1: Overwrite old values.


• SCD Type 2: Keep history with versioning (effective_date, expiry_date).

• Scenario (CMDP): For customer profile changes, I used SCD Type 2 in Delta Lake. Each update
created a new record with valid_from and valid_to fields, ensuring historical reporting for
marketing campaigns.

10. Can you explain the concept of indexing and how it affects queries?

Answer:
Indexes speed up queries by creating lookup structures, but they slow down inserts/updates since
indexes need updating. Proper indexing is key for balanced performance.

Scenario (CMDP): In PostgreSQL source systems, I added indexes on customer_id and campaign_id
fields to speed up extraction queries feeding the ETL pipelines. This reduced ingestion time significantly.

How do you find the second highest salary in a table?

Answer:

SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Scenario (CMDP): I used a similar query to find the second top campaign by spend from Redshift fact
tables for marketing analysis.

2. How do you get the top N records per group in SQL?

Answer: Use ROW_NUMBER() with PARTITION BY.

SELECT *
FROM (
SELECT campaign_id, customer_id, spend,
ROW_NUMBER() OVER(PARTITION BY campaign_id ORDER BY spend DESC) rn
FROM campaign_spend
) t
WHERE rn <= 3;

Scenario (CMDP): I applied this to get the top 3 customers by spend per campaign, which was used in
Power BI dashboards.

3. Explain the difference between DELETE, TRUNCATE, and DROP.

Answer:

• DELETE: Removes rows, can use WHERE, logged operation.


• TRUNCATE: Removes all rows, faster, minimal logging.
• DROP: Removes entire table structure + data.
Scenario (CMDP): While cleaning staging tables in Databricks SQL, I used TRUNCATE for
temporary loads to reset daily jobs.

4. How do you find NULL vs empty values in SQL?

Answer:

SELECT * FROM customers


WHERE email IS NULL OR email = '';

Scenario (CMDP): I checked NULL/blank values in customer contact info tables before sending records
into marketing pipelines.

5. How do you calculate running totals in SQL?

Answer:

SELECT customer_id, order_date,


SUM(spend) OVER(PARTITION BY customer_id ORDER BY order_date) AS
running_total
FROM campaign_orders;
Scenario (CMDP): Used this to track cumulative campaign spend per customer over time.

6. How do you remove duplicates but keep one record?

Answer:

SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY updated_at DESC)
rn
FROM customers
) t
WHERE rn = 1;

Scenario (CMDP): I deduplicated Kinesis streaming events by keeping the latest record per customer
session.

7. Explain correlated subqueries with an example.

Answer:
A subquery that depends on outer query values.

SELECT c.customer_id, [Link]


FROM customers c
WHERE [Link] > (
SELECT AVG(spend) FROM orders o WHERE o.customer_id = c.customer_id
);

Scenario (CMDP): I used correlated subqueries to find customers whose latest campaign spend was
above their personal average.

8. How do you handle slowly changing dimensions in SQL?

Answer: Use SCD Type 2 with versioning.

-- insert new row with updated valid_from and valid_to


Scenario (CMDP): For customer profile changes, I implemented SCD2 in Delta Lake, ensuring full
historical view for analytics.

9. How do you pivot data in SQL?

Answer:

SELECT customer_id,
SUM(CASE WHEN channel = 'Email' THEN spend END) AS email_spend,
SUM(CASE WHEN channel = 'SMS' THEN spend END) AS sms_spend
FROM campaign_data
GROUP BY customer_id;

Scenario (CMDP): I pivoted campaign data to show spend across Email, SMS, Social channels per
customer.

10. How do you find customers who never made a purchase?

Answer:

SELECT c.customer_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Scenario (CMDP): I used this logic to identify inactive customers who registered but never engaged with
campaigns.

11. How do you calculate percentage contribution in SQL?

Answer:

SELECT campaign_id,
SUM(spend) AS total_spend,
SUM(spend) * 100.0 / SUM(SUM(spend)) OVER() AS pct_contribution
FROM campaign_spend
GROUP BY campaign_id;
Scenario (CMDP): I calculated each campaign’s spend contribution to total marketing budget.

12. How do you detect gaps in a sequence of dates?

Answer: Use LAG().

SELECT customer_id, order_date,


order_date - LAG(order_date) OVER(PARTITION BY customer_id ORDER BY
order_date) AS gap
FROM orders;

Scenario (CMDP): I used this to find inactive days in customer engagement streams.

13. How do you find the most frequent item per group?

Answer:

SELECT customer_id, channel


FROM (
SELECT customer_id, channel,
RANK() OVER(PARTITION BY customer_id ORDER BY COUNT(*) DESC) r
FROM campaign_data
GROUP BY customer_id, channel
) t
WHERE r = 1;

Scenario (CMDP): Used this to find most preferred campaign channel per customer.

14. What is the difference between EXISTS and IN?

Answer:

• EXISTS: Returns true if subquery returns rows, efficient for correlated queries.
• IN: Matches values from subquery results, less efficient for large lists.
Scenario (CMDP): Used EXISTS in PostgreSQL to quickly validate if customer IDs existed in
active campaign tables.
15. How do you aggregate data by week/month in SQL?

Answer:

SELECT DATE_TRUNC('month', order_date) AS month, SUM(spend)


FROM orders
GROUP BY DATE_TRUNC('month', order_date);

Scenario (CMDP): Used DATE_TRUNC to roll up campaign performance data into monthly reports.

16. How do you find customers with transactions in consecutive months?

Answer:

SELECT customer_id
FROM (
SELECT customer_id, DATE_TRUNC('month', order_date) m,
LAG(DATE_TRUNC('month', order_date)) OVER(PARTITION BY customer_id ORDER
BY m) prev_m
FROM orders
) t
WHERE prev_m + INTERVAL '1 month' = m;

Scenario (CMDP): I applied this to identify loyal customers active across consecutive months.

17. What’s the difference between a primary key and a unique key?

Answer:

• Primary Key: Uniquely identifies rows, only one per table, not nullable.
• Unique Key: Ensures uniqueness but allows multiple unique constraints and NULLs.
Scenario (CMDP): In PostgreSQL, customer_id was primary, while email was unique to avoid
duplicate registrations.
18. How do you handle large fact tables in SQL?

Answer: Partitioning, indexing, archiving old data, and using materialized views for frequent
aggregations.
Scenario (CMDP): Redshift fact tables were partitioned by campaign_date to improve query
performance and cost efficiency.

19. How do you implement incremental loading in SQL?

Answer: By filtering based on last modified timestamp or incremental IDs.

SELECT * FROM source


WHERE updated_at > (SELECT MAX(updated_at) FROM target);

Scenario (CMDP): I used incremental loads with updated_at fields for PostgreSQL tables into Delta
Lake, reducing load time and cluster costs.

20. What’s the difference between a view and a materialized view?

Answer:

• View: Virtual, runs the underlying query every time.


Materialized View: Stores results physically, faster but needs refresh.
Scenario (CMDP): In Redshift, I created materialized views for campaign aggregates to speed
up Power BI dashboards.

1. How do you handle large files in Python efficiently?

Answer: Use generators, iterators, and chunk processing instead of loading into memory at once.
Libraries like pandas.read_csv(chunksize=…) or pyarrow are effective.
Scenario (CMDP): For large CSV campaign logs, I processed them in chunks using Pandas before
loading into Delta Lake, avoiding memory crashes.

2. What’s the difference between a list, tuple, and set in Python?

Answer:

• List: Mutable, ordered, allows duplicates.


• Tuple: Immutable, ordered, allows duplicates.
• Set: Mutable, unordered, no duplicates.
Scenario: Used set() to quickly deduplicate customer IDs before pushing to Redshift.

3. How do you handle exceptions in Python?

Answer: Use try-except-finally.


Scenario (CMDP): Wrapped database connection code in try-except blocks to catch PostgreSQL
errors and trigger Airflow retries.

4. Explain Python decorators and their use in data engineering.

Answer: Decorators allow wrapping functions with extra behavior.


Scenario: Created a custom @log_execution_time decorator to measure ETL task runtime in PySpark
jobs.

5. How do you handle JSON data in Python?

Answer: Use the json module for parsing/serializing.


Scenario (CMDP): Parsed Kinesis event streams in JSON format into Python dicts before applying
PySpark transformations.

6. Difference between multiprocessing and multithreading in Python?

Answer:

• Multiprocessing: Multiple processes, bypasses GIL, better for CPU-bound tasks.


• Multithreading: Multiple threads, affected by GIL, better for I/O-bound tasks.
Scenario: Used multiprocessing in Python to parallelize Redshift extract jobs for campaign data.
7. How do you connect Python to PostgreSQL/Redshift?

Answer: Using psycopg2 or SQLAlchemy.


Scenario (CMDP): Used psycopg2 for PostgreSQL extraction scripts before ingesting data into Delta
Lake pipelines.

8. How do you handle schema evolution in Python ETL scripts?

Answer: Use dynamic schema mapping with dictionaries, enforce column existence, and apply defaults
for missing fields.
Scenario: Wrote a Python schema validation function that flagged missing columns when Redshift
tables evolved.

9. What’s the difference between shallow copy and deep copy in Python?

Answer:

• Shallow copy: Copies references, changes affect original.


• Deep copy: Copies objects recursively, independent of original.
Scenario: While caching ETL configs, I used deepcopy() to avoid overwriting base configs across
Airflow tasks.

10. How do you optimize Pandas performance for big data?

Answer:

• Use vectorized operations.


• Convert to categorical types.
• Offload to Dask/PySpark for large-scale.
Scenario (CMDP): Initially used Pandas for transformations but migrated heavy processing to
PySpark for scalability.

🔹 Advanced Airflow Interview Q&A (Parth Sapar)


1. How do you design Airflow DAGs for complex pipelines?

Deep Answer:
I design Airflow DAGs in a layered approach — ingestion (PostgreSQL, Redshift, Kinesis), staging,
transformation (PySpark/dbt), and serving (Delta/BI). Each DAG is modular, meaning extract, transform,
and load are separate operators, which makes retries and debugging easier. I also parameterize DAGs so
they can run across different environments (dev, test, prod) without code duplication.

I use features like:

• Task dependencies (>> and <<) to control order.


• TaskGroups for readability.
• XComs to pass metadata between tasks.
• SLA + retries + email/Slack alerts to detect failures.

Scenario (CMDP): In CMDP, I created a main DAG that extracted batch data from PostgreSQL and
Redshift, while a streaming DAG ingested Kinesis data in near real time. The DAG orchestrated PySpark
jobs on Databricks, dbt transformations, and finally published outputs into Delta Gold tables. This
modular setup allowed the DAG to scale from daily batch runs to near real-time streaming without re-
engineering.

2. How do you handle failures and retries in Airflow?

Deep Answer:
I use Airflow’s retry mechanism (retries, retry_delay) and custom error handling inside tasks. I
implement idempotent tasks so retries don’t duplicate loads (e.g., using MERGE in Delta or UPSERT in
PostgreSQL). For critical pipelines, I add on_failure_callback that sends alerts to Slack/Teams and
triggers fallback jobs if needed.

Scenario (CMDP): When PostgreSQL extracts occasionally failed due to DB locks, I added 3 retries with
exponential backoff. I also made extraction tasks idempotent by logging last_run_timestamp in
metadata tables. This way, retries only processed new data, not full reloads, keeping pipelines stable.

3. How do you manage dependencies between Airflow tasks?

Deep Answer:
I break dependencies into logical stages: extract → transform → load → validate. I use TaskGroups to
group related tasks and apply trigger rules (all_success, all_failed, all_done) depending on the
situation. For inter-DAG dependencies, I use ExternalTaskSensor.
Scenario (CMDP): In CMDP, the PostgreSQL extraction task had to complete before running PySpark
transformations. I used TaskGroups like extract_group >> transform_group >> load_group. For
dbt models, I used Airflow’s KubernetesPodOperator to run dbt Cloud jobs, which only triggered once
raw → Silver transformations succeeded.

4. How do you handle backfills in Airflow?

Deep Answer:
I use Airflow’s catchup=True for historical runs and parameterize DAGs with execution dates. To
optimize, I run backfills with smaller batches (daily partitions) instead of large full loads, so failures are
easier to isolate. I also implement data versioning in Delta (time travel) to avoid overwriting good data
during backfills.

Scenario (CMDP): When RBI required a backdated compliance report, I enabled catchup to reprocess
90 days of historical customer engagement data from Redshift and Kinesis. Instead of one heavy
backfill job, Airflow triggered daily partitioned jobs with PySpark, which ensured performance and
minimized cluster failures.

5. How do you integrate Airflow with PySpark and dbt?

Deep Answer:
For PySpark, I use DatabricksSubmitRunOperator or SparkSubmitOperator. For dbt, I either call dbt
CLI with BashOperator or trigger dbt Cloud jobs via API. This allows orchestration across Python-based
and SQL-based transformations.

Scenario (CMDP): In CMDP, Airflow triggered PySpark jobs on Databricks for heavy transformations
(joins, aggregations) and then triggered dbt Cloud jobs for SQL modeling (standardizing schemas,
business rules). This hybrid approach gave the flexibility of Spark’s distributed compute and dbt’s
lightweight SQL modeling.
6. How do you monitor Airflow pipelines?

Deep Answer:
I use Airflow UI for DAG monitoring, but for production, I integrate with Datadog and CloudWatch. DAG
failures trigger Slack alerts, and key metrics (DAG duration, success rate, queue wait time) are pushed to
dashboards. I also enable task-level logging for debugging.

Scenario (CMDP): I set up Datadog monitoring where pipeline failures or SLA misses triggered real-time
Slack alerts. For example, if Kinesis streaming jobs delayed beyond 5 minutes, an Airflow sensor flagged
it and triggered alerts, ensuring SLA compliance for near real-time dashboards.

7. How do you manage Airflow in containerized/cloud environments?

Deep Answer:
I use Docker to containerize Airflow with dependencies (PySpark, dbt, AWS SDK). For orchestration, I’ve
worked with Astronomer/Airflow on Kubernetes where each task can run in its own container, scaling
independently.

Scenario (CMDP): To avoid dependency conflicts, I Dockerized Airflow with PySpark and dbt installed.
The same container image was deployed in dev, test, and prod environments, ensuring consistency and
avoiding “works on my machine” issues.

8. How do you handle DAG scalability as pipelines grow?

Deep Answer:
I modularize DAGs, use TaskGroups, and break monolithic DAGs into smaller dependent DAGs
connected via ExternalTaskSensor. I also optimize scheduler performance with
CeleryExecutor/KubernetesExecutor for parallelism.

Scenario (CMDP): As the project grew, a single DAG managing PostgreSQL, Redshift, and Kinesis loads
became too large. I split it into three DAGs: ingestion DAG, transformation DAG, and reporting DAG, all
linked by ExternalTaskSensor. This reduced complexity, improved parallelism, and cut overall runtime.
9. How do you ensure data quality with Airflow?

Deep Answer:
I implement data validation tasks inside DAGs using SQL checks (row counts, null checks) or dbt tests.
If validation fails, the DAG marks downstream tasks as failed, preventing bad data propagation.

Scenario (CMDP): In CMDP, after PostgreSQL → Delta ingestion, I added a dbt test task in the Airflow
DAG. If row counts mismatched or primary keys failed uniqueness tests, the DAG failed early and
triggered alerts — preventing bad data from reaching Gold tables.

10. Can you explain a critical issue you solved using Airflow?

Deep Answer:
During CMDP, a PySpark job triggered by Airflow was taking 48+ hours due to skewed partitions in
Redshift extracts. The pipeline was at risk of SLA breach. I analyzed logs from Airflow and Databricks,
identified skew in joins, applied repartitioning and broadcast joins, and optimized cluster configs.

With Airflow orchestration, I rolled out the fix in 11 hours instead of 48. I also updated Airflow DAG
configs to auto-scale cluster sizes depending on load, preventing recurrence.

Data Engineering Deep-Dive Q&A (Resume-


Based)

1. How do you design and maintain scalable data pipelines for regulatory reporting?

Deep Answer:
When building regulatory pipelines, my first step is requirements gathering — understanding exact RBI
metrics, data sources, frequency, and reporting deadlines. I then define a multi-layer pipeline:

• Ingestion Layer (Bronze): Batch extractions from PostgreSQL/Redshift using Airflow DAGs with
JDBC hooks; near-real-time streams from AWS Kinesis ingested via PySpark Structured
Streaming.
• Transformation Layer (Silver): PySpark cleanses, normalizes datatypes (important for multi-
source compliance), and applies joins. dbt handles SQL-driven transformations (RBI rule-based
checks).
• Aggregation Layer (Gold): Final datasets prepared for reporting with strict data validation.
• Governance: Unity Catalog for lineage, role-based access, and audit logs; sensitive fields
masked.

To ensure scalability, I implement partition-aware loads (partitioning by business_date), Delta ACID


transactions to avoid corruption, and auto-scaling clusters in Databricks.

Scenario (CMDP): In CMDP, I built a regulatory-style pipeline where PostgreSQL customer data,
Redshift campaign spend, and Kinesis clickstreams were unified into Delta Lake. Airflow orchestrated
ingestion → PySpark transformations → dbt models → Power BI reports. This design allowed us to scale
from daily batch runs to near real-time dashboards, all while maintaining data quality and audit
readiness.

2. How do you ensure accuracy and compliance when integrating multiple banking
systems?

Deep Answer:
Accuracy in multi-source pipelines comes from validation at every stage:

• Source-to-target reconciliation (row counts, checksums).


• Business-rule validations (account balances must tally).
• dbt tests (not null, unique, referential integrity).

Compliance means governance-first design:

• Encryption: At rest (AWS KMS) + in transit (TLS).


• PII handling: Masking/anonymizing sensitive identifiers in dbt/Unity Catalog.
• Auditability: Every query logged, every transformation documented in metadata.

Scenario (CMDP): When integrating Redshift spend + Kinesis interactions, schemas didn’t align. I
standardized them in dbt, added reconciliation jobs that validated Redshift totals against Delta
aggregates, and masked customer IDs in Unity Catalog. This ensured compliance (no analyst had raw PII)
and accuracy (financial numbers matched source systems).
3. Describe a time you optimized a heavy ETL pipeline.

Deep Answer:
In CMDP, a PySpark ETL pipeline was lagging — processing took 48+ hours, missing SLA. On analysis, I
found data skew in joins and unoptimized partitioning.

Fixes I applied:

• Repartitioned data on campaign_date to evenly distribute workload.


• Applied broadcast joins for small lookup tables.
• Cached intermediate DataFrames reused across transformations.
• Increased cluster size dynamically using Databricks auto-scaling.

Result: Pipeline runtime dropped from 48 → 11 hours, with failures reduced by 35%.

4. How do you handle schema evolution in streaming/batch pipelines?

Deep Answer:
For streaming (Kinesis → Delta):

• I use schema-on-read in PySpark, with nullable fields for new columns.


• Maintain a schema registry in metadata tables to flag new fields.
• Downstream dbt models enforce column order/standardization.

For batch (PostgreSQL/Redshift):

• I apply dbt macros that map new fields into Silver tables while keeping old models stable.
• Backfills are run to populate historical values for new fields.

Scenario (CMDP): A new field campaign_device_type was added midstream in Kinesis JSON
payloads. I modified PySpark ingestion to dynamically include the new field (nullable), logged it in
metadata, and updated dbt models. This ensured dashboards didn’t break while analysts could
gradually adopt the new metric.

5. How do you implement lineage and governance?

Deep Answer:

• Unity Catalog: Full transformation lineage, role-based access, and masking for PII.
• dbt Docs: Automatically generated documentation showing model-level lineage.
• Metadata tables: Track ingestion runs, schema versions, and row counts.
Scenario (CMDP): When compliance required explaining how a KPI was derived, I used Unity Catalog
lineage graphs to trace back from Gold tables to PostgreSQL/Redshift/Kinesis sources. This
transparency made audits smooth and prevented compliance issues.

6. How do you balance ETL vs. ELT in your projects?

Deep Answer:

• ETL (Transform before Load): Used when raw data is noisy/heavy (PySpark cleaning before
landing into Delta).
• ELT (Load first, Transform later): Used when transformations are SQL-based (dbt models inside
Delta/Redshift).

Scenario (CMDP): I used ETL with PySpark for cleansing Kinesis clickstream data before storing in Delta
(removing duplicates, handling late arrivals). Then ELT with dbt applied business rules (marketing
attribution, KPI rollups) on Silver-to-Gold transitions.

7. How do you handle incremental vs. full loads?

Deep Answer:

• Incremental: Based on updated_at columns or CDC logs, managed via Airflow checkpoints.
• Full Loads: Only used for dimension tables or one-time backfills.
• Delta MERGE: To upsert incremental changes.

Scenario (CMDP): PostgreSQL sources used updated_at. Airflow DAGs logged last successful run in
metadata, and subsequent runs only pulled changed rows. This reduced load time and warehouse cost
by 60% compared to daily full loads.

8. Tell me about a critical incident you resolved in production.

Deep Answer:
During CMDP, a regulatory-style pipeline broke — dashboards were blank, and SLA for client reporting
was 24 hrs. Root cause: Redshift → Delta ingestion failed due to schema mismatch after upstream
schema change.

Steps I took:

1. Traced lineage in Unity Catalog → pinpointed failing task.


2. Hot-fixed PySpark ingestion with schema-on-read + defaults.
3. Re-ran Airflow DAG for impacted partitions.
4. Documented fix, and added schema validation alert for future.

Impact: Restored pipeline in 11 hours vs original 48-hr SLA, preventing client escalation.

9. How do you ensure observability of your pipelines?

Deep Answer:

• Airflow: SLA, retries, email/Slack alerts.


• Datadog: Custom metrics (DAG success rate, task duration).
• Delta logs: To check ingestion completeness.

Scenario (CMDP): I built reconciliation dashboards in Power BI showing “Expected vs Loaded records”
per source. Datadog alerts notified if Kinesis lag exceeded thresholds. This observability reduced
incident detection time by 70%.

10. How do you apply Agile/Scrum in Data Engineering projects?

Deep Answer:
I work in sprint cycles — breaking pipelines into user stories (e.g., PostgreSQL ingestion, Delta
transformations, reporting). Daily standups catch blockers, sprint reviews demo working pipelines, and
retros track improvements.

Scenario (CMDP): A sprint was dedicated to migrating PostgreSQL → Delta pipelines. Mid-sprint, schema
issues delayed work. I raised it in standups, coordinated with upstream teams, and reprioritized sprint
backlog to first deliver Redshift → Delta, ensuring continuous progress.

Kafka Q&A (Little Hands-On + Conceptual Clarity)


1. Have you worked with Kafka?

Answer:
Yes, I’ve worked with Kafka in a limited way — mainly for POC-level streaming ingestion pipelines. I’ve
set up producers/consumers, created topics, and integrated them with PySpark Structured Streaming
to land data into Delta. My main production work was with AWS Kinesis, but since the APIs and
streaming concepts are very similar, I was able to experiment with Kafka without much difficulty.

2. What concepts of Kafka are you most comfortable with?

Answer:
I understand Kafka’s core building blocks:

• Producers write events to topics.


• Topics are partitioned for parallelism and scalability.
• Consumers read data, and consumer groups allow load balancing.
• Brokers store and replicate messages for fault tolerance.

I also explored offsets for ensuring “exactly once” processing, and I experimented with Kafka + PySpark
integration using checkpointing.

3. How did you use Kafka in your work?

Answer:
I worked on a small-scale Kafka ingestion pipeline where we simulated customer transaction events.
Kafka producers pushed JSON events into a topic, PySpark consumed those events, applied basic
transformations, and wrote them into Delta Lake Bronze. From there, dbt models created reporting
tables.

It wasn’t a full production system like my Kinesis pipelines, but it helped me get hands-on with Kafka
topics, partitions, and consumer offsets.

4. What’s your approach to designing a Kafka-based pipeline?

Answer:

• Producers send events into Kafka topics.


• Kafka topics partition the data for scalability.
• Consumers (PySpark/Structured Streaming) read from topics with checkpointing for offset
tracking.
• Delta/Parquet stores events in Bronze.
• Airflow orchestrates batch + streaming tasks.

Mini-Scenario: In my Kafka POC, I created a DAG that triggered PySpark jobs consuming events from
Kafka, landed them in Delta Bronze, then ran dbt transformations to Silver. The flow was almost identical
to what I built with Kinesis in CMDP.

5. How do you ensure fault tolerance in Kafka pipelines?

Answer:

• Use replication across brokers so data isn’t lost.


• Configure acknowledgements for producers (acks=all).
• Consumers use checkpointing to track offsets.
• Add monitoring to track lag between producers and consumers.

I didn’t manage Kafka infra directly, but I do understand how these configurations ensure durability and
recovery.

6. What’s the difference between Kafka and Kinesis in your experience?

Answer:

• Kafka is open-source and flexible, but requires setup/management (unless using Confluent
Cloud).
• Kinesis is fully managed by AWS, easier to set up, but has shard-based scaling and some
retention limits.
• In both, the design pattern is the same — producers → stream → consumers → storage.

Since I worked more with Kinesis in CMDP, I saw how to deal with lag, scaling, and checkpoints. I know
Kafka requires similar strategies, but gives more fine-grained control.
7. If your team asked you to switch from Kinesis to Kafka tomorrow, how would you
adapt?

Answer:
I’d leverage my existing PySpark Structured Streaming experience, since PySpark supports both
Kafka and Kinesis. The adaptation would involve:

• Subscribing to Kafka topics instead of Kinesis streams.


• Managing offsets in Kafka instead of DynamoDB.
• Using Kafka monitoring tools (Kafka Manager, Confluent Control Center) instead of CloudWatch.

Because I’ve already worked with real-time ingestion, lag management, and schema handling in
Kinesis, I’d transfer that directly to Kafka and ramp up quickly.

🔹 Deep Amazon Kinesis Interview Q&A


(Resume-Based)
1. How exactly did you use Kinesis in your project?

Deep Answer:
In CMDP, Kinesis Data Streams was the real-time ingestion backbone. Producers published customer
clickstream and app interaction events in JSON format. My PySpark Structured Streaming jobs
consumed the streams, validated schema, and performed deduplication before writing into Delta
Bronze with checkpointing in S3 for fault tolerance.

From there, I used dbt to model Silver and Gold layers for campaign attribution and customer journey
analysis. Airflow orchestrated both the batch jobs from PostgreSQL/Redshift and the streaming
consumers from Kinesis, creating a hybrid pipeline that supported daily aggregates and near real-time
dashboards.

2. What was the toughest issue you faced with Kinesis and how did you solve it?

Deep Answer:
The biggest challenge was consumer lag during marketing campaigns — when traffic spiked, events
piled up faster than consumers could process. Initially, our Spark jobs were hitting throughput limits
because we had under-provisioned shards.

I solved this in two steps:

1. Scaled shards dynamically — used CloudWatch to monitor IteratorAge and doubled shards
when lag crossed thresholds.
2. Optimized consumers — increased parallelism in PySpark Structured Streaming, repartitioned
by customer_id to balance load, and enabled checkpointing with sequence numbers to avoid
replays.

This brought down lag from 10+ minutes to under 1 minute, which was critical for real-time campaign
insights.

3. How do you ensure exactly-once processing with Kinesis?

Deep Answer:
Kinesis by default guarantees at-least-once delivery, so exactly-once requires design patterns:

• Idempotent writes in Spark → I used primary keys and merge operations in Delta to prevent
duplicates.
• Checkpointing → maintained consumer offsets in S3 to avoid reprocessing after restarts.
• Schema validation → invalid events quarantined into an error bucket instead of silently failing.
Scenario (CMDP): During one restart, duplicates inflated KPIs. By moving to Delta MERGE with event_id
as the key, I enforced idempotent ingestion, guaranteeing accuracy even if Spark retried events.

4. How do you monitor and optimize Kinesis pipelines?

Deep Answer:

• CloudWatch Metrics: Monitored IncomingBytes, [Link],


and WriteProvisionedThroughputExceeded.
• Auto-scaling shards: Used these metrics to adjust shard count dynamically.
• Enhanced Fan-Out: Enabled for multiple consumer groups to avoid competition.
• Datadog Integration: To visualize lag trends and trigger alerts on thresholds.

Scenario (CMDP): When multiple consumers (analytics + ML team) subscribed to the same stream,
throughput degraded. I enabled Enhanced Fan-Out, giving each consumer its own dedicated read pipe,
stabilizing performance.

5. How do you handle schema drift in Kinesis streams?

Deep Answer:
Since event schemas evolved over time (new fields in JSON), I used schema-on-read with PySpark and
kept new fields nullable to avoid breaking existing jobs. Then I standardized schema downstream with
dbt models.

Scenario: In CMDP, when marketing added a new field campaign_variant, initial jobs failed. I updated
the consumer logic to dynamically detect new fields, store them as null if missing, and added dbt
transformations later. This avoided downtime while still accommodating schema evolution.

6. Why did you use Kinesis Data Streams instead of Firehose?

Deep Answer:
We chose Kinesis Data Streams because we needed real-time processing + custom transformations
in PySpark. Firehose only supports simple delivery into S3/Redshift with limited transformation (via
Lambda).

Scenario: In CMDP, we needed to join clickstream events with campaign metadata in near real-time,
which required Spark transformations. That was only possible with Data Streams. Firehose wouldn’t
have supported this use case.
7. How would you design a real-time regulatory reporting system with Kinesis?

Deep Answer:

• Kinesis Data Streams → ingest transactions and compliance events in real time.
• PySpark Structured Streaming → validate transactions, enrich with customer metadata,
quarantine bad records.
• Delta Bronze/Silver/Gold → ensure auditability and reproducibility.
• Airflow DAGs → orchestrate daily aggregation jobs for RBI reports, plus reconciliation jobs.
• Unity Catalog → enforce governance and provide lineage.
• Power BI dashboards → provide regulators near-live visibility.

This design gives both streaming (real-time checks) and batch (daily/weekly compliance reports) in
one system.

8. If asked to switch from Kinesis to Kafka, how would you handle it?

Deep Answer:
Since my PySpark pipelines were built on Structured Streaming APIs, the core logic doesn’t change —
only the connector. The migration steps would be:

1. Replace Kinesis connector with Kafka connector in PySpark.


2. Replace DynamoDB checkpointing with Kafka offset checkpointing.
3. Monitor lag with Kafka Manager/Confluent instead of CloudWatch.

Because I already solved challenges like lag, schema drift, and deduplication in Kinesis, I can transfer
the same patterns to Kafka.
Deep AWS Interview Questions & Answers
(Resume-Based)

1. What AWS services did you use in your projects, and how did they fit together?

Deep Answer:
In CMDP, I built a hybrid pipeline that combined real-time (Kinesis) and batch (Redshift/PostgreSQL)
into Databricks Delta.

• Kinesis Data Streams → ingested customer clickstream data in real time.


• S3 → acted as the data lake, hosting raw Bronze data, error quarantine buckets, and versioned
backups.
• Redshift → served as a data warehouse for marketing spend and attribution datasets.
• Databricks (on top of S3) → unified batch + streaming data into Delta tables.
• Airflow on EC2 (Dockerized) → orchestrated both batch loads and streaming consumers.
• IAM + KMS → handled security, with least-privilege roles and encryption at rest/in transit.
• CloudWatch + Datadog → provided end-to-end monitoring (shard lag, DAG SLAs, PySpark job
runtimes).

This design let us move from daily batch-only reports to real-time campaign performance insights
without disrupting existing systems.

2. How do you handle security and compliance in AWS pipelines?

Deep Answer:
Security was a non-negotiable part of pipeline design. My approach was layered:
• IAM: Created role-based access for services (e.g., one IAM role for Kinesis → S3 writes, another for
Databricks → S3 reads). No cross-service over-permissions.
• Encryption: Enabled SSE-KMS for S3 buckets, TLS for in-transit traffic, and encrypted Redshift
clusters.
• Data Masking: Applied PII masking at the Silver layer in dbt (hashing customer IDs). Analysts only
had access to anonymized Gold views.
• Audit & Governance: Enabled CloudTrail + Unity Catalog for lineage and audit logs.

Scenario (CMDP): Since clickstream events had user identifiers, I made sure only anonymized IDs
flowed into reporting dashboards. Compliance checks were logged, and access to raw Bronze data was
restricted to only the data engineering team.

3. How do you monitor and troubleshoot pipelines in AWS?

Deep Answer:
I combined CloudWatch for infra metrics and Datadog for end-to-end observability.

• Kinesis: CloudWatch → IteratorAge (consumer lag),


WriteProvisionedThroughputExceeded (producer issues).
• Redshift: Monitored query queueing, disk usage, and WLM queue latency.
• Airflow on EC2: DAG SLA misses + retry counts, logged into Datadog.
• Datadog Dashboards: Aggregated pipeline metrics (Kinesis lag, Spark job durations, row counts)
for one-glance visibility.

Scenario: During a peak campaign launch, Kinesis lag spiked. CloudWatch revealed shard utilization >
80%. I scaled shards from 4 → 8 and parallelized Spark consumers, which reduced lag back under SLA.
Datadog dashboards confirmed stability.

4. How do you optimize Redshift performance at scale?

Deep Answer:
I followed a “push-down + tune” approach:

• Schema Design: Used sort keys on event_date and distribution keys on campaign_id to avoid
large shuffles.
• Compression: Applied columnar encoding (AZ64 for numerics, LZO for strings).
• Maintenance: Scheduled VACUUM and ANALYZE nightly to reclaim space and refresh stats.
• Push-down: Offloaded heavy joins to PySpark on Databricks → only aggregated/cleaned data
went into Redshift.
Scenario: In CMDP, attribution queries on campaign performance slowed from seconds → minutes as
data grew. After redesigning sort/distribution keys and moving a wide join into Spark, query runtime
dropped from 3 minutes → 18 seconds, keeping BI dashboards responsive.

5. Why did you use Kinesis over Glue or Firehose?

Deep Answer:

• Glue: Strong for ETL, but it’s batch-oriented. We needed real-time ingestion.
• Firehose: Easy to auto-load to S3/Redshift, but it lacks flexibility for complex transformations
(joins, deduplication, schema evolution).
• Kinesis Data Streams: Gave us millisecond latency, consumer control, checkpointing, and direct
PySpark integration.

Scenario: In CMDP, we had to join clickstream events with campaign metadata in real time to
measure engagement effectiveness. Firehose couldn’t do this, but Kinesis Data Streams + PySpark gave
us full control.

6. How do you use S3 effectively in a data pipeline?

Deep Answer:

• Bronze Layer: Raw JSON events from Kinesis + batch extracts from Redshift/Postgres.
• Silver Layer: Cleaned + standardized Parquet/Delta tables, partitioned by event_date.
• Gold Layer: Aggregates for business KPIs (campaign performance, engagement).
• Error Bucket: Stored invalid JSON or schema-mismatched records.
• Lifecycle Policies: Archived data > 90 days into Glacier, saving storage costs.

Scenario: When invalid clickstream events caused PySpark job failures, I wrote them into an S3
quarantine bucket. Later, I replayed only valid records from Bronze into Silver. This avoided data loss
while ensuring resilience.

7. How do you optimize for cost in AWS pipelines?

Deep Answer:

• Kinesis: Auto-scaled shards only during traffic peaks (e.g., campaign launches).
• Redshift: Enabled concurrency scaling, paused clusters during non-business hours.
• S3: Implemented lifecycle rules → after 90 days → Glacier.
• EC2/Airflow: Deployed on spot instances where workloads were retry-safe.

Scenario: In CMDP, pausing Redshift outside US business hours saved ~30% monthly costs. Similarly,
moving rarely used logs to Glacier reduced S3 bills without affecting reporting.

8. How would you design an RBI regulatory reporting system on AWS?

Deep Answer:

• Ingestion:
o Kinesis Data Streams → for real-time transaction feeds.
o S3 Batch Ingests → for daily banking data.
• Processing:
o Databricks on EMR/S3 → PySpark jobs for validation, enrichment, deduplication.
o dbt → regulatory transformations (exposure limits, daily balances).
• Storage:
o Delta Lake Bronze/Silver/Gold → ACID, partitioned, audit-friendly.
o Redshift → curated layer for reporting queries.
• Orchestration:
o MWAA (Airflow on AWS) → DAGs for batch + reconciliation jobs.
• Security:
o IAM roles → least privilege, per-service access.
o KMS + TLS → encryption.
o CloudTrail + Unity Catalog → lineage, audit logs.
• Delivery:
o Power BI / QuickSight dashboards → controlled access for compliance officers.

This ensures real-time monitoring for anomalies + daily RBI-compliant submissions, with full
traceability and governance.

🔹 Deep Cloud Data Engineering Q&A


1. How do you design a cloud-native data pipeline that is both scalable and reliable?

Deep Answer:
I start with a layered architecture (Bronze, Silver, Gold) on a cloud data lake (S3/Delta).

• Scalability: Partitioning by time or business keys, Spark auto-scaling clusters, Kinesis shard
scaling for streaming, and Redshift concurrency scaling for BI queries.
• Reliability: Orchestration with Airflow/MWAA, retries with exponential backoff, idempotent tasks,
schema validation via dbt, and quarantine buckets for bad data.
• Observability: CloudWatch + Datadog dashboards, SLA alerts, lineage via Unity Catalog.

Scenario (CMDP): In CMDP, when ingesting millions of clickstream events from Kinesis, I scaled
consumers horizontally and used checkpointing for fault tolerance. At the same time, Airflow ensured
retries on failed batch jobs, while dbt tests validated KPIs before dashboards refreshed. This allowed
near real-time insights with 99.9% reliability.

2. How do you ensure data quality in a cloud pipeline?

Deep Answer:
I embed quality at every layer:

• Ingestion: Record counts, duplicate detection, schema validation.


• Processing: dbt tests (null checks, referential integrity, threshold validations).
• Storage: Delta Lake ACID guarantees, time travel for rollback, quarantine buckets for invalid
records.
• Monitoring: Automated reconciliation jobs compare source totals with post-ETL outputs.

Scenario: In CMDP, when integrating Redshift marketing data with Kinesis clickstream events, I used dbt
to validate campaign IDs (unique, non-null). Any invalid events were sent to an S3 quarantine bucket with
Airflow alerts. This avoided downstream corruption while maintaining SLA compliance.

3. What’s your approach to handling real-time vs batch workloads in cloud


environments?

Deep Answer:
I design separate ingestion strategies but unify downstream in Delta Lake:

• Batch: JDBC pulls from PostgreSQL/Redshift, Airflow scheduled hourly/daily.


• Streaming: Kinesis Data Streams + PySpark Structured Streaming with checkpointing.
• Convergence: Both land in Bronze → cleaned in Silver → aggregated KPIs in Gold.
Scenario: In CMDP, batch-loaded marketing spend from Redshift was joined with real-time customer
engagement events from Kinesis. This allowed BI dashboards to update within minutes, balancing cost
(batch jobs) and agility (streaming).

4. How do you secure cloud pipelines for financial or regulatory workloads?

Deep Answer:
Security is layered:

• IAM: Least-privilege roles (separate for ingestion, processing, analytics).


• Encryption: TLS for in-transit, SSE-KMS for S3/Redshift at rest.
• PII Masking: Hashing/anonymization in Silver layer; Gold views restricted.
• Auditability: CloudTrail + Unity Catalog for lineage + access logging.
• Network Controls: Private subnets, VPC endpoints for S3/Kinesis.

Scenario: In CMDP, clickstream contained sensitive identifiers. I masked customer IDs with SHA-256 in
dbt before analysts accessed dashboards. IAM restricted raw Bronze data to engineering only. Unity
Catalog + CloudTrail logged every access event for compliance.

5. How do you optimize performance and cost in cloud data engineering?

Deep Answer:

• Storage: Use Parquet/Delta with partitioning, lifecycle policies to move cold data to Glacier.
• Compute: Auto-scaling Databricks clusters, Redshift pause/resume, spot instances for Airflow
tasks.
• Data Transfer: Keep workloads in same region/VPC to avoid cross-region charges.
• Query Optimization: Push heavy joins to Spark, pre-aggregate for BI dashboards.

Scenario: In CMDP, moving log data >90 days old from S3 → Glacier cut storage bills by ~30%. Query
runtime on campaign KPIs dropped from 3 min → 18 sec after partitioning Delta tables by event_date
and optimizing Redshift sort keys.

6. How do you troubleshoot cloud pipeline failures?

Deep Answer:

1. Check orchestration logs (Airflow/MWAA): Which task failed, dependency chain.


2. Inspect source/target mismatch: Row counts, schema drift, nulls.
3. Validate infra metrics: Kinesis lag, Spark executor failures, Redshift WLM queues.
4. Apply fix + preventive measure: Retry logic, schema contracts, shard scaling.

Scenario: In CMDP, a spike in campaign traffic caused Kinesis shard lag. CloudWatch showed
IteratorAge > SLA. I scaled shards from 4 → 8, parallelized consumers in Spark, and updated Airflow
DAG retries. This brought the pipeline back within SLA without data loss.

7. How do you handle schema evolution in cloud pipelines?

Deep Answer:

• Schema-on-read (Delta/Parquet): Use auto schema evolution (mergeSchema).


• Contracts (dbt): Validate expected schema fields before transformations.
• Fallback: Invalid records go to S3 quarantine.
• Alerts: Airflow notifies when schema drift detected.

Scenario: When device_type was added in Kinesis events, Delta handled schema evolution. dbt
validated the field, while older consumers continued unaffected. This allowed smooth adoption without
downtime.

8. How do you design cloud pipelines for regulatory reporting (e.g., RBI, Basel)?

Deep Answer:

• Ingestion: Kinesis (real-time transactions), S3 batch loads.


• Processing: PySpark/dbt to calculate exposure, balances, thresholds.
• Storage: Delta Lake (ACID, audit-friendly), Redshift for reporting queries.
• Orchestration: Airflow with SLAs, retries, alerts.
• Governance: IAM + KMS + Unity Catalog for compliance, lineage, and audit.
• Delivery: Power BI/QuickSight dashboards with restricted access.

Scenario: My CMDP design pattern is reusable — replace campaign KPIs with RBI regulatory metrics. By
embedding reconciliation jobs + audit logs, compliance teams can trace every metric back to its raw
source.
9. How do you handle multi-cloud or hybrid pipelines?

Deep Answer:

• Neutral frameworks: dbt for SQL, PySpark for compute, Airflow for orchestration.
• Storage sync: S3 ↔ ADLS ↔ GCS via secure connectors.
• Identity: Use federated identity/service accounts across clouds.
• Latency-sensitive: Keep real-time close to source (AWS), batch reporting can be cross-cloud.

Scenario: CMDP used AWS (Redshift, Kinesis, S3) but integrated with Azure Power BI. I exported curated
Gold datasets into Blob securely, avoiding full duplication. This hybrid model gave flexibility while staying
compliant.

10. How do you manage governance and lineage at scale?

Deep Answer:

• Unity Catalog: Central governance for Delta + access controls.


• dbt Docs: Auto lineage diagrams for SQL models.
• Glue Data Catalog: Metadata + schema registry.
• Audit Logs: CloudTrail + Datadog integrated for access visibility.

Scenario: In CMDP, every dbt model had lineage tracked, and Unity Catalog governed Delta tables.
When a campaign KPI was challenged, compliance teams could backtrack transformations to Bronze in
minutes.

Common questions

Powered by AI

Schema evolution in cloud-based data pipelines is managed through schema-on-read strategies that allow systems to evolve as new data arrives. In streaming contexts, technologies like Delta Lake handle evolution by accommodating changes in schema without interrupting data flow, utilizing metadata tracking for newly introduced fields. For batch processes, schema validation and contracts enforced through tools like dbt ensure upstream compatibility before data transformations. Challenges in schema evolution include managing backward compatibility, maintaining data integrity in the face of structural changes, and mitigating the risk of processing errors from unexpected schema drift. Continuous monitoring and rapid adaptation mechanisms are essential to address these issues effectively .

Implementing SCD in data warehousing involves capturing and preserving data changes over time. Type 1 changes overwrite existing records with new data, allowing easy maintenance but at the cost of losing historical context. This is suitable for dimensions where only the latest value is needed. In contrast, Type 2 maintains historical data by creating new rows for changes with version control, enabling full historical traceability at the cost of increased storage and complexity in managing additional metadata like effective dates. The choice between Type 1 and Type 2 depends on the business requirements for data tracking and reporting needs. For example, in a customer profile context, Type 2 is preferred for retaining history of changes for detailed analytics and reporting .

Ensuring data accuracy and compliance in multi-source data integration involves a detailed validation and governance framework. Common techniques include source-to-target reconciliation to verify completeness, business rule validations for accuracy (ensuring consistent balances), and ongoing integrity checks like those in dbt tests. Compliance is further ensured through encryption protocols, policy-driven data anonymization, and maintaining auditability via logging. This process is often augmented with automated reconciliation jobs, metadata repositories for transformation tracking, and strict access controls via roles. By implementing these techniques, systems can maintain the necessary standards of accuracy and compliance required in regulatory environments .

Self-contained ingestion and processing strategies in cloud-native data pipelines are vital to meet diverse workload requirements, providing flexibility and resilience. Batch processing often leverages scheduled JDBC pulls for periodic data ingestion, while streaming workloads utilize platforms like Kinesis for real-time data collection, requiring robust concurrency and failover strategies. These self-contained designs ensure processing consistency, adapt to workload variation, and support concurrency without cross-impact. Such strategies enable seamless integration of batch and real-time analytics in unified storage layers, enhancing reliability and insights generation while controlling costs across different scales and data velocity requirements .

The choice between OLTP and OLAP systems centers on the nature of the business applications. OLTP systems are optimized for transactional workloads involving numerous short online transactions, suitable for applications requiring high concurrency and quick, reliable insert/update operations, like order processing. On the other hand, OLAP systems are tailored for performing complex queries and data analysis over large volumes, making them ideal for decision support, such as BI reporting and data mining. A company’s database technology choice hinges on these operational needs, trading off transaction speed against analytical capacity. Solutions often require integrating both system types within their architectures to cover comprehensive operational and analytical requirements .

CTEs and temporary tables both enable handling intermediate results, but they serve different purposes. CTEs, defined using the WITH clause, are ideal for improving query readability and structuring complex queries by breaking down tasks into manageable parts. They are preferred for situations requiring composability and reused logic within a single query's scope. In contrast, temporary tables involve creating physical structures that persist beyond the immediate query execution but within the session. They are useful when multiple queries need to access intermediary data or when performance optimization is needed at the cost of additional resource overhead .

UNION ALL offers performance benefits over UNION by combining query results without removing duplicates, thus executing faster as it avoids the additional step of deduplication. This is particularly advantageous when working with large datasets where duplicates are acceptable or when further deduplication logic will be applied later. However, the trade-off is that UNION ALL may lead to larger result sets and potential data inconsistency if duplicates are undesired, requiring subsequent handling of duplicates through complex SQL logic or downstream processing .

Indexing creates data structures that enable faster retrieval of rows from a database, crucial for optimizing read-heavy operations in transactional systems. By allowing quick lookup of key values, indexes significantly enhance the performance of queries that involve SELECT operations. However, this improvement comes with the trade-off of potentially slowing down write operations, such as inserts or updates, because the index itself must be updated to reflect data changes. Properly managing indexes involves balancing the performance gain for reads against the overhead incurred during writes, often requiring thoughtful analysis of the database workload and query patterns .

Partitioning large fact tables by campaign_date improves SQL query performance by reducing the amount of data scanned during query execution. This is particularly beneficial for time-based queries, as only the relevant partitions need to be accessed, leading to faster execution times and lower computational costs. Additionally, partitioning can aid in more efficient data management and purging, as partitions can be archived or dropped individually. However, improper partitioning may lead to data skew and inefficient storage if not aligned with the most frequently used query patterns .

Window functions provide an advanced method for calculating running totals by allowing each row to retain access to the complete result set's calculation context, unlike traditional aggregate functions that collapse rows. This gives the flexibility to calculate cumulative sums dynamically by partitioning and ordering sets of rows within a window frame. Although more computationally intensive, window functions prevent the need for self-joins or subqueries typically employed with traditional aggregates, simplifying and optimizing calculation of metrics such as running totals. This functionality is valuable for analyses requiring these cumulative insights while maintaining row granularity .

You might also like