Databricks Data Engineer Exam Notes
Databricks Data Engineer Exam Notes
Databricks provides the 'Repair Run' feature to efficiently handle failures by re-running only the failed tasks within a workflow, thus avoiding the additional costs and time associated with re-running the entire process. This strategic handling ensures scalability and cost-efficiency in managing data workflows .
The Unity Catalog acts as a centralized governance layer in Databricks by controlling access at the metastore, catalog, schema, and table/view/volume levels. For Managed Tables, it manages both metadata and data files, meaning dropping a managed table deletes both data and metadata. In contrast, for External Tables, it only manages metadata, so dropping an external table deletes the metadata but leaves the data files intact .
Delta Sharing allows secure, read-only access to data without the need for a Databricks account, facilitating external data collaboration. It enables sharing of tables, views, or partitions via a sharing identifier provided by the data provider (partner), ensuring that only intended external parties access the datasets .
The Medallion Architecture in Databricks refers to a layered data approach that progresses from Bronze to Silver to Gold stages. The Bronze layer captures raw ingested data with minimal transformations, maintaining audit and load information. The Silver layer involves cleaning processes such as deduplication, joins, and the application of business logic. The Gold layer finalizes the structure into business model representations, such as aggregations and star schemas, for business intelligence optimization .
Kafka streaming, combined with structured streaming, allows for real-time, one-time, or periodic ingestion through trigger mechanisms like Trigger.Once() and Trigger.ProcessingTime. Auto Loader complements these capabilities by supporting continuous file detection and schema inference, enabling both batch and streaming ingestion. Together, these tools enhance flexible and scalable data ingestion operations within Databricks environments .
In Databricks, setting 'failOnNewColumns' to true enforces schema rigor by causing processes to fail if new columns are encountered, preventing unintended schema changes. This strict enforcement ensures data consistency and adherence to pre-defined schema structures, imposing a higher degree of data governance .
Delta Live Tables (DLT) is a declarative ETL framework that supports both batch and streaming pipelines, offering features like data quality expectations and error handling. It is suitable for complex ETL workflows requiring stringent data quality measures. Auto Loader, by contrast, incrementally detects and loads new files, providing schema inference and evolution capabilities and works in batch or streaming modes. It is more suited for simple continuous file ingestion scenarios .
Delta Lake provides key functionalities such as ACID transactions to ensure data reliability, schema enforcement and evolution to maintain data integrity, and time travel for historical data access. Moreover, the 'MERGE INTO' command is used for upserts, 'Z-Ordering' optimizes query performance by organizing files based on column values, 'OPTIMIZE' compacts files to improve read efficiency, and 'VACUUM' cleans up old files to manage storage efficiently .
Databricks equips its environments with audit logging capabilities that capture workspace events such as job executions, cluster events, and user actions, in JSON format. This facilitates the monitoring and security auditing of activities, providing insights into operational workflows and compliance adherence .
Databricks offers various compute options tailored for different tasks. All-Purpose Clusters are versatile but expensive for short workloads due to their interactive nature. Job Clusters are cost-effective for scheduled jobs as they're created per job and auto-terminate post execution. Serverless SQL Warehouses offer auto-scaling compute, ideal for BI and frequent SQL workloads due to pay-per-query pricing. Finally, Serverless Compute for Python is suited for high-frequency, small workloads due to its auto-scaling abilities .