Snowflake Snowpro Core Ebook
Snowflake Snowpro Core Ebook
course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
Snowflake Data Cloud Features and Architecture: - Snowflake optimizer determines the most efficient way to execute a query.
What is Snowflake? - Metadata Manager manages the metadata associated with data stored in Snowflake. Schema: A schema is a logical grouping of database objects (tables, views, etc.). Each
• Cloud based data platform (AWS, GCP, Azure Supported) - Snowflake Security includes encryption, network isolation, and other security features schema belongs to a single database
• Not built from old db tech to protect data.
• Cloud native Table Types :
• Separates Compute from Storage
• It is SaaS (Min upfront cost, Pay-as-you-go, fully managed Infra) Snowflake has multiple editions to choose from:
Traditional Architecture:
Shared Disk Architecture (ex: Oracle Real Application Clusters)
- High availability/reliability, easy data sharing b/w nodes, dynamic load balance
- Single point of failure, Network bottleneck
Shared Nothing Architecture (ex: Teradata) *External tables are read only and points to external stage.
- High scalable, better performance, flexible conf
- Fixed load balancing, Data Skew, Time consuming data partitioning Code for table creation:
For each Snowflake account, you choose Edition, cloud provider and location/Region.
Snowflake (SF) Architecture: SF has 3 distinct layer, Cloud Services, Query Processing Snowflake UI is called Snowsight. CREATE TEMPORARY (TEMP)| TRANSIENT| PERMANENT (Optional) TABLE
Layer, Database Storage. <Table_name> (
Data Exchange: Snowflake support can enable it, can share and control data between ID NUMBER,
selected group of entities. );
Data Types:
View Type:
Database Storage: Snowflake Partners: Huge partner network for data integration, ML/Data Science,
- Data into optimized, Compressed and columnar format (Proprietary) Security, Governance & Observability, BI, SQL Dev & Mgmt and Native programming
- Data is stored in cloud storage, accessible to all nodes Interface.
- Data not accessible directly, need to read via SF
- Cloud agnostic (AWS, GCP, Azure Supported) Snowflake Drivers and Connectors: Supports Go, JDBC, ODBC, .Net, [Link], PHP PDOst,
Python, Kafka Connector, Apace Spark.
Query Processing Layer:
- Data into optimized, Compressed and columnar format SQL Scripting: Snowflake Scripting is an extension to Snowflake SQL that adds support Code for View Creation :
- Data is stored in cloud storage, accessible to all nodes for procedural logic.
- Data not accessible directly, need to read via SF You can Declare Variables, return a value, branching construct, IF/CASE, Loops, FOR, CREATE OR REPLACE SECURE MATERIALIZED VIEW <View_name> AS SELECT….
- Cloud agnostic (AWS, GCP, Azure Supported) WHILE, REPEAT, LOOP, CURSOR, RESULTSET, determine number of rows impacted by
DML, (SQLROWCOUNT, SQLFOUND, SQLNOTFOUND), Handles Exception. Advantage of Materialized View: (Enterprise + feature)
Cloud Services:
- Snowflake cloud services coordinate platform activities Snowpark: Snowpark is a developer experience for Snowflake that allows developers to • Materialized views can improve the performance of repeat queries.
- Manage user requests from login to query dispatch write code in their preferred language (Python, Scala, or Java) and run that code directly • Materialized views are automatically by Snowflake.
- Responsible for security, access management, monitoring and other operational tasks on Snowflake. • Data accessed through materialized views is always Up-to-date.
- Authentication and access control verifies user identity and manages access to data. Database: A database is a logical grouping of schemas. Each database belongs to a single Extra cost of Storage and compute ( Managed by SF)
- Infrastructure manager provisions and manages compute resources for Snowflake. Snowflake account.
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
actions can be taken using the changed data. This process is referred to as change data
Snowflake Loading: capture (CDC). --Create TASK for Serverless (Example)
-The number of load operations that run in parallel cannot exceed the number of data Sequences:
files to be loaded Sequences are used to generate unique numbers across sessions and statements;
-Split Large files in smaller ones and combine small ones in large one however, Snowflake doesn’t guarantee generating sequence numbers without gaps.
-Ideal size 100-250 mb (or larger) in size compressed
-Loading very large file eg. 100 GB or larger, not recommended Syntax : CREATE OR REPLACE SEQUENCE SEQUEN_EX START = 1 INCREMENT = 1;
-For large Files, carefully use ON_ERROR option
-If load goes beyond 24 hours, then job can be cancelled with no data load -Create TASK with User Managed Warehouse (Example) SELECT SEQUENCE_SAMPLE2.nextval a, SEQUENCE_SAMPLE2.nextval b,
-Consider accumulating file in source at 1 minute frequency SEQUENCE_SAMPLE2.nextval c, SEQUENCE_SAMPLE2.nextval d;
CREATE OR REPLACE TASK TASK_DEMO_U
Streams: WAREHOUSE = COMPUTE_WH - Default value for the INCREMENT and START is 1.
A stream object records data manipulation language (DML) changes made to tables, SCHEDULE = '1 MINUTE'
including inserts, updates, and deletes, as well as metadata about each change, so that AS INSERT <command>;
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
Micro-Partitioning: Data storage monitoring
• Snowflake automatically divides table in micro-partitions
• Each micro-partition contains 50-500 MB of uncompressed data (SF Select Admin » Cost Management » Consumption
compress it further and stores)
• Organized in columnar format Snowflake provides various methods to view and control data storage for tables, stages,
• Very large tables can have multi-million micro-partitions and clones. Users can use Snowsight, the Classic Console, the SHOW TABLES command,
• Micro-partitions are immutable (Time-Travel, Cloning) or the TABLE_STORAGE_METRICS view to monitor storage usage and costs. Users can
also remove staged files or drop cloned objects to reduce storage.
• Search Access Path gets updated with data needed for look up
• New updates in table gets automatically updated
• Check progress search_optimization_progress in SHOW TABLES
• Cost for storage and compute
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
• When Scale Up, Charged only for extra compute
Before SOS • When Scale Down, Charged for both, for brief period
Query should have a column with 100-200k distinct values Compute Cost for Cloud services :
Selective filters bring very few records • Authentication, enforces security, performs query compilation and
Query runs at least for few seconds optimization, handles request query caching etc.
• Snowflake credits are used to pay for the processing time used by each
SOS Vs Clustering virtual warehouse.
• If usage is less than 10% of daily credit consumed, it is free.
SOS applies for all columns
SOS uses addition data structure Snowflake Storage Cost:
*Snowflake does not apply data egress charges when a Snowflake client or driver
retrieves query results across regions within the same cloud platform or across
different cloud platforms
Features
Compute Cost :
- Submit SQL statements for execution (Queries, DDL,DML)
- Check the status of the execution of a statement
• A virtual warehouse is one or more clusters of compute resources that Streamlit is an open-source Python library that makes it easy to create and share
- Cancel the execution of a statement
enable executing queries, loading data, and performing other DML custom web apps for machine learning and data science.
- Manage Deployments ( provision user, role etc.)
operations.
• Snowflake credits are used to pay for the processing time used by each Limitation
virtual warehouse. - PUT and GET are not supported
Billing
- Leverages cloud service layer
Cost Based on Size of VW and How long they Run, not on complexity of the Query. Account Access and Security
• Billed By Seconds Network security and policies (Admin » Security » Network Policies)
• 60 Sec minimum • Allows to restrict access to SF based on IP address
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
• SECURITYADMIN or Higher (eg ACCOUNTADMIN) can create network GRANT USAGE ON SCHEMA DUMMY.DUMMY_SC TO ROLE DUMMY_ROLE;
policies
• Can be set at ACCOUNT or USER level Type of table privilege
• Only one Policy per ACCOUNT, or per USER SELECT
TRUNCATE
• Supports CIDR Notation
DELETE
• Blocked IP gets preference OWNERSHIP
• Support only Internet Protocol version 4 (i.e. IPv4) addresses, not IPv6 Etc…
Privilege: A defined level of access to an object. Multiple distinct privileges may be used
to control the granularity of access granted.
Account roles ORGADMIN : At Org level, Can create Account, can run SHOW ORGANIZATION
To permit SQL actions on any object in your account, grant privileges on the object to an ACCOUNTS, SHOW REGIONS
account role. ACCOUNTADMIN : Parent to SYSADMIN and SECURITYADMIN
SECURITYADMIN : Manages GRANTS, Manage USERS/ROLES, PARENT OF USERADMIN
Database Roles SYSADMIN : Create DB, WH, Assign all custom role to SYSADMIN
Federated Authentication (SSO) : In a federated environment, user authentication is To limit SQL actions to a single database, as well as any object in the database, grant USERADMIN : CREATE USER/ROLE
separated from user access through the use of one or more external entities that privileges on the object to a database role in the same database. PUBLIC : Pseudo Role, Auto granted to all
provide independent authentication of user credentials.
Primary Role Role Hierarchy and Privilege Inheritance:
Service Provider: The default assumed Role USE ROLE <ROLE NAME>
In a Snowflake federated environment, Snowflake serves as the SP.
Identity provider (IdP): Secondary Role
Creating and maintaining user credentials and other profile information. Can assign any number of secondary role
Authenticating users for SSO access to the SP.
Privileges are granted to roles, and roles are granted to users, to specify the operations
Snowflake Supports SCIM 2.0, Native app support : Microsoft and Okta
that the users can perform on objects in the system.
Access History : (Enterprise + feature) Access History in Snowflake refers to when the
user query reads data and when the SQL statement performs a data write operation,
such as INSERT, UPDATE, and DELETE along with variations of the COPY command,
from the source data object to the target data object.
• Read Only Schema CREATE TAG IF NOT EXISTS COST ALLOWED_VALUES 'MKT','ADVT' COMMENT =
• Available for every DB ‘comment’;
• Maintains data about all DB object and some account level information, In
form of Views
• Table function for historical and usage data across your account Performance & Cost Optimization Concepts
• Need Privilege
Query Profile: It provides a graphical representation of the main components of the
-- Find the login history with information schema processing plan for the query, with statistics for each component, along with details
Row Level Security : (Enterprise + feature)
SELECT * FROM TABLE(information_schema.login_history()); and statistics for the overall query.
Only relevant rows are shown to authorized person
Account Usage:
Views that display object metadata and usage metrics for your account.
- Type: Account_Usage and Readers_Account_Usage
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
Warehouse Cache (Local Disk Cache) Maximum number of clusters, greater than 1 (up to 10).
At Query Processing Layer Minimum number of clusters, equal to or less than the maximum (up to 10).
Local to the WH
Purged when WH is suspended or shut Additionally, multi-cluster warehouses support all the same properties and actions as
Need to find whether it is worth to auto-suspend a WH single-cluster warehouses, including:
Scale Up in Snowflake means increasing the resources for a single warehouse. It’s like
EXPLAIN PLANS
upgrading from a SMALL to a MEDIUM or from a MEDIUM to a LARGE warehouse.
EXPLAIN [ USING {TABULAR | JSON | TEXT}] <statement>
Scale Down is exactly opposite of Scale up. Happens post completion of existing
Query History (Option under Activity)
queries.
Provides 14 days history
Scaling Policy:
QUERY_HISTORY (Under Information Schema)
Last 7 days activity
Data Spilling: Data spilling occurs when an operation cannot fit in memory. The
system begins by spilling data to the local disk of a warehouse node then to remote
disks. These operations can significantly slow down query execution. To mitigate this,
it’s recommended to use a larger warehouse, process data in smaller batches, or
optimize the query.
Types of Cache:
You can monitor Warehouse load Admin>>Warehouse: You can check warehouse
Query Result Cache (USE_CACHED_RESULT)
wise Query history.
In Cloud services layer
Expires in 24 hours
Work for exact same query
Doesn’t work with functions, which evaluates at runtime e.g. CURRENT_TIMESTAMP()
and UUID_STRING(), CURRENT_DATE() is exception
Doesn’t work with external functions
Doesn’t work when, table change, config change, Micro-partitions change
Storage: Clustering, QAS, SOS, Materialized View, Purge stage data, use of temp table,
Cost Consideration during replication and failover
Save on time travel (DATA_RETENTION_TIME_IN_DAYS), Monitor memory
consumption.
Data Transfer: Initial replication, Subsequent Sync FILE FORMAT:
Compute: To identify delta, to copy
Warehouse : Reduce Queues, Reduce spilling, Right size WH, WH cache optimization,
Storage: Target account Supported File Type for loading : CSV (Default),TSV, JSON, PARQUET, AVRO, ORC, XML
limit concurrent queries (MAX_CONCURRENCY_LEVEL),Separate WH , Monitor WH,
Control Cost: Select objects, adjust the refresh frequency Supported File Type for Unloading : CSV (Default),TSV,JSON,PARQUET
Resource Monitor.
Monitor and Control Cost Compression : AUTO (Default) | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE
Query : Query Profile stat, Query History, Avoid SELECT *, Use filter.
Resource Monitor : To monitor WH, can notify, suspend Data can be loaded/unloaded with UI or SQL statement.
Cost Optimization :
Budget : At account level, Send mail, also forecasting feature
Check Cost Insights in Admin>>Cost Mgmt >> Account Overview File Format definition (Example)
Attributing Costs
CREATE OR REPLACE FILE FORMAT my_csv_ff
Tagging : For logical bifurcation and chargeback using query/object tag TYPE = CSV
FIELD_DELIMITER = ','
To find cost use Views TAG_REFERENCES, WAREHOUSE_METERING_HISTORY, SKIP_HEADER = 1
QUERY_ATTRIBUTION_HISTORY from ACCOUNT_USAGE. NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = true
Data Loading and Unloading COMPRESSION = NONE;
Stage : “Snowflake refers to the location of data files in cloud storage as a stage” Describe FF
DESC FILE FORMAT my_csv_ff_new;
Show FF
SHOW FILE FORMATS;
Types of objects to be used :
Drop FF
DROP FILE FORMAT my_csv_ff_new;
Put Command
PUT [Link] Path>CSV @~/customers_user/staged AUTO_COMPRESS = FALSE
OVERWRITE = TRUE;
SYSTEM | BLOCK : Faster, Might be biased, block wise Example: SELECT [Link].udf_sample()
VALIDATE and VALIDATION_MODE
Example: SELECT COUNT(*) FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER Stored Procedure:
VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS TABLESAMPLE(1); A stored procedure is a named, precompiled collection of SQL statements and control-of-
flow language constructs that are stored in a database and can be called from SQL CALL
Example : COPY INTO customers_named Sampling on Join is possible if, BERNOULII/ROW BASED, No Seed Used. command, including other stored procedures.
FROM @internal_named_stage SAMPLE and TABLESAMPLE are synonymous. Languages Support: SQL, JS, Python (These 3 via Snowpark), Java, Scala
FILE_FORMAT = my_csv_ff
ON_ERROR = CONTINUE Supporting Function Types: UDF Vs SP:
RETURN_FAILED_ONLY = TRUE;
Feature User-Defined Function (UDF) Stored Procedure
VALIDATE: Validates the files loaded in a past execution of the COPY INTO <table> Return Value Must return a value Not Mandatory to return a value
command Use case When part of SQL stmt For DML, DDL Operations
Example: SELECT * FROM TABLE (VALIDATE (customers_named, JOB_ID => '_last')) Java, JS, Python, Scala, Snowflake
Languages Java, JS, Python, Scala, SQL
Scripting
Unloading data : Can be called for every row of a query Can only be executed as a
- Bulk(Table) unloading with COPY INTO <Location> Granularity
result set standalone command
- For getting single file from multiple files use SINGLE = TRUE, default is FALSE Inside Single SP with a statement, Can
- Can partition the data by PARTITION BY Multiple UDFs can be called
SELECT have nested SP
DB Access May not have access Has access
GET Command
GET @internal_named_stage file ://<File_path>;
Semi-Structured Data : Semi-structured data is data that does not conform to the
-- INSERT OVERWRITES the existing data in the employees table with the new data from
standards of traditional structured data, but contains tags (labels) or other types of
the select statement.
mark-up that identify individual, distinct entities within the data.
INSERT OVERWRITE INTO employees SELECT id, name, department FROM employees
WHERE department = 'Engineering';
Snowflake provides native data types (ARRAY, OBJECT, and VARIANT)
Snowflake supports querying the data
Data Transformations
Supported Data formats (For Load)
Scalar Function: Returns 1 value per invocation.
- JSON
Estimation Functions: Example: SELECT CEIL(135.135); Returns >> 136
- Avro
Distinct Count: Uses HyperLogLog (HLL) Algo to count distinct value in a column.
- ORC
Aggregate Function: Returns 1 value per group of rows.
- Parquet
Similarity Estimate: Uses MinHash Function to find similarity between 2 datasets. Example: SELECT AVG(salary) from <table>;
- XML
Window Function : Aggregate functions that can operate on a subset of rows within the
Supported Data formats (For Unload)
set of input rows.
- JSON
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
- Parquet
In a VARIANT column, JSON null values are stored as a string containing the word “null”
DATA ENCRYPTION:
Unstructured data : Unstructured data is information that does not fit into a predefined Example Code :
data model or schema.
Example : PDF, Image, Video, VCF (genomics), KDF (semiconductors),HDF5 (aeronautics) CREATE OR REPLACE STAGE internal_UN_stage
etc. ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY=(ENABLE=TRUE);
SF Supports for Unstructured data :
Data Protection and Data Sharing
• Securely access data files located in cloud storage.
• Share file access URLs with collaborators and partners. Continuous Data Protection with Snowflake:
• Load file access URLs and other file metadata into Snowflake tables.
• Process Unstructured data.
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]
- Source in past doesn’t exists Secure Data Sharing (Not available for VPS): Secure Data Sharing lets you share
- Pipe object with AUTO_INGEST = TRUE recreated selected objects in a database in your account with other Snowflake accounts. No data
gets copied or replicated.
Code : CREATE OR REPLACE TABLE clone_data CLONE sample_data;
Secure data sharing can be done with other snowflake account or readers (non-sf)
REPLICATION :
account.
• Tables
• External tables
• Secure views
• Secure materialized views
• Secure UDFs
•
When data shared with other SF account, consumer pays for the compute.
When data shared with none SF account, provider pays for the compute.
Cloning with Time-Travel is only available for Only available for Database, Schema, Table
& Streams
Cloning fails if
This Book is created as part of Udemy course Snowflake SnowPro Core Certification COF-C02 Masterclass, please complete the course
before reading this book, if you got this file without the course then you can purchase the course here. [Link]