Introduction to
Introduction
Introduction to Snowflake: Snowflake is an analytic data warehouse provided as Software-as-a-
Service (SaaS). Snowflake provides a data warehouse that is faster, easier to use, and far more flexible
than other data warehouse. Snowflake’s data warehouse is not built on an existing database or “big data”
software platform such as Hadoop. The Snowflake data warehouse uses a new SQL database engine with
a unique architecture designed for the cloud. Snowflake has additional functionality and unique
capabilities.
[Link]
Data Warehouse as a Cloud Service:
• There is no hardware (virtual or physical) for you to select, install, configure, or manage.
• There is no software for you to install, configure, or manage.
• Ongoing maintenance, management, and tuning is handled by Snowflake.
• Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than
an optional command line client), run in a public cloud infrastructure.
• Snowflake uses virtual compute instances for its compute needs and a storage service for persistent
storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
Video : Link
Architecture
Snowflake’s architecture is a hybrid of traditional shared-disk database architectures and shared-nothing database
architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is
accessible from all compute nodes in the data warehouse. But similar to shared-nothing architectures, Snowflake
processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a
portion of the entire data set locally.
Video : Link
Snowflake’s unique architecture consists of three key layers:
Database Storage
Query Processing
Cloud Services
Database Storage:
[Link]
Snowflake manages all aspects of how this data is stored — the organization, file size, structure,
compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The
data objects stored by Snowflake are not directly visible nor accessible by customers; they are only
accessible through SQL query operations run using Snowflake.
Architecture
Query Processing:
[Link]
Query execution is performed in the processing layer. Snowflake processes queries using “virtual
warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes
allocated by Snowflake from a cloud provider.
Automatic Query Optimization : [Link]
no-tuning/
Cloud Services:
[Link]
The cloud services layer is a collection of services that coordinate activities across Snowflake. These
services tie together all of the different components of Snowflake in order to process user requests,
from login to query dispatch. The cloud services layer also runs on compute instances provisioned by
Snowflake from the cloud provider. • Authentication
Between the services in this layer: • Infrastructure management
• Metadata management
• Query parsing and optimization
• Access control
Virtual Warehouses
[Link]
A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in
Snowflake. A warehouse provides the required resources, such as CPU, memory, and temporary
storage, to perform the following operations in a Snowflake session:
Executing SQL SELECT statements that require compute resources (e.g. retrieving rows from tables and
views).
Performing DML operations, such as:
Updating rows in tables (DELETE , INSERT , UPDATE).
Loading data into tables (COPY INTO <table>).
Unloading data from tables (COPY INTO <location>).
Video : Link
Multi-cluster Warehouses AND Auto scaling
Multi-cluster Warehouses :
[Link]
Multi-cluster warehouses enable you to scale compute resources to manage your user and query
concurrency needs as they change, such as during peak and off hours.
Auto scaling:
This mode is enabled by specifying different values for maximum and minimum clusters. In this mode,
Snowflake starts and stops clusters as needed to dynamically manage the load on the warehouse:
As the number of concurrent user sessions and/or queries for the warehouse increases, and queries
start to queue due to insufficient resources, Snowflake automatically starts additional clusters, up to
the maximum number defined for the warehouse.
Similarly, as the load on the warehouse decreases, Snowflake automatically shuts down clusters to
reduce the number of running servers and, correspondingly, the number of credits used by the
warehouse. To help control the usage of credits in Auto-scale mode, Snowflake provides a property,
SCALING_POLICY, that determines the scaling policy to use when automatically starting or shutting
down additional clusters
Video : Link
Constraints
[Link]
Constraints define integrity and consistency rules for data stored in tables. Snowflake provides support
for constraints as defined in the ANSI SQL standard, as well as some extensions for compatibility with
other databases, such as Oracle.
Overview of Constraints:
[Link]
Snowflake provides the following constraint functionality:
Unique, primary, and foreign keys, and NOT NULL columns.
Named constraints.
Single-column and multi-column constraints.
Creation of constraints inline and out-of-line.
Support for creation, modification and deletion of constraints.
Constraint Properties:
[Link]
Constraint properties are specified in the CONSTRAINT clause for a CREATE TABLE or ALTER
TABLE command.
Constraints
Additional Constraint Details:
[Link]
In this Topic:
Security Privileges for Constraints
Comments for Constraints
Support for Constraints in Copy Commands
Support for Constraints in GET_DDL
Creating Constraints:
[Link]
A constraint can be created at table creation using CREATE TABLE, or added to a table later using ALTER
TABLE:
.Single-column constraints can be created inline as part of the column definition.
.Multi-column constraints must be created in a separate, i.e. out-of-line, clause that specifies the
columns in the constraint.
Constraints
Modifying Constraints:
[Link]
constraint is created:
• The constraint can be renamed.
• Some properties can be modified, e.g. ENFORCED and VALIDATE.
• Some properties cannot be modified, e.g. such as DEFERRABLE. To modify these properties, the
constraint must be dropped and recreated.
• The column definition for a constraint cannot be modified, e.g. add new columns, drop existing
columns, or change the order of columns. To make these types of changes, the constraint must
be dropped and recreated.
• When modifying a constraint, the constraint can be identified using either the constraint name
or the columns in the constraint definition along with the type of the constraint. Primary keys
can also be identified using the PRIMARY KEY keyword, because each table can have only a
single primary key.
• If a table with constraints is modified, e.g. rename table or swap table with another table, the
constraints are updated to reflect the changes.
Constraints
Dropping Constraints:
[Link]
Constraints are dropped using the ALTER TABLE command:
ALTER TABLE … DROP CONSTRAINT … explicitly drops the specified constraint. Similar
to modifying constraints, the constraint can be identified by the constraint name or
column definition along with the constraint type. For a primary key, they can also be
identified using the PRIMARY KEY keyword.
ALTER TABLE … DROP COLUMN … drops a column and its associated constraints.
By default, when a primary/unique key is dropped, all foreign keys referencing the key
being dropped are also dropped, unless the RESTRICT drop option is specified.
Constraints are also dropped when the associated tables/schemas/databases are
dropped. The DROP commands support the CASCADE | RESTRICT drop options.
Unique/primary/foreign key constraints can be explicitly dropped (using ALTER TABLE …
DROP CONSTRAINT …):
ALTER TABLE DROP { CONSTRAINT <name> | PRIMARY KEY | { UNIQUE | FOREIGN
KEY } (<column>, [ ... ] ) } [ CASCAD
Connecting to Snowflake
[Link]
• Snowflake supports multiple ways of connecting to the service:
• A web-based user interface from which all aspects of managing and using Snowflake can be
accessed.
• Command line clients (e.g. SQL) which can also access all aspects of managing and using
Snowflake.
• ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to
Snowflake.
• Native connectors (e.g. Python) that can be used to develop applications for connecting to
Snowflake.
• Third-party connectors that can be used to connect applications such as ETL tools (e.g.
Informatica) and BI tools to Snowflake
Snowflake Regions
[Link]
Snowflake Regions let your organization choose where your data is geographically stored across your
global operations. They also determine where your compute resources are provisioned.
Each Snowflake account is located in a single region (i.e. multi-region accounts are not supported). In
addition, Snowflake does not yet support accessing or sharing data between regions. If you wish to use
Snowflake across multiple regions, you must have a separate Snowflake account in each region.
Snowflake Key Features
Security and Data Protection:
[Link]
protection
Choose the level of security you require for your Snowflake account, based on your Snowflake Edition.
• Choose the geographical location where your data is stored, based on your Snowflake Region.
• User authentication through standard user/password credentials.
Enhanced authentication:
Multi-factor authentication (MFA).
• Federated authentication and single sign-on (SSO) — requires Snowflake Enterprise Edition.
• All communication between clients and the server protected through TLS.
• Deployment inside a cloud platform VPC.
• Isolation of data via Amazon S3 policy controls.
• Support for PHI data (in compliance with HIPAA regulations) — requires Snowflake Enterprise for
Sensitive Data (ESD).
• Automatic data encryption by Snowflake using Snowflake-managed keys.
• Object-level access control.
Snowflake Key Features
Standard and Extended SQL Support
[Link]
extended-sql-support
• Most DDL and DML defined in SQL:1999, including:
• Database and schema DDL.
• Table and view DDL.
• Standard DML such as UPDATE, DELETE, and INSERT.
• DML for bulk data loading/unloading.
• Core data types.
• SET operations.
• CAST functions.
• Advanced DML such as multi-table INSERT, MERGE, and multi-merge.
• Transactions.
• Temporary and transient tables for transitory data.
• Analytical aggregates (Group by cube, rollup, and grouping sets).
Snowflake Key Features
Tools and Interfaces
[Link]
interfaces
• Web-based GUI for account and general management, monitoring of resources and system
usage, and querying data.
• SnowSQL (Python-based command line client).
• Virtual warehouse management from the GUI or command line, including creating, resizing
(with zero downtime), suspending, and dropping warehouses.
Snowflake Key Features
Connectivity
[Link]
1. Broad ecosystem of supported 3rd-party partners and technologies.
2. Support for using free trials to connect to selected partners.
3. Extensive set of client connectors and drivers provided by Snowflake:
• Python connector
• Spark connector
• [Link] driver
• Go Snowflake driver
• .NET driver
• JDBC client driver
• ODBC client driver
Snowflake Key Features
Data Import and Export:
[Link]
and-export
• Support for bulk loading and unloading data into/out of tables, including:
• Load any data that uses a supported character encoding.
• Load data from compressed files.
• Load most flat, delimited data files (CSV, TSV, etc.).
• Load data files in JSON, Avro, ORC, Parquet, and XML format.
• Load from S3 data sources and local files using Snowflake web interface or command line
client.
• Support for continuous bulk loading data from files:
• Use Snow pipe to load data in micro-batches from internal stages (i.e. within Snowflake) or
external stages.
Data Sharing:
[Link]
• Support for sharing data with other Snowflake accounts:
• Provide data to other accounts to consume.
• Consume data provided by other accounts.
Overview of the Data Lifecycle
[Link]
Snowflake provides support for all standard SELECT, DDL, and DML operations across the lifecycle of
data in the system, from organizing and storing data to querying and working with data, as well as
removing data from the system.
Continuous Data Protection
[Link]
Continuous Data Protection (CDP) encompasses a comprehensive set of features that help protect data
stored in Snowflake against human error, malicious acts, and software or hardware failure. At every
stage within the data lifecycle, Snowflake enables your data to be accessible and recoverable in the
event of accidental or intentional modification, removal, or corruption.
Introduction to Snowpipe
[Link]
snowpipe-work
Snowpipe enables loading data from files as soon as they’re available in a stage. This means you
can load data from files in micro-batches, making it available to users within minutes, rather
than manually executing COPY statements on a schedule to load larger batches.
In this Topic:
How Does Snowpipe Work?
How Is Snowpipe Different from the COPY Command?
Recommended Load File Size
Snowpipe DDL
Video : Link
20
How Does Snowpipe Work?
• Your client application calls a public REST endpoint with a list of data filenames and a referenced
pipe name (Java and Python SDKs are provided for your convenience). If new data files matching
the list are discovered in the stage, they are queued for loading. Snowflake-provided compute
resources load data from the queue into a Snowflake table based on parameters defined in the
pipe.
• A pipe is a named, first-class Snowflake object that contains a COPY statement used by the
Snowpipe REST service. The COPY statement identifies the source location of the data files (i.e., a
named stage) and a target table. All data types are supported, including semi-structured data
types such as JSON and Avro.
• The following diagram shows the Snowpipe process flow:
Snowpipe process flow
Snowpipe DDL
• To support creating and managing pipes, Snowflake provides the following set of special DDL
commands:
• CREATE PIPE
• ALTER PIPE
• DROP PIPE
• DESCRIBE PIPE
• SHOW PIPES
• In addition, providers can view, grant, or revoke access to the necessary database objects for
Snowpipe using the following standard access control DDL:
• GRANT <privileges> … TO ROLE
• REVOKE <privileges> … FROM ROLE
• SHOW GRANTS
23
Understanding Billing for Snowpipe Usage
[Link]
• With Snow pipe's serverless compute model, users can initiate any size load without managing a
virtual warehouse. Instead, Snowflake provides and manages the compute resources,
automatically growing or shrinking capacity based on the current Snowpipe load. Accounts are
charged based on their actual compute resource usage; in contrast with customer-managed
virtual warehouses, which consume credits when active, and may sit idle or be over utilized.
• Snowflake tracks the resource consumption of loads for all pipes in an account, with per-
second/per-core granularity, as Snowpipe actively queues and processes data files. Per-core refers
to the physical CPU cores in a compute server.
• Using a multi-threaded client application enables submitting data files in parallel, which initiates
additional servers and loads the data in less time. However, the actual overall compute time
required would be identical to using a single-threaded client application, just spread out over
more internal Snowpipe servers. The utilization recorded is then translated into familiar
Snowflake credits, which are listed on the bill for your account.
Calling Snowpipe REST Endpoints to Load Data
[Link]
• This set of topics describes how to call the public REST endpoints to load data and retrieve load
history reports.
• In this Topic:
• Authentication
• Workflow
– Configuring Snowpipe
– Using Snowpipe to Load Data
Authentication
• Calls to the public Snowpipe REST endpoints use key-based authentication, rather than the typical
username/password authentication, because the ingestion service does not maintain client
sessions.
• To follow the general principle of least privilege, we recommend creating a separate user and role
to use for ingesting files using a pipe. The user should be created with this role as its default role,
and the role should have the minimum set of permissions needed to insert files into the table.
Workflow
• This section provides a high-level overview of the setup and load workflow.
• Configuring Snowpipe
• Create a named stage object where your data files will be staged. Snowpipe supports both
internal (Snowflake) stages and external stages, i.e. S3 buckets.
• Create a pipe object using CREATE PIPE.
Workflow
• Configure security for the user who will execute the continuous data load. If you plan to restrict
Snowpipe data loads to a single user, you only need to configure key pair authentication for the user
once. After that, you only need to grant access control privileges on the database objects used for each
data load.
• Install a client SDK (Java or Python) for calling the Snowpipe public REST endpoints.
Using Snowpipe to Load Data
• Call a REST endpoint with a list of files to load continuously when staged.
• Retrieve the load history.
Troubleshooting Snowpipe
[Link]
• This topic describes a methodical approach to troubleshooting issues with loading data using
Snowpipe.
• In this Topic:
• Step 1: Checking Authentication Issues
• Step 2. Viewing the COPY History for the Table
• Step 3: Checking the Pipe Status
• Other Issues
– Unloaded Set of Files
– Difficulty Reloading Data Into a Table
Snowpipe
Managing Snowpipe:
[Link]
• This topic describes the administrative tasks associated with managing Snowpipe.
• In this Topic:
• Pausing and Resuming Pipe
Pausing and Resuming Pipes:
• The PIPE_EXECUTION_PAUSED parameter enables pausing or resuming a pipe, typically in
preparation for transferring ownership of the pipe. This parameter is supported at the following
levels:
• Account
• Schema
• Pipe
• At the pipe level, the object owner (or a parent role in a role hierarchy) can set the parameter to
pause or resume an individual pipe.
Working With Snowflake
Snowflake Objects
● The ACCOUNTADMIN role in Snowflake provides a user the ability to administer both security (ie:
SECURITYADMIN role) and objects (i.e.: SYSADMIN role) within a database; this role has overarching
privileges across the Snowflake account and should be handed out with extreme caution
● Data (Storage) :
○ All data in Snowflake is maintained in databases.
○ Each database consists of one or more schemas, and, within these schemas, one or more
tables and/or views.
○ Schemas can be thought of as logical groupings of database objects, such as tables and
views by concept or purpose.
○ Snowflake does not place any hard limits on the number of databases, schemas (within
a database), or objects (within a schema) you can create.
Snowflake Objects
● Processing (Compute):
○ Compute in Snowflake is provisioned via Virtual Warehouses.
○ Virtual Warehouses can be configured to auto-suspend after a specified period of inactivity,
or auto-resume as soon as a user requests an operation that requires compute power.
■ Auto-suspend / auto-resume can be specified upon initial configuration, or modified as
needed after the fact
■ The right balance of auto-suspend / auto-resume will result in an optimized bill--minimum bill
time for a warehouse, upon startup, is 1 minute per startup; after a minute has passed, accounts
are billed by the second.
● Table Design:
○ Snowflake can support both structured and semi-structured data.
○ Semi-structured data should be stored, as is, in a table column with a data type of VARIANT .
■ A Snowflake table can contain both a VARIANT data type, as well as a timestamp.
■ Snowflake supports data in VARIANTs up to a maximum size of 16MB compressed.
○ Snowflake’s proprietary algorithm stores and processes semi-structured data in a unique way.
Extra links for the Video Tutorial
WHAT IS SNOWFLAKE DATAWAREHOUSE
HOW TO CREATE DATABASE AND TABLE IN SNOWFLAKE
HOW TO CREATE A TABLE USING OTHER TABLE AND CLONE TABLE IN SNOWFLAKE
HOW TO CREATE VIEW IN SNOWFLAKE DATA WAREHOUSE
WORKSHEETS AND QUERIES
Approximate time duration to complete – 30 min