Snowflake Data Warehouse Architecture Overview
Snowflake Data Warehouse Architecture Overview
Introduction
Deep dive
JOSIP S ABAN
JOSIPS ABAN@[Link]
Space: the final frontier. These are the
voyages of the starship Enterprise. Its
continuing mission: to explore strange
new worlds. To seek out new life and
new civilizations. To boldly go where
no one has gone before!
JEAN-LUC PICARD
STARSHIP CAPTAIN
About me
Finished Computing at Faculty of Electrical Engineering and Computing, Zagreb in 2004,
Cotrugli MBA in 2011, now on Executive MBA in Innovation & Entrepreneurship on TU Wien
Started as web developer, then moved to databases and business intelligence with strong
focus on management
Around seven years in Erste Bank, worked in several areas, starting in Croatia and then in
Erste Digital Vienna
Details about my career path and partial portfolio are available at:
• [Link]
• [Link] ( conference lecture slides, blogs, MOOC certificates )
Lecture goals
Introduction to the concept of data warehouse as service
Lecture assumptions
• You come from world of databases…I don’t show trivial things or show you how to code
• You know a full business intelligence project lifecycle…I don’t need to explain BI concepts
• You came here to learn “what is different” and “why should I try Snowflake”
My goal is achieved if you leave this lecture with feeling “I want to know more”
Agenda
Introduction
Snowflake architecture
Best practices
Discussion
Introduction
Organizational topics
This lecture lasts two hours
Questions are welcome, I will answer them at the end of each block
It would be great that you have a second monitor during this lecture, so you can
make notes or do some interactive activities
Introduction to Snowflake
Snowflake is, currently, only data warehouse built for only for the cloud as SAAS
product
It runs on all three main cloud providers - Amazon Web Services, Microsoft Azure, and
Google Cloud
It is used to manage structured and unstructured data
It can serve data warehousing, data engineering, and data science use cases
Shared-Disk Shared-Nothing
Expensive hardware with redundancy to handle Typically built on commodity hardware
component failure
High availability Node availability is low and system availability is
high
Relatively low scalability High Scalability
Preferred in OLTP systems that require ACID Preferred in an environment with high, read/write
compliance rates
Data is partitioned and striped but within the Data may be partitioned and distributed across
storage array the cluster
Snowflake architecture - intro
• Snowflake's follows the hybrid approach which is called
multi cluster shared data architecture
• It decouples storage and compute to scale them
independently
• Snowflake is designed as three-layered architecture, to
be discussed in next block where we focus on
architecture
• Storage resources
• Compute resources
• Cloud services
• All three layers are self-scaling, and Snowflake charges
separately for disk and virtual warehouse; services layer
is managed inside provisioned computing nodes
• Benefit of this design is that each layer may be scaled independently of
others
Snowflake architecture
Snowflake architecture
Storage layer
• Information is automatically stored in optimized
and compressed micro partitions, in columnar
fashion
• Data is saved in the cloud and is managed using
a shared-disk architecture
• Components establish connections to the
storage layer in order to retrieve information for
query processing
• Users pay for the monthly average storage
usage - storage space is elastic and based on
user per TB
Snowflake architecture
Storage layer
• Uses scalable cloud blob storage available in all three public
clouds
• Storage layer supports scaling independent of the compute
layer
• By splitting compute and storage you can scale read and write
requests without having to prioritize one over the other,
internally handled by Snowflake - this is one of the unique
features
• As data loads into Snowflake, algorithms take over to process
and partition the incoming data and create metadata
• Columnar compression applied to these partitions optimizes
the utilization of space as well as improves query performance
• The data is also encrypted to meet the highest standards of
security required by enterprise companies
Snowflake architecture
Compute layer
• Snowflake executes queries using a "Virtual
Warehouse“
• Layer of separation between the query processing and disc
storage layer
• Compute executes queries against the data in the
storage layer
• Virtual Warehouses are computing units consisting of
several nodes with Snowflake-provisioned CPU and
Memory
• Each virtual warehouse may be configured to use a single storage
tier
• Generally, virtual warehouses operate independently
Snowflake architecture
Compute layer
• Uses standard computing infrastructure, i.e. virtual
machines available to anyone in a public cloud
environment
• When an incoming query is detected, computing
power becomes available immediately to process the
request
• As specialty of Snowflake, it deploys multiple virtual
warehouses to process a request while
simultaneously maintaining the integrity of the
transaction, making the system ACID compliant
Snowflake architecture
Cloud layer
Focus of this chapter is how to organize your DWH so that as little as possible goes wrong
We cover decisions on your data warehouse solution, databases, schemas and different
types of tables
Database organization
Raw data:
• It is “highly recommend” to create a Snowflake
database to ingest all of raw data
• This way you can re-run data models in the case of
errors
• It is imperative that the only “system” has full
access to this is your ingestion tool – Airflow,
Airbyte, Prefect, …
• Only other “user” that loads this database will be another
ingestion tool
• No human user can do anything on ingestion, except reading
rights
• All manual data correction should be, ideally,
possible through tasks in ingestion tools, or as one-
time user “incident” interventions
Database organization
Transformed data – BASE models
• Usually there is a couple of types of
transformations
• We start with “BASE” models – basic
transformations that are done on the raw data
• When creating a user, you will have to assign them a temporary password
• Make sure you set MUST_CHANGE_PASSWORD=true for security purposes for
any human user
• When you are creating a user for a tool, you do not want to select this
• Make sure you assign it a secure password from the start and store it in a
secrets manager
Database organization
Roles and users
• Each user should have a default warehouse to help manage costs and usage of your
data warehouse
• This isn’t necessarily a security feature, but it will help you understand where your
Snowflake credits are being utilized
• Assign the user the warehouse related to their role
• Creating roles
• It's a best practice to create a different role for the ingestion tool, analyst, engineer,
and business user within your organization
• You can also create specific roles for BI tools
• Creating a different role for each user and tool will allow you to control what each of
these users can and cannot do within each database and its schemas.
• Ingestion tool permissions - Access to RAW - Can view, select, and create in RAW
• This is the only role that should be allowed to write to your RAW database
• Your ingestion tool is the only thing that should be creating schemas and tables
within your RAW data location
• This is the most important role within your database because of the power it has to
access raw data
Database organization
Roles and users
• Analytics Engineer permissions:
• Access to RAW, BASE, DATA_MART_DEV, DATA_MART_PROD, and RDA
• Can view and select from RAW, BASE, and DATA_MART_PROD
• Can view, select, and create in DATA_MART_DEV and RDA
• The analytics engineer, or whoever is writing the dbt data models, should be
the only one who has access to the RAW database
• However, they should only be able to view and select from it
• Business user permissions:
• Access to BASE, DATA_MART_PROD, and RDA
• Can view and select from each of these
• Business users should have the ability to view data and do basic selections,
but not alter the data in any way
Database organization
Snowflake’s permission hierarchy
• You need to grant certain permissions to the database,
schema, tables/views, and future tables/views
• Snowflake’s permissions are unique in that you can’t assign a
permission to the database and expect it to also apply for the
schemas and tables/views within that database
• Grant permissions to databases
• MONITOR allows a role to see details on an object; you will want to grant this to
all roles who you want to see the database.
• USAGE allows a role to use a database; you will need to grant this to any roles
that wish to query the database.
• CREATE grants a role the ability to create an object within the database
Database organization
Snowflake’s permission hierarchy
• Grant permissions to schemas
• You need to grant the MONITOR privilege on schemas within your databases to the roles
• USAGE allows a role to use a schema; you will need to grant this to any roles that wish to
query the schema
• CREATE also exists on the schema level - you will need to assign this to all the roles for the
schemas within the databases
• Grant permissions to tables and views
• Here is where things get different - tables and views have unique privileges from databases
and schemas.
• SELECT allows a role to select from a table or view
• INSERT and DELETE are all permissions that should be given to users on “DATA_MART_DEV”
and “DATA_MART_PROD”
• Since this is the role assigned to the dbt user, it needs to be able to make changes to
tables
• Keep in mind that these permissions only work for tables, not views.
• With tables, views and schemas, you also need to make sure you grant permission to
future tables and views within a schema or database
Final thoughts
Start the process of building Snowflake architecture by planning and designing
the databases, schemas, and tables/views
If you decide something isn’t working for you, don’t be afraid to change it - there
are many different ways to organize Snowflake data warehouse architecture
Selected Snowflake
features
Lets take a look on Snowflake
GUI
Integrated Web GUI is one of the best database web-based administration on the
market
Username: TestUser
Snowflake123!
Virtual warehouse overview
Warehouses are required for queries, as well as all DML operations, including
loading data
Warehouse is defined by its size, as well as the other properties that can be set to
help control and automate warehouse activity
They can also be resized at any time, even while running, to accommodate the
need for more or less compute resources, based on the type of operations being
performed by the warehouse
Virtual warehouse overview
Size specifies the amount of compute resources available per cluster in a warehouse
Warehouse Credits / H Credits / Sec Notes
Size our ond
X-Small 1 0.0003 Default size for warehouses created using CREATE WAREHOUSE.
Small 2 0.0006
Medium 4 0.0011
Large 8 0.0022
X-Large 16 0.0044 Default for warehouses created in the web interface.
2X-Large 32 0.0089
3X-Large 64 0.0178
4X-Large 128 0.0356
5X-Large 256 0.0711 Generally available in Amazon Web Services regions, and in preview in US
Government and Azure regions.
6X-Large 512 0.1422 Generally available in Amazon Web Services regions, and in preview in US
Government and Azure regions.
Virtual warehouse overview
Impact of size on data loading
• Increasing the size of a warehouse does not always improve data loading performance
• Data loading performance is influenced more by the number of files being loaded (and
the size of each file) than the size of the warehouse
• [Link]
This can be used to simplify and automate your monitoring and usage of
warehouses
Auto-suspend ensures that you do not leave a warehouse running (and consuming
credits) when there are no incoming queries
Snowflake enables the sharing of databases through shares, which are created by
data providers and “imported” by data consumers
Tableau -
[Link]
PowerBI -
[Link]
f628dadbdc85
From these two example you can learn a lot how Snowflake system views are
organized
Time travel
• Time Travel allows us to access historical data from any
point, within a defined period, in the past, until the
time travel period
• Useful when you…
• Updated the wrong data and wanted to roll back 3–4 versions back
• Dropped the schema/table/database by mistake
• Required historical data for analysis purposes
• Time travel is feature of Enterprise Edition
• Code examples:
select * from table_name at(statement => 'QueryID')
select * from table_name before(statement =>
'QueryID')
select * from table_name at(offset => -60*5)
select * from table_name at(timestamp =>'Mon,26 dec
2020 05:06:23.189-0800':: timestamp_tz)
Time travel
• Time travel cannot be disabled for any account
• Users with an ACCOUNT ADMIN role can set the data
retention period using the period
• However, time travel can be disabled for individual
objects like databases, schemas, and tables by
specifying the DATA_RETENTION_TIME_IN_DAYS
parameter to 0 while creating.
• Once the retention period is completed, the data will
be automatically moved to the FAIL-SAFE layer, and
time travel actions can no longer be performed.
• If the retention period is defined at the account level,
then all the child objects like database, schema, and
tables will have the same value until and unless the
value is explicitly specified while creating the child
objects
Time travel
• Storage cost for time travel
• We can see the storage cost of time travel with the following query:
• SELECT * FROM snowflake.account_usage.table_storage_mterics
WHERE schema = schema_name;
• From the above query below parameters are important in calculating
the cost:
• ACTIVE_BYTES — Actual table cost
• TIME_TRAVEL_BYTES — Time travel cost
• FAIL_SAFE_BYTES — Fail Safe cost
Fail safe – method of last resort
• Fail-safe ensures historical data is protected in the
event of a system failure or other event
• Fail-safe provides a (non-configurable) 7-day
period during which historical data may be
recoverable by Snowflake - this period starts
immediately after the Time Travel retention period
ends
• This service is BEST EFFORT recovery service
• It is intended only for use when all other recovery options failed
• Fail-safe is not provided as a means for accessing historical data
after the Time Travel retention period has ended
• Data recovery through Fail-safe may take from several hours to
several days to complete
• In the web interface, account administrators can
view the total data storage for their account,
including historical data in Fail-safe
Micro partitions
Traditional data warehouses rely on static partitioning of large tables for performance and scaling
• Static partitioning has a number of limitations, such as maintenance overhead and data skew, which can result in
disproportionately-sized partitions
Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that
delivers all the advantages of static partitioning without the known limitations
All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage
• Each micro-partition contains between 50 MB and 500 MB of uncompressed data - actual size in Snowflake is smaller because
data is always stored compressed
• Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion
• This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even
hundreds of millions, of micro-partitions