0% found this document useful (0 votes)
22 views65 pages

Snowflake Data Warehouse Architecture Overview

The document provides an introduction to Snowflake, a cloud-based data warehouse service, detailing its architecture, unique features, and best practices for data organization. It emphasizes the importance of separating storage and compute resources for scalability and efficiency, while also outlining the lifecycle of a Snowflake query. Additionally, it discusses the setup of user roles and permissions to maintain data integrity within the Snowflake environment.

Uploaded by

anika
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views65 pages

Snowflake Data Warehouse Architecture Overview

The document provides an introduction to Snowflake, a cloud-based data warehouse service, detailing its architecture, unique features, and best practices for data organization. It emphasizes the importance of separating storage and compute resources for scalability and efficiency, while also outlining the lifecycle of a Snowflake query. Additionally, it discusses the setup of user roles and permissions to maintain data integrity within the Snowflake environment.

Uploaded by

anika
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Snowflake

Introduction
Deep dive

JOSIP S ABAN
JOSIPS ABAN@[Link]
Space: the final frontier. These are the
voyages of the starship Enterprise. Its
continuing mission: to explore strange
new worlds. To seek out new life and
new civilizations. To boldly go where
no one has gone before!

JEAN-LUC PICARD
STARSHIP CAPTAIN
About me
Finished Computing at Faculty of Electrical Engineering and Computing, Zagreb in 2004,
Cotrugli MBA in 2011, now on Executive MBA in Innovation & Entrepreneurship on TU Wien

Started as web developer, then moved to databases and business intelligence with strong
focus on management

Around seven years in Erste Bank, worked in several areas, starting in Croatia and then in
Erste Digital Vienna

Owner of Meridian Data, specialized for data analytics and visualization

Details about my career path and partial portfolio are available at:
• [Link]
• [Link] ( conference lecture slides, blogs, MOOC certificates )
Lecture goals
Introduction to the concept of data warehouse as service

Presentation of Snowflake architecture

Explaining advanced or specific features of Snowflake platform

Lecture assumptions
• You come from world of databases…I don’t show trivial things or show you how to code
• You know a full business intelligence project lifecycle…I don’t need to explain BI concepts
• You came here to learn “what is different” and “why should I try Snowflake”

My goal is achieved if you leave this lecture with feeling “I want to know more”
Agenda

Introduction

Snowflake architecture

Best practices

Selected Snowflake topics

Discussion
Introduction
Organizational topics
This lecture lasts two hours

I will make breaks between blocks

Questions are welcome, I will answer them at the end of each block

If you wish to discuss further, please contact me on mail or LinkedIN, I am in main


hotel and have tendency to drink a lot of coffee 

It would be great that you have a second monitor during this lecture, so you can
make notes or do some interactive activities
Introduction to Snowflake
Snowflake is, currently, only data warehouse built for only for the cloud as SAAS
product
 It runs on all three main cloud providers - Amazon Web Services, Microsoft Azure, and
Google Cloud
 It is used to manage structured and unstructured data
 It can serve data warehousing, data engineering, and data science use cases

There is no “on-premise” version

It utilizes available compute, transfer and storage cloud options


• [Link]
Distributed system architecture
Shared – disk architecture
• Each node shares the storage disk but it has its own
memory/CPU
• Multiple nodes or multiple compute resources like BI
dashboards, IDE, ETL workloads…access the single
storage disk
• System becomes slower as we are onboard more and
more ETL workloads - there is a single storage system
that is single point of failure.
• Benefit - read and write operations are consistent and
efficient because it is not having any distributed
storage
• Disadvantage - distributed locking and
two-phase commit since there is common storage
shared by all
Distributed system architecture
Shared – nothing architecture
• No node shares any resource with the other nodes
and data is shuffled across all the nodes through
the common network connection
• Storage and compute are tightly coupled to each
other
• Adding more and more nodes can scale
horizontally, but however, compute and storage
cannot scale independently
• Disadvantage - data needs to be shuffled across
the network, it will be an overhead when there is
a lot of data that needs to be shuffled across the
nodes – this is how Hadoop, Cloudera Impala or
Cassandra work
Distributed system architecture
Shared-Disk vs Shared-Nothing – summary

Shared-Disk Shared-Nothing
Expensive hardware with redundancy to handle Typically built on commodity hardware
component failure
High availability Node availability is low and system availability is
high
Relatively low scalability High Scalability
Preferred in OLTP systems that require ACID Preferred in an environment with high, read/write
compliance rates
Data is partitioned and striped but within the Data may be partitioned and distributed across
storage array the cluster
Snowflake architecture - intro
• Snowflake's follows the hybrid approach which is called
multi cluster shared data architecture
• It decouples storage and compute to scale them
independently
• Snowflake is designed as three-layered architecture, to
be discussed in next block where we focus on
architecture
• Storage resources
• Compute resources
• Cloud services
• All three layers are self-scaling, and Snowflake charges
separately for disk and virtual warehouse; services layer
is managed inside provisioned computing nodes
• Benefit of this design is that each layer may be scaled independently of
others
Snowflake architecture
Snowflake architecture
Storage layer
• Information is automatically stored in optimized
and compressed micro partitions, in columnar
fashion
• Data is saved in the cloud and is managed using
a shared-disk architecture
• Components establish connections to the
storage layer in order to retrieve information for
query processing
• Users pay for the monthly average storage
usage - storage space is elastic and based on
user per TB
Snowflake architecture
Storage layer
• Uses scalable cloud blob storage available in all three public
clouds
• Storage layer supports scaling independent of the compute
layer
• By splitting compute and storage you can scale read and write
requests without having to prioritize one over the other,
internally handled by Snowflake - this is one of the unique
features
• As data loads into Snowflake, algorithms take over to process
and partition the incoming data and create metadata
• Columnar compression applied to these partitions optimizes
the utilization of space as well as improves query performance
• The data is also encrypted to meet the highest standards of
security required by enterprise companies
Snowflake architecture
Compute layer
• Snowflake executes queries using a "Virtual
Warehouse“
• Layer of separation between the query processing and disc
storage layer
• Compute executes queries against the data in the
storage layer
• Virtual Warehouses are computing units consisting of
several nodes with Snowflake-provisioned CPU and
Memory
• Each virtual warehouse may be configured to use a single storage
tier
• Generally, virtual warehouses operate independently
Snowflake architecture
Compute layer
• Uses standard computing infrastructure, i.e. virtual
machines available to anyone in a public cloud
environment
• When an incoming query is detected, computing
power becomes available immediately to process the
request
• As specialty of Snowflake, it deploys multiple virtual
warehouses to process a request while
simultaneously maintaining the integrity of the
transaction, making the system ACID compliant
Snowflake architecture
Cloud layer

• This is where all the intelligent action happens -


authenticating users, management of the cluster,
query execution and optimization, security,
encryption, and the orchestration of transaction
execution
• Examples:
• Whenever a login process is initiated, it must traverse this layer
• Snowflake queries are routed through this layer's analyzer and
later to the Compute Layer for execution
• This layer stores the metadata necessary to improve a query or
filter data
Snowflake architecture
Cloud layer

• This layer runs on compute nodes that are stateless


and span the entire data center
• Intelligent use of metadata distributed across the
cluster of computing nodes maintains the global
state of transactions and the system.
• Although processing of the metadata takes sizable
computing power, by design, the processing of
metadata happens on a separate cluster of machines
which reduces the impact of the actual compute
resources processing the data for the user
Life cycle of Snowflake query
• When a query is issued, the services layer parses the query, compiles it, and
determines which set of partitions hold the data of interest and flags them for
scanning
• User can submit query via any supported client/connector provided by
Snowflake
• The cloud service layer then scales up/down the warehouse depending on the
indication provided by the user (x-small up to x-large)
• The cloud service layer first creates a session and checks if the user credential
is valid to use the snowflake account using MFA and IP address if applicable.
• Once the authentication is done then…
• The logical plan of the query gets created
• All the objects and operations in the query including the warehouse validated if the respective user
has privileges to access the objects
• Then the optimized query plan is generated and submitted to the computer layer
• It also resumes the cluster if it is shut down or spins up/down the cluster based on the need
• Finally, the compute layer reads the data from the storage layer and provides the result to the user
• The results will get cached for 24 hours, if we run the same query again then the results will
automatically get fetched from the cached data
Snowflake versions
Some unique Snowflake features
• These are some of unique or specific features of Snowflake
• Micro Partitions
• Time Travel
• Compressed Storage Cost
• Data Caching
• Data Sharing
• Virtual Warehouse
• Zero Clone Copy
• Micro Partitioning
• 7 days Fail Safe
• Cross-Cloud DB Replication
Opening Snowflake test account
• Get a one-time email or use a dedicated “spam” e-mail, no payment information
required
• Follow this step-by-step guide
• Go to signup page – [Link] – and fill-in required information
• Select Snowflake Edition ( unless you need to test features from Business-Critical Edition,
choose Enterprise Edition )
• Choose one of three cloud providers – AWS, Microsoft Azure or Google Cloud Platform (GCP)
• Physical location - choose a region close to your physical location; this reduces latency
• Go to your email inbox and activate the account - you will find the URL to access your
Snowflake account
• Choose a Username and Password
Opening Snowflake test account

What do we get with trial account


• Trial account includes $400 worth of credits for roughly 40-day period, whichever occurs
first
• Trial can be started using any Snowflake edition, cloud platform, or region
• At the end of the trial, your account will be suspended
• You can still log into a suspended account, but you cannot use any features
• Once the trial is over, the account can be converted to a paid account by adding credit
card details
Best practices
Introduction
High-level data manipulation steps:
• Data flows into in DWH through data ingestion tools like Airflow/Prefect, making sure raw data is
available.
• Data is transformed using SQL and modern data transformation tools like dbt
• Data then flows out of it to business users and data visualization platforms

It is imperative that it is done correctly, considering different factors like development


and production, security, and business use cases

Focus of this chapter is how to organize your DWH so that as little as possible goes wrong

We cover decisions on your data warehouse solution, databases, schemas and different
types of tables
Database organization
Raw data:
• It is “highly recommend” to create a Snowflake
database to ingest all of raw data
• This way you can re-run data models in the case of
errors
• It is imperative that the only “system” has full
access to this is your ingestion tool – Airflow,
Airbyte, Prefect, …
• Only other “user” that loads this database will be another
ingestion tool
• No human user can do anything on ingestion, except reading
rights
• All manual data correction should be, ideally,
possible through tasks in ingestion tools, or as one-
time user “incident” interventions
Database organization
Transformed data – BASE models
• Usually there is a couple of types of
transformations
• We start with “BASE” models – basic
transformations that are done on the raw data

• Best practice - avoid having your transformations


read from your raw database
• They should always read from another Snowflake
database that contains them – they are almost
same as “RAW” database but with basic
transformations such as data type casting and
field name changes
Database organization
Transformed data – BASE models
• “BASE” data models are usually views rather
then tables
• This saves costs within your Snowflake
warehouse because you aren’t storing a full
copy of the underlying data
• Because this data is always the same, no
matter in development or production, there is
no need to create separate environments
• Views don’t need to be automated and
deployed each day because they simply read
from your raw data
Database organization
Transformed data – DATAMART models
• “DATAMART” models are more complex objects
that you build with tools like dbt
• These require both a development and
production environment
• You don’t want one database to contain both
development and production models, so it is
best to create a different database for each
Database organization
Transformed data – DATAMART models
• They both read from the tables in the “BASE”
database
• One is for testing the creation of your data models
• Other one is validated, orchestrated, and depends on the
business
• Tables created in Snowflake using dbt are
transient tables - these are similar to
permanent tables except they don’t have a full
history available on Snowflake
• This helps save on storage costs but is another reason why
it’s important to always have a copy of all your raw data
Database organization
Transformed data – Reporting and
experimentation
• Analysts usually write longer one-off queries for
reporting and experimentation purposes and they
need to be stored so that they can be accessed by
business users and visualization tools
• Because these are usually only written once and
don’t need to be automated to run every day, you do
not want to store these in development or production
• Most of reporting is done within BI tool, using the
data models built in production, rather than within
Snowflake
• However, if you are automating your reporting in
Snowflake, creating these reports in
“DATA_MART_PROD” is probably a better idea
Database organization
Organizing schemas

• Snowflake schemas act as a more granular way to


organize tables and views within your database
• RAW data and BASE models
• “RAW” and “BASE” databases should have same schemas
• In both of these databases you should create a schema for every
data source
Database organization
Organizing schemas
• DEV and PROD models
• “DATA_MART_DEV” and “DATA_MART_PROD” should have the same schemas.
• Inside these we usually start with two schemas: “INTERMEDIATE” and “CORE”
• These terms are commonly used in dbt documentation.
• Intermediate models are those that come in between base models and the
final product, or the core data model - they are the output tables of the SQL
files that don’t necessarily get used for analysis, but are an important step in
building the final model
• The only person that really needs access to these is the analytics engineer,
or the one who coded them.
• Core models are those that are the final product of a data model – they are
the table that results from the very last SQL file in a sequence of code
• These are the ones that data analysts and business users will need to
access in your production environment
• All analyses, dashboards, and reports will be built from these data models
Database organization
Roles and users
• Setting up the correct roles and permissions strong keeps
data integrity
• Not everyone within your organization is going to know how to
use Snowflake or properly query the data
• When configuring Snowflake environment you should start
with creating users, then using the users to determine the
Snowflake roles needed for your environment
• From there you can create your corresponding warehouse and
reset the defaults of each user
Database organization
Roles and users
• Creating users:
• Every person and tool that accesses your data warehouse should have its own
user
• This will keep the reigns tight on who has permission to access certain
resources
• It will also allow you to track credit usage and changes made.
• Also, you may not want everyone in your company to have access to Snowflake

• When creating a user, you will have to assign them a temporary password
• Make sure you set MUST_CHANGE_PASSWORD=true for security purposes for
any human user
• When you are creating a user for a tool, you do not want to select this
• Make sure you assign it a secure password from the start and store it in a
secrets manager
Database organization
Roles and users
• Each user should have a default warehouse to help manage costs and usage of your
data warehouse
• This isn’t necessarily a security feature, but it will help you understand where your
Snowflake credits are being utilized
• Assign the user the warehouse related to their role
• Creating roles
• It's a best practice to create a different role for the ingestion tool, analyst, engineer,
and business user within your organization
• You can also create specific roles for BI tools
• Creating a different role for each user and tool will allow you to control what each of
these users can and cannot do within each database and its schemas.
• Ingestion tool permissions - Access to RAW - Can view, select, and create in RAW
• This is the only role that should be allowed to write to your RAW database
• Your ingestion tool is the only thing that should be creating schemas and tables
within your RAW data location
• This is the most important role within your database because of the power it has to
access raw data
Database organization
Roles and users
• Analytics Engineer permissions:
• Access to RAW, BASE, DATA_MART_DEV, DATA_MART_PROD, and RDA
• Can view and select from RAW, BASE, and DATA_MART_PROD
• Can view, select, and create in DATA_MART_DEV and RDA
• The analytics engineer, or whoever is writing the dbt data models, should be
the only one who has access to the RAW database
• However, they should only be able to view and select from it
• Business user permissions:
• Access to BASE, DATA_MART_PROD, and RDA
• Can view and select from each of these
• Business users should have the ability to view data and do basic selections,
but not alter the data in any way
Database organization
Snowflake’s permission hierarchy
• You need to grant certain permissions to the database,
schema, tables/views, and future tables/views
• Snowflake’s permissions are unique in that you can’t assign a
permission to the database and expect it to also apply for the
schemas and tables/views within that database
• Grant permissions to databases
• MONITOR allows a role to see details on an object; you will want to grant this to
all roles who you want to see the database.
• USAGE allows a role to use a database; you will need to grant this to any roles
that wish to query the database.
• CREATE grants a role the ability to create an object within the database
Database organization
Snowflake’s permission hierarchy
• Grant permissions to schemas
• You need to grant the MONITOR privilege on schemas within your databases to the roles
• USAGE allows a role to use a schema; you will need to grant this to any roles that wish to
query the schema
• CREATE also exists on the schema level - you will need to assign this to all the roles for the
schemas within the databases
• Grant permissions to tables and views
• Here is where things get different - tables and views have unique privileges from databases
and schemas.
• SELECT allows a role to select from a table or view
• INSERT and DELETE are all permissions that should be given to users on “DATA_MART_DEV”
and “DATA_MART_PROD”
• Since this is the role assigned to the dbt user, it needs to be able to make changes to
tables
• Keep in mind that these permissions only work for tables, not views.
• With tables, views and schemas, you also need to make sure you grant permission to
future tables and views within a schema or database
Final thoughts
Start the process of building Snowflake architecture by planning and designing
the databases, schemas, and tables/views

Strong documentation of every piece in Snowflake data warehouse architecture is


necessary in medium to bigger projects

If you decide something isn’t working for you, don’t be afraid to change it - there
are many different ways to organize Snowflake data warehouse architecture
Selected Snowflake
features
Lets take a look on Snowflake
GUI
Integrated Web GUI is one of the best database web-based administration on the
market

Alternative freeware tool – Dbeaver – [Link]

My login ( will be deleted after this session ): onetimemailsnow@[Link]

Username: TestUser

Snowflake123!
Virtual warehouse overview
Warehouses are required for queries, as well as all DML operations, including
loading data

Warehouse is defined by its size, as well as the other properties that can be set to
help control and automate warehouse activity

Warehouses can be started and stopped at any time

They can also be resized at any time, even while running, to accommodate the
need for more or less compute resources, based on the type of operations being
performed by the warehouse
Virtual warehouse overview
Size specifies the amount of compute resources available per cluster in a warehouse
Warehouse Credits / H Credits / Sec Notes
Size our ond
X-Small 1 0.0003 Default size for warehouses created using CREATE WAREHOUSE.

Small 2 0.0006
Medium 4 0.0011
Large 8 0.0022
X-Large 16 0.0044 Default for warehouses created in the web interface.

2X-Large 32 0.0089
3X-Large 64 0.0178
4X-Large 128 0.0356
5X-Large 256 0.0711 Generally available in Amazon Web Services regions, and in preview in US
Government and Azure regions.

6X-Large 512 0.1422 Generally available in Amazon Web Services regions, and in preview in US
Government and Azure regions.
Virtual warehouse overview
Impact of size on data loading
• Increasing the size of a warehouse does not always improve data loading performance
• Data loading performance is influenced more by the number of files being loaded (and
the size of each file) than the size of the warehouse
• [Link]

Impact of size on query performance


• The size of a warehouse can impact the amount of time required to execute queries
submitted to the warehouse, particularly for larger, more complex queries
• In general, query performance scales with warehouse size because larger warehouses
have more compute resources available to process queries
• [Link]
Virtual warehouse overview
Auto-suspension and Auto-resumption
• A warehouse can be set to automatically resume or suspend, based on activity
• By default, auto-suspend is enabled - Snowflake automatically suspends the warehouse if it
is inactive for the specified period of time.
• By default, auto-resume is enabled - Snowflake automatically resumes the warehouse
when any statement that requires a warehouse is submitted and the warehouse is the
current warehouse for the session

This can be used to simplify and automate your monitoring and usage of
warehouses

Auto-suspend ensures that you do not leave a warehouse running (and consuming
credits) when there are no incoming queries

Auto-resume ensures that the warehouse starts up again as soon as it is needed


Virtual warehouse overview
Query Processing and Concurrency
• The number of queries that a warehouse can concurrently process is determined by the
size and complexity of each query
• As queries are submitted, the warehouse calculates and reserves the compute resources
needed to process each query
• If the warehouse does not have enough remaining resources to process a query, the
query is queued, pending resources that become available as other running queries
complete
Data sharing and marketplace
Accessed through GUI

External page – [Link]

Please also take a look on two important concepts


• Direct share - simplest form of data sharing that enables account-to-account sharing of data
• [Link]
• Data exchange
• [Link]

Take a look on data caching -


[Link]
Secure data sharing
Secure Data Sharing enables sharing selected objects in a database in your
account with other Snowflake accounts – you can share:
• Tables
• External tables
• Secure views
• Secure materialized views
• Secure UDFs

Snowflake enables the sharing of databases through shares, which are created by
data providers and “imported” by data consumers

All database objects shared between accounts are read-only


Secure data sharing
All sharing is accomplished through
Snowflake’s services layer and metadata
store
• This means that shared data does not take
up any storage in a consumer account and
does not contribute to his monthly data
storage charges

The only charges to consumers are for


the compute resources (i.e. virtual
warehouses) used to query the shared
data
Secure data sharing
Since no data is copied or exchanged,
setup is quick for providers and access to
data is instantaneous for consumers:
• The provider creates a share of a database
in their account and grants access to specific
objects in the database
• On the consumer side, a read-only database
is created from the share

This way you can create a network of


providers that can share data with multiple
consumers (including within their own
organization) and consumers that can
access shared data from multiple providers
Secure data sharing
Shares
• Shares are named Snowflake objects that
encapsulate all of the information required
to share a database:
• The privileges that grant access to the database(s)
and the schema containing the objects to share
• The privileges that grant access to the specific objects
in the database
• The consumer accounts with which the database and
its objects are shared
Once a database is created (in a consumer
account) from a share, all the shared objects
are accessible to users in the consumer
account:
Secure data sharing
Shares
• Usage Metrics Shared with Providers
• Consumer’s Snowflake account name
• Consumer’s Snowflake organization name
• Statistical data on the data consumption, including:
• Date of the consumption.
• The number of queries a consumer account runs on a
provider’s share
• The data is provided at the daily aggregate level
Secure data sharing
Third – party accounts
• You might wish to share data with a consumer
who does not have a Snowflake account
• The provider account uses shares to share
databases with reader accounts; however, a
reader account can only consume data from the
provider account that created it
• Users in a reader account can query data that
has been shared with it, but cannot perform any
of the DML tasks
Monitor costs and operations
Usual business case – “actively monitor Snowflake usage and money spending”

In this case - don’t invent the wheel

Tableau -
[Link]

PowerBI -
[Link]
f628dadbdc85

From these two example you can learn a lot how Snowflake system views are
organized
Time travel
• Time Travel allows us to access historical data from any
point, within a defined period, in the past, until the
time travel period
• Useful when you…
• Updated the wrong data and wanted to roll back 3–4 versions back
• Dropped the schema/table/database by mistake
• Required historical data for analysis purposes
• Time travel is feature of Enterprise Edition
• Code examples:
select * from table_name at(statement => 'QueryID')
select * from table_name before(statement =>
'QueryID')
select * from table_name at(offset => -60*5)
select * from table_name at(timestamp =>'Mon,26 dec
2020 05:06:23.189-0800':: timestamp_tz)
Time travel
• Time travel cannot be disabled for any account
• Users with an ACCOUNT ADMIN role can set the data
retention period using the period
• However, time travel can be disabled for individual
objects like databases, schemas, and tables by
specifying the DATA_RETENTION_TIME_IN_DAYS
parameter to 0 while creating.
• Once the retention period is completed, the data will
be automatically moved to the FAIL-SAFE layer, and
time travel actions can no longer be performed.
• If the retention period is defined at the account level,
then all the child objects like database, schema, and
tables will have the same value until and unless the
value is explicitly specified while creating the child
objects
Time travel
• Storage cost for time travel
• We can see the storage cost of time travel with the following query:
• SELECT * FROM snowflake.account_usage.table_storage_mterics
WHERE schema = schema_name;
• From the above query below parameters are important in calculating
the cost:
• ACTIVE_BYTES — Actual table cost
• TIME_TRAVEL_BYTES — Time travel cost
• FAIL_SAFE_BYTES — Fail Safe cost
Fail safe – method of last resort
• Fail-safe ensures historical data is protected in the
event of a system failure or other event
• Fail-safe provides a (non-configurable) 7-day
period during which historical data may be
recoverable by Snowflake - this period starts
immediately after the Time Travel retention period
ends
• This service is BEST EFFORT recovery service
• It is intended only for use when all other recovery options failed
• Fail-safe is not provided as a means for accessing historical data
after the Time Travel retention period has ended
• Data recovery through Fail-safe may take from several hours to
several days to complete
• In the web interface, account administrators can
view the total data storage for their account,
including historical data in Fail-safe
Micro partitions
Traditional data warehouses rely on static partitioning of large tables for performance and scaling
• Static partitioning has a number of limitations, such as maintenance overhead and data skew, which can result in
disproportionately-sized partitions

Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that
delivers all the advantages of static partitioning without the known limitations

All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage
• Each micro-partition contains between 50 MB and 500 MB of uncompressed data - actual size in Snowflake is smaller because
data is always stored compressed
• Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion
• This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even
hundreds of millions, of micro-partitions

Snowflake stores metadata about all rows stored in a micro-partition, including:


• The range of values for each of the columns in the micro-partition.
• The number of distinct values.
• Additional properties used for both optimization and efficient query processing
Micro partitions
Additional benefits of micro-partitions
• Micro-partitions can overlap in their range of values, which, combined with their
uniformly small size, helps prevent skew
• Columns are stored independently within micro-partitions, often referred to as columnar
storage - this enables efficient scanning of individual columns; only the columns
referenced by a query are scanned
• Columns are also compressed individually within micro-partitions - Snowflake
automatically determines the most efficient compression algorithm for the columns in
each micro-partition
Micro partitions
Impact of Micro-partitions
• DML
• All DML operations take advantage of the underlying micro-partition metadata to facilitate table maintenance –
some operations, such as deleting all rows from a table, are metadata-only operations
• Query Pruning
• The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-partitions at
query run-time, including columns containing semi-structured data
• The closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected, the more
efficient is the pruning performed on the table.
• Not all predicate expressions can be used to pruned - Snowflake does not prune micro-partitions based on a
predicate with a subquery, even if the subquery results in a constant
Micro partitions
Data clustering
• Typically, data stored in tables is sorted/ordered along natural dimensions (e.g. date
and/or geographic regions). This “clustering” is a key factor in queries because table
data that is not sorted or is only partially sorted may impact query performance,
particularly on very large tables.

• In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected


and recorded for each micro-partition created during the process.

• Snowflake then leverages this clustering information to avoid unnecessary scanning of


micro-partitions during querying, significantly accelerating the performance of queries
that reference these columns
Thank you for your time

You might also like