0% found this document useful (0 votes)

9 views62 pages

Redshift Data Warehouse

The document discusses modernizing data warehouses using Amazon Redshift, highlighting its architecture, features, and benefits such as reduced maintenance costs and faster data processing. It emphasizes the advantages of a lake house approach, the use of columnar storage, and advanced query acceleration with AQUA technology. Additionally, it covers data warehousing trends, migration strategies, and various technical concepts related to Amazon Redshift's functionality.

Uploaded by

Narsireddy Avula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views62 pages

Redshift Data Warehouse

Uploaded by

Narsireddy Avula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Modernize your data warehouse

Aneesh Chandra PN
Data & Analytics Specialist Solutions Architect
Amazon Web Services

• Architecture and concepts

• Accelerating your data warehouse migration

• Additional resources

• Q&A
Data warehousing trends

Data
010010010
01010001
100010100

Exponential growth End-to-end insights from Migrations

of event data analyzing all of your data to the cloud
Why Amazon Redshift
Tens of thousands of customers use Amazon Redshift and process over 2 EB of data per day

Lake formation catalog & security 3x faster than other Usage-based, RIs
cloud data warehouses Predictable costs
Exabyte querying & AWS integrated
(e.g., AWS DMS, Amazon CloudWatch)

Virtually unlimited AWS-grade security (e.g., VPC, encryption Easy to provision & manage, automated
elastic linear scaling with AWS KMS, AWS CloudTrail) backups, AWS support, and 99.9% SLAs
Data warehouse
(business data)

Amazon
Redshift

Data lake
(event data)

Customers moving to data lake architectures

Amazon Redshift enables you to have a lake house approach
“ Moving to Amazon Redshift has helped us reduce our overall maintenance costs
by nearly 80% compared with our legacy data warehouse. By leveraging Amazon
Redshift Spectrum’s ability to query data directly in our Amazon S3 data lake, we
have been able to easily integrate new data sources in hours, not days or weeks.
This has not only reduced our time to insight, but it helped us control our
infrastructure costs. Amazon Redshift requires very little maintenance, to the
point where we don’t even have a dedicated administrator, and we spend less

”
than an hour a month on maintenance and administration.
Elliott Cordo
VP of Data Analytics
Equinox
Tens of thousands of customers use Amazon Redshift
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PostgreSQL
OLAP

MPP

PostgreSQL Columnar
AWS Cloud

Amazon Amazon IAM Amazon

SWF VPC EC2

OLAP

MPP

PostgreSQL Columnar

Amazon S3 AWS KMS Amazon Amazon

Route 53 CloudWatch
AWS Cloud

Amazon Amazon IAM Amazon

SWF VPC EC2

OLAP

MPP

PostgreSQL Columnar

Amazon Redshift

Amazon S3 AWS KMS Amazon Amazon

Route 53 CloudWatch
Features delivered to meet customer needs
Robust result Large # of tables support Copy command support
set caching ~20,000 for ORC, Parquet
IAM role chaining Elastic resize Groups
Performance enhancements:
Amazon Redshift Spectrum: date CloudWatch
Auto- Health and performance Automatic table hash join, vacuum, window functions,
formats, scalar JSON and ION file
monitoring with support for resize ops, aggregations, console,
format support, region expansion,
predicate filtering
analyze Amazon CloudWatch distribution style WLM queues union all, efficient compile
code cache

200+
Unload ~25 query monitoring
Auto WLM AQUA (Advanced Query Accelerator)
to CSV rules (QMR) support

Resiliency of
Concurrency scaling DC1 migration to DC2
ROLLBACK processing
Autoanalyze for
Manage multi-part Spectrum Request Apply new
incremental changes
query in AWS console
on table Accelerator distribution key

Performance: Bloom filters in

Amazon Redshift
Faster classic joins, complex queries that Amazon Redshift Spectrum:
Spectrum: Row group
filtering in Parquet and
ORC, nested data support,
resize with new features in the create internal table,
communication layer
Concurrency scaling

enhanced VPC routing,

multiple partitions
optimized data
transfer protocol
past 18 months Integration with
Autovacuum sort,
autoanalyze, and
AWS Lake Formation
autotable sort
Performance: Join pushdowns
Snapshot scheduler
Auto WLM with to subquery, mixed workload
temporary tables, rank functions,
Advisor recommendations AZ64 compression Console
query priorities Stored procedures null handling in join, for distribution keys encoding redesign
single row insert
Column-level access Performance of Federated Materialized Manual pause
Spatial processing control with RA3 inter-region
AWS Lake Formation snapshot transfers query views and resume
Amazon Redshift architecture
Massively parallel, shared-nothing SQL clients/BI tools
columnar architecture JDBC/ODBC

Leader node Leader

SQL endpoint node
Stores metadata
Coordinates parallel SQL processing
Compute Compute Compute
Compute nodes node node node
Local, columnar storage
Executes queries in parallel Load
Load, unload, backup, restore Unload
Backup
Restore
Amazon Redshift Spectrum nodes
Execute queries directly against Amazon S3
Amazon S3
Amazon Redshift architecture
Massively parallel, shared-nothing SQL clients/BI tools
columnar architecture JDBC/ODBC

Leader node Leader

SQL endpoint node
Stores metadata
Coordinates parallel SQL processing
Compute Compute Compute
Compute nodes node node node
Local, columnar storage
Executes queries in parallel Load Amazon
Load, unload, backup, restore Query ... Redshift
1 2 3 4 N Spectrum
Amazon Redshift Spectrum nodes
Execute queries directly against Amazon S3
Amazon S3
Amazon Redshift evolving architecture
Massively parallel, shared-nothing
SQL clients/BI tools
columnar architecture
Leader node JDBC/ODBC

Compute nodes
Leader
Amazon Redshift Spectrum nodes
node

Amazon Redshift managed storage

Pay separately for storage and compute Compute Compute Compute
Large high-speed SSD-backed cache node node node
Automatic scaling (up to 64 TB/instance)
Supports up to 8.2 PB of cluster storage
AQUA: Advanced Query Accelerator (preview)
A new distributed and hardware-accelerated processing layer that will make Amazon Redshift
10x faster than any other cloud data warehouse without increasing cost

Compute
Compute Compute Compute
Compute
Compute Compute
Compute Compute
Compute
Clusters
Clusters
RA3
Clusters
Compute
Clusters
Clusters
RA3
Clusters
Compute
Clusters
Clusters
RA3
Clusters Minimize data movement over the network
Clusters Clusters Clusters
cluster cluster cluster by pushing down operations to AQUA nodes

AQUA nodes with custom AWS-designed analytics

processors to make operations (compression,
AQUA AQUA AQUA AQUA encryption, filtering, and aggregations) faster than
node node node node
traditional CPUs

Available in preview only with RA3 – no code

changes required

Amazon Redshift managed storage

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Terminology and concepts: Node types
Amazon Redshift analytics – RA3 (new)
Amazon Redshift managed storage – solid-state disks + Amazon S3

Dense compute – DC2

Solid-state disks

Dense storage – DS2

Magnetic disks

Instance type Disk type Size Memory CPUs Slices

RA3 4xlarge (new) RMS Scales to 16 TB 96 GB 12 4

RA3 16xlarge (new) RMS Scales to 64 TB 384 GB 48 16

DC2 large SSD 160 GB 16 GB 2 2

DC2 8xlarge SSD 2.56 TB 244 GB 32 16

DS2 xlarge Magnetic 2 TB 32 GB 4 2

DS2 8xlarge Magnetic 16 TB 244 GB 36 16

Terminology and concepts: Columnar
Amazon Redshift uses a columnar architecture for storing data on disk

Goal: Reduce I/O for analytics queries

Physically store data on disk by column rather than row

Read only the column data that is required

Example: Columnar architecture
CREATE TABLE deep_dive (
aid INT --audience_id
aid loc dt
,loc CHAR(3) --location 1 SFO 2017-10-20
);
,dt DATE --date
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt

SELECT min(dt) FROM deep_dive;

Row-based storage
• Need to read everything
• Unnecessary I/O
Example: Columnar architecture
CREATE TABLE deep_dive (
aid INT --audience_id
aid loc dt
,loc CHAR(3) --location 1 SFO 2017-10-20
);
,dt DATE --date
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt

SELECT min(dt) FROM deep_dive;

Column-based storage
Only scan blocks for relevant column
Terminology and concepts: Compression
Goals Impact
Allow more data to be stored within Allows 2–4x more data
an Amazon Redshift cluster to be stored within the cluster
Improve query performance
by decreasing I/O

By default, COPY automatically analyzes and compresses data on first load

into an empty table

ANALYZE COMPRESSION is a built-in command that finds the optimal compression

for each column on an existing table
Compression example
CREATE TABLE deep_dive (
aid INT --audience_id
aid loc dt
,loc CHAR(3) --location 1 SFO 2017-10-20
);
,dt DATE --date
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt

Add 1 of 13 different encodings

to each column
Compression example
CREATE TABLE deep_dive (
aid INT ENCODE AZ64
aid loc dt
,loc CHAR(3) ENCODE BYTEDICT 1 SFO 2017-10-20
);
,dt DATE ENCODE RUNLENGTH
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt

• More efficient compression by

storing the same data type in
the columnar architecture
• Columns grow and shrink
independently
• Reduces storage requirements
• Reduces I/O
Terminology and concepts: Blocks
Column data is persisted to 1 MB immutable blocks

Blocks are individually encoded with 1 of 13 encodings

A full block can contain millions of values

Terminology and concepts: Zone maps
Goal
Eliminates unnecessary I/O

In-memory block metadata

• Contains per-block minimum and maximum values
• All blocks automatically have zone maps
• Effectively prunes blocks that cannot contain data for a given query
Terminology and concepts: Data sorting
Goal Impact
Make queries run faster by Enables range-restricted scans
increasing the effectiveness of to prune blocks by leveraging
zone maps and reducing I/O zone maps

Achieved with the table property SORTKEY defined on one or

more columns

Optimal sort key is dependent on:

• Query patterns
• Business requirements
• Data profile
Example: Sort key
CREATE TABLE deep_dive (
aid INT --audience_id Add a sort key to one or more
,loc CHAR(3) --location columns to physically sort
,dt DATE --date
the data on disk
) SORTKEY (dt, loc);

deep_dive
aid loc dt
1 SFO 2017-10-20

2 JFK 2017-10-20

3 SFO 2017-04-01

4 JFK 2017-05-14
Example: Sort key
CREATE TABLE deep_dive (
aid INT --audience_id Add a sort key to one or more
,loc CHAR(3) --location columns to physically sort
,dt DATE --date
the data on disk
) SORTKEY (dt, loc);

deep_dive deep_dive (sorted)

aid loc dt aid loc dt
1 SFO 2017-10-20 3 SFO 2017-04-01

2 JFK 2017-10-20 4 JFK 2017-05-14

3 SFO 2017-04-01 2 JFK 2017-10-20

4 JFK 2017-05-14 1 SFO 2017-10-20

Example: Zone maps and sorting
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017';

Unsorted table
MIN: 01-JUNE-2017
MAX: 20-JUNE-2017

MIN: 08-JUNE-2017
MAX: 30-JUNE-2017

MIN: 12-JUNE-2017
MAX: 20-JUNE-2017

MIN: 02-JUNE-2017
MAX: 25-JUNE-2017
Example: Zone maps and sorting
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017';

Unsorted table
MIN: 01-JUNE-2017
MAX: 20-JUNE-2017

MIN: 08-JUNE-2017
MAX: 30-JUNE-2017

MIN: 12-JUNE-2017
MAX: 20-JUNE-2017

MIN: 02-JUNE-2017
MAX: 25-JUNE-2017
Example: Zone maps and sorting
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017';

Unsorted table Sorted by date

MIN: 01-JUNE-2017 MIN: 01-JUNE-2017
MAX: 20-JUNE-2017 MAX: 06-JUNE-2017

MIN: 08-JUNE-2017 MIN: 07-JUNE-2017

MAX: 30-JUNE-2017 MAX: 12-JUNE-2017

MIN: 12-JUNE-2017 MIN: 13-JUNE-2017

MAX: 20-JUNE-2017 MAX: 21-JUNE-2017

MIN: 02-JUNE-2017 MIN: 21-JUNE-2017

MAX: 25-JUNE-2017 MAX: 30-JUNE-2017
Example: Zone maps and sorting
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017';

Unsorted table Sorted by date

MIN: 01-JUNE-2017 MIN: 01-JUNE-2017
MAX: 20-JUNE-2017 MAX: 06-JUNE-2017

MIN: 08-JUNE-2017 MIN: 07-JUNE-2017

MAX: 30-JUNE-2017 MAX: 12-JUNE-2017

MIN: 12-JUNE-2017 MIN: 13-JUNE-2017

MAX: 20-JUNE-2017 MAX: 21-JUNE-2017

MIN: 02-JUNE-2017 MIN: 21-JUNE-2017

MAX: 25-JUNE-2017 MAX: 30-JUNE-2017
Terminology and concepts: Slices

A slice can be thought of

as a virtual compute node Facts about slices

Unit of data partitioning Each compute node is initialized with

either 2 or 16 slices
Parallel query processing
Table rows are distributed to slices
A slice processes only its own data
Data distribution
Distribution style is a table property that KEY
dictates how that table’s data is distributed
throughout the cluster
KEY: Value is hashed, same value goes to same location (slice) Slice Slice Slice Slice
1 2 3 4
ALL: Full table data goes to the first slice of every node Node 1 Node 2

EVEN: Round robin

AUTO: Combines EVEN and ALL ALL
Goals EVEN
Distribute data evenly for
parallel processing Slice Slice Slice Slice Slice Slice Slice Slice
1 2 3 4 1 2 3 4
Minimize data movement Node 1 Node 2 Node 1 Node 2
during query processing
Data distribution example
CREATE TABLE deep_dive ( Table: deep_dive
aid INT --audience_id
,loc CHAR(3) --location User columns System columns
,dt DATE --date
) (EVEN|KEY|ALL|AUTO); aid loc dt ins del row

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2
Data distribution, EVEN example
CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES
aid INT --audience_id (1, 'SFO', '2016-09-01'),
,loc CHAR(3) --location (2, 'JFK', '2016-09-14'),
,dt DATE --date (3, 'SFO', '2017-04-01'),
) DISTSTYLE EVEN; (4, 'JFK', '2017-05-14');

Table: deep_dive
User Columns System Columns
aid loc dt ins del row

Rows: 1
Slice 0 Slice 1 Slice 2 Slice 3

Table: deep_dive Table: deep_dive

User Columns System Columns User Columns System Columns
aid loc dt ins del row aid loc dt ins del row

Rows: 1 Rows: 1
Slice 0 Slice 1 Slice 2 Slice 3

Table: deep_dive Table: deep_dive Table: deep_dive

User Columns System Columns User Columns System Columns User Columns System Columns
aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row