0% found this document useful (0 votes)
10 views20 pages

Amazon Redshift Overview and Features

Notes on the Amazon Redshift

Uploaded by

chandra.sekhar15
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Amazon Redshift Overview and Features

Notes on the Amazon Redshift

Uploaded by

chandra.sekhar15
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud offered by Amazon Web
Services (AWS). It is designed to handle large-scale data storage and complex queries, providing fast and scalable
data analytics capabilities.

Key Features of Amazon Redshift:

1. Scalability: Redshift can scale from a few gigabytes to petabytes of data, and you can add or remove
nodes easily based on your storage and performance needs.
2. Columnar Storage: Data is stored in a columnar format, which significantly speeds up query performance
for analytical workloads, as only the necessary columns are accessed during queries.
3. Massively Parallel Processing (MPP): Redshift uses multiple nodes to perform parallel processing,
meaning queries are executed across many servers at once, greatly improving speed.
4. Integrated with AWS Ecosystem: Redshift is tightly integrated with other AWS services like S3,
DynamoDB, EMR, and Lambda, making it easier to build a seamless data pipeline.
5. SQL Interface: Redshift supports standard SQL for querying, making it accessible to users familiar with
SQL. It also supports advanced features like window functions, joins, and subqueries.
6. Performance Optimization: Redshift provides various performance optimizations, including automated
data compression, distribution styles (to control how data is spread across nodes), and query optimization
techniques.
7. Security: Redshift offers encryption at rest and in transit, VPC isolation, IAM (Identity and Access
Management) integration for access control, and audit logging.
8. Cost-Effective: With a pay-as-you-go pricing model, Redshift can be more cost-efficient compared to
traditional on-premises data warehouses. It also offers Reserved Instances for cost savings over time.

Use Cases:

 Business Intelligence (BI): Companies use Redshift for analyzing large datasets and generating insights
through BI tools like Tableau, Power BI, or AWS QuickSight.
 Data Warehousing: Redshift is often used as a central repository to store data from different sources for
further analysis.
 Data Lakes: With integration with S3, Redshift can query data stored in a data lake without requiring
movement or ETL processing, providing faster insights.

How It Works:

1. Cluster: A Redshift cluster is composed of one or more nodes. The leader node coordinates query
processing, while compute nodes handle the query execution.
2. Data Distribution: Data is distributed across compute nodes based on a chosen distribution style, which
optimizes query performance by reducing data shuffling during queries.
3. Query Execution: When a query is executed, it is parsed, planned, and optimized, and then distributed
across the nodes for parallel execution.

Redshift is widely used by organizations of all sizes for fast, scalable, and secure data analytics in the cloud,
especially when dealing with large datasets and complex queries.
What is columnar storage?

Columnar storage is a method of organizing and storing data in a database where data is stored by columns rather
than by rows. This contrasts with row-based storage, where each record (or row) is stored together, and each
column is stored in sequence for each row.

In columnar storage, all the values for a particular column are stored together in contiguous blocks of memory or
disk. For example, if you have a database table with columns for "Name," "Age," "Country," etc., the columnar
storage system would store all the "Name" values together, all the "Age" values together, and all the "Country"
values together, rather than storing each entire row (e.g., "John, 30, USA") in a sequential manner.

Benefits of Columnar Storage:

1. Improved Query Performance:


o Faster Reads for Analytical Queries: In analytical workloads, queries often involve filtering or
aggregating data based on one or a few columns (e.g., "SELECT * FROM table WHERE Age > 30").
Since all the data for each column is stored together, columnar storage allows the system to only
read the relevant columns for the query, reducing I/O and improving performance.
o Data Compression: Since data in a column is usually of the same type and often has a similar
pattern (e.g., many values may be the same), columnar storage can achieve better compression
rates. For example, a column of integers may compress more efficiently than rows of mixed data
types.
2. Efficient Storage:
o Compression: Columns often contain many repeated values (such as many rows having the same
country or age), so columnar databases can use more efficient compression algorithms (like run-
length encoding or dictionary encoding). This reduces the amount of storage required for the
data.
o Sparse Data Handling: Columnar formats can efficiently handle sparse data (data with many
missing values) because only the columns with data need to be stored, and the missing values
don't need to consume storage space.
3. Optimized for Analytical Workloads:
o Analytical workloads usually involve operations like SELECT, GROUP BY, JOIN, and AGGREGATE
queries that focus on large sets of data from specific columns. Columnar storage is ideal for such
operations because:
 Aggregation can be done directly on a column without needing to scan entire rows.
 Scanning only the relevant columns for filtering and aggregation drastically reduces the
time spent on I/O.

Example:

Let’s say you have the following table in a traditional row-based format:

Name Age Country


John 30 USA
Sarah 25 UK
Emily 35 Canada
James 40 USA

In a row-based storage system, the rows are stored together in the following format:
[John, 30, USA]
[Sarah, 25, UK]
[Emily, 35, Canada]
[James, 40, USA]

In a columnar storage system, however, the data is stored as follows:

Names: [John, Sarah, Emily, James]


Ages: [30, 25, 35, 40]
Countries: [USA, UK, Canada, USA]

Now, if you want to run a query like SELECT AVG(Age) FROM table WHERE Country = 'USA', the columnar system
can quickly retrieve just the "Age" column and filter the rows where the "Country" column is 'USA', making it much
faster than scanning all rows.

Use Cases for Columnar Storage:

 Data Warehousing: Columnar storage is ideal for data warehousing and analytics where queries often
involve large datasets and only a few columns are needed at a time.
 Business Intelligence (BI): Tools like Tableau, Power BI, or AWS QuickSight, which often focus on specific
metrics or aggregations, can benefit greatly from columnar storage to speed up data retrieval.
 Big Data Analytics: In systems like Amazon Redshift, Google BigQuery, or Apache Parquet, which are
designed to handle massive datasets efficiently, columnar storage allows for faster processing of queries
and better compression.

In Summary:

Columnar storage is highly effective for analytical queries where you need to work with large datasets but only a
subset of columns at a time. By storing data by columns instead of rows, it improves query performance, reduces
I/O operations, and often leads to better storage efficiency, particularly in data-intensive environments like data
warehouses and analytics platforms.

Key Components of Amazon Redshift Architecture:

Leader Node

The Leader Node in a Redshift Cluster manages all external and internal communication. It is responsible for
preparing query execution plans whenever a query is submitted to the cluster. Once the query execution plan is
ready, the Leader Node distributes query execution code on the compute nodes and assigns slices of data to each
to compute node for computation of results.

Leader Node distributes query load to compute node only when the query involves accessing data stored on the
compute nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift
architecture which are always executed on the Leader Node.

Compute Nodes

Compute Nodes are responsible for actual execution of queries and have data stored with them. They execute
queries and return intermediate results to the Leader Node which further aggregates the results.

There are two types of Compute Nodes available in Redshift architecture:


 Dense Storage (DS) – Dense Storage nodes allow you to create large data warehouses using Hard Disk
Drives (HDDs) for a low price point.
 Dense Compute (DC) – Dense Compute nodes allow you to create high-performance data warehouses
using Solid-State Drives (SSDs).

A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes is depicted in
below diagram:

Node slices
A compute node consist of slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it
where it performs Query Operations. The Leader Node is responsible for assigning a Query code and data to a slice
for execution. Slices once assigned query load work in parallel to generate query results.

Data is distributed among the Slices on the basis of Distribution Style and Distribution Key of a particular table. An
even distribution of data enables Redshift to assign workload evenly to slices and maximizes the benefit of parallel
processing.

Number of Slices per Compute Node is decided on the basis of the type of node.

Massively parallel processing (MPP)

Redshift architecture allows it to use Massively parallel processing (MPP) for fast processing even for the most
complex queries and a huge amount of data. Multiple compute nodes execute the same query code on portions of
data to maximize parallel processing.

Columnar Data Storage

Data in Redshift is stored in a columnar fashion which drastically reduces the I/O on disks. Columnar storage
reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to execute a
query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-
memory processing.

Redshift uses Sort Keys to sort columns and filter out chunks of data while executing queries.

Data compression

Data compression is one of the important factors in ensuring query performance. It reduces storage footprint and
enables loading of large amounts of data in the memory fast. Owing to Columnar data storage, Redshift can use
adaptive compression encoding depending on the column data type.

Query Optimizer

Redshift’s Query Optimizer generate query plans that are MPP-aware and takes advantage of Columnar Data
Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution.

13 Key points to remember about Amazon Redshift

1. Massively Parallel Processing (MPP) Architecture

Amazon Redshift has a Massively Parallel Processing Architecture. MPP enables Redshift to distribute and
parallelize queries across multiple nodes. Apart from queries, the MPP architecture also enables parallel
operations for data loads, backups and restores. Redshift architecture is inherently parallel; there is no additional
tuning or overheads for distribution of loads for the end users.

2. Redshift supports Single Node Clusters to 100 Nodes Clusters with up to 1.6 PB of storage

You can provision a Redshift cluster with from a single Node to 100 Nodes configuration depending on the
processing and storage capacity required. Redshift nodes come in two sizes XL & 8XL. XL node comes with 2 TB
attached storage and 8XL node comes with 16 TB attached storage. Clusters can have a maximum of 32 XL nodes
(64 TB) or 100 8XL nodes (1.6 PB).

3. Redshift does not support multi AZ deployments

Redshift clusters currently support only Single AZ deployments. You will not be able to access Redshift n case of an
Availability Zone failure. An AZ failure will not affect the durability of your data, you can start using the cluster once
the AZ is available. To ensure continuous access to your data, you can launch an additional cluster in different AZ.
You can restore a new Redshift cluster in a different AZ by recreating it using the snap shots of the original cluster.
Alternately, you can have a cluster running always in a different AZ, accessing the same set of data from S3.

4. Columnar Storage & Data Compression

Redshift provides columnar data storage. With Columnar data storage, all values for a particular column are stored
contiguously on the disk in sequential blocks.

Columnar data storage helps reduce the I/O requests made to the disk compared to a traditional row based data
storage. It also reduces the amount of data loaded from the disk improving the processing speed, as more memory
is available for query executions.

As similar data is stored sequentially, Redshift compresses the data rather efficiently. Compression of data further
reduces the amount of I/O required for queries.

5. Parallel uploads to Redshift are supported only for data stored in Amazon S3 & DynamoDB

Redshift currently supports data imports/copy only from S3 and DynamoDB. Using COPY command from S3 is the
fastest way to load data into Redshift. COPY loads data in parallel and is much more efficient than Insert
statement.

Redshift does not have support to load data in parallel from other sources. You will either have to use Insert
statements or write scripts to first load data into S3 and then into Redshift. This could sometime be a complex
process depending on the size and format of data available with you.

6. Redshift is Secure

Amazon provides various security features for Redshift just like all other AWS services.
Access Control can be maintained at the account level using IAM roles. For data base level access control, you can
define Redshift database groups and users and restrict access to specific database and tables.

Redshift can be launched in Amazon VPC. You can define VPC security groups to restrict inbound access to your
clusters.
Redshift allows data encryption for all data which is stored in the cluster as well as SSL encryption for data in
transit.

7. Distribution Keys

Redshift achieves high query performance by distributing data evenly on all the nodes of a cluster and slices within
a node.
A Redshift cluster is made of multiple nodes and each node has multiple slices. The number of slices is equal to the
number of processor cores in a node. Each slice is allocated a portion of node’s memory and disk space. During
query execution the data is distributed across slices, the slices operate in parallel to execute the queries.

To distribute data evenly among slices, you need to define a distribution key for a table while creating it. If a
distribution key is defined during table creation, any data, which is loaded in the table, is distributed across nodes
based on the distribution key value. Matching values from a distribution key column are stored together.

A good distribution key will ensure even load distribution across slices, uneven distributions will cause some slices
to handle more load than others, and slows down the query execution.

If a distribution key is not defined for a column, the data is by default distributed in a round robin fashion by
Redshift.

8. You cannot change the distribution key once a table is created

A distribution key for a table cannot be amended once it is created. This is very important to keep in mind while
identifying the right distribution key for a table.

To change a distribution key, the only work around is to create a new table with the updated distribution key, load
data into this table and rename the table as the original table after deleting the original table.

9. Redshift does not enforce Database Constraints or support Indexes

You can define database constraints like unique, primary and foreign keys but these constraints are informational
only and are not enforced by Redshift. These constraints, though are used by Redshift to create query execution
plans, ensuring optimal execution. If the primary key and foreign key constraints are correct, they should be
declared while creating tables to have optimal executions.

Redshift also does not support creation of secondary indexes on columns.

10. Redshift does not automatically reclaim space that is freed on deletes or updates

Redshift is based on PostgreSQL version 8.0.2 and inherits some of its limitations. One such limitation is that
Redshift does not reclaim and reuse the space freed up by delete or update commands. The free space left by
deleted or updated records in large numbers can cost some extra processing.

Every update command in Redshift first deletes the existing row and then inserts a new record with the updated
values.

To reclaim this unused space, you can run the Vacuum command. Vacuum command reclaims the freed space and
also sorts data in the disk.

Ideally there would be very little updates or deletes once data is loaded in a data warehouse, but in case it does,
you can run the Vacuum command.

11. Query Concurrency in a cluster

Redshift enforces a query concurrency limit of 15 on a cluster.


Queries in are executed in a queue, by default there is one queue per query cluster which can run up to five
concurrent queries. Users can modify the configuration to allow up to 15 queries per queue and a maximum of 8
queues.

The concurrent queries for a cluster across queues is limited to a maximum of 15. Users cannot modify this
configuration.

12. Amazon QuickSight

QuickSight is a useful tool for building dashboards and BI Reports on Redshift. It is tuned into work faster with
Redshift.

13. Amazon Redshift Utils in GitHub

Amazon Redshift Github utilities available in github have highly useful admin scripts.

Summary of Redshift Architecture:

 Leader Node: Coordinates the query execution, aggregates results, and handles metadata.
 Compute Nodes: Perform the parallel processing of queries, store data, and return results.
 Data Distribution: Data is distributed across compute nodes to leverage MPP and parallelism.
 Columnar Storage: Optimizes query performance for analytical workloads and reduces storage usage
through compression.
 Massively Parallel Processing (MPP): Distributes queries across nodes for faster execution of complex
queries.
 Fault Tolerance & Security: Provides replication, automated backups, and data encryption to ensure high
availability and security.

The architecture of Amazon Redshift is specifically designed to scale efficiently for large datasets while maintaining
fast query performance, making it an ideal solution for data warehousing and analytics in the cloud.

Amazon Redshift offers different node types to accommodate your workloads, and we recommend choosing RA3 or
DC2 depending on the required performance, data size, and expected data growth.
RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute
and managed storage independently. With RA3, you choose the number of nodes based on your performance
requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of
data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can't
launch RA3 clusters in EC2-Classic.

Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and
Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs,
Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for
Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For
workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse
storage capacity separate from compute nodes.

DC2 nodes enable you to have compute-intensive data warehouses with local SSD storage included. You choose the
number of nodes you need based on data size and performance requirements. DC2 nodes store your data locally
for high performance, and as the data size grows, you can add more compute nodes to increase the storage
capacity of the cluster. For datasets under 1 TB (compressed), we recommend DC2 node types for the best
performance at the lowest price. If you expect your data to grow, we recommend using RA3 nodes so you can size
compute and storage independently to achieve improved price and performance. You launch clusters that use the
DC2 node types in a virtual private cloud (VPC).

Some node types allow one node (single-node) or two or more nodes (multi-node). The minimum number of nodes
for clusters of some node types is two nodes. On a single-node cluster, the node is shared for leader and compute
functionality. Single-node clusters are not recommended for running production workloads. On a multi-node
cluster, the leader node is separate from the compute nodes. The leader node is the same node type as the
compute nodes. You only pay for compute nodes.
Amazon Redshift applies quotas to resources for each AWS account in each AWS Region. A quota restricts the
number of resources that your account can create for a given resource type, such as nodes or snapshots, within an
AWS Region.

The cost of your cluster depends on the AWS Region, node type, number of nodes, and whether the nodes are
reserved in advance.

In Amazon Redshift, distribution style determines how data is distributed across the compute nodes in a cluster.
Choosing the right distribution style is crucial for performance optimization, as it affects data locality, query
execution speed, and overall cluster efficiency.
Types of Distribution Styles
1. AUTO (Default)
o Redshift decides the distribution style automatically based on the table size and query patterns.
o Small tables are typically set to ALL.
o Larger tables are set to EVEN or KEY, depending on query patterns.
2. EVEN
o Data is distributed evenly across all the nodes in the cluster.
o Suitable for tables where no specific column is frequently joined or filtered on.
o Avoids data skew but may result in high data transfer during joins or aggregations.
3. KEY
o Data is distributed based on the values in a specified column (distribution key).
o Rows with the same key value are stored on the same node.
o Optimal for tables frequently joined or filtered on the same column (distribution key).
o Can reduce data shuffling during query execution but may lead to data skew if the key values are
not evenly distributed.
4. ALL
o A full copy of the table is stored on each node.
o Best for small dimension tables that are frequently joined with larger fact tables.
o Eliminates data shuffling but increases storage and maintenance overhead.

Guidelines for Choosing Distribution Style


1. Small Tables
o Use ALL for small lookup or dimension tables.
2. Large Tables
o Use KEY if the table is frequently joined with other tables on the same key.
o Use EVEN if no specific column is suitable as a distribution key or the workload is highly varied.
3. AUTO Mode
o Use AUTO when you’re unsure or expect workload patterns to change, as Redshift optimizes the
distribution style over time.

Best Practices
 Analyze Query Patterns: Use queries to identify frequently joined or filtered columns.
 Avoid Skew: Choose distribution keys with high cardinality and an even distribution of values.
 Combine with Sort Keys: Align the distribution key with the sort key to improve query performance
further.

Redshift Snapshots
A snapshot in Redshift is a point-in-time backup of your cluster. These snapshots allow you to restore the cluster to
a specific state.
Types of Snapshots
1. Automated Snapshots
o Created automatically by Amazon Redshift based on your cluster's snapshot schedule.
o Retention period is configurable (default is 1 day).
o Managed entirely by Redshift.
2. Manual Snapshots
o Created manually by the user.
o Persist until explicitly deleted.
o Useful for long-term backups or sharing across accounts.
How to Create a Manual Snapshot
1. AWS Management Console:
o Navigate to the Amazon Redshift Console.
o Select the cluster.
o Choose Snapshots > Create Snapshot.
o Enter a name for the snapshot and create it.
2. AWS CLI:
aws redshift create-cluster-snapshot --cluster-identifier my-cluster --snapshot-identifier my-snapshot
3. AWS SDK: Use the CreateClusterSnapshot API.
Restoring a Snapshot
 Snapshots can be restored into a new cluster using the console, CLI, or SDK.

Sharing Redshift Snapshots


You can share manual snapshots with other AWS accounts or make them publicly accessible.
Steps to Share a Snapshot
1. Grant Access to the Snapshot
o Use the Redshift Console or CLI to add an AWS account to the snapshot's access list.
o By default, snapshots are private.
AWS Management Console:
o Navigate to the snapshot.
o Choose Actions > Share Snapshot.
o Add the AWS account ID to share with.
AWS CLI:
aws redshift modify-snapshot-attribute \
--snapshot-identifier my-snapshot \
--account-to-share-with 123456789012
2. Public Snapshots
o You can make a snapshot public by modifying its attributes:
aws redshift modify-snapshot-attribute \
--snapshot-identifier my-snapshot \
--attribute-name restore --operation add --values all
o Caution: Public snapshots are accessible to anyone.
3. Receiving the Shared Snapshot
o The recipient can view the shared snapshot in their AWS account under Snapshots > Shared
Snapshots.
o They can restore it into a new cluster.

Best Practices for Sharing


 Grant Access Carefully: Share snapshots only with trusted accounts to avoid data breaches.
 Use Tags: Tag snapshots to track their purpose and ownership.
 Enable Encryption: If the snapshot is encrypted, ensure the recipient has access to the KMS key used.

In Amazon Redshift, VACUUM is a maintenance operation used to reorganize and reclaim storage space. Over time,
as data is updated or deleted, the storage becomes fragmented, and the table can accumulate "dead rows" that
degrade query performance. Running a VACUUM operation resolves this by compacting and sorting the data.

When and Why to Use VACUUM


1. Reclaim Space:
o Reorganizes storage and removes deleted rows to free up space.
2. Improve Query Performance:
o Maintains sort order in sorted tables, improving the efficiency of query execution plans.
3. Optimize Storage:
o Ensures data is stored compactly, reducing disk I/O.

Types of VACUUM Operations


1. VACUUM FULL:
o Reclaims disk space and resorts all rows in the table.
o Useful after significant delete operations or updates.
o Can be resource-intensive and should be used during low-traffic periods.
2. VACUUM DELETE ONLY:
o Removes only the rows marked for deletion but does not resort the table.
o Faster and less resource-intensive than FULL.
o Use when there is minimal need for resorting.
3. VACUUM SORT ONLY:
o Resorts the table without reclaiming space from deleted rows.
o Useful for maintaining sort order when no major delete operations have occurred.
4. VACUUM REINDEX:
o Rebuilds interleaved sort keys for optimized query performance.
o Use after bulk data loads or significant changes to the table.

-- General syntax
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ]
[ table_name ]
[ TO threshold_percent ];

-- Example: Vacuum a specific table


VACUUM FULL my_table;

-- Example: Vacuum only rows with more than 20% fragmentation


VACUUM my_table TO 20;

Options
 TO threshold_percent:
o Specifies the minimum percentage of unsorted rows to trigger the VACUUM operation.
o Default is 95%.

Best Practices
1. Analyze Table Usage:
o Check the STV_BLOCKLIST system table to identify tables with unsorted or deleted blocks:
SELECT tbl, name, unsorted, size, deleted FROM svv_table_info WHERE unsorted > 0 OR deleted > 0;
2. Schedule VACUUM During Low Activity:
o VACUUM is resource-intensive and can impact cluster performance.
o Run during maintenance windows or periods of low query activity.
3. Use the Right Type:
o Use DELETE ONLY for tables with significant deletes.
o Use SORT ONLY if resorting is needed without space reclamation.
4. Avoid Frequent VACUUMs:
o Frequent small VACUUM operations can increase overhead.
o Instead, batch updates/deletes and run VACUUM afterward.
5. Combine with ANALYZE:
o Run ANALYZE after VACUUM to update table statistics for query optimization:
ANALYZE my_table;
6. Enable Automatic Table Optimization:
o Redshift automatically reclaims space for small tables and resorts rows for certain large tables.
o This reduces the need for manual VACUUM operations.

Loading Data from S3 into Redshift

To load data from S3 into Redshift, use the COPY command.

Steps to Load Data:


1. Prepare Data in S3:
o Format: CSV, JSON, Parquet, ORC, or AVRO.
o Optional: Compress using GZIP or BZIP2.
2. Run the COPY Command:
COPY my_table
FROM 's3://my-bucket/my-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
FORMAT AS JSON 'auto';
 Replace FORMAT AS JSON 'auto' with the appropriate format (e.g., CSV, PARQUET).
 Use DELIMITER for custom delimiters:
FORMAT AS CSV DELIMITER ',' IGNOREHEADER 1;
Benefits:
 High-performance parallel data loading.
 Supports bulk inserts for large datasets.

Unloading Data from Redshift to S3


The UNLOAD command exports query results from Redshift to S3.
Steps to Unload Data:
1. Run the UNLOAD Command:
UNLOAD ('SELECT * FROM my_table')
TO 's3://my-bucket/my-exported-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
PARALLEL OFF
ALLOWOVERWRITE;
 Use PARALLEL OFF for a single file or PARALLEL ON for multiple files.
 Use DELIMITER to specify a custom delimiter.
Benefits:
 Efficiently exports large query results to S3 for archival or further processing.
 Supports various formats like CSV and JSON.

Integration with Data Pipelines


AWS Glue:
 Use Glue ETL jobs to transform and load data into Redshift from S3 or vice versa.
AWS Data Pipeline:
 Automate workflows for moving and processing data between Redshift and S3.
Amazon Kinesis Data Firehose:
 Load streaming data directly into S3 and use COPY to ingest it into Redshift.

Amazon Redshift integrates with AWS Lambda to extend its capabilities, allowing you to invoke Lambda functions
for tasks like:
 Data Enrichment: Fetching additional data from external sources in real time.
 Custom Transformations: Applying complex logic to data.
 Event Processing: Triggering downstream actions from Redshift queries.
 Machine Learning Integration: Running ML models hosted in Lambda and using the results in Redshift.

How Redshift Integrates with Lambda


1. Invoking Lambda Functions from Redshift Using Lambda UDFs
Amazon Redshift supports Lambda User-Defined Functions (UDFs), which allow SQL queries to call Lambda
functions.
Steps to Set Up Lambda Integration
1. Create a Lambda Function
o Write the function to perform the desired task.
import json

def lambda_handler(event, context):


input_value = event['value']
# Custom logic (e.g., add 10)
result = input_value + 10
return {"result": result}
o Deploy the function in AWS Lambda.

2. Grant Redshift Permission to Invoke the Lambda


o Attach an IAM policy to the Redshift IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:region:account-id:function:function-name"
}
]
}
3. Create a Lambda UDF in Redshift
Use the CREATE FUNCTION command to define the UDF.
CREATE FUNCTION add_ten(value INT)
RETURNS INT
IMMUTABLE
LAMBDA 'arn:aws:lambda:region:account-id:function:function-name';
4. Call the Lambda UDF in Queries:
Use the UDF in your SQL queries:
SELECT add_ten(5);

2. Event-Driven Integration Using Lambda


Amazon Redshift can trigger Lambda functions in response to events using Event Notifications.
1. Enable Event Notifications
o Configure event subscriptions for the Redshift cluster.
o Event categories: configuration, monitoring, security, etc.
o Example: Trigger a Lambda function when a snapshot is created.
2. Create an Event Subscription
o Use the AWS Management Console, CLI, or SDK to subscribe to Redshift events.
aws redshift create-event-subscription \
--subscription-name my-event-subscription \
--sns-topic-arn arn:aws:sns:region:account-id:my-sns-topic \
--source-type cluster \
--source-ids my-cluster \
--event-categories monitoring \
--severity INFO
3. Trigger Lambda via SNS
Configure the SNS topic to trigger the Lambda function.
3. Combining Redshift with Lambda for ETL Workflows
Lambda can serve as part of an ETL pipeline:
 Ingestion: Push data from external systems into S3 or directly into Redshift.
 Transformation: Perform data transformations and updates in Redshift.
 Downstream Processing: Trigger actions like notifications or API calls.
Example Workflow:
1. Load data into Redshift using COPY.
2. Trigger Lambda to validate or transform the data.
3. Update Redshift tables based on the results.

Amazon Redshift federated queries allow you to query and combine data across your Amazon Redshift cluster and
operational data stores like Amazon RDS, Aurora PostgreSQL, and other PostgreSQL databases without moving or
copying the data. This feature is beneficial for real-time analytics and combining historical and operational data.

How Federated Queries Work


 Redshift uses a federated query feature to connect to external databases via PostgreSQL-compatible
connections.
 You can join tables from external databases with Redshift tables or directly query the external databases.

Setting Up Federated Queries


1. Enable Federated Query in Redshift
 Ensure your Amazon Redshift cluster is RA3 instance type as federated queries are only supported on RA3
node types.
2. Configure the External Database
 Use Amazon RDS or Aurora PostgreSQL (or a compatible PostgreSQL database).
 Ensure the external database is accessible from the Redshift cluster. This may involve setting up VPC
peering, security groups, and subnet configurations.
3. Grant Permissions
 Attach an IAM role to the Redshift cluster that includes the rds-db:connect permission for the external
database.
IAM Policy Example:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "rds-db:connect",
"Resource": "arn:aws:rds-db:region:account-id:dbuser/db-resource-id/db-user"
}
]
}
4. Create a Federated Schema
 Use the CREATE EXTERNAL SCHEMA command to define the schema for the external database.
Example:
CREATE EXTERNAL SCHEMA external_schema_name
FROM POSTGRES
DATABASE 'database_name'
URI 'jdbc:postgresql://db-endpoint:port/database_name'
IAM_ROLE 'arn:aws:iam::account-id:role/my-redshift-role'
SECRET_ARN 'arn:aws:secretsmanager:region:account-id:secret:secret-id';
 URI: Replace with the JDBC URL of your PostgreSQL database.
 SECRET_ARN: Use AWS Secrets Manager to securely store database credentials.
5. Query the External Data
 You can now query external tables using SQL.
Example:
SELECT * FROM external_schema_name.table_name WHERE column_name = 'value';
 Join Redshift and external data:
SELECT a.column1, b.column2
FROM redshift_table a
JOIN external_schema_name.external_table b
ON [Link] = [Link]
WHERE b.column3 > 100;

Key Features
1. Cross-Database Joins:
o Join data from Amazon Redshift and external databases seamlessly.
2. No Data Movement:
o Query live data directly without needing to ETL data into Redshift.
3. SQL-Based Queries:
o Use familiar SQL syntax to query external databases.
4. Cost-Effective:
o No additional storage or ingestion costs.

Best Practices
1. Optimize Queries:
o Push down operations to the external database to reduce data transfer. Redshift attempts to push
down filters, projections, and aggregations when possible.
2. Secure Connections:
o Use VPC, security groups, and Secrets Manager to secure communication between Redshift and
the external database.
3. Monitor Query Performance:
o Use Redshift system views (SVL_FEDERATED_QUERY_STATS) to monitor federated query
performance and troubleshoot issues.
4. Partition External Tables:
o Ensure external tables in the source database are well-partitioned for efficient querying.
5. Limit Federated Query Usage:
o Use federated queries for real-time or infrequent data access. For frequent queries, consider
replicating data into Redshift for better performance.

Limitations
 Supported only on RA3 node types.
 Works with Amazon RDS PostgreSQL, Amazon Aurora PostgreSQL, and compatible PostgreSQL databases.
 No direct support for MySQL or SQL Server (workarounds involve using ETL tools or intermediate
transformations).
 Query performance depends on the external database's performance and network latency.

Amazon Redshift Materialized Views


A materialized view in Amazon Redshift is a database object that stores the results of a query physically on disk.
Unlike a regular view, which is just a saved SQL query, a materialized view provides faster query performance
because the results are precomputed and cached.

Features of Materialized Views in Redshift


1. Performance Improvement:
o Speeds up complex queries and aggregations by precomputing results.
o Reduces the computational overhead for frequently accessed data.
2. Automatic Refresh:
o Materialized views can be incrementally or fully refreshed to keep data up to date.
3. Storage:
o Data is physically stored, so it consumes additional disk space compared to a regular view.
4. Incremental Refresh:
o Only the changes (deltas) in the underlying base tables are applied to update the materialized
view.
o Available for certain types of queries.
5. Query Optimization:
o Amazon Redshift automatically rewrites queries to use materialized views if they can improve
performance.

Creating a Materialized View


Syntax
CREATE MATERIALIZED VIEW mv_name
AS
SELECT column1, column2, aggregate_function(column3)
FROM base_table
WHERE condition
GROUP BY column1, column2;
Example
Create a materialized view to precompute total sales by product:
sql
CopyEdit
CREATE MATERIALIZED VIEW mv_total_sales
AS
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales
GROUP BY product_id;

Querying a Materialized View


You query a materialized view just like a regular table:
SELECT * FROM mv_total_sales WHERE total_sales > 1000;

Refreshing a Materialized View


1. Manual Refresh: Use the REFRESH MATERIALIZED VIEW command to update the view with the latest data.
REFRESH MATERIALIZED VIEW mv_total_sales;
2. Automatic Refresh: Redshift does not natively support automatic refresh for materialized views, but you
can automate it using:
o AWS Lambda + EventBridge.
o Scheduled SQL queries via your preferred orchestration tool.

Updating Materialized Views Incrementally


For queries with incremental refresh support, Redshift updates only the changed data, which improves refresh
performance. Supported operations include:
 Simple SELECT queries with basic joins.
 Aggregate functions (e.g., SUM, COUNT).
 WHERE clauses with deterministic conditions.
For unsupported queries, Redshift performs a full refresh.

Managing Materialized Views


View Dependencies
You can check dependencies for materialized views using system tables:
SELECT *
FROM svv_materialized_views
WHERE schema = 'your_schema';
Drop a Materialized View
DROP MATERIALIZED VIEW mv_name;

Best Practices for Materialized Views


1. Use Incremental Refresh Where Possible:
o Design queries to support incremental refresh for better performance.
2. Avoid Frequent Refreshes:
o Refresh only when necessary to balance freshness and performance.
3. Monitor Storage Usage:
o Materialized views consume disk space. Use STV_BLOCKLIST to monitor storage usage.
4. Query Optimization:
o Ensure the query logic in materialized views aligns with frequently used queries for maximum
benefit.
5. Combine with Sorting and Compression:
o Use sort keys and compression on base tables to improve the materialized view’s performance.
6. Automate Refresh:
o Schedule regular refreshes using Lambda or cron jobs for time-sensitive workloads.

Limitations of Materialized Views


1. Not Fully Dynamic:
o Materialized views do not automatically update when underlying tables change.
o Requires manual or scheduled refresh.
2. Incremental Refresh Limitations:
o Certain complex queries, such as those with subqueries, window functions, or unsupported joins,
require a full refresh.
3. Storage Consumption:
o Data is stored on disk, consuming additional storage.
4. Limited Query Rewrite:
o Automatic query rewrite to use materialized views depends on query compatibility.
In Amazon Redshift, materialized views can be used effectively with streaming data from services like Amazon
Kinesis or Amazon MSK (Managed Streaming for Apache Kafka). By combining Redshift materialized views with
streaming data, you can perform near-real-time analytics on continuously updating datasets.

Using Materialized Views for Streams


1. Key Workflow
1. Stream Data to Amazon Redshift:
o Use Amazon Kinesis Data Firehose or MSK Connect to stream data into an Amazon Redshift
table.
o Data Firehose can deliver data directly into a staging table in Redshift.
2. Create a Materialized View:
o Build a materialized view on top of the staging table or the transformed data table.
o The materialized view aggregates or filters the streamed data for analytics.
3. Refresh the Materialized View:
o Periodically refresh the materialized view to incorporate new data from the stream.

2. Example: Materialized View with Kinesis Stream


Scenario: Stream website event logs from Kinesis Data Firehose into Redshift and maintain an aggregated
materialized view for real-time analysis.
Step 1: Configure Kinesis Data Firehose
1. Create a Kinesis Data Firehose delivery stream.
2. Set the destination as Amazon Redshift.
3. Specify a Redshift staging table (e.g., event_logs_staging) for incoming data.
Step 2: Create the Materialized View
Create a materialized view to aggregate the streaming data for analytics.
CREATE MATERIALIZED VIEW mv_event_summary
AS
SELECT
event_type,
COUNT(*) AS event_count,
MAX(event_timestamp) AS last_event_time
FROM event_logs_staging
WHERE event_timestamp >= dateadd(day, -7, current_date)
GROUP BY event_type;
 Purpose:
o Aggregates event counts by type.
o Tracks the most recent event for each type.
o Only considers data from the last 7 days.
Step 3: Refresh the Materialized View
Run the following periodically (e.g., every minute or hour) to keep the materialized view updated:
REFRESH MATERIALIZED VIEW mv_event_summary;
 Automate this using AWS Lambda, Step Functions, or EventBridge.

3. Example: Materialized View with Kafka (MSK)


Scenario: Stream IoT sensor data from MSK into Redshift for near-real-time analysis.
Step 1: Stream Data to Redshift
1. Use MSK Connect or Kafka Connect with a Redshift Sink Connector to stream sensor data into a staging
table (e.g., sensor_data_staging).

Step 2: Create the Materialized View


Create a materialized view to summarize sensor data.
CREATE MATERIALIZED VIEW mv_sensor_summary
AS
SELECT
sensor_id,
AVG(sensor_value) AS avg_value,
MAX(sensor_value) AS max_value,
MIN(sensor_value) AS min_value,
MAX(event_time) AS last_updated
FROM sensor_data_staging
WHERE event_time >= dateadd(hour, -1, current_timestamp)
GROUP BY sensor_id;
 Purpose:
o Computes average, max, and min values for each sensor in the last hour.
o Tracks the most recent data timestamp.
Step 3: Automate Refresh
Refresh the materialized view regularly to ensure it reflects the latest data.

Best Practices for Materialized Views with Streams


1. Use Incremental Refresh:
o Ensure the queries support incremental refresh for better performance.
o Avoid unsupported query constructs (e.g., subqueries, window functions).
2. Optimize Staging Tables:
o Use appropriate sort and distribution keys for the staging table to speed up materialized view
refresh operations.
3. Schedule Refresh Efficiently:
o Balance refresh frequency with performance overhead.
o Use AWS Lambda or a cron job to automate periodic refresh.
4. Use Partitioning:
o If querying recent data (e.g., last day or hour), ensure the base table or materialized view uses
efficient filters.
5. Monitor Refresh Performance:
o Use Redshift system views like SVL_MV_REFRESH_STATUS to track refresh performance and
troubleshoot delays.
6. Avoid Frequent Small Updates:
o Consolidate smaller updates into larger batches for better refresh performance.

Limitations
1. Manual Refresh Required:
o Redshift materialized views do not refresh automatically. Automate refresh with external tools.
2. Full Refresh for Complex Queries:
o Some queries require a full refresh, which can be resource-intensive.
3. Incremental Refresh Restrictions:
o Incremental refresh supports a subset of query types. Ensure your materialized view query is
compatible.

You might also like