Amazon Redshift Overview and Features
Amazon Redshift Overview and Features
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud offered by Amazon Web
Services (AWS). It is designed to handle large-scale data storage and complex queries, providing fast and scalable
data analytics capabilities.
1. Scalability: Redshift can scale from a few gigabytes to petabytes of data, and you can add or remove
nodes easily based on your storage and performance needs.
2. Columnar Storage: Data is stored in a columnar format, which significantly speeds up query performance
for analytical workloads, as only the necessary columns are accessed during queries.
3. Massively Parallel Processing (MPP): Redshift uses multiple nodes to perform parallel processing,
meaning queries are executed across many servers at once, greatly improving speed.
4. Integrated with AWS Ecosystem: Redshift is tightly integrated with other AWS services like S3,
DynamoDB, EMR, and Lambda, making it easier to build a seamless data pipeline.
5. SQL Interface: Redshift supports standard SQL for querying, making it accessible to users familiar with
SQL. It also supports advanced features like window functions, joins, and subqueries.
6. Performance Optimization: Redshift provides various performance optimizations, including automated
data compression, distribution styles (to control how data is spread across nodes), and query optimization
techniques.
7. Security: Redshift offers encryption at rest and in transit, VPC isolation, IAM (Identity and Access
Management) integration for access control, and audit logging.
8. Cost-Effective: With a pay-as-you-go pricing model, Redshift can be more cost-efficient compared to
traditional on-premises data warehouses. It also offers Reserved Instances for cost savings over time.
Use Cases:
Business Intelligence (BI): Companies use Redshift for analyzing large datasets and generating insights
through BI tools like Tableau, Power BI, or AWS QuickSight.
Data Warehousing: Redshift is often used as a central repository to store data from different sources for
further analysis.
Data Lakes: With integration with S3, Redshift can query data stored in a data lake without requiring
movement or ETL processing, providing faster insights.
How It Works:
1. Cluster: A Redshift cluster is composed of one or more nodes. The leader node coordinates query
processing, while compute nodes handle the query execution.
2. Data Distribution: Data is distributed across compute nodes based on a chosen distribution style, which
optimizes query performance by reducing data shuffling during queries.
3. Query Execution: When a query is executed, it is parsed, planned, and optimized, and then distributed
across the nodes for parallel execution.
Redshift is widely used by organizations of all sizes for fast, scalable, and secure data analytics in the cloud,
especially when dealing with large datasets and complex queries.
What is columnar storage?
Columnar storage is a method of organizing and storing data in a database where data is stored by columns rather
than by rows. This contrasts with row-based storage, where each record (or row) is stored together, and each
column is stored in sequence for each row.
In columnar storage, all the values for a particular column are stored together in contiguous blocks of memory or
disk. For example, if you have a database table with columns for "Name," "Age," "Country," etc., the columnar
storage system would store all the "Name" values together, all the "Age" values together, and all the "Country"
values together, rather than storing each entire row (e.g., "John, 30, USA") in a sequential manner.
Example:
Let’s say you have the following table in a traditional row-based format:
In a row-based storage system, the rows are stored together in the following format:
[John, 30, USA]
[Sarah, 25, UK]
[Emily, 35, Canada]
[James, 40, USA]
Now, if you want to run a query like SELECT AVG(Age) FROM table WHERE Country = 'USA', the columnar system
can quickly retrieve just the "Age" column and filter the rows where the "Country" column is 'USA', making it much
faster than scanning all rows.
Data Warehousing: Columnar storage is ideal for data warehousing and analytics where queries often
involve large datasets and only a few columns are needed at a time.
Business Intelligence (BI): Tools like Tableau, Power BI, or AWS QuickSight, which often focus on specific
metrics or aggregations, can benefit greatly from columnar storage to speed up data retrieval.
Big Data Analytics: In systems like Amazon Redshift, Google BigQuery, or Apache Parquet, which are
designed to handle massive datasets efficiently, columnar storage allows for faster processing of queries
and better compression.
In Summary:
Columnar storage is highly effective for analytical queries where you need to work with large datasets but only a
subset of columns at a time. By storing data by columns instead of rows, it improves query performance, reduces
I/O operations, and often leads to better storage efficiency, particularly in data-intensive environments like data
warehouses and analytics platforms.
Leader Node
The Leader Node in a Redshift Cluster manages all external and internal communication. It is responsible for
preparing query execution plans whenever a query is submitted to the cluster. Once the query execution plan is
ready, the Leader Node distributes query execution code on the compute nodes and assigns slices of data to each
to compute node for computation of results.
Leader Node distributes query load to compute node only when the query involves accessing data stored on the
compute nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift
architecture which are always executed on the Leader Node.
Compute Nodes
Compute Nodes are responsible for actual execution of queries and have data stored with them. They execute
queries and return intermediate results to the Leader Node which further aggregates the results.
A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes is depicted in
below diagram:
Node slices
A compute node consist of slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it
where it performs Query Operations. The Leader Node is responsible for assigning a Query code and data to a slice
for execution. Slices once assigned query load work in parallel to generate query results.
Data is distributed among the Slices on the basis of Distribution Style and Distribution Key of a particular table. An
even distribution of data enables Redshift to assign workload evenly to slices and maximizes the benefit of parallel
processing.
Number of Slices per Compute Node is decided on the basis of the type of node.
Redshift architecture allows it to use Massively parallel processing (MPP) for fast processing even for the most
complex queries and a huge amount of data. Multiple compute nodes execute the same query code on portions of
data to maximize parallel processing.
Data in Redshift is stored in a columnar fashion which drastically reduces the I/O on disks. Columnar storage
reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to execute a
query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-
memory processing.
Redshift uses Sort Keys to sort columns and filter out chunks of data while executing queries.
Data compression
Data compression is one of the important factors in ensuring query performance. It reduces storage footprint and
enables loading of large amounts of data in the memory fast. Owing to Columnar data storage, Redshift can use
adaptive compression encoding depending on the column data type.
Query Optimizer
Redshift’s Query Optimizer generate query plans that are MPP-aware and takes advantage of Columnar Data
Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution.
Amazon Redshift has a Massively Parallel Processing Architecture. MPP enables Redshift to distribute and
parallelize queries across multiple nodes. Apart from queries, the MPP architecture also enables parallel
operations for data loads, backups and restores. Redshift architecture is inherently parallel; there is no additional
tuning or overheads for distribution of loads for the end users.
2. Redshift supports Single Node Clusters to 100 Nodes Clusters with up to 1.6 PB of storage
You can provision a Redshift cluster with from a single Node to 100 Nodes configuration depending on the
processing and storage capacity required. Redshift nodes come in two sizes XL & 8XL. XL node comes with 2 TB
attached storage and 8XL node comes with 16 TB attached storage. Clusters can have a maximum of 32 XL nodes
(64 TB) or 100 8XL nodes (1.6 PB).
Redshift clusters currently support only Single AZ deployments. You will not be able to access Redshift n case of an
Availability Zone failure. An AZ failure will not affect the durability of your data, you can start using the cluster once
the AZ is available. To ensure continuous access to your data, you can launch an additional cluster in different AZ.
You can restore a new Redshift cluster in a different AZ by recreating it using the snap shots of the original cluster.
Alternately, you can have a cluster running always in a different AZ, accessing the same set of data from S3.
Redshift provides columnar data storage. With Columnar data storage, all values for a particular column are stored
contiguously on the disk in sequential blocks.
Columnar data storage helps reduce the I/O requests made to the disk compared to a traditional row based data
storage. It also reduces the amount of data loaded from the disk improving the processing speed, as more memory
is available for query executions.
As similar data is stored sequentially, Redshift compresses the data rather efficiently. Compression of data further
reduces the amount of I/O required for queries.
5. Parallel uploads to Redshift are supported only for data stored in Amazon S3 & DynamoDB
Redshift currently supports data imports/copy only from S3 and DynamoDB. Using COPY command from S3 is the
fastest way to load data into Redshift. COPY loads data in parallel and is much more efficient than Insert
statement.
Redshift does not have support to load data in parallel from other sources. You will either have to use Insert
statements or write scripts to first load data into S3 and then into Redshift. This could sometime be a complex
process depending on the size and format of data available with you.
6. Redshift is Secure
Amazon provides various security features for Redshift just like all other AWS services.
Access Control can be maintained at the account level using IAM roles. For data base level access control, you can
define Redshift database groups and users and restrict access to specific database and tables.
Redshift can be launched in Amazon VPC. You can define VPC security groups to restrict inbound access to your
clusters.
Redshift allows data encryption for all data which is stored in the cluster as well as SSL encryption for data in
transit.
7. Distribution Keys
Redshift achieves high query performance by distributing data evenly on all the nodes of a cluster and slices within
a node.
A Redshift cluster is made of multiple nodes and each node has multiple slices. The number of slices is equal to the
number of processor cores in a node. Each slice is allocated a portion of node’s memory and disk space. During
query execution the data is distributed across slices, the slices operate in parallel to execute the queries.
To distribute data evenly among slices, you need to define a distribution key for a table while creating it. If a
distribution key is defined during table creation, any data, which is loaded in the table, is distributed across nodes
based on the distribution key value. Matching values from a distribution key column are stored together.
A good distribution key will ensure even load distribution across slices, uneven distributions will cause some slices
to handle more load than others, and slows down the query execution.
If a distribution key is not defined for a column, the data is by default distributed in a round robin fashion by
Redshift.
A distribution key for a table cannot be amended once it is created. This is very important to keep in mind while
identifying the right distribution key for a table.
To change a distribution key, the only work around is to create a new table with the updated distribution key, load
data into this table and rename the table as the original table after deleting the original table.
You can define database constraints like unique, primary and foreign keys but these constraints are informational
only and are not enforced by Redshift. These constraints, though are used by Redshift to create query execution
plans, ensuring optimal execution. If the primary key and foreign key constraints are correct, they should be
declared while creating tables to have optimal executions.
10. Redshift does not automatically reclaim space that is freed on deletes or updates
Redshift is based on PostgreSQL version 8.0.2 and inherits some of its limitations. One such limitation is that
Redshift does not reclaim and reuse the space freed up by delete or update commands. The free space left by
deleted or updated records in large numbers can cost some extra processing.
Every update command in Redshift first deletes the existing row and then inserts a new record with the updated
values.
To reclaim this unused space, you can run the Vacuum command. Vacuum command reclaims the freed space and
also sorts data in the disk.
Ideally there would be very little updates or deletes once data is loaded in a data warehouse, but in case it does,
you can run the Vacuum command.
The concurrent queries for a cluster across queues is limited to a maximum of 15. Users cannot modify this
configuration.
QuickSight is a useful tool for building dashboards and BI Reports on Redshift. It is tuned into work faster with
Redshift.
Amazon Redshift Github utilities available in github have highly useful admin scripts.
Leader Node: Coordinates the query execution, aggregates results, and handles metadata.
Compute Nodes: Perform the parallel processing of queries, store data, and return results.
Data Distribution: Data is distributed across compute nodes to leverage MPP and parallelism.
Columnar Storage: Optimizes query performance for analytical workloads and reduces storage usage
through compression.
Massively Parallel Processing (MPP): Distributes queries across nodes for faster execution of complex
queries.
Fault Tolerance & Security: Provides replication, automated backups, and data encryption to ensure high
availability and security.
The architecture of Amazon Redshift is specifically designed to scale efficiently for large datasets while maintaining
fast query performance, making it an ideal solution for data warehousing and analytics in the cloud.
Amazon Redshift offers different node types to accommodate your workloads, and we recommend choosing RA3 or
DC2 depending on the required performance, data size, and expected data growth.
RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute
and managed storage independently. With RA3, you choose the number of nodes based on your performance
requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of
data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can't
launch RA3 clusters in EC2-Classic.
Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and
Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs,
Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for
Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For
workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse
storage capacity separate from compute nodes.
DC2 nodes enable you to have compute-intensive data warehouses with local SSD storage included. You choose the
number of nodes you need based on data size and performance requirements. DC2 nodes store your data locally
for high performance, and as the data size grows, you can add more compute nodes to increase the storage
capacity of the cluster. For datasets under 1 TB (compressed), we recommend DC2 node types for the best
performance at the lowest price. If you expect your data to grow, we recommend using RA3 nodes so you can size
compute and storage independently to achieve improved price and performance. You launch clusters that use the
DC2 node types in a virtual private cloud (VPC).
Some node types allow one node (single-node) or two or more nodes (multi-node). The minimum number of nodes
for clusters of some node types is two nodes. On a single-node cluster, the node is shared for leader and compute
functionality. Single-node clusters are not recommended for running production workloads. On a multi-node
cluster, the leader node is separate from the compute nodes. The leader node is the same node type as the
compute nodes. You only pay for compute nodes.
Amazon Redshift applies quotas to resources for each AWS account in each AWS Region. A quota restricts the
number of resources that your account can create for a given resource type, such as nodes or snapshots, within an
AWS Region.
The cost of your cluster depends on the AWS Region, node type, number of nodes, and whether the nodes are
reserved in advance.
In Amazon Redshift, distribution style determines how data is distributed across the compute nodes in a cluster.
Choosing the right distribution style is crucial for performance optimization, as it affects data locality, query
execution speed, and overall cluster efficiency.
Types of Distribution Styles
1. AUTO (Default)
o Redshift decides the distribution style automatically based on the table size and query patterns.
o Small tables are typically set to ALL.
o Larger tables are set to EVEN or KEY, depending on query patterns.
2. EVEN
o Data is distributed evenly across all the nodes in the cluster.
o Suitable for tables where no specific column is frequently joined or filtered on.
o Avoids data skew but may result in high data transfer during joins or aggregations.
3. KEY
o Data is distributed based on the values in a specified column (distribution key).
o Rows with the same key value are stored on the same node.
o Optimal for tables frequently joined or filtered on the same column (distribution key).
o Can reduce data shuffling during query execution but may lead to data skew if the key values are
not evenly distributed.
4. ALL
o A full copy of the table is stored on each node.
o Best for small dimension tables that are frequently joined with larger fact tables.
o Eliminates data shuffling but increases storage and maintenance overhead.
Best Practices
Analyze Query Patterns: Use queries to identify frequently joined or filtered columns.
Avoid Skew: Choose distribution keys with high cardinality and an even distribution of values.
Combine with Sort Keys: Align the distribution key with the sort key to improve query performance
further.
Redshift Snapshots
A snapshot in Redshift is a point-in-time backup of your cluster. These snapshots allow you to restore the cluster to
a specific state.
Types of Snapshots
1. Automated Snapshots
o Created automatically by Amazon Redshift based on your cluster's snapshot schedule.
o Retention period is configurable (default is 1 day).
o Managed entirely by Redshift.
2. Manual Snapshots
o Created manually by the user.
o Persist until explicitly deleted.
o Useful for long-term backups or sharing across accounts.
How to Create a Manual Snapshot
1. AWS Management Console:
o Navigate to the Amazon Redshift Console.
o Select the cluster.
o Choose Snapshots > Create Snapshot.
o Enter a name for the snapshot and create it.
2. AWS CLI:
aws redshift create-cluster-snapshot --cluster-identifier my-cluster --snapshot-identifier my-snapshot
3. AWS SDK: Use the CreateClusterSnapshot API.
Restoring a Snapshot
Snapshots can be restored into a new cluster using the console, CLI, or SDK.
In Amazon Redshift, VACUUM is a maintenance operation used to reorganize and reclaim storage space. Over time,
as data is updated or deleted, the storage becomes fragmented, and the table can accumulate "dead rows" that
degrade query performance. Running a VACUUM operation resolves this by compacting and sorting the data.
-- General syntax
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ]
[ table_name ]
[ TO threshold_percent ];
Options
TO threshold_percent:
o Specifies the minimum percentage of unsorted rows to trigger the VACUUM operation.
o Default is 95%.
Best Practices
1. Analyze Table Usage:
o Check the STV_BLOCKLIST system table to identify tables with unsorted or deleted blocks:
SELECT tbl, name, unsorted, size, deleted FROM svv_table_info WHERE unsorted > 0 OR deleted > 0;
2. Schedule VACUUM During Low Activity:
o VACUUM is resource-intensive and can impact cluster performance.
o Run during maintenance windows or periods of low query activity.
3. Use the Right Type:
o Use DELETE ONLY for tables with significant deletes.
o Use SORT ONLY if resorting is needed without space reclamation.
4. Avoid Frequent VACUUMs:
o Frequent small VACUUM operations can increase overhead.
o Instead, batch updates/deletes and run VACUUM afterward.
5. Combine with ANALYZE:
o Run ANALYZE after VACUUM to update table statistics for query optimization:
ANALYZE my_table;
6. Enable Automatic Table Optimization:
o Redshift automatically reclaims space for small tables and resorts rows for certain large tables.
o This reduces the need for manual VACUUM operations.
Amazon Redshift integrates with AWS Lambda to extend its capabilities, allowing you to invoke Lambda functions
for tasks like:
Data Enrichment: Fetching additional data from external sources in real time.
Custom Transformations: Applying complex logic to data.
Event Processing: Triggering downstream actions from Redshift queries.
Machine Learning Integration: Running ML models hosted in Lambda and using the results in Redshift.
Amazon Redshift federated queries allow you to query and combine data across your Amazon Redshift cluster and
operational data stores like Amazon RDS, Aurora PostgreSQL, and other PostgreSQL databases without moving or
copying the data. This feature is beneficial for real-time analytics and combining historical and operational data.
Key Features
1. Cross-Database Joins:
o Join data from Amazon Redshift and external databases seamlessly.
2. No Data Movement:
o Query live data directly without needing to ETL data into Redshift.
3. SQL-Based Queries:
o Use familiar SQL syntax to query external databases.
4. Cost-Effective:
o No additional storage or ingestion costs.
Best Practices
1. Optimize Queries:
o Push down operations to the external database to reduce data transfer. Redshift attempts to push
down filters, projections, and aggregations when possible.
2. Secure Connections:
o Use VPC, security groups, and Secrets Manager to secure communication between Redshift and
the external database.
3. Monitor Query Performance:
o Use Redshift system views (SVL_FEDERATED_QUERY_STATS) to monitor federated query
performance and troubleshoot issues.
4. Partition External Tables:
o Ensure external tables in the source database are well-partitioned for efficient querying.
5. Limit Federated Query Usage:
o Use federated queries for real-time or infrequent data access. For frequent queries, consider
replicating data into Redshift for better performance.
Limitations
Supported only on RA3 node types.
Works with Amazon RDS PostgreSQL, Amazon Aurora PostgreSQL, and compatible PostgreSQL databases.
No direct support for MySQL or SQL Server (workarounds involve using ETL tools or intermediate
transformations).
Query performance depends on the external database's performance and network latency.
Limitations
1. Manual Refresh Required:
o Redshift materialized views do not refresh automatically. Automate refresh with external tools.
2. Full Refresh for Complex Queries:
o Some queries require a full refresh, which can be resource-intensive.
3. Incremental Refresh Restrictions:
o Incremental refresh supports a subset of query types. Ensure your materialized view query is
compatible.