NAME - AYUSH AGRAHARI
ROLL NO - 22113020
BRANCH - CHEMICAL ENGINEERING
Amazon Redshi
1. Introduc on
Amazon Redshi is a fully managed, petabyte-scale data warehouse service provided by
Amazon Web Services (AWS). It allows organiza ons to run complex analy cal queries
against structured and semi-structured data using SQL. Redshi is designed for analy cs
workloads (OLAP) where queries scan large volumes of data, aggregate results and return
insights quickly.
2. Key features and impac ul informa on
• Columnar storage: Redshi stores data by column which reduces I/O for analy cs
queries that read only a subset of columns.
• Massively Parallel Processing (MPP): Redshi distributes data across nodes and
processes queries in parallel to improve performance.
• Compression: Automa c compression encodings reduce storage consump on and
speed up query I/O.
• Scalability: Clusters can be resized (add/remove nodes) and Redshi Serverless o ers
on-demand compu ng for variable workloads.
• Integra on with AWS ecosystem: Na ve connectors to S3, AWS Glue, Athena, Kinesis,
SageMaker and seamless access to other AWS services.
• Security & Compliance: Supports VPC, IAM-based access control, encryp on at-rest
and in-transit, and integra on with AWS Key Management Service (KMS).
• Concurrency Scaling & Workload Management: Features like workload management
(WLM), concurrency scaling, and automa c materialized views help maintain
performance under concurrent user queries.
3. Architecture overview
At a high level, a Redshi cluster consists of one leader node and one or more compute
(worker) nodes. The leader node receives SQL queries, plans the execu on, and coordinates
the work across compute nodes. Compute nodes store data (in columnar format) and
execute query fragments in parallel. Data is o en loaded from S3 or streamed from other
sources and distributed across the nodes in slices (Redistribu on for parallelism).
ti
ti
ft
ti
f
ft
ti
ti
ft
tf
ti
ti
ti
ft
ft
ti
ti
ft
ft
ti
ti
ti
ti
ti
ff
ti
4. Data loading and common pa erns
• COPY command: Fast, parallel bulk loading from S3, DynamoDB, or remote hosts —
recommended over many single-row INSERTs.
• External tables: Using Redshi Spectrum, you can query data directly on S3 without
loading it into the cluster.
• ETL/ELT: Typical pa ern is to stage raw data in S3, transform using Glue/EMR or SQL in
Redshi (ELT), and store curated tables for analy cs.
5. Performance op miza on best prac ces
• Choose appropriate distribu on styles (KEY, ALL, EVEN) so joins and aggrega ons avoid
unnecessary data movement.
• Sort keys: Use sort keys (compound or interleaved) to speed up range-restricted scans
and improve query performance.
• Compression encodings: Use ANALYZE COMPRESSION or let COPY choose op mal
encodings.
• Vacuuming and analyzing: Regular VACUUM and ANALYZE keep table sta s cs and
reduce fragmenta ons for be er performance.
• Use WLM queues and concurrency scaling to isolate cri cal queries and handle bursts
of analy cs workloads.
6. Security, access control & maintenance
• Authen ca on & Access: Use IAM and database users/roles to manage ne-grained
access.
• Encryp on: Enable encryp on for data at rest (KMS) and in transit (SSL/TLS).
• Backups & snapshots: Automated and manual snapshots allow point-in- me restores
and cross-region snapshots for DR.
• Monitoring: Use CloudWatch, AWS Console, and system tables (STL/PG_ views) to
monitor queries, disk usage, and performance.
7. OLAP vs OLTP (in very simple words)
• OLTP (Online Transac on Processing): Systems designed to handle many short, atomic
transac ons — e.g., banking transac ons, order inser on, upda ng customer pro les.
They focus on fast writes, concurrency, and data integrity.
• OLAP (Online Analy cal Processing): Systems designed for complex read-heavy queries
over large amounts of historical data — e.g., repor ng, dashboarding, analy cal queries
that aggregate and scan many rows.
ft
ti
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
tt
ti
ft
tt
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
fi
Think: OLTP = day-to-day transac ons (many small opera ons). OLAP = deep analysis
(few complex, heavy queries).
8. Why Amazon Redshi is op mized for OLAP (not OLTP)
• Columnar storage: Op mized for reading large column sets and performing
aggrega ons. OLTP needs fast single-row reads/writes (row-based), which columnar
storage is not designed for.
• MPP architecture: Redshi distributes large scans across many nodes for parallel
processing — excellent for analy cs. OLTP workloads require many small, low-latency
transac ons, which can su er in MPP clusters.
• Batch loading & COPY: Redshi is tuned for bulk loading (COPY). Frequent single-row
inserts/deletes (typical in OLTP) are ine cient and lead to fragmenta on and
performance degrada on.
• Concurrency and latency: Redshi 's WLM and concurrency scaling handle concurrent
analy cal queries, but it is not op mized for extremely high concurrency of short write
transac ons. OLTP systems (e.g., Aurora, RDS) provide lower latency for transac onal
workloads.
• ACID & transac onal seman cs: While Redshi supports transac onal opera ons, its
primary design trade-o s favor analy cal throughput and storage e ciency over high-
frequency transac onal integrity and low latency.
9. Prac cal ps when using Redshi in a real project
• Keep hot transac onal data in OLTP systems (RDS/Aurora) and periodically ETL/
replicate aggregated or cleaned data into Redshi for analy cs.
• Use spectrum/external tables to keep infrequently accessed data on S3 and save
cluster storage costs.
• Automate maintenance: schedule VACUUM and ANALYZE a er heavy loads, and
monitor disk u liza on to resize clusters when needed.
• Consider Redshi Serverless for variable workloads or RA3 instances when you want
decoupled storage and compute with managed scaling.
10. Short conclusion
Amazon Redshi is a powerful, fully managed data warehouse service tailored for OLAP
workloads. Its columnar storage, MPP execu on, compression, and ght AWS integra on
make it suitable for large-scale analy cs. For transac on-heavy, low-latency OLTP use-cases,
purpose-built transac onal databases (RDS/Aurora) are preferable. Understanding Redshi
architecture, best prac ces and common troubleshoo ng steps is important for interviews
and real-world projects.
ti
ti
ti
ti
ti
ti
ti
ft
ti
ft
ti
ti
ti
ti
ti
ti
ff
ti
ft
ff
ft
ti
ft
ti
ti
ti
ti
ft
ti
ti
ft
ffi
ti
ft
ft
ti
ti
ti
ti
ft
ti
ffi
ti
ti
ti
ti
ti
ft