0% found this document useful (0 votes)

6 views25 pages

Amazon RedShift

Amazon Redshift is a cloud-based data warehouse service that utilizes a cluster architecture with leader and compute nodes to efficiently process large-scale queries on structured data using a columnar storage model. It offers features like MPP for parallel processing, data compression, and query optimization, enabling users to run complex queries at a lower cost. Users can create databases, schemas, and tables, load data from S3, and manage their clusters through the AWS console, with options for encryption and scalability.

Uploaded by

gvsjjngmqg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views25 pages

Amazon RedShift

Uploaded by

gvsjjngmqg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Amazon RedShift:

1. It is a Data Warehouse as a service from Amazon.

2. It also works on the concept of nodes and clusters.
3. You can connect redshift with many types of applications like Reporting tools etc
4. It uses MPP model, stores the data in columnar data storage which makes it very efficient to
store the data in a compressed format.
5. It is based on Postgres SQL, so most of the syntax would be same as in postgres. but it has a
lot of rich features than just a traditional database.
6. You can use this to run super complicated queries over Petabytes range of Structured data
and run the query in parallel at economic cost.
7. It allows us to pay for compute and storage separately there by reducing the cost.

You can start at 0.25 USD per hour and scale up to PB for $1000 per terabyte per year.

Architecture of Redshift

Client Application:
It can be any application (like Java, .Net, PHP, BI-reporting tool) that uses JDBC/ODBC to
interact with the Client Application.
Redshift is based on Postgres SQL, so many Tool that supports connecting to Postgres can
also be connected to Redshift.

Cluster:
It is the core component of Redshift.
It is the collection of one or more compute node.

Leader Node:
If the cluster has more than 1 compute node, then only Leader node is added and it creates
coordination between/among all the compute nodes.
Client Application interacts with Leader Node and leader node assignes task to all the
compute node.
Leader node collects the output from all the compute node and give it back to the client.
It parses, develops execution plans to carry out database operations in form of series of
steps necessary to obtain results for complex queries.
Based on the best execution plan, the leader node compiles the code and the compiled code
is sent to compute
Based on execution plan, the leader node compiles code, distributes the compiled code to
the compute node and the code runs on a portioned of data.

The leader distributes the code(SQL compiled code) to compute node, only when the
data(that should be worked upon) is present in the compute node.
All other query otherwise run on the leader node.

Compute Node:
This are the actual worker node that runs the code in the data and send the result back to
the leader node for final aggregation.

Node Slices:
A compute node is divided into slices.
Each slice has a portion of RAM, Disk, Core allocated to it.
Leader node manages distributing data to the slice.
Leader node also shares a portion of work for any queries to the slice.
All the slices then work in parallel to complete the task.
The number of size per node is determined by the node size of the cluster.

Internal Network:
Redshift takes advantage of high-bandwidth connections to provide super speed network
communication between leader node and compute node.

Redshift performance
Redshift uses a lot of optimization out of the box to make sure it performs really well and
saves you a lot of time and money.
Some of the optimizations are
1. MPP: It is another Paradigm just like MapReduce to help parallel process of data.
The sql query which is converted to compiled code is distributed to all the compute node
and they all work together on some portion of data there by enabling a Massive Parallel
Processing.

2. Columnar Data Storage:

The entire column is stored together so that if we want to fetch few columns, we don’t have
to scan all the rows to get those columns. In fact we can just fetch some of those columns
directly.
Most of the DB Block size ranges from 2 KB to 32 KB.
Redshift Block size uses 1 MB of data which means more data can be put together in a block
which helps in faster IO operations.
3. Data Compression:
Data is compressed and then stored. When a query is executed, the compressed data is read
into memory and then un compressed during the query execution. Loading less data into the
memory enables Redshift to allocate more memory to analyze the data.

As columnar storage stores the column together we get more repetitive data together

4. Query Optimizer:
It enables Redshift to optimize and re-arrange the query so that the data can be filtered
earlier.

5. Result Caching:
The result of the Query executed in the redshift is cached in the memory of leader node.
When a user submits a query, Redshift checks the result cache to check if we have valid,
cached copy of the query result. If a match is found in the result cache, Redshift uses the
cached results and does not execute the query.
Result Caching is enable by default and is transparent to user.
Redshift does not cache some large query result sets. It determines whether to cache query
results based on a number of factors such as number of entries in the cache and the instance
type of your redshift cluster.

6. Compiled Code:
The leader node distributes fully optimized compiled code across all the nodes of a cluster.
Compiling the query eliminates the overhead associated with an interpreter and therefore
increases the execution speed.

The execution engine compiles different code for JDBC and ODBC connections, so If we have
2 clients that are using different approach would need little extra time for the code to be
compiled. If both the client is using the same approach then they can use the same code that
was compiled earlier.

7. Encryption:
While you setup the cluster, you can choose to encrypt the cluster for additional security.
Once the encryption is enabled, Redshift stores all the user table data in the encrypted
format.
Encryption is immutable in nature. That means if a cluster is already encrypted, you cannot
change it to un encrypted. The only thing that you can do is unload the data and reload the
data to unencrypted cluster.

8. Scalability:
Since redshift distributes the data and execute the query on top of it, In case you wish to
improve your performance, you can add more computing power to your cluster.
The data is also duplicated in the other machine so that in case your compute goes out of
service, the data is still intact in the other machine and would be processed from there.
9.

Fundamental Construct of Redshift

Redshift is just another SQL on Cloud with its own advantages and disadvantages.
Let’s understand what all builds the Redshift.

1. Databases:
Redshift creates one database when you provision a cluster.
A cluster can have one or more Databases.

2. Schemas:
A database contains more than 1 schema.
Each schema can have multiple tables and other named objects like views etc.
Whenever we create a table, by default we get a named schema PUBLIC.
Schema is just a collection of tables.
A database can have multiple schema and many different schema can have same name
table.
User with necessary privileges can access objects across multiple schemas in a database.

3. Tables:
A table is just like a spreadsheet that stores your data.
When you load the data in a table, Redshift distributes the rows of the table to each of the
node slices according to the tables distribution style.
The number of slices per node depends on the node size of the cluster.
All the nodes that is hosting the data would participate in the query execution.
While executing the query, the query optimizer redistributes the rows to the compute nodes
as needed to perform any joins and aggregation.
In that process, few rows might travel to certain node or the entire table might get
broadcasted to the other node.

4. Views:
Views are created on top of tables.
You might create view that would just have a few columns as compared to original table.
You can have different users that are working on different views that is connection to the
same table.
5.

Users and Users Group:

Users:
When you create a cluster, you will have to specify a master user who has access to all the
database that are created within the cluster. This master user is a superuser who is the only
user with access to the database initially, though this user can create additional superusers
and users.

Users Group:
More feature:[Link]

Check this video too.

[Link]
Login to Redshift
1. Create an account in aws
2. Go to aws console and launch redshift service by search for it.
3. Lets launch [Link] default a VPC is already created.

4. Lets create IAM roles

Launch IAM,Go to roles and click on Create roles button
5. Select AWS service and then redshift as shown below and then choose redshift customizable

6. Click on next permission button.

7. In the filter policy, type redshift and then choose
AmazonDMSRedshiftS3Role (Not required,remove this)
AmazonRedshiftFullAccess
AmazonS3ReadOnlyAccess
8. Clieck on next tag->next Review
Give a role name myRedShiftRole and click on create role.
Thuis will create a role and shows you the next page like this

9. Click on the hyperlink of myRedShiftRole. It will show you a page with ARN(Amazon resource
Name) .Make sure to copy the ARN
arn:aws:iam::236515830814:role/myRedShiftRole

10. Let’s now launch the redshift cluster.

Go to redshift again and click on “create cluster” from right hand side.
11. On the Create Cluster Page,
Give some cluster name under cluster identifier
Choose Free Trial(It is going to use [Link] per node)
Give Admin user name and Password
We have given awsuser/Awspassword1 as username and password.
While creating a cluster don’t use free trail. Always choose production and then choose the
lowest possible configuration and then from network, choose public access. Otherwise your
redshift might not have access from outside.

12. Click on Create Cluster. While creating a cluster the minimum node will be 2 and maximum
is 32. If you choose Production, you will get more option to choose like the total no of
machine, the size of each machine etc.
13. Once we create the cluster, Click on the link of the cluster and you will find all the
information about the cluster.
14. Terminate the cluster(If not required)
Select the cluster and click on more action->delete

You can also click on the link of the cluster and delete it from the next page.

15. Lets monitor the usage of your cluster, to understand the charges.
Go to your AWS account, from your profile, choose My billing dashboard.
You would see the charges along with the breakup per services.
Click on Cost Explorer to see the daily cost or cost between 2 days and so on.

16. Enable your port

Go to VPC->Security Groups->Inbound rules->Edit Inbound rules->Click on Add rule button
and add an entry for redshift. Give your Public IP address /32 and click on save roles.

17. Connect using DBeaver

File->New->Database connection->Analytical ->Redshift->Next
Host: Copy the end point from the cluster page. Remove the port and dbname
[Link]/dev

Host: [Link]
Password: Awspassword1

18. Once the connection is established you will be able to connect using dbeaver
19. You can see some of the settings that you can apply, using the command show all
Right click on Your Database->SQL Editor->New SQL script.
An editior will comeup, where you can write all your queries.

You can change any of these settings using the below technique
set statement_timeout to 1;

Create Table and Load the data:

1. Go to IAM, Roles and copy the ARN of the roles

arn:aws:iam::236515830814:role/myRedShiftRole
Note: This has access to s3 in readonly and full access to redshift.

2. Let’s create a Database, Schema and table

Employee
EmpNo empName empJob empSalary empMgr deptNo
111 Saketh Analyst 6000 444 10
222 Sudha Clerk 5000 333 20
333 Jagan Manager 7000 111 10
444 Madhu Engineer 6000 222 40
555 Kiran Engineer 9000 222 10
666 Raveena Manager 9000 111 10
Dept
deptNo deptName deptLoc
10 INVENTORY HYD
20 FINANCE BGLR
30 HR MUMBAI

create database test;

After this step, Please edit your connection and change
the database to test and then re-establish the connection.

create schema empdept;

CREATE TABLE [Link](

EMPNO integer,
EMPNAME varchar(30),
EMPJOB varchar(30),
EMPSALARY decimal(8,2),
EMPMGR integer,
DEPTNO integer
);

CREATE TABLE [Link](

DEPTNO integer,
DEPTNAME varchar(30),
DEPTLOC varchar(30)
);

Let’s upload 2 files in our s3 bucket.

Create a folder empdept and upload 2 files [Link] and [Link]
s3://olcbucket2/empdept/

3. Load the data from s3 bucket to your redshift table.

copy [Link] from 's3://olcbucket2/empdept/[Link]'
credentials 'aws_iam_role=arn:aws:iam::236515830814:role/myRedShiftRole'
delimiter ',' region 'us-east-1';

copy [Link] from 's3://olcbucket2/empdept/[Link]'

credentials 'aws_iam_role=arn:aws:iam::236515830814:role/myRedShiftRole'
delimiter ',' region 'us-east-1';

4. Validate the result

select * from [Link];
select * from [Link];
You can also check the data from the UI.

5. Lets create the snapshot and then load the data back from snapshot.
Before you delete your cluster, it is super important to take the backup of the data and
[Link] is what you should do.
Click on snapshot from the Left hand side.(cluster->snapshots) and click on Create
Snapshot.

6. Give some name to the snapshot and click on create snapshot.

7. Once you create the snapshot, you will see the details as shown [Link] note it
occupies some space in your s3 and hence would cost you the storage.

Restoring the Cluster:

Go to the snapshot. Select the snapshot which you wish to restore and click on
Restore from snapshot.

Select “restored to provisioned cluster”. You can specify the name of your new cluster that
you are trying to create. Lets call it redshift-cluster-2.
Once you restore your cluster. You will notice that the role which was associated earlier with
the original cluster is detached now. You will have to attach it again.
Choose your cluster and go to properties tab.

Scroll down to check the IAM role attached is empty.

Click on Associate IAM role, choose the role that we had created earlier. Click on Associate
IAM roles.

Predefined schema:
State wise count.

Amazon Redshift Overview and Setup Guide
No ratings yet
Amazon Redshift Overview and Setup Guide
5 pages
Amazon Redshift
No ratings yet
Amazon Redshift
8 pages
Overview of Amazon Redshift Architecture
No ratings yet
Overview of Amazon Redshift Architecture
1 page
Amazon Redshift: Key Features & Overview
No ratings yet
Amazon Redshift: Key Features & Overview
3 pages
Amazon Redshift: Parallel Processing Guide
No ratings yet
Amazon Redshift: Parallel Processing Guide
5 pages
Amazon Redshift Data Warehouse Architecture
No ratings yet
Amazon Redshift Data Warehouse Architecture
3 pages
Comparing Amazon Redshift and Snowflake
No ratings yet
Comparing Amazon Redshift and Snowflake
54 pages
Redshift Column Management and Features
No ratings yet
Redshift Column Management and Features
8 pages
Amazon Redshift Beginner's Guide
No ratings yet
Amazon Redshift Beginner's Guide
10 pages
Amazon Redshift: Features and Components
No ratings yet
Amazon Redshift: Features and Components
4 pages
AWS Redshift Architecture and Performance
100% (1)
AWS Redshift Architecture and Performance
16 pages
Amazon Redshift Overview and Features
No ratings yet
Amazon Redshift Overview and Features
4 pages
Amazon Redshift Overview and Best Practices
No ratings yet
Amazon Redshift Overview and Best Practices
3 pages
Understanding Amazon Redshift Architecture
No ratings yet
Understanding Amazon Redshift Architecture
2 pages
Amazon Redshift: Scalable Data Warehouse
No ratings yet
Amazon Redshift: Scalable Data Warehouse
15 pages
Overview of AWS Redshift Features
No ratings yet
Overview of AWS Redshift Features
145 pages
Getting Started With Amazon Redshift
No ratings yet
Getting Started With Amazon Redshift
51 pages
Amazon Redshift Overview and Benefits
No ratings yet
Amazon Redshift Overview and Benefits
22 pages
Review of Query Optimization and Resource Management in Amazon Redshift For Large-Scale Analytical Workloads
No ratings yet
Review of Query Optimization and Resource Management in Amazon Redshift For Large-Scale Analytical Workloads
10 pages
Amazon Redshift: Fast Petabyte-Scale Data Warehouse
No ratings yet
Amazon Redshift: Fast Petabyte-Scale Data Warehouse
32 pages
Amazon Redshift Super Class Overview
No ratings yet
Amazon Redshift Super Class Overview
74 pages
Amazon Redshift Overview and Features
No ratings yet
Amazon Redshift Overview and Features
20 pages
Amazon Redshift: Cloud Data Warehouse Guide
No ratings yet
Amazon Redshift: Cloud Data Warehouse Guide
25 pages
Amazon Redshift Architecture Overview
No ratings yet
Amazon Redshift Architecture Overview
20 pages
What Does Amazon Redshift Do
No ratings yet
What Does Amazon Redshift Do
13 pages
Amazon Redshift Performance Tuningand Optimization
No ratings yet
Amazon Redshift Performance Tuningand Optimization
5 pages
Understanding Amazon Redshift Basics
No ratings yet
Understanding Amazon Redshift Basics
12 pages
AWS Database Services Overview
No ratings yet
AWS Database Services Overview
29 pages
Amazon Redshift: Getting Started Guide
No ratings yet
Amazon Redshift: Getting Started Guide
34 pages
Cheat Sheet - Redshift Performance Optimization
No ratings yet
Cheat Sheet - Redshift Performance Optimization
17 pages
Cloud Data Warehouses Explained
No ratings yet
Cloud Data Warehouses Explained
28 pages
Redshift Data Warehouse
No ratings yet
Redshift Data Warehouse
62 pages
Amazon Redshift论文
No ratings yet
Amazon Redshift论文
13 pages
Amazon Redshift Database Developer Guide
No ratings yet
Amazon Redshift Database Developer Guide
783 pages
Amazon Redshift Masterclass Overview
No ratings yet
Amazon Redshift Masterclass Overview
82 pages
Innovations in Amazon Redshift 2022
No ratings yet
Innovations in Amazon Redshift 2022
13 pages
AWS Solutions Architect Exam Guide
No ratings yet
AWS Solutions Architect Exam Guide
23 pages
Amazon Redshift Data Engineering Guide
No ratings yet
Amazon Redshift Data Engineering Guide
65 pages
Amazon Redshift Super Class Overview
No ratings yet
Amazon Redshift Super Class Overview
75 pages
AWS Redshift: Getting Started Guide
No ratings yet
AWS Redshift: Getting Started Guide
2 pages
Amazon Redshift Overview and Benefits
No ratings yet
Amazon Redshift Overview and Benefits
9 pages
Redshift Best Practices Guide
No ratings yet
Redshift Best Practices Guide
50 pages
Overview of AWS Database Services
No ratings yet
Overview of AWS Database Services
10 pages
Redshift DG
No ratings yet
Redshift DG
735 pages
Amazon Redshift Developer Guide PDF
No ratings yet
Amazon Redshift Developer Guide PDF
733 pages
Optimizing Redshift Data Warehouse Design
No ratings yet
Optimizing Redshift Data Warehouse Design
15 pages
Redshift DG
No ratings yet
Redshift DG
871 pages
AWS Data Engineering Cheat Sheet
No ratings yet
AWS Data Engineering Cheat Sheet
27 pages
Best Practices for AWS Redshift Setup
No ratings yet
Best Practices for AWS Redshift Setup
47 pages
AWS Databases for Cloud Practitioner Exam
No ratings yet
AWS Databases for Cloud Practitioner Exam
10 pages
AWS Database Solutions Overview
No ratings yet
AWS Database Solutions Overview
1 page
Data Lakes vs. Data Warehouses Explained
No ratings yet
Data Lakes vs. Data Warehouses Explained
15 pages
Redshift Performance Optimization Guide
No ratings yet
Redshift Performance Optimization Guide
17 pages
Deep Dive and Best Practices For Amazon Redshift ANT418
100% (1)
Deep Dive and Best Practices For Amazon Redshift ANT418
85 pages
Amazon Refshift Book PDF
No ratings yet
Amazon Refshift Book PDF
549 pages
Snowflake and SQL Interview Questions
No ratings yet
Snowflake and SQL Interview Questions
13 pages
Advanced Redshift Techniques for Data Analysis
No ratings yet
Advanced Redshift Techniques for Data Analysis
15 pages
Practical Labs: Yocto Project and Openembedded Training
No ratings yet
Practical Labs: Yocto Project and Openembedded Training
24 pages
Exam 70-465: SQL Server 2012 Solutions
No ratings yet
Exam 70-465: SQL Server 2012 Solutions
4 pages
Introduction to OOP in C# Classes
No ratings yet
Introduction to OOP in C# Classes
30 pages
Introduction to MARIE Architecture
No ratings yet
Introduction to MARIE Architecture
68 pages
Database System Architecture Overview
No ratings yet
Database System Architecture Overview
24 pages
Encrypting Data Transfer in AAA
No ratings yet
Encrypting Data Transfer in AAA
78 pages
eSign Application Audit Trail Report
No ratings yet
eSign Application Audit Trail Report
1 page
Microsoft Office 2024 Pricing Overview
No ratings yet
Microsoft Office 2024 Pricing Overview
4 pages
Internet Security and Digital Currency Guide
No ratings yet
Internet Security and Digital Currency Guide
7 pages
Complete Exam Form Process 2025-2026
No ratings yet
Complete Exam Form Process 2025-2026
3 pages
Colorful Motherboard BIOS Update Guide
No ratings yet
Colorful Motherboard BIOS Update Guide
6 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
14 pages
Smart AC App Registration Guide
No ratings yet
Smart AC App Registration Guide
40 pages
Convert Webpages to PDF Easily
No ratings yet
Convert Webpages to PDF Easily
5 pages
Comparing Variables in Programming
No ratings yet
Comparing Variables in Programming
86 pages
LabVIEW Excel Report Generation Guide
No ratings yet
LabVIEW Excel Report Generation Guide
3 pages
Array and Matrix Operations in C
No ratings yet
Array and Matrix Operations in C
36 pages
KCGG 151G Settings File Report
No ratings yet
KCGG 151G Settings File Report
2 pages
Untitled Document-5
No ratings yet
Untitled Document-5
4 pages
Java Programs for Primitive Types and Search
No ratings yet
Java Programs for Primitive Types and Search
70 pages
Momentum Candles Indicator Code
No ratings yet
Momentum Candles Indicator Code
4 pages
Mythic AI V3.3.7 Release Notes
No ratings yet
Mythic AI V3.3.7 Release Notes
3 pages
iMaster NCE x86 Server Specs
No ratings yet
iMaster NCE x86 Server Specs
14 pages
Spreadsheet Practical Questions Guide
No ratings yet
Spreadsheet Practical Questions Guide
9 pages
DEAP Dataset for Emotion Recognition
No ratings yet
DEAP Dataset for Emotion Recognition
3 pages
SPARK Academy: AI Training in Medical Imaging
No ratings yet
SPARK Academy: AI Training in Medical Imaging
16 pages
Online Student Management System
No ratings yet
Online Student Management System
10 pages
Blockchain-Based Fake Product Detection
No ratings yet
Blockchain-Based Fake Product Detection
9 pages
Nithish Kumar: IT Hardware & Networking Expert
No ratings yet
Nithish Kumar: IT Hardware & Networking Expert
2 pages
Java Interfaces: Features and Examples
No ratings yet
Java Interfaces: Features and Examples
14 pages

Amazon RedShift

Uploaded by

Amazon RedShift

Uploaded by

Amazon RedShift:

1. It is a Data Warehouse as a service from Amazon.

2. Columnar Data Storage:

Fundamental Construct of Redshift

Users and Users Group:

Check this video too.

4. Lets create IAM roles

6. Click on next permission button.

10. Let’s now launch the redshift cluster.

16. Enable your port

17. Connect using DBeaver

Create Table and Load the data:

2. Let’s create a Database, Schema and table

create database test;

create schema empdept;

CREATE TABLE [Link](

CREATE TABLE [Link](

Let’s upload 2 files in our s3 bucket.

3. Load the data from s3 bucket to your redshift table.

copy [Link] from 's3://olcbucket2/empdept/[Link]'

4. Validate the result

6. Give some name to the snapshot and click on create snapshot.

Restoring the Cluster:

Scroll down to check the IAM role attached is empty.

You might also like