0% found this document useful (0 votes)
6 views25 pages

Amazon RedShift

Amazon Redshift is a cloud-based data warehouse service that utilizes a cluster architecture with leader and compute nodes to efficiently process large-scale queries on structured data using a columnar storage model. It offers features like MPP for parallel processing, data compression, and query optimization, enabling users to run complex queries at a lower cost. Users can create databases, schemas, and tables, load data from S3, and manage their clusters through the AWS console, with options for encryption and scalability.

Uploaded by

gvsjjngmqg
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views25 pages

Amazon RedShift

Amazon Redshift is a cloud-based data warehouse service that utilizes a cluster architecture with leader and compute nodes to efficiently process large-scale queries on structured data using a columnar storage model. It offers features like MPP for parallel processing, data compression, and query optimization, enabling users to run complex queries at a lower cost. Users can create databases, schemas, and tables, load data from S3, and manage their clusters through the AWS console, with options for encryption and scalability.

Uploaded by

gvsjjngmqg
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Amazon RedShift:

1. It is a Data Warehouse as a service from Amazon.


2. It also works on the concept of nodes and clusters.
3. You can connect redshift with many types of applications like Reporting tools etc
4. It uses MPP model, stores the data in columnar data storage which makes it very efficient to
store the data in a compressed format.
5. It is based on Postgres SQL, so most of the syntax would be same as in postgres. but it has a
lot of rich features than just a traditional database.
6. You can use this to run super complicated queries over Petabytes range of Structured data
and run the query in parallel at economic cost.
7. It allows us to pay for compute and storage separately there by reducing the cost.

You can start at 0.25 USD per hour and scale up to PB for $1000 per terabyte per year.

Architecture of Redshift

Client Application:
It can be any application (like Java, .Net, PHP, BI-reporting tool) that uses JDBC/ODBC to
interact with the Client Application.
Redshift is based on Postgres SQL, so many Tool that supports connecting to Postgres can
also be connected to Redshift.

Cluster:
It is the core component of Redshift.
It is the collection of one or more compute node.

Leader Node:
If the cluster has more than 1 compute node, then only Leader node is added and it creates
coordination between/among all the compute nodes.
Client Application interacts with Leader Node and leader node assignes task to all the
compute node.
Leader node collects the output from all the compute node and give it back to the client.
It parses, develops execution plans to carry out database operations in form of series of
steps necessary to obtain results for complex queries.
Based on the best execution plan, the leader node compiles the code and the compiled code
is sent to compute
Based on execution plan, the leader node compiles code, distributes the compiled code to
the compute node and the code runs on a portioned of data.

The leader distributes the code(SQL compiled code) to compute node, only when the
data(that should be worked upon) is present in the compute node.
All other query otherwise run on the leader node.

Compute Node:
This are the actual worker node that runs the code in the data and send the result back to
the leader node for final aggregation.

Node Slices:
A compute node is divided into slices.
Each slice has a portion of RAM, Disk, Core allocated to it.
Leader node manages distributing data to the slice.
Leader node also shares a portion of work for any queries to the slice.
All the slices then work in parallel to complete the task.
The number of size per node is determined by the node size of the cluster.

Internal Network:
Redshift takes advantage of high-bandwidth connections to provide super speed network
communication between leader node and compute node.

Redshift performance
Redshift uses a lot of optimization out of the box to make sure it performs really well and
saves you a lot of time and money.
Some of the optimizations are
1. MPP: It is another Paradigm just like MapReduce to help parallel process of data.
The sql query which is converted to compiled code is distributed to all the compute node
and they all work together on some portion of data there by enabling a Massive Parallel
Processing.

2. Columnar Data Storage:


The entire column is stored together so that if we want to fetch few columns, we don’t have
to scan all the rows to get those columns. In fact we can just fetch some of those columns
directly.
Most of the DB Block size ranges from 2 KB to 32 KB.
Redshift Block size uses 1 MB of data which means more data can be put together in a block
which helps in faster IO operations.
3. Data Compression:
Data is compressed and then stored. When a query is executed, the compressed data is read
into memory and then un compressed during the query execution. Loading less data into the
memory enables Redshift to allocate more memory to analyze the data.

As columnar storage stores the column together we get more repetitive data together

4. Query Optimizer:
It enables Redshift to optimize and re-arrange the query so that the data can be filtered
earlier.

5. Result Caching:
The result of the Query executed in the redshift is cached in the memory of leader node.
When a user submits a query, Redshift checks the result cache to check if we have valid,
cached copy of the query result. If a match is found in the result cache, Redshift uses the
cached results and does not execute the query.
Result Caching is enable by default and is transparent to user.
Redshift does not cache some large query result sets. It determines whether to cache query
results based on a number of factors such as number of entries in the cache and the instance
type of your redshift cluster.

6. Compiled Code:
The leader node distributes fully optimized compiled code across all the nodes of a cluster.
Compiling the query eliminates the overhead associated with an interpreter and therefore
increases the execution speed.

The execution engine compiles different code for JDBC and ODBC connections, so If we have
2 clients that are using different approach would need little extra time for the code to be
compiled. If both the client is using the same approach then they can use the same code that
was compiled earlier.

7. Encryption:
While you setup the cluster, you can choose to encrypt the cluster for additional security.
Once the encryption is enabled, Redshift stores all the user table data in the encrypted
format.
Encryption is immutable in nature. That means if a cluster is already encrypted, you cannot
change it to un encrypted. The only thing that you can do is unload the data and reload the
data to unencrypted cluster.

8. Scalability:
Since redshift distributes the data and execute the query on top of it, In case you wish to
improve your performance, you can add more computing power to your cluster.
The data is also duplicated in the other machine so that in case your compute goes out of
service, the data is still intact in the other machine and would be processed from there.
9.

Fundamental Construct of Redshift


Redshift is just another SQL on Cloud with its own advantages and disadvantages.
Let’s understand what all builds the Redshift.

1. Databases:
Redshift creates one database when you provision a cluster.
A cluster can have one or more Databases.

2. Schemas:
A database contains more than 1 schema.
Each schema can have multiple tables and other named objects like views etc.
Whenever we create a table, by default we get a named schema PUBLIC.
Schema is just a collection of tables.
A database can have multiple schema and many different schema can have same name
table.
User with necessary privileges can access objects across multiple schemas in a database.

3. Tables:
A table is just like a spreadsheet that stores your data.
When you load the data in a table, Redshift distributes the rows of the table to each of the
node slices according to the tables distribution style.
The number of slices per node depends on the node size of the cluster.
All the nodes that is hosting the data would participate in the query execution.
While executing the query, the query optimizer redistributes the rows to the compute nodes
as needed to perform any joins and aggregation.
In that process, few rows might travel to certain node or the entire table might get
broadcasted to the other node.

4. Views:
Views are created on top of tables.
You might create view that would just have a few columns as compared to original table.
You can have different users that are working on different views that is connection to the
same table.
5.

Users and Users Group:


Users:
When you create a cluster, you will have to specify a master user who has access to all the
database that are created within the cluster. This master user is a superuser who is the only
user with access to the database initially, though this user can create additional superusers
and users.

Users Group:
More feature:[Link]

Check this video too.


[Link]
Login to Redshift
1. Create an account in aws
2. Go to aws console and launch redshift service by search for it.
3. Lets launch [Link] default a VPC is already created.

4. Lets create IAM roles


Launch IAM,Go to roles and click on Create roles button
5. Select AWS service and then redshift as shown below and then choose redshift customizable

6. Click on next permission button.


7. In the filter policy, type redshift and then choose
AmazonDMSRedshiftS3Role (Not required,remove this)
AmazonRedshiftFullAccess
AmazonS3ReadOnlyAccess
8. Clieck on next tag->next Review
Give a role name myRedShiftRole and click on create role.
Thuis will create a role and shows you the next page like this

9. Click on the hyperlink of myRedShiftRole. It will show you a page with ARN(Amazon resource
Name) .Make sure to copy the ARN
arn:aws:iam::236515830814:role/myRedShiftRole

10. Let’s now launch the redshift cluster.


Go to redshift again and click on “create cluster” from right hand side.
11. On the Create Cluster Page,
Give some cluster name under cluster identifier
Choose Free Trial(It is going to use [Link] per node)
Give Admin user name and Password
We have given awsuser/Awspassword1 as username and password.
While creating a cluster don’t use free trail. Always choose production and then choose the
lowest possible configuration and then from network, choose public access. Otherwise your
redshift might not have access from outside.

12. Click on Create Cluster. While creating a cluster the minimum node will be 2 and maximum
is 32. If you choose Production, you will get more option to choose like the total no of
machine, the size of each machine etc.
13. Once we create the cluster, Click on the link of the cluster and you will find all the
information about the cluster.
14. Terminate the cluster(If not required)
Select the cluster and click on more action->delete

You can also click on the link of the cluster and delete it from the next page.

15. Lets monitor the usage of your cluster, to understand the charges.
Go to your AWS account, from your profile, choose My billing dashboard.
You would see the charges along with the breakup per services.
Click on Cost Explorer to see the daily cost or cost between 2 days and so on.

16. Enable your port


Go to VPC->Security Groups->Inbound rules->Edit Inbound rules->Click on Add rule button
and add an entry for redshift. Give your Public IP address /32 and click on save roles.

17. Connect using DBeaver


File->New->Database connection->Analytical ->Redshift->Next
Host: Copy the end point from the cluster page. Remove the port and dbname
[Link]/dev

Host: [Link]
Password: Awspassword1

18. Once the connection is established you will be able to connect using dbeaver
19. You can see some of the settings that you can apply, using the command show all
Right click on Your Database->SQL Editor->New SQL script.
An editior will comeup, where you can write all your queries.

You can change any of these settings using the below technique
set statement_timeout to 1;

Create Table and Load the data:


1. Go to IAM, Roles and copy the ARN of the roles

arn:aws:iam::236515830814:role/myRedShiftRole
Note: This has access to s3 in readonly and full access to redshift.

2. Let’s create a Database, Schema and table

Employee
EmpNo empName empJob empSalary empMgr deptNo
111 Saketh Analyst 6000 444 10
222 Sudha Clerk 5000 333 20
333 Jagan Manager 7000 111 10
444 Madhu Engineer 6000 222 40
555 Kiran Engineer 9000 222 10
666 Raveena Manager 9000 111 10
Dept
deptNo deptName deptLoc
10 INVENTORY HYD
20 FINANCE BGLR
30 HR MUMBAI

create database test;


After this step, Please edit your connection and change
the database to test and then re-establish the connection.

create schema empdept;

CREATE TABLE [Link](


EMPNO integer,
EMPNAME varchar(30),
EMPJOB varchar(30),
EMPSALARY decimal(8,2),
EMPMGR integer,
DEPTNO integer
);

CREATE TABLE [Link](


DEPTNO integer,
DEPTNAME varchar(30),
DEPTLOC varchar(30)
);

Let’s upload 2 files in our s3 bucket.


Create a folder empdept and upload 2 files [Link] and [Link]
s3://olcbucket2/empdept/

3. Load the data from s3 bucket to your redshift table.


copy [Link] from 's3://olcbucket2/empdept/[Link]'
credentials 'aws_iam_role=arn:aws:iam::236515830814:role/myRedShiftRole'
delimiter ',' region 'us-east-1';

copy [Link] from 's3://olcbucket2/empdept/[Link]'


credentials 'aws_iam_role=arn:aws:iam::236515830814:role/myRedShiftRole'
delimiter ',' region 'us-east-1';

4. Validate the result


select * from [Link];
select * from [Link];
You can also check the data from the UI.

5. Lets create the snapshot and then load the data back from snapshot.
Before you delete your cluster, it is super important to take the backup of the data and
[Link] is what you should do.
Click on snapshot from the Left hand side.(cluster->snapshots) and click on Create
Snapshot.

6. Give some name to the snapshot and click on create snapshot.

7. Once you create the snapshot, you will see the details as shown [Link] note it
occupies some space in your s3 and hence would cost you the storage.

Restoring the Cluster:


Go to the snapshot. Select the snapshot which you wish to restore and click on
Restore from snapshot.

Select “restored to provisioned cluster”. You can specify the name of your new cluster that
you are trying to create. Lets call it redshift-cluster-2.
Once you restore your cluster. You will notice that the role which was associated earlier with
the original cluster is detached now. You will have to attach it again.
Choose your cluster and go to properties tab.

Scroll down to check the IAM role attached is empty.

Click on Associate IAM role, choose the role that we had created earlier. Click on Associate
IAM roles.

Predefined schema:
State wise count.

You might also like