0% found this document useful (0 votes)
7 views11 pages

AWS Data Engineering Services Overview

The document provides a comprehensive overview of various AWS services related to data engineering, detailing their primary purposes, use cases, and considerations. It includes sections on data concepts, pipeline stages, and limitations of specific AWS services. Each service is categorized by functionality, cost efficiency, and best use cases, aiding in understanding their applications in data engineering workflows.

Uploaded by

Naelah Khojandi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

AWS Data Engineering Services Overview

The document provides a comprehensive overview of various AWS services related to data engineering, detailing their primary purposes, use cases, and considerations. It includes sections on data concepts, pipeline stages, and limitations of specific AWS services. Each service is categorized by functionality, cost efficiency, and best use cases, aiding in understanding their applications in data engineering workflows.

Uploaded by

Naelah Khojandi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Engineering Table Map AWS DEA Related Services

Icon AWS Service Servie Type 🏳️ Primary Purpose 📦 Data Source ⚙️ Least Operations 💹 Cost Efficient 💲 Cost | Per 🚦 Best Use Case ✅ Key Benefits ⚠️ Considerations ❌ Not Used For
AWS Secrets Manager Security Management Manage and retrieve secrets (Credentials) securely Manually by user or API call 🟢 High 🟢 High −$ (based on usage) Storing and managing secrets (API keys, passwords) Automatic rotation of secrets, audit logging Costs can increase with the number of secrets Securely transferring secrets to users

Amazon S3 Storage Object storage for unstructured data Unstructured & semi-structured data, large files 🟢 High 🟢 High $ (cost-effective for storage) Storing large datasets and data lakes Scalable, durable, low-cost storage Not designed for analytics directly SQL analytics, data transformation

AWS Lambda Compute Serverless compute for running code Event-based sources: S3, DynamoDB, API Gateway, SNS, SQS, Kinesis, and custom events 🟢 High 🟢 High $ - $$ (pay-per-invocation) Event-driven tasks, lightweight ETL, and triggers Serverless, scales automatically, pay-per-use Best for lightweight, short-duration tasks Long-running jobs, complex ETL processes

AWS DMS Data Migration Database migration Relational databases, NoSQL databases 🟡 Medium 🟢 High $$$ (depends on data volume) Migrating or syncing databases to AWS Continuous replication, cross-DB compatibility Best for migration, not ongoing ETL Data warehousing, complex analytics

Amazon RDS Database Managed relational database Relational data (MySQL, PostgreSQL, Oracle, etc.) 🟡 Medium 🟡 Medium $$ - $$$ (depends on instance size) Operational databases for applications Managed, multiple DB engines Limited for analytical workloads Data warehousing, real-time analytics

Amazon Athena Analytics Query S3 data with SQL Data stored in Amazon S3 🟢 High 🟢 High $$ (pay-per-query) Ad-hoc querying of structured data in S3 Serverless, no ETL needed, pay-per-query Limited for complex processing Data transformation, real-time analytics

Amazon Redshift Data Warehouse Data warehousing for structured data Structured data from S3, relational databases 🔴 Low 🟡 Medium $$$ - $$$$ (depends on nodes) BI reporting and analytics on large datasets Columnar storage, high-performance SQL Higher cost for smaller datasets Real-time analytics, unstructured data

Amazon Redshift Spectrum Analytics Querying S3 data from Redshift without loading it into the warehouse Structured and semi-structured data in S3 🟢 High 🟢 High $$ (charged per TB of data scanned) Running SQL queries on structured and semi-structured data in S3 Extends Redshift to S3 data, saves on storage costs, supports large datasets Requires Redshift cluster, adds extra costs on query usage in S3 ETL processing, data transformation

AWS Glue ETL ETL service and Data Catalog S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs) ETL for transforming raw data into analytics-ready form Serverless ETL, Data Catalog Not suitable for large-scale SQL analytics Real-time analytics, ad-hoc SQL querying

AWS Glue Data Catalog Metadata Management Centralized metadata repository for data sources S3, relational databases, JDBC-compatible sources 🟢 High 🟢 High $ (cost-effective for metadata storage) Managing and discovering data across AWS and external sources Integrates with S3, Redshift, Athena, enables data discovery and schema inference Not a data storage solution; requires data sources to be managed separately ETL processing, ad-hoc querying

AWS Glue Studio ETL Visual interface for creating ETL jobs S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs) Designing and managing ETL workflows without coding Intuitive GUI, drag-and-drop ETL design, simplifies complex workflows Limited customization for advanced scripts Real-time processing, non-ETL tasks

AWS Glue DataBrew Data Preparation Data preparation and cleaning with a visual tool S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs, visual prep) No-code data cleaning, exploration Visual interface, no coding required Limited for large transformations SQL analytics, real-time streaming

Kinesis Data Streaming Real-time data streaming Real-time streaming data from IoT, apps, and databases 🟡 Medium 🔴 Low $$$ (per shard/hour) Real-time analytics and data ingestion Supports real-time processing Requires streaming data, setup for scale Batch processing, SQL querying

Kinesis Data Streams Data Streaming Real-time streaming data ingestion Streaming sources (e.g., IoT, logs) 🔴 Low 🟡 Medium $$$ (Shards per hour) Real-time analytics and event streaming Durable streams, real-time data ingestion Requires setup and tuning for scale Batch processing, large-scale ML

Amazon Data Firehose Data Streaming Real-time data delivery Streaming sources (e.g., IoT, logs) 🟢 High 🟢 High $$$ (Volume of data ingested) Streaming data delivery to S3, Redshift Fully managed, near real-time delivery Limited transformations during delivery Complex data transformations, heavy analytics

Amazon EMR Big Data Processing Big data processing with Hadoop, Spark, etc. S3, HDFS, relational databases, NoSQL sources 🔴 Low 🔴 Low $$$$ (based on cluster size) Data transformations, large-scale ML Customizable, supports Hadoop ecosystem Cluster management required Simple ETL, ad-hoc SQL on S3

Amazon MSK for Apache Kafka Data Streaming Managed Apache Kafka for streaming data Streaming sources (logs, IoT, metrics) 🔴 Low 🔴 Low $$$ (per cluster and storage usage) Real-time data streaming and event-driven architectures Fully managed Kafka, integrates with AWS, highly scalable Requires Kafka knowledge; higher operational costs than Kinesis Batch processing, SQL querying

Managed Apache Flink Data Streaming Real-time data processing and analytics Streaming data (MSK, Kinesis) 🟡 Medium 🔴 Low $$$ (pay-per-instance/hour + storage) Processing streaming data for analytics Fully managed Flink, low latency, supports complex event processing Requires Flink knowledge; not ideal for simple streaming needs Batch processing, data storage

AWS QuickSight Business Intelligence (BI) Business intelligence (BI) and data visualization Redshift, S3, RDS, Athena, or other SQL sources 🟢 High 🟢 High $ - $$ (per-user pricing) Visualizing data from multiple sources Interactive dashboards, integrates with Redshift and S3 Limited for complex ETL, transformations Data preparation, ETL, large-scale ML

AWS Lake Formation Data Lake Management Data lake management and security Primarily S3 data 🟢 High 🟢 High $ (management costs minimal) Secure data lake setup and access controls Simplifies setup, access control for S3 Primarily for S3 data management SQL querying, data transformation

AWS Data Exchange Data Exchange Secure sharing of third-party data Third-party data providers 🟢 High 🟢 High $ - $$$ (Dataset consumption) Accessing third-party datasets for analytics Easy integration with AWS analytics services Limited to external datasets Internal ETL, real-time ingestion

AWS Data Zone Data Governance Data governance and sharing S3, Redshift, databases 🟢 High 🟢 High $ (Metadata and governance) Centralized data discovery and governance Simplifies governance and access management Requires setup for organization-wide adoption Data processing or analytics

MWAA - Apache Airflow Workflow Orchestration Orchestrate workflows and pipelines Any source (S3, RDS, Redshift, etc.) 🟢 High 🔴 Low $$$ (based on usage and task execution) Scheduling and managing ETL workflows Fully managed Airflow, supports DAGs, scales dynamically Requires DAG knowledge; costs increase with large workflows Real-time streaming, heavy transformations

AWS Step Functions Workflow Orchestration Workflow orchestration Compatible with Lambda, Glue, S3, RDS, and many AWS services 🟢 High 🟡 Medium $$ (pay-per-transaction) Coordinating ETL and multi-step data processing tasks Serverless, visual workflows, error handling Best for multi-step workflows, not real-time streaming Real-time streaming, single-step tasks

Amazon EventBridge Event Bus Serverless event bus to route events between services Event streams, SaaS apps, AWS services 🟢 High 🟢 High $ (pay-per-event, cost-effective for low volumes) Triggering workflows in response to application or system events Fully managed, integrates with over 200 AWS services and SaaS apps. Requires understanding of event-driven architecture Data transformation, large-scale analytics

Amazon CloudWatch Monitoring Monitoring and observability for AWS resources Logs, metrics, alarms from AWS resources 🟢 High 🟡 Medium $$ (per logs, metrics, and alarms) Monitoring AWS infrastructure, alarms, log analysis Fully managed monitoring, integrates with AWS services Costs scale with number of logs, metrics, and alarms Real-time data transformation, heavy processing tasks

12/1/2024
Data Engineering Table Map Data Concepts

Concept Description Use Cases Related Concepts

DaaS (Data as a Service) A cloud service providing data on demand, allowing access to data without owning or storing it. Data APIs, data marketplaces, real-time insights Data sharing, cloud computing

DaaP (Data as a Platform) Data storage and processing platform providing tools for data collection, storage, processing, and sharing. Data pipelines, analytics platforms ETL, data lakes, data warehouses

Data Types Different forms of data: Structured (e.g., relational databases), Semi-structured (e.g., JSON), Unstructured (e.g., images). Various data storage and analytics needs Schema design, data formats

OLTP Online Transaction Processing - handles real-time transactions with frequent updates, typically in row-based storage. E-commerce, banking transactions Row-based databases, ACID

OLAP Online Analytical Processing - optimizes read-heavy operations, often using columnar storage for fast querying. Business intelligence, data warehousing Columnar databases, reporting

Row-Based Storage Data stored by rows, ideal for fast read/write operations on entire rows, common in OLTP systems. Transactional systems, real-time processing Relational databases, ACID

Columnar-Based Storage Data stored by columns, enabling fast read operations for analytical queries, used in OLAP systems. Data warehousing, analytics Columnar databases, BI tools

5 Vs of Data Challenges in big data characterized by Volume (amount), Variety (types), Velocity (speed), Veracity (accuracy), and Value (usefulness). Big data management, analytics Data quality, data governance

Data Processing Modes Batch Processing for processing data in large chunks at set intervals; Streaming Processing for continuous processing as data arrives. Real-time analytics, periodic reporting Kinesis, Apache Kafka, Spark

ETL Extract, Transform, Load - traditional method where data is transformed before loading into the target system. Data warehousing, data integration Glue, Informatica

ELT Extract, Load, Transform - loads raw data into the target system, allowing transformation within the system. Cloud-based data warehousing (e.g., Redshift, Snowflake) Athena, Redshift

12/1/2024
Data Engineering Table Map Data Pipeline Stages

Pipeline Stage Functionality Service Examples

Data Sources Objects, Databases, Mobile, IoT S3 (for objects), RDS (for databases), AWS IoT

Ingestion Database import, Object/file ingestion, Streaming data AWS Glue, Kinesis Data Firehose, S3 Transfer Acceleration

Storage Databases, Managed storage Amazon RDS, Amazon S3, DynamoDB

Cataloging, Processing, and Governance Data Catalog, Processes AWS Glue Data Catalog, AWS Lambda, Lake Formation

Analytics and Visualization Search, Interactive dashboards, Queries, Embedded analytics Amazon OpenSearch, QuickSight, Athena, Redshift

Security and Monitoring Elements

1. Single Sign-On: AWS SSO

2. Identity: AWS IAM

3. Network Security: VPC Security Groups, AWS WAF

4. Monitoring: CloudWatch, CloudTrail

12/1/2024
Data Engineering Table Map AWS Limitations

Service Type of Limitation Limit Can be Increased?


Object Size Max: 5 TB per object ❌
Amazon S3 PUT Upload Size Max: 5 GB (for a single PUT request) ❌, use multipart uploads
Bucket Name Count 100 buckets per account ❌
Data Stream Retention Default: 24 hours; Max: 30 days ❌
Amazon Kinesis Record Size Max: 1 MB per record ❌
Shard Throughput Max: 1 MB/sec write, 2 MB/sec read ❌, add more shards
Job Timeout Max: 48 hours ❌
AWS Glue Concurrent Jobs Max: 350 concurrent job runs (default) ✅ via AWS Support
Data Catalog Table Count 1 million tables per account ❌
Cluster Instance Count Max: 10,000 instances per cluster ✅ via AWS Support
Amazon EMR Step Execution Time Max: 12 hours (default for Hadoop) ✅ via configuration
File Upload Size to HDFS Dependent on cluster disk size ❌
Query Timeout Max: 30 minutes ❌
Amazon Athena Query Result Size Max: 10 GB ❌
Concurrent Queries 100 queries per account ❌
Node Count Max: 128 nodes (RA3); 200 nodes (DS2 and DC2) ✅ via AWS Support
Amazon Redshift Concurrent Queries Max: 50 queries ❌
Table Count Max: 9,900 tables per cluster ❌
Database Size Max: 64 TB (varies by engine) ❌
Amazon RDS Connections Max: Depends on instance type and engine ❌
Backup Retention Max: 35 days ❌
Item Size Max: 400 KB per item (including attribute names and values) ❌
Partition Key Size Max: 2048 bytes ❌
DynamoDB
Global Secondary Indexes Max: 20 per table ❌
Provisioned Throughput Soft limit: 40,000 RCU/WCU per table ✅ via AWS Support
Execution Timeout Max: 15 minutes ❌

12/1/2024
Data Engineering Table Map AWS Limitations

Service Type of Limitation Limit Can be Increased?


Deployment Package Size Max: 50 MB (compressed); 250 MB (uncompressed in-memory) ❌
Lambda
Memory Allocation Min: 128 MB; Max: 10,240 MB ❌
Concurrent Executions Default: 1,000 per account ✅ via AWS Support
State Machine Execution Time Max: 1 year ❌
Input/Output Payload Size Max: 32 KB per state ❌
Concurrent Executions Default: 2,000 per account (Standard Workflows) ✅ via AWS Support
AWS Step Functions State Machine Transitions Max: 25,000 transitions per second (Standard Workflows) ✅ via AWS Support
Execution History Retention Max: 90 days (Standard Workflows) ❌
Express Workflow Execution Max: 5 minutes per execution ❌
Execution Events Limit Max: 10,000 events per execution (Standard Workflows) ❌
Rule Count per Event Bus Max: 300 rules per event bus ✅ via AWS Support
Target Count per Rule Max: 5 targets per rule ✅ via AWS Support
Event Payload Size Max: 256 KB ❌
Amazon EventBridge
Event Retention Default: 24 hours; Max: 7 days ❌
Custom Event Buses Max: 100 custom event buses per account ✅ via AWS Support
Event Rate (Default Bus) Max: 1,000 requests per second ✅ via AWS Support
Log Retention Default: Indefinite; Max: Defined by user ❌
Amazon CloudWatch Metric Data Retention 15 months (high-resolution metrics: 63 days) ❌
Metrics Per Account 500,000 custom metrics per account ✅ via AWS Support

12/1/2024
Data Engineering Table Map Real-World Examples

Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
Secrets Manager enables secure A finance application uses AWS
🏢 Manage sensitive API keys AWS Secrets Manager
When you need to manage and storage and management of secrets, Secrets Manager to store API keys
Data Silos
and database credentials rotate secrets easily making it easy to retrieve credentials securely, allowing seamless
as needed integration with microservices.

S3 offers high durability and A media company stores


🏭 Store and retrieve large Amazon S3
When you need scalable and
availability, making it ideal for storing high-resolution video files in Amazon Data Silos
amounts of unstructured data cost-effective object storage
files, backups, and multimedia content S3, using lifecycle policies to
transition older files to cheaper
An e-commerce site uses AWS
Lambda allows you to run code on
🏢 Run code in response to AWS Lambda
When you want to execute code in
demand, paying only for the compute
Lambda to automatically resize
Exponential Data Growth
events without managing servers response to triggers images uploaded by users before
time used
storing them in S3.

When you need to replicate or DMS supports many database A healthcare organization uses AWS
🏢 Migrate databases quickly AWS DMS migrate databases with minimal engines, providing low-downtime DMS to migrate patient records from Data Silos
and securely an on-premises Oracle database to
downtime migrations and real-time replication
Amazon Aurora with minimal
A retail application uses Amazon RDS
RDS automates database
🏢 Manage relational databases Amazon RDS
When you want a fully managed
management tasks like backups,
to manage its MySQL database for
Data Silos
with ease relational database solution handling transactions and inventory
patching, and scaling
management.

When you need fast query Timestream is designed specifically An energy company uses Amazon
🏢 Store and analyze Amazon Timestream performance and efficient storage for for time-series data, allowing for easy Timestream to monitor and analyze Data Silos
time-series data power consumption data from smart
time-series data storage and real-time analysis
meters over time for optimizing
A marketing team uses Amazon
🏭 Run ad-hoc queries on When you need to analyze data Athena allows you to run SQL queries
Athena to analyze clickstream data
structured or semi-structured Amazon Athena stored in S3 without managing against data in S3, enabling quick Data Silos
stored in S3 to improve website
data infrastructure analysis without data loading
engagement.
Redshift is designed for high A business intelligence team uses
🏢 Set up a petabyte-scale data Amazon Redshift
When you need fast complex performance and can scale to Amazon Redshift for generating
Data Silos
warehouse queries on large datasets petabytes of data using columnar complex reports and dashboards from
storage and MPP large sales datasets.
An analytics firm combines internal
When you want to analyze data Redshift Spectrum allows you to
🏭 Query data in S3 directly Amazon Redshift Spectrum stored in S3 without loading it into query data in S3 directly, expanding
data in Redshift with external data
Data Silos
from a Redshift cluster stored in S3 using Redshift Spectrum
Redshift the analytical capabilities of Redshift
for deeper insights.
A healthcare organization uses AWS
Glue offers serverless ETL
🏢 Automate data preparation AWS Glue
When you need to prepare data for
capabilities and can automatically
Glue to clean and transform patient
Exponential Data Growth
and ETL processes analytics and machine learning data from various sources for use in
discover and catalog data
analytics.
A data engineering team uses AWS
The Data Catalog allows teams to
🏢 Maintain a central repository AWS Glue Data Catalog
When you need to manage and
quickly find data and understand its
Glue Data Catalog to manage
Data Silos
of metadata for data sources discover metadata for datasets metadata across multiple data
lineage and usage
sources for their ETL processes.

12/1/2024
Data Engineering Table Map Real-World Examples

Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
A data analyst uses AWS Glue Studio
Glue Studio provides a drag-and-drop
🏢 Visualize and manage ETL AWS Glue Studio
When you want to create and
interface, simplifying ETL job creation
to visually design and schedule ETL
Data Silos
jobs with a user-friendly interface orchestrate data workflows visually workflows for preparing sales data for
and management
reporting.
DataBrew provides a visual interface A marketing team uses AWS Glue
🏢 Simplify data preparation AWS Glue DataBrew
When you want to clean and for data preparation, enabling users to DataBrew to preprocess user Exponential Data Growth
without coding normalize data quickly and easily perform actions like filtering and engagement data for their analytics,
transforming without writing code speeding up the data preparation

Kinesis allows for the ingestion and A social media platform uses Amazon
🏢 Build real-time streaming Kinesis
When you want to process and
analysis of streaming data from Kinesis to monitor user engagement Exponential Data Growth
applications analyze data streams in real time
various sources metrics in real time.

A financial institution uses Kinesis


Data Streams enables you to build
🏢 Process and analyze data Kinesis Data Streams
When you need to handle high
applications that continuously process
Data Streams to detect and prevent
Real-Time Processing
streams for real-time insights throughput of streaming data fraudulent activities by continuously
and analyze streaming data
analyzing transaction data.
Data Firehose delivers streaming data
An IoT application uses Amazon Data
🏢 Load streaming data to Amazon Data Firehose
When you need to reliably capture to destinations like S3, Redshift, and
Firehose to collect and store logs from Exponential Data Growth
storage and analysis services streaming data Elasticsearch Service without
connected devices for later analytics.
complex setup
An online streaming service uses
When you need to process big data EMR simplifies running big data
🏢 Perform large-scale data Amazon EMR using frameworks like Apache frameworks, allowing you to scale up
Amazon EMR to process and analyze
Exponential Data Growth
processing and analytics large amounts of video logs to
Hadoop or Apache Spark or down based on processing needs
improve streaming quality.
A logistics company utilizes Amazon
MSK simplifies running Apache Kafka
🏢 Stream data in real-time with MSK for Apache Kafka
When you need a managed Kafka
and makes it easy to ingest, process,
MSK to monitor and analyze real-time
Real-Time Processing
Kafka service for real-time data feeds data from their delivery trucks to
and analyze streaming data
enhance operational efficiency.
An analytics firm uses Managed
When you need real-time data Managed Flink allows you to build
🏢 Perform low-latency Managed Apache Flink processing for complex event-driven streaming applications for real-time
Apache Flink to process real-time
Real-Time Processing
processing for streaming data data streams from IoT sensors for
applications analytics
predictive maintenance.
A SaaS company uses AWS
When you want to create rich QuickSight provides easy-to-use BI
🏢 Create interactive AWS QuickSight visualizations and dashboards for tools that connect to various data
QuickSight to visualize customer
Data Silos
dashboards and visualizations usage patterns for product
business intelligence sources, enabling interactive analysis
management insights.
A retail company creates a centralized
When you need to maintain Lake Formation simplifies data lake
🏢 Set up secure, scalable data AWS Lake Formation governance and create data lakes creation with built-in security and
data lake using AWS Lake Formation
Data Silos
lakes to store all customer and transaction
securely access management features
data for analytics.
Data Exchange makes it easy to A financial services company uses
🏢 Access third-party data for AWS Data Exchange
When you need to evaluate and access and integrate diverse datasets AWS Data Exchange to access
Data Silos
analytics purchase third-party datasets from various providers into your external stock market data to enrich
analytics workflow their investment analytics.

12/1/2024
Data Engineering Table Map Real-World Examples

Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
Data Zone provides a centralized A research team uses AWS Data
🏢 Create controlled When you want to facilitate data
space for managing data access and
environments for data AWS Data Zone sharing and collaboration across Zone to share and analyze clinical Data Silos
collaboration, improving productivity trial data, ensuring that only
collaboration teams
and security authorized personnel can access
SageMaker provides a fully managed A marketing analytics firm uses
When you want to create and
🏢 Build, train, and deploy Amazon SageMaker manage machine learning models at
service for building, training, and Amazon SageMaker to develop Data Silos
machine learning models deploying ML models, reducing predictive models for customer
scale
operational overhead segmentation and to optimize
A data engineering team uses MWAA
When you need to manage MWAA provides a managed
🏢 Orchestrate complex MWAA - Apache Airflow dependencies and scheduling of environment for Apache Airflow,
to schedule and manage a series of
Data Silos
workflows for data processing ETL jobs that prepare datasets for
data pipelines simplifying workflow orchestration
machine learning training.

When you need to manage the flow Step Functions allows you to define A chatbot application uses AWS Step
🏢 Coordinate microservices AWS Step Functions of application logic across multiple workflows visually and manage state Functions to orchestrate the flow of Data Silos
and serverless workflows data between various AWS services,
AWS services across distributed systems
such as Lambda and DynamoDB,

🏢 Route events between AWS When you need to connect different EventBridge enables easy routing of An online retail platform uses Amazon
services for event-driven Amazon EventBridge parts of your application using event events between AWS services, EventBridge to trigger workflows Data Silos
architectures data facilitating event-driven architectures based on user activities, such as
order placements and inventory
CloudWatch offers monitoring A gaming company uses Amazon
🏢 Monitor AWS resources and Amazon CloudWatch
When you need insights into your capabilities, allowing you to set CloudWatch to monitor in-game Data Silos
applications in real-time AWS application's performance alarms and automate responses metrics and automatically scale
based on resource metrics resources based on player activity

12/1/2024
Data Engineering Table Map Pro Tips

Relevant AWS Service(s) Hint/Description


To use Spark in Athena, configure it in Athena workgroups.
Use Bucketing CTAS in Athena for high cardinality; use Partitioning for low cardinality.
Amazon Athena Use EXPLAIN ANALYZE in Athena to see execution plans.
Limit data scanned in Athena using workgroups.
UNLOAD works for Athena too.
Amazon Athena, AWS Glue Data Catalog Athena takes long because of many partitions; use Athena partition projection and Glue Data Catalog.
Amazon Aurora Aurora's auto-scaling feature adjusts the number of read replicas based on demand.
Amazon Aurora Serverless Aurora Serverless automatically scales based on unpredictable workloads.
Amazon EventBridge Use Amazon EventBridge to build event-driven architectures integrating AWS services and external SaaS.
Amazon Kinesis Data Analytics Kinesis Data Analytics supports Kinesis Data Streams, Firehose, and S3.
Amazon Kinesis Data Analytics, Apache Flink To analyze data in real-time, use Apache Flink. Use sliding windows to combine current and past data.
Use Kinesis Data Firehose to transform JSON to Parquet.
Amazon Kinesis Data Firehose
Small files in Kinesis Data Firehose indicate scaling; large files suggest lagging from the source.
Amazon QuickSight QuickSight has no integration with VPC Connect.
If queries are taking too long in Redshift, it could be a data skew. Distribute the data evenly.
Queries fail in Redshift—use STL_ALERT_EVENT_LOG to investigate.
Amazon Redshift
Redshift event notifications detect cluster changes, not data changes.
Use MERGE in Redshift to combine data with INSERT or UPDATE.
Amazon Redshift, CloudWatch Logs Monitor Redshift authentication, logins, and connections via audit logging.
Amazon Redshift, Kinesis Data Streams Redshift streaming ingestion supports Kinesis Data Streams directly.
Amazon Redshift, S3 Load files from S3 to Redshift using staging tables and one single COPY per table.
S3 Batch Operations allows large-scale operations on existing S3 objects.
S3 deletions precede transactions, which precede deleted marks.
Encrypt each S3 file with a different key using Client-Side Encryption.
Amazon S3
To read specific bytes from S3 objects, use S3 Select ScanRange or Byte Range Fetch.
S3 consistency is ensured for GET, PUT, and LIST calls.

12/1/2024
Data Engineering Table Map Pro Tips

Relevant AWS Service(s) Hint/Description


For uploading large files (over 100 MB) to S3, use multipart upload to improve efficiency.
Amazon S3, IAM Use bucket policies or IAM roles with resource-based policies for secure cross-account S3 access.
Amazon S3, S3 Glacier Deep Archive Query Glacier Deep Archive yearly? Load data to S3 Standard and use S3 Select.
Amazon SQS SQS FIFO queues cannot exceed 3000 messages/second.
Amazon VPC, CloudWatch Logs Enable VPC Flow Logs to capture network traffic for troubleshooting and security monitoring.
AWS CodeDeploy, EMR CodeDeploy doesn't work for EMR.
AWS Direct Connect Use AWS Direct Connect for consistent, high-bandwidth connections between on-premises and AWS.
DMS instance must be in the same AZ and VPC as the target.
AWS DMS
DMS does not migrate empty tables.
AWS DMS, RDS, S3 To transfer data from RDS to S3, DMS is the right option.
AWS Elastic Beanstalk Use Elastic Beanstalk to quickly deploy and manage applications without worrying about infrastructure.
Crawlers need an attached AWSGlueServiceRole role.
AWS Glue Jobs process old data if [Link]() or max concurrent jobs = 1 is missing.
Remove duplicates in S3 using Glue FindMatches.
AWS Glue Data Catalog Glue Data Catalog uses resource policies for access.
AWS Glue DataBrew To detect discrepancies between different database reports, use Glue DataBrew.
AWS Glue for Ray Glue for Ray can scale AI and Python workloads.
AWS Glue, CloudWatch Glue metrics in CloudWatch: job progress bar, job logs, driver logs, executor logs.
AWS Glue, Python Shell Use Python Shell and pandas to handle small files.
AWS Lake Formation, S3 Write the S3 path on the Data Lake to register the bucket, not the bucket name.
Lambda provisioned concurrency works against cold starts.
AWS Lambda
Use Lambda Layers to package libraries and dependencies separately from the function.
AWS Lambda, RDS, VPC, Security Groups A Lambda accessing an RDS cluster must be in the same VPC, and the RDS security group must grant access.
AWS SAM, S3 SAM is not only for API Gateway and Lambda; you can also deploy S3 buckets.
CloudWatch, AWS Billing Set CloudWatch alarms for billing metrics to stay updated on unexpected cost spikes.
DynamoDB, Kinesis Client Library DynamoDB is required for KCL checkpoints, with one table per KCL application.

12/1/2024
Data Engineering Table Map Pro Tips

Relevant AWS Service(s) Hint/Description


EMR, S3 If EMR is accessing data in S3, ensure both are in the same region to avoid high latency.
IAM, RDS IAM Database Authentication works for RDS MySQL and PostgreSQL.
Kinesis Client Library, DynamoDB KCL discovers new Kinesis shards via DynamoDB and balances the load.
Kinesis Data Firehose, Amazon Redshift Store data from EC2 to Redshift in real time using Kinesis Data Firehose, not Kinesis Data Streams.
S3 Glacier Vault, Vault Lock Use Glacier Vault and Vault Lock for sensitive archived data and compliance.
S3, AWS Lambda, AWS Glue Configure S3 notification events to trigger Lambdas and run crawlers.
S3, Parquet, ORC Parquet supports more complex data types than ORC.

12/1/2024

You might also like