AWS Data Engineering Services Overview
AWS Data Engineering Services Overview
Icon AWS Service Servie Type 🏳️ Primary Purpose 📦 Data Source ⚙️ Least Operations 💹 Cost Efficient 💲 Cost | Per 🚦 Best Use Case ✅ Key Benefits ⚠️ Considerations ❌ Not Used For
AWS Secrets Manager Security Management Manage and retrieve secrets (Credentials) securely Manually by user or API call 🟢 High 🟢 High −$ (based on usage) Storing and managing secrets (API keys, passwords) Automatic rotation of secrets, audit logging Costs can increase with the number of secrets Securely transferring secrets to users
Amazon S3 Storage Object storage for unstructured data Unstructured & semi-structured data, large files 🟢 High 🟢 High $ (cost-effective for storage) Storing large datasets and data lakes Scalable, durable, low-cost storage Not designed for analytics directly SQL analytics, data transformation
AWS Lambda Compute Serverless compute for running code Event-based sources: S3, DynamoDB, API Gateway, SNS, SQS, Kinesis, and custom events 🟢 High 🟢 High $ - $$ (pay-per-invocation) Event-driven tasks, lightweight ETL, and triggers Serverless, scales automatically, pay-per-use Best for lightweight, short-duration tasks Long-running jobs, complex ETL processes
AWS DMS Data Migration Database migration Relational databases, NoSQL databases 🟡 Medium 🟢 High $$$ (depends on data volume) Migrating or syncing databases to AWS Continuous replication, cross-DB compatibility Best for migration, not ongoing ETL Data warehousing, complex analytics
Amazon RDS Database Managed relational database Relational data (MySQL, PostgreSQL, Oracle, etc.) 🟡 Medium 🟡 Medium $$ - $$$ (depends on instance size) Operational databases for applications Managed, multiple DB engines Limited for analytical workloads Data warehousing, real-time analytics
Amazon Athena Analytics Query S3 data with SQL Data stored in Amazon S3 🟢 High 🟢 High $$ (pay-per-query) Ad-hoc querying of structured data in S3 Serverless, no ETL needed, pay-per-query Limited for complex processing Data transformation, real-time analytics
Amazon Redshift Data Warehouse Data warehousing for structured data Structured data from S3, relational databases 🔴 Low 🟡 Medium $$$ - $$$$ (depends on nodes) BI reporting and analytics on large datasets Columnar storage, high-performance SQL Higher cost for smaller datasets Real-time analytics, unstructured data
Amazon Redshift Spectrum Analytics Querying S3 data from Redshift without loading it into the warehouse Structured and semi-structured data in S3 🟢 High 🟢 High $$ (charged per TB of data scanned) Running SQL queries on structured and semi-structured data in S3 Extends Redshift to S3 data, saves on storage costs, supports large datasets Requires Redshift cluster, adds extra costs on query usage in S3 ETL processing, data transformation
AWS Glue ETL ETL service and Data Catalog S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs) ETL for transforming raw data into analytics-ready form Serverless ETL, Data Catalog Not suitable for large-scale SQL analytics Real-time analytics, ad-hoc SQL querying
AWS Glue Data Catalog Metadata Management Centralized metadata repository for data sources S3, relational databases, JDBC-compatible sources 🟢 High 🟢 High $ (cost-effective for metadata storage) Managing and discovering data across AWS and external sources Integrates with S3, Redshift, Athena, enables data discovery and schema inference Not a data storage solution; requires data sources to be managed separately ETL processing, ad-hoc querying
AWS Glue Studio ETL Visual interface for creating ETL jobs S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs) Designing and managing ETL workflows without coding Intuitive GUI, drag-and-drop ETL design, simplifies complex workflows Limited customization for advanced scripts Real-time processing, non-ETL tasks
AWS Glue DataBrew Data Preparation Data preparation and cleaning with a visual tool S3, relational databases, JDBC-compatible sources 🟢 High 🟡 Medium $$$ (ETL jobs, visual prep) No-code data cleaning, exploration Visual interface, no coding required Limited for large transformations SQL analytics, real-time streaming
Kinesis Data Streaming Real-time data streaming Real-time streaming data from IoT, apps, and databases 🟡 Medium 🔴 Low $$$ (per shard/hour) Real-time analytics and data ingestion Supports real-time processing Requires streaming data, setup for scale Batch processing, SQL querying
Kinesis Data Streams Data Streaming Real-time streaming data ingestion Streaming sources (e.g., IoT, logs) 🔴 Low 🟡 Medium $$$ (Shards per hour) Real-time analytics and event streaming Durable streams, real-time data ingestion Requires setup and tuning for scale Batch processing, large-scale ML
Amazon Data Firehose Data Streaming Real-time data delivery Streaming sources (e.g., IoT, logs) 🟢 High 🟢 High $$$ (Volume of data ingested) Streaming data delivery to S3, Redshift Fully managed, near real-time delivery Limited transformations during delivery Complex data transformations, heavy analytics
Amazon EMR Big Data Processing Big data processing with Hadoop, Spark, etc. S3, HDFS, relational databases, NoSQL sources 🔴 Low 🔴 Low $$$$ (based on cluster size) Data transformations, large-scale ML Customizable, supports Hadoop ecosystem Cluster management required Simple ETL, ad-hoc SQL on S3
Amazon MSK for Apache Kafka Data Streaming Managed Apache Kafka for streaming data Streaming sources (logs, IoT, metrics) 🔴 Low 🔴 Low $$$ (per cluster and storage usage) Real-time data streaming and event-driven architectures Fully managed Kafka, integrates with AWS, highly scalable Requires Kafka knowledge; higher operational costs than Kinesis Batch processing, SQL querying
Managed Apache Flink Data Streaming Real-time data processing and analytics Streaming data (MSK, Kinesis) 🟡 Medium 🔴 Low $$$ (pay-per-instance/hour + storage) Processing streaming data for analytics Fully managed Flink, low latency, supports complex event processing Requires Flink knowledge; not ideal for simple streaming needs Batch processing, data storage
AWS QuickSight Business Intelligence (BI) Business intelligence (BI) and data visualization Redshift, S3, RDS, Athena, or other SQL sources 🟢 High 🟢 High $ - $$ (per-user pricing) Visualizing data from multiple sources Interactive dashboards, integrates with Redshift and S3 Limited for complex ETL, transformations Data preparation, ETL, large-scale ML
AWS Lake Formation Data Lake Management Data lake management and security Primarily S3 data 🟢 High 🟢 High $ (management costs minimal) Secure data lake setup and access controls Simplifies setup, access control for S3 Primarily for S3 data management SQL querying, data transformation
AWS Data Exchange Data Exchange Secure sharing of third-party data Third-party data providers 🟢 High 🟢 High $ - $$$ (Dataset consumption) Accessing third-party datasets for analytics Easy integration with AWS analytics services Limited to external datasets Internal ETL, real-time ingestion
AWS Data Zone Data Governance Data governance and sharing S3, Redshift, databases 🟢 High 🟢 High $ (Metadata and governance) Centralized data discovery and governance Simplifies governance and access management Requires setup for organization-wide adoption Data processing or analytics
MWAA - Apache Airflow Workflow Orchestration Orchestrate workflows and pipelines Any source (S3, RDS, Redshift, etc.) 🟢 High 🔴 Low $$$ (based on usage and task execution) Scheduling and managing ETL workflows Fully managed Airflow, supports DAGs, scales dynamically Requires DAG knowledge; costs increase with large workflows Real-time streaming, heavy transformations
AWS Step Functions Workflow Orchestration Workflow orchestration Compatible with Lambda, Glue, S3, RDS, and many AWS services 🟢 High 🟡 Medium $$ (pay-per-transaction) Coordinating ETL and multi-step data processing tasks Serverless, visual workflows, error handling Best for multi-step workflows, not real-time streaming Real-time streaming, single-step tasks
Amazon EventBridge Event Bus Serverless event bus to route events between services Event streams, SaaS apps, AWS services 🟢 High 🟢 High $ (pay-per-event, cost-effective for low volumes) Triggering workflows in response to application or system events Fully managed, integrates with over 200 AWS services and SaaS apps. Requires understanding of event-driven architecture Data transformation, large-scale analytics
Amazon CloudWatch Monitoring Monitoring and observability for AWS resources Logs, metrics, alarms from AWS resources 🟢 High 🟡 Medium $$ (per logs, metrics, and alarms) Monitoring AWS infrastructure, alarms, log analysis Fully managed monitoring, integrates with AWS services Costs scale with number of logs, metrics, and alarms Real-time data transformation, heavy processing tasks
12/1/2024
Data Engineering Table Map Data Concepts
DaaS (Data as a Service) A cloud service providing data on demand, allowing access to data without owning or storing it. Data APIs, data marketplaces, real-time insights Data sharing, cloud computing
DaaP (Data as a Platform) Data storage and processing platform providing tools for data collection, storage, processing, and sharing. Data pipelines, analytics platforms ETL, data lakes, data warehouses
Data Types Different forms of data: Structured (e.g., relational databases), Semi-structured (e.g., JSON), Unstructured (e.g., images). Various data storage and analytics needs Schema design, data formats
OLTP Online Transaction Processing - handles real-time transactions with frequent updates, typically in row-based storage. E-commerce, banking transactions Row-based databases, ACID
OLAP Online Analytical Processing - optimizes read-heavy operations, often using columnar storage for fast querying. Business intelligence, data warehousing Columnar databases, reporting
Row-Based Storage Data stored by rows, ideal for fast read/write operations on entire rows, common in OLTP systems. Transactional systems, real-time processing Relational databases, ACID
Columnar-Based Storage Data stored by columns, enabling fast read operations for analytical queries, used in OLAP systems. Data warehousing, analytics Columnar databases, BI tools
5 Vs of Data Challenges in big data characterized by Volume (amount), Variety (types), Velocity (speed), Veracity (accuracy), and Value (usefulness). Big data management, analytics Data quality, data governance
Data Processing Modes Batch Processing for processing data in large chunks at set intervals; Streaming Processing for continuous processing as data arrives. Real-time analytics, periodic reporting Kinesis, Apache Kafka, Spark
ETL Extract, Transform, Load - traditional method where data is transformed before loading into the target system. Data warehousing, data integration Glue, Informatica
ELT Extract, Load, Transform - loads raw data into the target system, allowing transformation within the system. Cloud-based data warehousing (e.g., Redshift, Snowflake) Athena, Redshift
12/1/2024
Data Engineering Table Map Data Pipeline Stages
Data Sources Objects, Databases, Mobile, IoT S3 (for objects), RDS (for databases), AWS IoT
Ingestion Database import, Object/file ingestion, Streaming data AWS Glue, Kinesis Data Firehose, S3 Transfer Acceleration
Cataloging, Processing, and Governance Data Catalog, Processes AWS Glue Data Catalog, AWS Lambda, Lake Formation
Analytics and Visualization Search, Interactive dashboards, Queries, Embedded analytics Amazon OpenSearch, QuickSight, Athena, Redshift
12/1/2024
Data Engineering Table Map AWS Limitations
12/1/2024
Data Engineering Table Map AWS Limitations
12/1/2024
Data Engineering Table Map Real-World Examples
Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
Secrets Manager enables secure A finance application uses AWS
🏢 Manage sensitive API keys AWS Secrets Manager
When you need to manage and storage and management of secrets, Secrets Manager to store API keys
Data Silos
and database credentials rotate secrets easily making it easy to retrieve credentials securely, allowing seamless
as needed integration with microservices.
When you need to replicate or DMS supports many database A healthcare organization uses AWS
🏢 Migrate databases quickly AWS DMS migrate databases with minimal engines, providing low-downtime DMS to migrate patient records from Data Silos
and securely an on-premises Oracle database to
downtime migrations and real-time replication
Amazon Aurora with minimal
A retail application uses Amazon RDS
RDS automates database
🏢 Manage relational databases Amazon RDS
When you want a fully managed
management tasks like backups,
to manage its MySQL database for
Data Silos
with ease relational database solution handling transactions and inventory
patching, and scaling
management.
When you need fast query Timestream is designed specifically An energy company uses Amazon
🏢 Store and analyze Amazon Timestream performance and efficient storage for for time-series data, allowing for easy Timestream to monitor and analyze Data Silos
time-series data power consumption data from smart
time-series data storage and real-time analysis
meters over time for optimizing
A marketing team uses Amazon
🏭 Run ad-hoc queries on When you need to analyze data Athena allows you to run SQL queries
Athena to analyze clickstream data
structured or semi-structured Amazon Athena stored in S3 without managing against data in S3, enabling quick Data Silos
stored in S3 to improve website
data infrastructure analysis without data loading
engagement.
Redshift is designed for high A business intelligence team uses
🏢 Set up a petabyte-scale data Amazon Redshift
When you need fast complex performance and can scale to Amazon Redshift for generating
Data Silos
warehouse queries on large datasets petabytes of data using columnar complex reports and dashboards from
storage and MPP large sales datasets.
An analytics firm combines internal
When you want to analyze data Redshift Spectrum allows you to
🏭 Query data in S3 directly Amazon Redshift Spectrum stored in S3 without loading it into query data in S3 directly, expanding
data in Redshift with external data
Data Silos
from a Redshift cluster stored in S3 using Redshift Spectrum
Redshift the analytical capabilities of Redshift
for deeper insights.
A healthcare organization uses AWS
Glue offers serverless ETL
🏢 Automate data preparation AWS Glue
When you need to prepare data for
capabilities and can automatically
Glue to clean and transform patient
Exponential Data Growth
and ETL processes analytics and machine learning data from various sources for use in
discover and catalog data
analytics.
A data engineering team uses AWS
The Data Catalog allows teams to
🏢 Maintain a central repository AWS Glue Data Catalog
When you need to manage and
quickly find data and understand its
Glue Data Catalog to manage
Data Silos
of metadata for data sources discover metadata for datasets metadata across multiple data
lineage and usage
sources for their ETL processes.
12/1/2024
Data Engineering Table Map Real-World Examples
Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
A data analyst uses AWS Glue Studio
Glue Studio provides a drag-and-drop
🏢 Visualize and manage ETL AWS Glue Studio
When you want to create and
interface, simplifying ETL job creation
to visually design and schedule ETL
Data Silos
jobs with a user-friendly interface orchestrate data workflows visually workflows for preparing sales data for
and management
reporting.
DataBrew provides a visual interface A marketing team uses AWS Glue
🏢 Simplify data preparation AWS Glue DataBrew
When you want to clean and for data preparation, enabling users to DataBrew to preprocess user Exponential Data Growth
without coding normalize data quickly and easily perform actions like filtering and engagement data for their analytics,
transforming without writing code speeding up the data preparation
Kinesis allows for the ingestion and A social media platform uses Amazon
🏢 Build real-time streaming Kinesis
When you want to process and
analysis of streaming data from Kinesis to monitor user engagement Exponential Data Growth
applications analyze data streams in real time
various sources metrics in real time.
12/1/2024
Data Engineering Table Map Real-World Examples
Company Goal What Service to use When to Use Why Real-World Example Type of Data Challenge
Data Zone provides a centralized A research team uses AWS Data
🏢 Create controlled When you want to facilitate data
space for managing data access and
environments for data AWS Data Zone sharing and collaboration across Zone to share and analyze clinical Data Silos
collaboration, improving productivity trial data, ensuring that only
collaboration teams
and security authorized personnel can access
SageMaker provides a fully managed A marketing analytics firm uses
When you want to create and
🏢 Build, train, and deploy Amazon SageMaker manage machine learning models at
service for building, training, and Amazon SageMaker to develop Data Silos
machine learning models deploying ML models, reducing predictive models for customer
scale
operational overhead segmentation and to optimize
A data engineering team uses MWAA
When you need to manage MWAA provides a managed
🏢 Orchestrate complex MWAA - Apache Airflow dependencies and scheduling of environment for Apache Airflow,
to schedule and manage a series of
Data Silos
workflows for data processing ETL jobs that prepare datasets for
data pipelines simplifying workflow orchestration
machine learning training.
When you need to manage the flow Step Functions allows you to define A chatbot application uses AWS Step
🏢 Coordinate microservices AWS Step Functions of application logic across multiple workflows visually and manage state Functions to orchestrate the flow of Data Silos
and serverless workflows data between various AWS services,
AWS services across distributed systems
such as Lambda and DynamoDB,
🏢 Route events between AWS When you need to connect different EventBridge enables easy routing of An online retail platform uses Amazon
services for event-driven Amazon EventBridge parts of your application using event events between AWS services, EventBridge to trigger workflows Data Silos
architectures data facilitating event-driven architectures based on user activities, such as
order placements and inventory
CloudWatch offers monitoring A gaming company uses Amazon
🏢 Monitor AWS resources and Amazon CloudWatch
When you need insights into your capabilities, allowing you to set CloudWatch to monitor in-game Data Silos
applications in real-time AWS application's performance alarms and automate responses metrics and automatically scale
based on resource metrics resources based on player activity
12/1/2024
Data Engineering Table Map Pro Tips
12/1/2024
Data Engineering Table Map Pro Tips
12/1/2024
Data Engineering Table Map Pro Tips
12/1/2024