0% found this document useful (0 votes)
16 views4 pages

Data Warehouse Overview and Concepts

Uploaded by

krish2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views4 pages

Data Warehouse Overview and Concepts

Uploaded by

krish2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Datawarehouse:- System that aggregates data from multiple sources into central repository of structured data to support

analytics (OLAP-OnLine Analytical Processing). Supports ML, AI, data mining, OLAP and reporting.

Another def:- Subject/business oriented (customer/supplier/product/sales etc.), integrated (data collected from
multiple data sources), time-variant (timely collection of data over period) and non-volatile (existing data is not changed
just new data appended) collection of data to support mgmt. decision making process.

DWH provided on appliances, on-cloud, on-premises and mixed solutions by IBM, Oracle, Microsoft, amazon, Google etc.

Data marts:- domain/user/business function specific repository system (Type- Independent, dependent, hybrid). Specific
schema data repository for ease of retrieval and for analytics.

Data lake:- Repository of raw data in its native form without any preprocessing. For structured, semi-structured and
unstructured data. Cons- Data duplication lead to storage excess and less data quality

Data lakehouse:- To ensure optimized data quality with less storage costs and with schematic data. Pros of both DWH
and Datalake.

FACT and Dimension tables:-

FACT- quantitative/aggregated data of business processes, contains foreign keys to dimension tables

DIMENSION-categorical variables to filter, group fact data. Contains business entities

Data Modeling into FLAT schema, STAR schema or SNOWFLAKE schema depending upon the storage/query processing
requirement.

Why do we use these schemas, and how do they differ?

Star schemas are optimized for reads and are widely used for designing data marts(query boost), whereas snowflake
schemas are optimized for writes and are widely used for transactional data warehousing(writing/size boost).
 Normalization reduces redundancy, data size (5 NF types)

Data Cube Rep:-

Slicing- 1 layer of cube is cut

Dicing- large cube is filtered into small cube

Drill up and down-Drilling up and down into subsequent layers

Pivoting-Rearrange the view of cube

Rolling up- summarize data using aggregate functions

1. Grouping sets- subtotals for every requested tuple of items


2. CUBE-subtotals/totals for combined and single category
3. ROLLUP-
4. Materialized Views:- Snapshot of contents of sql query or to replicate data in staging database or precompute
expensive queries for DWH

DWH architecture:-

DataSources(DB,Datalakes,ERP,OLTPs)ETLProcessing w/o staging areaDWHDatamartReporting/analytical tools


Data Quality concerns:-

 Accuracy (Match b/w src / target system)


 Completeness (missing, null, invalid values)
 Consistency (datatypes, datafields, names etc.)
 Currency (up to date information)

Managing DQ :- DetectCaptureReportInvestigateDiagnoseCorrect and then automating workflows

1.
Question 1
What do we call a normalized version of the star schema?
1 / 1 point
Product schema
Normalized schema
Parent dimension
Snowflake schema
Correct
Correct, the normalized version of the star schema is called a snowflake schema, due to its multiple layers of
branching which resembles a snowflake pattern.
2.
Question 2
Considering a general architectural model for an Enterprise Data Warehouse, which of these components is holding
data and developing workflows?
1 / 1 point
Enterprise data warehouse repository
Staging and sandbox areas
Data sources
Data marts
Correct
Correct, these components are holding data and developing workflows.
3.
Question 3
Materialized Views can be set up to have different refresh options, such as: (Select 1 answer).
1 / 1 point
Populated
Never, upon request, and immediately
Automatically
Manually refresh
Correct
Materialized Views can be set up to have different refresh options, such as “never” (they are only populated when
created, which is useful if the data seldom changes), “upon request” (manually refresh, for example, after changes
to the data have been made, or scheduled refresh, for example, after daily data loads), and “immediately”
(automatically refresh after every statement).
4.
Question 4
Accumulating snapshot fact tables are used to __________.
0 / 1 point
extract data
process events
load data
record events
Incorrect
Incorrect, please review the Facts and Dimensional Modeling video.
5.
Question 5
In what location is data from source systems extracted to?
1 / 1 point
Target systems
Operating system
Staging area
Business intelligence platform
Correct
Correct, a staging area is a separate location where data from source systems is extracted to.
6.
Question 6
Materialized views can be used to __________.
1 / 1 point
safely work with affecting source database
automatically safe query results
replicate data
synchronize updates
Correct
Correct, they can be used to replicate data, for example to be used in a staging database

 2 design approaches of DWH:- Top down (SRCDWHDM) and Bottom-Ups (SRCDMDWH)

Common questions

Powered by AI

Integrating data from multiple sources into a data warehouse can pose significant challenges in ensuring data accuracy and consistency due to differences in data formats, schema mismatches, varying data quality standards, and discrepancies in data representation. Addressing these requires thorough data cleansing, harmonization techniques such as entity resolution and consistent naming conventions, and robust ETL processes to map source data accurately into target structures. Additionally, maintaining a consistent and up-to-date single source of truth can be complex, requiring continuous monitoring and validation to ensure accuracy and that all system changes and updates are captured comprehensively.

The ETL (Extract, Transform, Load) process is critical for ensuring data quality within a Data Warehouse architecture as it involves extracting data from multiple sources, transforming it into a suitable format, and loading it into the data warehouse. During transformation, data cleansing operations are performed to address quality issues such as inaccuracies, inconsistencies, incomplete data, and data duplication. This process also includes the application of business rules to consolidate data from different sources into a coherent dataset, thereby improving the accuracy, completeness, consistency, and currency of the database. These transformations ensure that the data stored in the Data Warehouse is reliable for decision-making processes.

Data lakes are prone to storage issues such as data duplication, leading to inefficient use of storage resources, and data quality issues, including lack of structure leading to inconsistency and difficulty in managing vast unorganized datasets. These issues arise from storing raw data in its native form without preprocessing or standardization. The data lakehouse model addresses these issues by introducing a schema that combines the flexibility of a data lake with the structured querying capabilities of a data warehouse, thereby optimizing storage by reducing redundancy and implementing a clearer data structure that enhances data quality while allowing analytics over broader data types.

One might opt for a snowflake schema in transactional data warehousing environments because it offers a structured normalization of data that reduces redundancy and potential update anomalies, which is crucial for maintaining data integrity in environments with frequent updates. The snowflake schema spreads data across multiple tables, optimizing for storage efficiency and write performance, which is a significant advantage when handling transactional data where data writing operations are intensive. In contrast, the star schema, while more efficient for read-heavy queries, can be less optimal for environments where rigorous data maintenance and consistency are needed.

The normalized structure of a snowflake schema improves data organization by reducing redundancy and thus potentially minimizing data storage costs. Unlike the star schema, which uses denormalized data structures to improve query performance, the snowflake schema normalizes data into multiple related tables, which helps organize data into more manageable and smaller data sets that resemble a snowflake pattern. This design helps in reducing update anomalies and improves integrity, though it may complicate query operations.

A data lakehouse combines the strengths of traditional data lakes and data warehouses, offering several advantages. It provides the ability to handle raw data (a characteristic of data lakes) while also enforcing a structural schema for organized data storage (like data warehouses), thereby optimizing data quality and reducing storage costs. This hybrid approach allows for both agile data processing and the efficient querying of organized data. However, it may be challenging to implement due to the complexity of integrating the diverse technologies and processes of data lakes and warehouses, and it might require significant resources to ensure the system is optimally configured.

Data marts are essential for providing focused analytical capabilities, facilitating the retrieval and analysis of data specific to particular business domains or user functions, which allows for more efficient decision-making processes due to the simplified and targeted data. On the other hand, data lakes store raw, unprocessed data in its native format, supporting the storage of a broader range of data types (structured, semi-structured, unstructured) and allowing for flexible data exploration and advanced analytics such as machine learning. Together, they complement each other by balancing detailed, domain-specific data processing with the breadth and flexibility needed for complex, large-scale data analytics.

Materialized views enhance query performance in data warehouses by storing precomputed results of complex queries, allowing for faster access to recurrently queried datasets. By reducing the computational load on the database during query execution, they can significantly decrease response times for end users. However, the trade-offs include increased storage requirements for maintaining these views, as well as the need to manage refreshing these views to ensure the data remains up-to-date. Automatic refresh modes can introduce additional processing overhead, while manual or upon-request updates require careful scheduling to maintain data relevance.

Data cube operations such as slicing, dicing, and pivoting enhance analytical capabilities by enabling users to perform complex queries and views of data from different perspectives. Slicing allows analysts to extract and view one dimension of a data cube, simplifying the view to one specific data subset. Dicing, however, provides a more granular approach by slicing a cube into numerous sub-cubes that offer insights into various data intersections. Pivoting facilitates the rearrangement of data views, changing the dimensional orientation to better visualize data relationships and insights. Collectively, these operations empower users to explore and analyze multidimensional data sets efficiently, increasing the depth and contextuality of analysis.

The top-down approach to data warehouse design begins with the creation of a comprehensive enterprise data warehouse, which is then disseminated into specific data marts tailored for various business areas. This design is best suited for organizations requiring a centralized and integrated view of enterprise data, enabling more cohesive strategic decision-making. The bottom-up approach starts with individual data marts, which are later combined into a data warehouse. This method is ideal for organizations that need to quickly implement specific analytic capabilities in certain areas, with the flexibility to build out larger systems incrementally as needs arise and resources allow.

You might also like