0% found this document useful (0 votes)
12 views13 pages

Understanding Data Warehousing Concepts

Uploaded by

armaanshk0007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Understanding Data Warehousing Concepts

Uploaded by

armaanshk0007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT II DATA WAREHOUSING

Definition of Data ware house:


It may be understood as Repository of all of the organization’s data (or simply data) in a pattern that can
be analysed so that it can be arranged for administration and other information workers.

A data warehouse is a central repository of information that can be analyzed to make more informed
decisions. Data flows into a data warehouse from transactional systems, relational databases, and other
sources.

It can be described as any centralized data repository which can be queried for business benefits. It is a
database that stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and analyst) to
make superior and higher decisions.

It is designed for query and analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources. It provides integrated, enterprise-wide, historical
data and focuses on providing support for decision-makers for data modeling and analysis.

A data warehouse is the foremost repository for the data available for developing business intelligence
architectures and decision support systems. The term data warehousing indicates the whole set of
interrelated activities involved in designing, implementing and using a data warehouse.

However, some challenges faced are:

 Data should be acquired from a variety of incompatible systems.


 The identical piece of data might reside in the databases of distinct systems in distinct types. A
specific data item might not only be represented in distinct formats, but the values of this Data
piece might be distinct in distinct databases. Which value is the correct one?
 Data is continually altering. How often should the Data warehouse be revised to contemplate a
sensibly current view?
 The amount of Data is massive. How is it analysed and presented easily so that it is useful?

To meet these needs, a broad range of powerful tools were developed over the years and became
productized. They included:

_ Extract, Transform, and Load (ETL) utilities for the moving of data from the diverse data sources to the
common data warehouse.

_ Data-mining pushes for complex predetermined analysis and ad hoc queries of the Data retained in the
Data warehouse.

_ Reporting tools to provide management employees with the outcomes of the analysis in very simple to
absorb formats.

It is possible to identify three main categories of data feeding into a data warehouse: internal data,
external data and personal data.

Internal data: Internal data are stored for the most part in the databases, referred to as transactional
systems or operational systems, that are the backbone of an enterprise information system.

Internal data are gathered through transactional applications that routinely preside over the operations
of a company, such as administration, accounting, production and logistics. This collection of trans-
actional software applications is termed enterprise resource planning (ERP). The data stored in the
operational systems usually deal with the main entities involved in a company processes, namely
customers, products, sales, employees and suppliers.

These data usually come from different components of the information system:

• back-office systems, that collect basic transactional records such as orders, invoices, inventories,
production and logistics data;

• front-office systems, that contain data originating from call-center activities, customer assistance,
execution of marketing campaigns;

• web-based systems, that gather sales transactions on e-commerce websites, visits to websites, data
available on forms filled out by existing and prospective customers.

External data: There are several sources of external data that may be used to extend information. For
example, some agencies gather and make available data relative to sales, market share and future trend
predictions for specific business industries, as well as economic and financial indicators.

Other agencies provide data market surveys and consumer opinions collected through questionnaires.
Geographic Information Systems (GIS), which represent a set of applications for acquiring, organizing,
storing and presenting territorial data. These contain information relative to entities having a specific
geographic position.

Personal data: In most cases, decision makers performing a business intelligence analysis also rely on
information and personal assessments stored inside worksheets or local databases located in their
computers.

Definition of Data Mart:

Data marts are systems that gather all the data required by a specific company department, such as
marketing or logistics, for the purpose of performing business intelligence analyses and executing decision
support applications specific to the function itself. Therefore, a data mart can be considered as a
functional or departmental data warehouse of a smaller size and a more specific type than the overall
company data warehouse.

A data mart is a special type of a data warehouse. It is focused on a single subject (or functional area),
such as Sales, Finance, or Marketing. Whereas data warehouses have an enterprise-wide depth, the
information in data marts pertains to a single department. The primary use for a data mart is Business
Intelligence (BI) applications.

A data mart therefore contains a subset of the data stored in the company data warehouse, which are
usually integrated with other data that the company department responsible for the data mart owns and
deems of interest. For example, a marketing data mart will contain data extracted from the central data
warehouse, such as information on customers and sales transactions, but also additional data pertaining
to the marketing function, such as the results of marketing campaigns run in the past. Data warehouses
and data marts thus share the same technological frame-work.
Data quality

The need to verify, preserve and improve the quality of data is a constant concern of those responsible
for the design and updating of a data warehouse. Data in a warehouse should have the following qualities:

Accuracy. To be useful for subsequent analyses, data must be highly accurate. For instance, it is necessary
to verify that names and encodings are correctly represented and values are within admissible ranges.

Completeness. In order to avoid compromising the accuracy of business intelligence analyses, data should
not include a large number of missing values.

Consistency. The form and content of the data must be consistent across the different data sources after
the integration procedures, with respect to currency and measurement units.

Timeliness. Data must be frequently updated, based on the objectives of the analysis. It is customary to
arrange an update of the data warehouse regularly on a daily or at most weekly basis.

Non-redundancy. Data repetition and redundancy should be avoided in order to prevent waste of
memory and possible inconsistencies.

Relevance. Data must be relevant to the needs of the business intelligence system in order to add real
value to the analyses that will be subsequently performed.

Interpretability. The meaning of the data should be well understood and correctly interpreted by the
analysts.

Accessibility. Data must be easily accessible by analysts and decision support applications

Data Warehouse Architecture:


Basic architecture and function of a data ware house is as follows:

Major components includes:

The data warehouse itself, together with additional data marts, that contains the data and the functions
that allow the data to be accessed, visualized and modified.
Data acquisition applications, also known as extract, transform and load (ETL) or back-end tools, which
allow the data to be extracted, transformed and loaded into the data warehouse.

Business intelligence and decision support applications, which represent the front-end and allow the
knowledge workers to carry out the analyses and visualize the results.

The three-level distinction applies to the architecture shown in above figure even from a technological
perspective.

• The level of the data sources and the related ETL tools that are usually installed on one or more servers.

• The level of the data warehouse and any data mart, possibly available on one or more servers as well,
and separated from those containing the data sources. This second level also includes the metadata
documenting the origin and meaning of the records stored in the data warehouse.

• The level of the analyses that increase the value of the information contained in a data warehouse
through query, reporting and possibly sophisticated decision support tools. The applications for business
intelligence and decision support analysis are usually found on separate servers or directly on the client
PC used by analysts and knowledge workers.

A data warehouse may be implemented according to different design approaches: top-down, bottom-up
and mixed.

Top-down. The top-down methodology is based on the overall design of the data warehouse, and is
therefore more systematic. However, it implies longer development times and higher risks of not being
completed within schedule since the whole data warehouse is actually being developed.

Bottom-up. The bottom-up method is based on the use of prototypes and therefore system extensions
are made according to a step-by-step scheme. This approach is usually quicker, provides more tangible
results but lacks an overall vision of the entire system to be developed.

Mixed.

The mixed methodology is based on the overall design of the data warehouse, but then proceeds with a
prototyping approach, by sequentially implementing different parts of the entire system. This approach
is highly practical and usually preferable, since it allows small and controlled steps to be taken while
bearing in mind the whole picture.

Extract, Transform and Load (ETL)

these utilities extract data from source databases, change/transform them into the widespread data
warehouse format, and load them into the data warehouse

The transform function is the key to the achievement of this approach. Its job is to request a series of rules
to extracted data so that it is properly formatted for loading into the data warehouse.

Examples:
The selection of data to load.

The translation of encoded items (for example, 1 for male, 2 for female to M, F).

Deriving new calculated values (sale price = price - discount).

Merging data from multiple sources.

Summarizing (aggregating) certain rows and columns.

Splitting a column into multiple columns.

Resolving discrepancies between similar data items.

Validating the data.

ETL model :

The ETL process is performed occasionally, such as daily, weekly, or monthly, depending upon the
enterprise needs. This method is called offline ETL because the key database is not relentlessly updated.
It is revised on a periodic batch basis.

The ETL utilities make data collection from numerous diverse systems practical. Then, the data needs to
be converted into useful information. Some key points to remember:

_ Data are easily facts, figures, and text that can be processed by a computer.

Example: A transaction at retail point-of-sale is data.

_ Information is processed data.

For example, analysis of point-of-sale transactions yields information of consumer buying behaviour.
_ Knowledge represents a pattern that connects information and usually presents a high grade of
predictability as to what is recounted or what will happen next.

Example: An example of knowledge is the prediction of promotional efforts on sales of particular items
based on buyers’ buying behaviour.

Data-Mining Engines
Useful data-mining engines were evolved to support complex analysis and ad hoc queries on a data
warehouse’s database. Data mining looks for patterns among hundreds of seemingly unrelated fields in a
large database, patterns that recognize earlier unknown trends. These trends play a key role in strategic
decision making because they disclose localities for process enhancement(ORACLE)

Reporting Tools

The knowledge created by a data-mining engine is not very useful unless it is presented easily and clearly
to those who need it. There are many formats for reporting information and knowledge results. One of
the common techniques for displaying information is the digital dashboard.

Metadata
In order to document the meaning of the data contained in a data warehouse, it is recommended to set
up a specific information structure, known as metadata, i.e. data describing data. The metadata indicate
for each attribute of a data warehouse the original source of the data, their meaning and the
transformations to which they have been subjected. The documentation provided by metadata should be
constantly kept up to date, in order to reflect any modification in the data warehouse structure. The
documentation should be directly accessible to the data warehouse users.
OLAP and OLTP:

Online Analytical Processing consists of a type of software tools that are used for data analysis for business
decisions. OLAP provides an environment to get insights from the database retrieved from multiple
database systems at one time. Examples – Any type of Data warehouse system is an OLAP system.

An OLTP(Online Transactional Processing) system captures and maintains transaction data in a database.
Each transaction involves individual database records made up of multiple fields or columns. Examples
include banking and credit card activity or retail checkout scanning.

The data from one or more OLTP databases is ingested into OLAP systems through extract, transform,
load (ETL)

Cubes and multidimensional analysis


The multidimensional representation is based on a star schema which contains two types of data tables:
dimension tables and fact tables.

Dimension Tables: In general, dimensions are associated with the entities around which the processes of
an organization revolve. Dimension tables then correspond to primary entities contained in the data
warehouse, and in most cases, they directly derive from master tables stored in OLTP systems, such as
customers, products, sales, locations and time.
Each dimension table is often internally structured according to hierarchical relationships.
For example, the temporal dimension is usually based upon two major hierarchies: {day, week,
year} and {day, month, quarter, year}
Similarly, the location dimension may be hierarchically organized as {street, zip code, city, province,
region, country, area}
Products in their turn have hierarchical structures such as {item, family, type} in the manufacturing
industry and {item, category, department} in the retail industry.
Fact Tables: Fact tables usually refer to transactions and contain two types of
data:
• links to dimension tables, that are required to properly reference the information contained in each fact
table;
• numerical values of the attributes that characterize the corresponding transactions and that represent
the actual target of the subsequent OLAP analyses.

For example, a fact table may contain sales transactions and make reference to several dimension tables,
such as customers, points of sale, products, suppliers, time. The corresponding measures of interest are
attributes such as quantity of items sold, unit price and discount.
In this example the fact table allows analysts to evaluate the trends of sales over time, either total, or
referred to a single customer, or referred to a group of customers, that can be identified through
any hierarchy induced by the dimension table associated with the customers.
The analyst may also evaluate the trend over time of sales percentages relative to customers located in a
specific region.

STAR SCHEMA-
The fact table is placed in the middle of the schema and is linked to the dimension tables through
appropriate references. It uses a single large fact table to store transactional or measured data, and one
or more smaller dimensional tables that store attributes about the data.
Example –
SNOWFLAKE SCHEMA-
Sometimes dimension tables are connected in their turn to other dimension tables, as shown
In the given example the dimension table referring to the location is in turn hierarchically connected with
the dimension table containing geographical information. This brings about a snowflake
schema.
GALAXY SCHEMA-
A data warehouse includes several fact tables, interconnected with dimension tables, linked in their turn
with other dimension tables. This type of schema is termed a galaxy schema as shown-

(In the above example tables that have name written above it are fact tables others are dimension tables)
(time in shipment fact table may also point to time dimension table)

A fact table connected with n dimension tables may be represented by a n-dimensional data cube where
each axis corresponds to a dimension. Multidimensional cubes are a natural extension of the popular two-
dimensional spreadsheets, which can be interpreted as two-dimensional cubes.

Cube Example:
For instance, consider a sales fact table developed along the three dimensions of {time, product, region}.
Suppose we select only two dimensions for the analysis, such as{time, product}, having preset the region
attribute along the three values {USA, Asia, Europa}.
In this way we obtain the three two-dimensional tables in which the rows correspond to quarters of a
year and the columns to products as follows:

Table1-
Table 2-

Table 3-

The cube shown in Figure below is a three-dimensional illustration of the same sales fact table
Atomic data are represented by 36 cells that can be obtained by crossing all possible values along the
three dimensions: time{Q1, Q2, Q3, Q4},region{USA, Asia, Europa} and product {TV, PC, DVD}

These atomic cells can be supplemented by 44 cells corresponding to the summary values obtained
through consolidation along one or more dimensions

four-dimensional space cannot be represented graphically. However, we can obtain four logical views
composed of three-dimensional cubes, called cuboids
Starting from a fact table linked to n dimension tables, it is possible to obtain a lattice of cuboids, each of
them corresponding to a different consolidation along one or more dimensions.

Figure below illustrates the lattice composed by the cuboids obtained from the data cube defined along
the four dimensions {time, product, region, supplier}

The cuboid associated with the atomic data, which therefore does not imply any type of consolidation, is
called a base cuboid. At the other extreme, the apex cuboid is defined as the cuboid corresponding to the
consolidation along all dimensions, therefore associated with the grand total of the measure of
interest.

Common questions

Powered by AI

A data warehouse acts as a central repository of information that supports informed decision-making within an organization. It aggregates data from transactional systems, relational databases, and diverse sources, providing integrated, enterprise-wide, historical data . This centralized data repository enables knowledge workers like executives, managers, and analysts to make superior decisions by offering the foundational information for business intelligence architectures and decision support systems . By focusing on query and analysis instead of transaction processing, data warehouses are structured to satisfy decision-making requests, offering tools like ETL utilities, data-mining engines, and reporting mechanisms, which further assist in complex analyses and producing understandable results for decision-makers .

OLAP (Online Analytical Processing) systems are designed for data analysis and support business decisions by providing insights from databases retrieved from multiple systems. These systems enable sophisticated analyses by aggregating and consolidating data for reporting and decision support applications . OLTP (Online Transactional Processing) systems, conversely, focus on capturing and maintaining transaction data, emphasizing database efficiency for transaction processing like banking activities or retail checkouts . While OLTP systems efficiently handle transaction recording, OLAP systems leverage extracted data for complex, multidimensional analyses necessary for strategic decision-making .

The essential architectural components of a data warehouse include the data repository itself, data acquisition applications (ETL tools), and business intelligence and decision support applications . The data warehouse, with any additional data marts, holds the data and provides access, visualization, and modification functions. Data acquisition applications extract, transform, and load data into the warehouse, ensuring data is processed from various sources into a unified format. Business intelligence and decision support applications enable analyses and result visualization, often supported by knowledge workers conducting queries on separate servers or client PCs . These components collectively provide a seamless flow from data ingestion to decision support, enabling strategic insights and business intelligence .

Data mining engines enhance data warehouse effectiveness by identifying patterns in large datasets, which could otherwise be overlooked. These engines support complex analyses and ad hoc queries by recognizing previously unknown trends, fostering strategic decision-making. By revealing process enhancements and key insights, data mining uncovers valuable trends that contribute to a competitive advantage . Moreover, the outputs of data mining must be conveyed through user-friendly formats, such as digital dashboards, to ensure that decision-makers easily understand and apply the findings to improve business processes .

Maintaining data quality in a data warehouse involves addressing challenges such as accuracy, completeness, consistency, timeliness, non-redundancy, relevance, interpretability, and accessibility . To ensure accuracy, proper verification of data representation and value ranges is necessary. Completeness demands minimizing missing values to maintain reliable analytical outcomes. Consistency requires uniformity in data format and content across sources after integration. Timeliness is maintained by scheduling regular data updates, typically daily or weekly. Non-redundancy is achieved by avoiding data repetition to conserve memory and prevent incoherence. Relevance ensures data aligns with the business intelligence system's needs to add real value, while interpretability and accessibility focus on making data easily understandable and retrievable by analysts .

Multidimensional cubes in OLAP systems facilitate complex querying by enabling data to be viewed and analyzed from multiple perspectives, referred to as dimensions, such as time, product, and region . They are a natural extension of two-dimensional spreadsheets, representing data in a star, snowflake, or galaxy schema to provide a multi-faceted view that is critical for business intelligence analysis. These cubes empower analysts to perform operations like slicing, dicing, and drilling down to granular data levels or rolling up to summary levels, thus offering sophisticated analysis capabilities over vast data volumes and supporting strategic decision-making .

The star schema is advantageous due to its simplicity and efficiency in querying, as it centralizes a large fact table connected directly to smaller, descriptive dimension tables . This design facilitates fast retrieval of attributes, as data is generally accessed via the star-joined fact and dimension tables. However, the potential drawback is the possible redundancy of data within its dimension tables, which could arise from the denormalized structure . While the layout suits query performance, it might not be optimal for situations requiring high volume updates due to its lack of efficiency in managing data integrity across multiple dimensions in more complex, normalized setups such as snowflake schemas .

Metadata in a data warehouse improves functionality and utility by documenting the meaning, origin, and transformations of the data. It acts as 'data about data,' providing transparency regarding how data attributes were derived and how they relate to the enterprise's information needs . Keeping metadata updated is crucial for maintaining the credibility and accuracy of queries and analyses performed by users, as it reflects any modifications in the data warehouse's structure. By providing direct access to this documentation, metadata enables users to better understand and interpret the data, significantly increasing the data's interpretability and the decision-making efficacy for business intelligence .

Data marts are smaller and more specific subsets of data warehouses tailored to meet the needs of individual departments, such as marketing or logistics . While data warehouses provide enterprise-wide depth integrating data from various sources, data marts focus on a single subject or functional area, containing a subset of the data stored in the company data warehouse, often integrated with other department-specific data deemed of interest . This focus allows departments to perform targeted business intelligence analyses and execute decision support applications specific to their needs, benefiting from relevant and specialized information .

The top-down design methodology involves creating a comprehensive data warehouse design that encompasses the entire organization's needs, offering a systematic and robust structure initially . This centralized approach can better ensure data consistency and integration but may require longer implementation times and greater initial investment. In contrast, the bottom-up approach begins with the creation of data marts for specific departments or functional areas, which are then integrated into a comprehensive data warehouse structure over time . This method allows for faster deployment and immediate departmental utility but may lead to inconsistencies and require more complex integration efforts as the overall data warehouse evolves .

You might also like