Data Warehousing: OLAP & Aggregation Insights
Data Warehousing: OLAP & Aggregation Insights
Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP,
Data Mining interface, Security, Backup and Recovery, Tuning Data Warehouse, Testing
Data Warehouse. Warehousing Applications and Recent Trends: Types of Warehousing
Applications, Web Mining, Spatial Mining and Temporal Mining.
Lecture 33: Aggregation, Historical Information, Query Facility
Aggregation, historical information management, and query facilities are essential components
of data warehousing systems, enabling efficient data analysis and decision-making. Here’s an
overview of each:
Aggregation
1. Definition:
• Aggregation involves combining and summarizing data to derive meaningful
insights.
• Aggregated data often represent higher-level summaries or totals, facilitating
analysis and reporting.
2. Purpose:
• Aggregation reduces the volume of data by consolidating detailed information
into more manageable and meaningful summaries.
• Aggregated data provides valuable insights into trends, patterns, and anomalies
within the dataset.
3. Common Aggregation Functions:
• Sum: Adds up numerical values.
• Average: Calculates the mean value of numerical data.
• Count: Counts the number of occurrences.
• Min/Max: Finds the minimum or maximum value within a dataset.
• Group By: Groups data based on specified attributes for aggregation.
4. Usage:
• Aggregation is essential for generating reports, creating dashboards, and
performing ad-hoc analysis.
• It simplifies data exploration and visualization by presenting summarized views
of complex datasets.
Historical Information Management
1. Definition:
• Historical information management involves storing and managing historical
data within the data warehouse.
• Historical data represents past snapshots of business transactions, events, or
states.
2. Purpose:
• Historical data provides context and historical perspective for analysis and
decision-making.
• It supports trend analysis, forecasting, and predictive modeling by capturing
past patterns and behaviors.
3. Data Retention Policies:
• Organizations define data retention policies to determine how long historical
data should be retained in the data warehouse.
• Policies consider regulatory requirements, business needs, and storage
constraints.
4. Temporal Data Modeling:
• Temporal data modeling techniques, such as slowly changing dimensions
(SCDs), are used to manage changes in historical data over time.
• SCDs track historical changes to dimension attributes, allowing for accurate
historical analysis.
Query Facility
1. Definition:
• Query facilities provide tools and interfaces for querying and accessing data
stored in the data warehouse.
• Users can write and execute SQL queries or use graphical interfaces to retrieve
data.
2. Features:
• SQL Support: Query facilities support SQL (Structured Query Language) for
data retrieval and manipulation.
• OLAP (Online Analytical Processing): Supports multidimensional analysis with
capabilities like slicing, dicing, drilling, and pivoting.
• Ad-Hoc Querying: Allows users to create and execute ad-hoc queries to explore
data interactively.
• Parameterized Queries: Supports parameterized queries to enable dynamic
filtering and customization.
3. Performance Optimization:
• Query facilities optimize query performance through techniques like query
optimization, indexing, and caching.
• They leverage database engine capabilities for efficient query execution and
resource utilization.
4. Integration:
• Query facilities integrate with reporting tools, BI platforms, and data
visualization tools to enable seamless data analysis and reporting.
• They support integration with ETL (Extract, Transform, Load) tools for data
preparation and loading.
Aggregation, historical information management, and query facilities are critical components
of data warehousing systems, enabling efficient data analysis and decision-making. By
aggregating data, managing historical information, and providing robust query facilities,
organizations can extract valuable insights from their data warehouse and drive informed
business decisions.
Lecture 34: OLAP Functions
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
Who uses OLAP and Why?
OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting
o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
How OLAP Works?
Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and
grouping. These queries are calculated during a process that is usually called 'building' or
'processing' of the OLAP cube. This process happens overnight, and by the time end users get
to work - data will have been updated.
OLAP Guidelines ([Link] Rule)
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
OLAP Operations
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level
of city to the level of country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways
−
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
• Drill-down is performed by stepping down a concept hierarchy for the dimension time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the level
of month.
• When drill-down is performed, one or more dimensions from the data cube are added.
• It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.
• Here Slice is performed for the dimension "time" using the criterion time = "Q1".
• It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.
Lecture 35: OLAP Servers, ROLAP, MOLAP, HOLAP
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in
order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data. Many Business
Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP
• HOLAP
Relational OLAP (ROLAP):
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as relational
tables. ROLAP servers are used to bridge the gap between the relational back-end server and
the client’s front-end tools. ROLAP servers store and manage warehouse data using RDBMS,
and OLAP middleware fills in the gaps.
Benefits:
• It is compatible with data warehouses and OLTP systems.
• The data size limitation of ROLAP technology is determined by the underlying
RDBMS. As a result, ROLAP does not limit the amount of data that can be stored.
Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
Through array-based multidimensional storage engines, Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage utilization in
multidimensional data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It
is used for OLAP, which is based on the arrays’ random access capability. Dimension instances
determine array elements, and the data or measured value associated with each cell is typically
stored in the corresponding array element. The multidimensional array is typically stored in
MOLAP in a linear allocation based on nested traversal of the axes in some predetermined
order.
However, unlike ROLAP, which stores only records with non-zero facts, all array elements are
defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements occupying
a larger portion of them. MOLAP systems typically include provisions such as advanced
indexing and hashing to locate data while performing queries for handling sparse arrays,
because both storage and retrieval costs are important when evaluating online performance.
MOLAP cubes are ideal for slicing and dicing data and can perform complex calculations.
When the cube is created, all calculations are pre-generated.
Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data
cannot be stored in the cube itself.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than [Link]
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-type
information. Because detailed data is stored in a relational database, cubes are smaller than
MOLAP.
Benefits:
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
Limitations
• Because it supports both MOLAP and ROLAP servers, HOLAP architecture is
extremely complex.
• There is a greater likelihood of overlap, particularly in their functionalities.
Lecture 36: Data Mining interface, Security, Backup and Recovery, Tuning Data
Warehouse.
Data Mining Interface
A data mining interface facilitates the exploration and extraction of actionable insights from
large datasets using data mining techniques. Here’s an overview:
1. Query Interface:
• Provides users with tools to define and execute data mining queries against the
data warehouse.
• Supports various query languages or graphical interfaces for defining mining
tasks.
2. Data Exploration Tools:
• Enables users to explore data visually and interactively to identify patterns,
trends, and anomalies.
• Includes features such as data visualization, clustering, classification, and
association rule discovery.
3. Model Building and Evaluation:
• Allows users to build predictive models using machine learning algorithms and
evaluate their performance.
• Provides tools for model training, testing, and validation using techniques like
cross-validation.
4. Integration with BI Tools:
• Integrates with business intelligence (BI) tools and dashboards to visualize and
present data mining results.
• Enables users to incorporate predictive insights into decision-making processes.
Security
Data warehouse security is crucial for protecting sensitive information and ensuring
compliance with regulatory requirements. Here are key security measures:
1. Access Control:
• Implement role-based access control (RBAC) to restrict access to data based on
users’ roles and responsibilities.
• Enforce strong authentication mechanisms, such as multi-factor authentication
(MFA), to prevent unauthorized access.
2. Data Encryption:
• Encrypt data at rest and in transit to prevent unauthorized access or interception.
• Use encryption techniques such as SSL/TLS for network communication and
encryption algorithms for data storage.
3. Auditing and Monitoring:
• Implement auditing and logging mechanisms to track user activities and
changes to data.
• Monitor access patterns and detect suspicious behavior to prevent security
breaches.
4. Data Masking and Anonymization:
• Mask sensitive data to anonymize personally identifiable information (PII) and
protect privacy.
• Replace sensitive data with pseudonymized or randomized values to ensure
confidentiality.
5. Compliance and Governance:
• Ensure compliance with regulations such as GDPR, HIPAA, and PCI-DSS by
implementing data governance policies and controls.
• Conduct regular security assessments and audits to identify vulnerabilities and
ensure adherence to security standards.
Backup and Recovery
Backup and recovery processes are essential for data warehouse reliability and resilience.
Here’s how it’s managed:
1. Regular Backups:
• Schedule regular backups of the data warehouse to ensure data availability in
case of data loss or corruption.
• Implement full, incremental, or differential backup strategies based on recovery
requirements.
2. Redundant Storage:
• Store backup copies of data in redundant storage locations, such as cloud storage
or off-site data centers.
• Ensure data redundancy and fault tolerance to mitigate the risk of data loss due
to hardware failures or disasters.
3. Point-in-Time Recovery:
• Maintain transaction logs or incremental backups to facilitate point-in-time
recovery to a specific moment in the past.
• Enable rollback or recovery to restore the data warehouse to a consistent state
after data corruption or accidental changes.
4. Disaster Recovery Planning:
• Develop and test disaster recovery plans to ensure business continuity in the
event of catastrophic failures or natural disasters.
• Establish procedures for failover, data restoration, and system recovery to
minimize downtime and data loss.
5. Automated Backup Solutions:
• Use automated backup solutions and backup scheduling tools to streamline
backup and recovery processes.
• Monitor backup jobs and receive alerts for any failures or anomalies to ensure
timely resolution.
A robust data mining interface facilitates data exploration and predictive analysis, while
comprehensive security measures protect sensitive information and ensure compliance. Backup
and recovery processes ensure data warehouse resilience and availability, safeguarding against
data loss and disruptions. By implementing these measures effectively, organizations can
leverage their data warehouse infrastructure securely and reliably to drive business insights and
decision-making.
Real-time
Type Best For Scalability Cost Complexity Capability
Department-specific
Data Mart Low Low Low Medium
needs
Big Data
Large, varied datasets Very High High High High
DW
Real-time
Immediate insights High High High Very High
DW
Lecture 39: Web Mining, Spatial Mining
Web mining, spatial mining, and temporal mining are specialized areas within the field of data
mining that focus on extracting knowledge and insights from specific types of data. Here’s an
overview of each:
Web Mining
Web mining involves extracting useful information and patterns from web data, including web
pages, web logs, social media, and online transactions. It encompasses three main types:
1. Web Content Mining:
• Involves extracting useful information from web documents, such as text,
images, and multimedia content.
• Techniques include text mining, image analysis, and natural language
processing (NLP) to analyze web content.
2. Web Structure Mining:
• Focuses on analyzing the structure of the web, including hyperlinks between
web pages and the topology of the web graph.
• Techniques include link analysis algorithms, such as PageRank and HITS, to
identify important web pages and communities within the web graph.
3. Web Usage Mining:
• Analyzes user interactions with web-based systems, including web logs,
clickstream data, and user sessions.
• Techniques include sessionization, path analysis, and association rule mining to
understand user behavior and preferences.
Spatial Mining
Spatial mining, also known as geospatial data mining or spatial analytics, focuses on extracting
patterns and insights from spatial data. It deals with data that has spatial or geographic
attributes, such as maps, satellite imagery, GPS coordinates, and location-based services. Key
techniques include:
1. Spatial Clustering:
• Identifies spatially dense regions or clusters within spatial datasets.
• Techniques include density-based clustering algorithms like DBSCAN and
hierarchical clustering methods.
2. Spatial Classification:
• Predicts categorical attributes or classes for spatial objects based on their spatial
features.
• Techniques include spatial decision trees, support vector machines (SVM), and
spatial naive Bayes classifiers.
3. Spatial Association Analysis:
• Identifies relationships and associations between spatial objects or attributes.
• Techniques include spatial autocorrelation analysis, hotspot detection, and
spatial association rule mining.
4. Spatial Visualization:
• Visualizes spatial data and analysis results using maps, charts, and interactive
visualizations.
• Techniques include choropleth maps, heatmaps, and interactive GIS
(Geographic Information System) tools.
Lecture 40: Temporal Mining
Temporal mining focuses on extracting patterns and trends from temporal data, which includes
time-stamped sequences, time series, event logs, and temporal databases. It deals with
analyzing data that evolves over time to discover temporal dependencies and patterns. Key
techniques include:
1. Time Series Analysis:
• Analyzes time-stamped data sequences to identify patterns, trends, and
seasonality.
• Techniques include statistical methods, autoregressive models, and spectral
analysis.
2. Temporal Association Analysis:
• Identifies temporal patterns and associations between events or sequences of
events.
• Techniques include sequence mining, temporal rule discovery, and episode
detection.
3. Temporal Clustering:
• Groups temporal data into clusters based on temporal similarity or behavior.
• Techniques include dynamic time warping (DTW), time-based clustering
algorithms, and event stream clustering.
4. Temporal Visualization:
• Visualizes temporal data and analysis results to facilitate exploration and
interpretation.
• Techniques include time series plots, calendar heatmaps, and animated
visualizations.
Web mining, spatial mining, and temporal mining are specialized areas within data mining that
focus on extracting knowledge and insights from specific types of data. By applying techniques
and algorithms tailored to the characteristics of web, spatial, and temporal data, organizations
can uncover valuable patterns, trends, and relationships to support decision-making and
enhance understanding in various domains.