0% found this document useful (0 votes)
13 views29 pages

Data Warehousing: OLAP & Aggregation Insights

The document covers essential components of data warehousing systems, including aggregation, historical information management, and query facilities, which facilitate efficient data analysis and decision-making. It also discusses OLAP (Online Analytical Processing) functions, types of OLAP servers (ROLAP, MOLAP, HOLAP), and their respective benefits and limitations. Additionally, it highlights the importance of data mining interfaces, security measures, and backup and recovery strategies in data warehousing.

Uploaded by

ritikasingh01008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Data Warehousing: OLAP & Aggregation Insights

The document covers essential components of data warehousing systems, including aggregation, historical information management, and query facilities, which facilitate efficient data analysis and decision-making. It also discusses OLAP (Online Analytical Processing) functions, types of OLAP servers (ROLAP, MOLAP, HOLAP), and their respective benefits and limitations. Additionally, it highlights the importance of data mining interfaces, security measures, and backup and recovery strategies in data warehousing.

Uploaded by

ritikasingh01008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 5: Data Visualization and Overall Perspective: Aggregation, Historical Information,

Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP,
Data Mining interface, Security, Backup and Recovery, Tuning Data Warehouse, Testing
Data Warehouse. Warehousing Applications and Recent Trends: Types of Warehousing
Applications, Web Mining, Spatial Mining and Temporal Mining.
Lecture 33: Aggregation, Historical Information, Query Facility
Aggregation, historical information management, and query facilities are essential components
of data warehousing systems, enabling efficient data analysis and decision-making. Here’s an
overview of each:
Aggregation
1. Definition:
• Aggregation involves combining and summarizing data to derive meaningful
insights.
• Aggregated data often represent higher-level summaries or totals, facilitating
analysis and reporting.
2. Purpose:
• Aggregation reduces the volume of data by consolidating detailed information
into more manageable and meaningful summaries.
• Aggregated data provides valuable insights into trends, patterns, and anomalies
within the dataset.
3. Common Aggregation Functions:
• Sum: Adds up numerical values.
• Average: Calculates the mean value of numerical data.
• Count: Counts the number of occurrences.
• Min/Max: Finds the minimum or maximum value within a dataset.
• Group By: Groups data based on specified attributes for aggregation.
4. Usage:
• Aggregation is essential for generating reports, creating dashboards, and
performing ad-hoc analysis.
• It simplifies data exploration and visualization by presenting summarized views
of complex datasets.
Historical Information Management
1. Definition:
• Historical information management involves storing and managing historical
data within the data warehouse.
• Historical data represents past snapshots of business transactions, events, or
states.
2. Purpose:
• Historical data provides context and historical perspective for analysis and
decision-making.
• It supports trend analysis, forecasting, and predictive modeling by capturing
past patterns and behaviors.
3. Data Retention Policies:
• Organizations define data retention policies to determine how long historical
data should be retained in the data warehouse.
• Policies consider regulatory requirements, business needs, and storage
constraints.
4. Temporal Data Modeling:
• Temporal data modeling techniques, such as slowly changing dimensions
(SCDs), are used to manage changes in historical data over time.
• SCDs track historical changes to dimension attributes, allowing for accurate
historical analysis.
Query Facility
1. Definition:
• Query facilities provide tools and interfaces for querying and accessing data
stored in the data warehouse.
• Users can write and execute SQL queries or use graphical interfaces to retrieve
data.
2. Features:
• SQL Support: Query facilities support SQL (Structured Query Language) for
data retrieval and manipulation.
• OLAP (Online Analytical Processing): Supports multidimensional analysis with
capabilities like slicing, dicing, drilling, and pivoting.
• Ad-Hoc Querying: Allows users to create and execute ad-hoc queries to explore
data interactively.
• Parameterized Queries: Supports parameterized queries to enable dynamic
filtering and customization.
3. Performance Optimization:
• Query facilities optimize query performance through techniques like query
optimization, indexing, and caching.
• They leverage database engine capabilities for efficient query execution and
resource utilization.
4. Integration:
• Query facilities integrate with reporting tools, BI platforms, and data
visualization tools to enable seamless data analysis and reporting.
• They support integration with ETL (Extract, Transform, Load) tools for data
preparation and loading.
Aggregation, historical information management, and query facilities are critical components
of data warehousing systems, enabling efficient data analysis and decision-making. By
aggregating data, managing historical information, and providing robust query facilities,
organizations can extract valuable insights from their data warehouse and drive informed
business decisions.
Lecture 34: OLAP Functions
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
Who uses OLAP and Why?
OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting
o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
How OLAP Works?
Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and
grouping. These queries are calculated during a process that is usually called 'building' or
'processing' of the OLAP cube. This process happens overnight, and by the time end users get
to work - data will have been updated.
OLAP Guidelines ([Link] Rule)
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:

1) Multidimensional Conceptual View: This is the central features of an OLAP system. By


needing a multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing
operations, and the dissimilar nature of source data totally transparent to users. Such
transparency helps to improve the efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and perform
any necessary transformations. The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the
database increases. That is, the performance of OLAP should not suffer as the number of
dimensions is increased. Users must observe consistent run time, response time, or machine
utilization every time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently
intelligent that the various clients to be attached with a minimum of effort and integration
programming. The server should be capable of mapping and consolidating data between
dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in
both is structure and operational capabilities. Additional operational capabilities may be
allowed to selected dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the
sparse matrix, the system must be easy to dynamically assume the distribution of the
information and adjust the storage and access to obtain and maintain a consistent level of
performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to
identify dimensional order and necessarily functions roll-up and drill-down methods within a
dimension or across the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction
like as reorientation (pivoting), drill-down and roll-up, and another manipulation to be
accomplished naturally and precisely via point-and-click and drag and drop methods on the
cells of the scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should
be unlimited. Each of these common dimensions must allow a practically unlimited number of
customer-defined aggregation levels within any given consolidation path.

OLAP Operations
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level
of city to the level of country.
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways

• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.

• Drill-down is performed by stepping down a concept hierarchy for the dimension time.
• Initially the concept hierarchy was "day < month < quarter &lt year."
• On drilling down, the time dimension is descended from the level of quarter to the level
of month.
• When drill-down is performed, one or more dimensions from the data cube are added.
• It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.
• Here Slice is performed for the dimension "time" using the criterion time = "Q1".
• It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.
Lecture 35: OLAP Servers, ROLAP, MOLAP, HOLAP

Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in
order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data. Many Business
Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
• ROLAP
• MOLAP
• HOLAP
Relational OLAP (ROLAP):
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as relational
tables. ROLAP servers are used to bridge the gap between the relational back-end server and
the client’s front-end tools. ROLAP servers store and manage warehouse data using RDBMS,
and OLAP middleware fills in the gaps.
Benefits:
• It is compatible with data warehouses and OLTP systems.
• The data size limitation of ROLAP technology is determined by the underlying
RDBMS. As a result, ROLAP does not limit the amount of data that can be stored.
Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP (MOLAP):
Through array-based multidimensional storage engines, Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage utilization in
multidimensional data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It
is used for OLAP, which is based on the arrays’ random access capability. Dimension instances
determine array elements, and the data or measured value associated with each cell is typically
stored in the corresponding array element. The multidimensional array is typically stored in
MOLAP in a linear allocation based on nested traversal of the axes in some predetermined
order.
However, unlike ROLAP, which stores only records with non-zero facts, all array elements are
defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements occupying
a larger portion of them. MOLAP systems typically include provisions such as advanced
indexing and hashing to locate data while performing queries for handling sparse arrays,
because both storage and retrieval costs are important when evaluating online performance.
MOLAP cubes are ideal for slicing and dicing data and can perform complex calculations.
When the cube is created, all calculations are pre-generated.
Benefits:
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data
cannot be stored in the cube itself.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than [Link]
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-type
information. Because detailed data is stored in a relational database, cubes are smaller than
MOLAP.
Benefits:
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
Limitations
• Because it supports both MOLAP and ROLAP servers, HOLAP architecture is
extremely complex.
• There is a greater likelihood of overlap, particularly in their functionalities.
Lecture 36: Data Mining interface, Security, Backup and Recovery, Tuning Data
Warehouse.
Data Mining Interface
A data mining interface facilitates the exploration and extraction of actionable insights from
large datasets using data mining techniques. Here’s an overview:
1. Query Interface:
• Provides users with tools to define and execute data mining queries against the
data warehouse.
• Supports various query languages or graphical interfaces for defining mining
tasks.
2. Data Exploration Tools:
• Enables users to explore data visually and interactively to identify patterns,
trends, and anomalies.
• Includes features such as data visualization, clustering, classification, and
association rule discovery.
3. Model Building and Evaluation:
• Allows users to build predictive models using machine learning algorithms and
evaluate their performance.
• Provides tools for model training, testing, and validation using techniques like
cross-validation.
4. Integration with BI Tools:
• Integrates with business intelligence (BI) tools and dashboards to visualize and
present data mining results.
• Enables users to incorporate predictive insights into decision-making processes.
Security
Data warehouse security is crucial for protecting sensitive information and ensuring
compliance with regulatory requirements. Here are key security measures:
1. Access Control:
• Implement role-based access control (RBAC) to restrict access to data based on
users’ roles and responsibilities.
• Enforce strong authentication mechanisms, such as multi-factor authentication
(MFA), to prevent unauthorized access.
2. Data Encryption:
• Encrypt data at rest and in transit to prevent unauthorized access or interception.
• Use encryption techniques such as SSL/TLS for network communication and
encryption algorithms for data storage.
3. Auditing and Monitoring:
• Implement auditing and logging mechanisms to track user activities and
changes to data.
• Monitor access patterns and detect suspicious behavior to prevent security
breaches.
4. Data Masking and Anonymization:
• Mask sensitive data to anonymize personally identifiable information (PII) and
protect privacy.
• Replace sensitive data with pseudonymized or randomized values to ensure
confidentiality.
5. Compliance and Governance:
• Ensure compliance with regulations such as GDPR, HIPAA, and PCI-DSS by
implementing data governance policies and controls.
• Conduct regular security assessments and audits to identify vulnerabilities and
ensure adherence to security standards.
Backup and Recovery
Backup and recovery processes are essential for data warehouse reliability and resilience.
Here’s how it’s managed:
1. Regular Backups:
• Schedule regular backups of the data warehouse to ensure data availability in
case of data loss or corruption.
• Implement full, incremental, or differential backup strategies based on recovery
requirements.
2. Redundant Storage:
• Store backup copies of data in redundant storage locations, such as cloud storage
or off-site data centers.
• Ensure data redundancy and fault tolerance to mitigate the risk of data loss due
to hardware failures or disasters.
3. Point-in-Time Recovery:
• Maintain transaction logs or incremental backups to facilitate point-in-time
recovery to a specific moment in the past.
• Enable rollback or recovery to restore the data warehouse to a consistent state
after data corruption or accidental changes.
4. Disaster Recovery Planning:
• Develop and test disaster recovery plans to ensure business continuity in the
event of catastrophic failures or natural disasters.
• Establish procedures for failover, data restoration, and system recovery to
minimize downtime and data loss.
5. Automated Backup Solutions:
• Use automated backup solutions and backup scheduling tools to streamline
backup and recovery processes.
• Monitor backup jobs and receive alerts for any failures or anomalies to ensure
timely resolution.
A robust data mining interface facilitates data exploration and predictive analysis, while
comprehensive security measures protect sensitive information and ensure compliance. Backup
and recovery processes ensure data warehouse resilience and availability, safeguarding against
data loss and disruptions. By implementing these measures effectively, organizations can
leverage their data warehouse infrastructure securely and reliably to drive business insights and
decision-making.

Tuning Data Warehouse


Tuning a data warehouse involves optimizing its performance by adjusting its design,
configuration, and query processing. There are several key areas where tuning can be applied
to improve the performance of a data warehouse:
Design: The design of a data warehouse should be optimized for performance, including the
use of appropriate indexing, partitioning, and aggregation strategies. A well-designed data
warehouse can improve query response times and reduce data loading times.
Configuration: The configuration of the data warehouse hardware and software can have a
significant impact on performance. Configuring hardware resources such as CPU, memory, and
disk can improve query processing times, while software configurations such as parallelism
and compression can improve data loading and query response times.
Query processing: Tuning query processing involves optimizing the way queries are executed
against the data warehouse, including the use of appropriate join strategies, aggregation
methods, and query optimization techniques.
Lecture 37: Testing Data Warehouse, Warehousing applications and Recent Trends
Testing Data Warehouse
Testing a data warehouse involves evaluating its performance and functionality to ensure that
it meets the requirements of its users. There are several types of testing that can be applied to
a data warehouse:
Unit testing: This involves testing individual components of the data warehouse, such as data
sources, ETL processes, and query processing.
Integration testing: This involves testing the integration of different components of the data
warehouse, including data sources, ETL processes, and the data warehouse itself.
Performance testing involves testing the data warehouse's performance under various load
conditions, including concurrent user activity and varying data volumes.
User acceptance testing: This involves testing the data warehouse with end-users to ensure that
it meets their requirements and is easy to use.
Overall, tuning and testing a data warehouse are critical to ensuring its performance and
functionality, and should be conducted regularly to ensure that the data warehouse continues
to meet the evolving needs of its users.
Warehousing applications and Recent Trends
Warehousing applications encompass a wide range of software solutions designed to manage,
analyze, and derive insights from large volumes of data stored in data warehouses. Here are
some types of warehousing applications along with recent trends:
Types of Warehousing Applications
1. Business Intelligence (BI) Tools:
• BI tools enable users to visualize and analyze data stored in the data warehouse
through dashboards, reports, and ad-hoc queries.
• Features include data visualization, interactive dashboards, drill-down
capabilities, and predictive analytics.
2. Data Integration and ETL Tools:
• ETL (Extract, Transform, Load) tools facilitate the extraction, transformation,
and loading of data into the data warehouse from various sources.
• Features include data profiling, data cleansing, data mapping, and workflow
orchestration.
3. Data Mining and Predictive Analytics:
• Data mining tools use statistical algorithms and machine learning techniques to
discover patterns, trends, and insights from data.
• Predictive analytics tools enable organizations to forecast future outcomes and
make data-driven predictions based on historical data.
4. Data Governance and Metadata Management:
• Data governance tools enforce policies and standards for data management,
ensuring data quality, security, and compliance.
• Metadata management tools capture and manage metadata about data assets,
schemas, and lineage within the data warehouse.
5. Advanced Analytics and Machine Learning:
• Advanced analytics platforms provide capabilities for performing complex
analytical tasks such as clustering, classification, and sentiment analysis.
• Machine learning platforms enable organizations to build, train, and deploy
machine learning models for predictive analytics and automated decision-
making.
6. Data Visualization and Self-Service Analytics:
• Data visualization tools allow users to create interactive visualizations and
explore data intuitively.
• Self-service analytics platforms empower business users to perform ad-hoc
analysis and generate insights without relying on IT support.
Recent Trends in Warehousing Applications
1. Cloud-Based Solutions:
• The adoption of cloud-based data warehousing solutions has increased, offering
scalability, flexibility, and cost-effectiveness.
• Cloud data warehouses such as Snowflake, Amazon Redshift, and Google
BigQuery provide managed services with built-in scalability and integration
capabilities.
2. Augmented Analytics:
• Augmented analytics combines machine learning and natural language
processing to automate data preparation, insight discovery, and decision-
making.
• Features include automated insights, anomaly detection, and natural language
query interfaces.
3. Real-Time Analytics:
• Real-time analytics solutions enable organizations to analyze streaming data
and respond to events in real-time.
• Technologies such as Apache Kafka, Apache Flink, and Apache Spark
Streaming support real-time data processing and analytics.
4. DataOps and DevOps Practices:
• DataOps and DevOps practices are increasingly being adopted to streamline and
automate the development, deployment, and operation of data warehousing
applications.
• Continuous integration, continuous deployment (CI/CD), and infrastructure as
code (IaC) principles are applied to data pipelines and analytics workflows.
5. Data Privacy and Compliance:
• With the growing focus on data privacy regulations such as GDPR and CCPA,
there is an increased emphasis on data governance, privacy protection, and
compliance management.
• Tools for data anonymization, pseudonymization, and consent management
help organizations comply with regulatory requirements.
6. AI-Powered Analytics:
• AI-powered analytics solutions leverage artificial intelligence and machine
learning algorithms to automate data analysis, generate insights, and improve
decision-making.
• Features include automated anomaly detection, pattern recognition, and
recommendation engines.
Warehousing applications play a crucial role in managing, analyzing, and deriving insights
from data stored in data warehouses. Recent trends in warehousing applications include the
adoption of cloud-based solutions, augmented analytics, real-time analytics, DataOps/DevOps
practices, data privacy and compliance tools, and AI-powered analytics. By leveraging these
trends, organizations can enhance their data warehousing capabilities and drive innovation in
data-driven decision-making.
Lecture 38: Types of Data Warehouses Application
Data warehouses come in various forms, each designed to meet specific organizational needs
and data handling approaches. Understanding these different types is crucial for choosing the
right solution to improve data management and support business intelligence efforts.

1. Enterprise Data Warehouse (EDW)


An Enterprise Data Warehouse (EDW) refers to a comprehensive data repository that integrates
data drawn from different areas of an organisation. It holds all information for all business units
required giving a consolidated and unified view of the organization.
Features of Enterprise Data Warehouse (EDW)
• Scope and Scale: The EDWs are deployed for managing data generated from diverse
business areas including finance, marketing, selling, and operations.
• Integration: It gathers information from transactional databases, ERP systems, CRM
systems and other data feeds from outside the business environment.
• Architecture: Usually, the structure of an EDW complies with a star/snowflake scheme
containing facts (numeric, measurable information) and dimensions (qualitative and
contextual data).
• Usage: They facilitate complex queries, high-level analysis, and business insight
information within the business.
• Maintenance: Being central repositories of an organization’s data, EDWs demand
investments in infrastructure, upkeep, as well as stewardship for the quality of data kept
in them.
2. Operational Data Store (ODS)
An ODS is another form of data warehouse data layer that holds data from operational systems
in a consolidated and integrated format for near real-time reporting and operational analysis.
Features of Operational Data Store (ODS)
• Real-Time Data: ODSs are intended for the incorporation of updated information as
close as possible to the operational time, so they offer the current look at the operational
data.
• Scope: While EDWs are used for strategic decision-making, ODSs deal with the
organization’s operational activities and transactional requirements.
• Integration: They collect data from different operational sources like transactional,
ERP, CRM and so on for which the data is denormalized mostly.
• Usage: Other Decision Support System is utilized for short-term decisions and
comprises inventory control, customers’ orders and control of employee scheduling.
• Architecture: Generally, ODSs do not retain data history; they provide real-time data
for fast read and write operations and are data marts for loading data into the EDW.
3. Data Mart
A Data Mart can be defined as an element of a Data Warehouse system designed to hold data
from a particular business division, department or user type. It is created to serve the specific
interests of a specific class of people.
Features of Data Mart
• Scope: Data marts are less comprehensive than EDWs and relevant to the
organization’s specific departments, such as sales, finance, or marketing.
• Integration: They routinely pull or gather data from the EDW and other operation
sources to develop a specific data set.
• Usage: Data marts are used to supply departmental reporting and analysis as the users
of this type of data get real-time access to the data that can be useful for their
organizational tasks.
• Architecture: Data marts can be deployed based on the star schema or any other
simplification of a star-war structure. They can be operations-based or derived
(extracted from the operational system directly) or they can be dependent, that is,
retrieved from the EDW.
• Flexibility: They also introduce faster implementation and lower cost than building an
entire EDW to satisfy all the needs of an organization because they can address
departmental requirements instead.
4. Cloud Data Warehouses
Cloud Data Warehouses are Data Warehousing solutions that are located on the cloud platform
that offer a scalable platform for effective usage of data storage and analysis.
Features of Cloud Data Warehouses
• Scalability: Cloud data warehouses mean that you can adjust the storage or
computational power as required to fit various loads and data amounts.
• Cost-Effectiveness: It is a logical model because it entails organizations making
payments only to the extent of the services they want, therefore departing from huge
initial capital investments.
• Maintenance: Cloud service providers take care of the infrastructure, and manage their
upgrades and security thus freeing managers to give their time to analysis on data
patterns.
• Integration: They enable the easy integration of different forms of data, the cloud, and
other applications to help with data ingestion.
• Flexibility: Cloud data warehouses facilitate several analytics, including operational
processing, near real-time, machine learning, and complex business analytics. Some
examples include Amazon Redshift, Google BigQuery, and most recently, Snowflake.
5. Big Data Warehouses
Big Data Warehouses are advanced preparation instruments for data warehousing to address
gigantic volumes of structured and unstructured data that are created with velocity.
Features of Big Data Warehouses
• Volume, Variety, and Velocity: They can analyze and retain wide arrays of data
collected from social media platforms, IoT devices, logs, and many others.
• Architecture: Large buckets usually take advantage of distributed computing software
like Hadoop and Apache Spark in technologies like HDFS, No SQL database, and
columnar storage.
• Analytics: They enable data mining, machine learning and real-time analysis
techniques that are used in decision-making from the big data.
• Integration: Most big data platforms and tools interconnect with various big data
warehouses that facilitate data ingestion, transformation, as well as analysis.
• Use Cases: They are applied in areas like finance, healthcare, retail/wholesale, telecom,
and oil & gas, to name a few: consumer behaviour analysis, fraud detection, predictive
maintenance, and the like.
6. Virtual Data Warehouse
A Virtual Data Warehouse provides a logical view of data from multiple sources without
physically storing the data in a central location.
Features of Virtual Data Warehouses:
• Data Integration: Combines data from various sources without physical movement.
• Real-time Access: Provides up-to-date information directly from source systems.
• Reduced Storage Costs: Minimizes the need for additional storage infrastructure.
• Flexibility: Easily adapts to changes in source systems.
7. Hybrid Data Warehouse
A Hybrid Data Warehouse combines on-premises and cloud-based data storage and processing
capabilities.
Features of Hybrid Data Warehouses:
• Flexibility: Allows organizations to keep sensitive data on-premises while leveraging
cloud benefits.
• Scalability: Can scale resources up or down based on needs.
• Cost-Effective: Balances the benefits of cloud and on-premises solutions.
• Data Sovereignty: Helps comply with data residency requirements.
8. Real-time Data Warehouse
A Real-time Data Warehouse is designed to process and analyze data as it's generated,
providing immediate insights.
Features of Real-time Data Warehouses:
• Immediate Data Processing: Analyzes data in real-time or near-real-time.
• Low Latency: Minimizes delay between data generation and availability for analysis.
• Streaming Data Support: Handles continuous data streams effectively.
• Time-Sensitive Decision Support: Enables quick responses to changing conditions.
Comparison of Data Warehouse Types

Real-time
Type Best For Scalability Cost Complexity Capability

EDW Large enterprises High High High Limited

ODS Operational reporting Medium Medium Medium High

Department-specific
Data Mart Low Low Low Medium
needs

Cloud Flexible, scalable Pay-as-


Very High Medium High
DW needs you-go

Big Data
Large, varied datasets Very High High High High
DW

Virtual Distributed data


Medium Low Medium High
DW sources

Hybrid Balancing security


High Medium High Medium
DW and scalability

Real-time
Immediate insights High High High Very High
DW
Lecture 39: Web Mining, Spatial Mining
Web mining, spatial mining, and temporal mining are specialized areas within the field of data
mining that focus on extracting knowledge and insights from specific types of data. Here’s an
overview of each:
Web Mining
Web mining involves extracting useful information and patterns from web data, including web
pages, web logs, social media, and online transactions. It encompasses three main types:
1. Web Content Mining:
• Involves extracting useful information from web documents, such as text,
images, and multimedia content.
• Techniques include text mining, image analysis, and natural language
processing (NLP) to analyze web content.
2. Web Structure Mining:
• Focuses on analyzing the structure of the web, including hyperlinks between
web pages and the topology of the web graph.
• Techniques include link analysis algorithms, such as PageRank and HITS, to
identify important web pages and communities within the web graph.
3. Web Usage Mining:
• Analyzes user interactions with web-based systems, including web logs,
clickstream data, and user sessions.
• Techniques include sessionization, path analysis, and association rule mining to
understand user behavior and preferences.
Spatial Mining
Spatial mining, also known as geospatial data mining or spatial analytics, focuses on extracting
patterns and insights from spatial data. It deals with data that has spatial or geographic
attributes, such as maps, satellite imagery, GPS coordinates, and location-based services. Key
techniques include:
1. Spatial Clustering:
• Identifies spatially dense regions or clusters within spatial datasets.
• Techniques include density-based clustering algorithms like DBSCAN and
hierarchical clustering methods.
2. Spatial Classification:
• Predicts categorical attributes or classes for spatial objects based on their spatial
features.
• Techniques include spatial decision trees, support vector machines (SVM), and
spatial naive Bayes classifiers.
3. Spatial Association Analysis:
• Identifies relationships and associations between spatial objects or attributes.
• Techniques include spatial autocorrelation analysis, hotspot detection, and
spatial association rule mining.
4. Spatial Visualization:
• Visualizes spatial data and analysis results using maps, charts, and interactive
visualizations.
• Techniques include choropleth maps, heatmaps, and interactive GIS
(Geographic Information System) tools.
Lecture 40: Temporal Mining
Temporal mining focuses on extracting patterns and trends from temporal data, which includes
time-stamped sequences, time series, event logs, and temporal databases. It deals with
analyzing data that evolves over time to discover temporal dependencies and patterns. Key
techniques include:
1. Time Series Analysis:
• Analyzes time-stamped data sequences to identify patterns, trends, and
seasonality.
• Techniques include statistical methods, autoregressive models, and spectral
analysis.
2. Temporal Association Analysis:
• Identifies temporal patterns and associations between events or sequences of
events.
• Techniques include sequence mining, temporal rule discovery, and episode
detection.
3. Temporal Clustering:
• Groups temporal data into clusters based on temporal similarity or behavior.
• Techniques include dynamic time warping (DTW), time-based clustering
algorithms, and event stream clustering.
4. Temporal Visualization:
• Visualizes temporal data and analysis results to facilitate exploration and
interpretation.
• Techniques include time series plots, calendar heatmaps, and animated
visualizations.
Web mining, spatial mining, and temporal mining are specialized areas within data mining that
focus on extracting knowledge and insights from specific types of data. By applying techniques
and algorithms tailored to the characteristics of web, spatial, and temporal data, organizations
can uncover valuable patterns, trends, and relationships to support decision-making and
enhance understanding in various domains.

You might also like