0% found this document useful (0 votes)
18 views7 pages

Data Analytics Overview for COMP 333

The document provides an overview of Data Analytics, emphasizing its importance across various industries and outlining the course objectives for COMP 333. It details the main components of data analytics, including descriptive data analysis, data wrangling, and exploratory data analysis, as well as the iterative data analytics process. The document also distinguishes between different types of data analysis: descriptive, predictive, and prescriptive.

Uploaded by

lankwitzjacques
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Data Analytics Overview for COMP 333

The document provides an overview of Data Analytics, emphasizing its importance across various industries and outlining the course objectives for COMP 333. It details the main components of data analytics, including descriptive data analysis, data wrangling, and exploratory data analysis, as well as the iterative data analytics process. The document also distinguishes between different types of data analysis: descriptive, predictive, and prescriptive.

Uploaded by

lankwitzjacques
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMP 333 — Week 1 Nutshell

Data Analytics in a Nutshell


This lecture provides an overview of Data Analytics
to let you orient yourself for COMP 333
and see what is important and where to focus your efforts.
To quote the course outline:
“The aim of this course is to introduce students
to the Python programming language
and related tools for data analytics; and
to expose them to a broad range of data analysis problems across a range of disciplines.”

Why?

Because

Data Analytics has permeated into every industry, government, and business function.
The future will need data-driven approaches for all fields of human endeavour.

What is Data Analytics?


The aim of data analytics is to add value to your data
so it becomes actionable data
which means it helps you and your organisation to make decisions.
You will see it termed as “monetization of data” in the business world.
The main steps of the data analytics are
- descriptive data analysis
- data wrangling
- exploratory data analysis
These steps fit into an overall data analytics process
where you combine an understanding of the data and the business
to come up with data-driven input into the decision-making of the organization.
THE IMPORTANT THINGS

This is an overview.
It is an orientation of what is to come.
You are not meant to understand everything in this document today!
Each topic will be done in much more detail again later in the semester.

Descriptive Data Analysis (DDA)


DDA is a basic tool for understanding your data
DDA is used throughout all stages of data analytics.

Be aware of the type of data that you have:


- categorical versus continuous
– categorical: nominal versus ordinal
– continuous: interval versus ratio
- structured versus unstructured
and for numerical values, be aware of
- accuracy
- precision
- significant digits

Describe the data and the data distribution for each feature in the dataset
- central tendency: mean, median, mode
- variation: standard deviation, inter-quartile range (IPR)
- outlier values and extreme values
- skew
- kurtosis

You want descriptions that are robust to presence of outliers


Visualization as box-plots, violin plots, histograms, scatter plots
Data Wrangling
Data wrangling is extremely important because your data is typically “messy”
and remember Garbage-In-Garbage-Out (GIGO) rule for computation
so you need to tidy-up your data before doing “serious” work.

Data wrangling is generally 60%+ of the time and effort for data analytics!

For data wrangling, you need to look closely at your data, so DDA is a basic tool.

Steps in data wrangling:


Step 1: Discover
Step 2: Structure
Step 3: Cleanse
Step 4: Enrich
Step 5: Validate
Step 6: Publish

Issues for data cleaning:


I errors in data
I outliers and anomalies
I missing values and imputation of missing values
I unification and normalization so data is comparable
I entity recognition

Data wrangling is the traditional ETL (Extract-Load-Transform) process


from data warehouses and OLAP (online analytical processing).

The output of data wrangling is formatted as “Tidy Data”


which has three basic properties:
1. Each variable is saved in its own column
2. Each observation is saved in its own row
3. Each type of observation is stored in its own (single) table
Exploratory Data Analysis (EDA)
EDA grew out of the statistics community.
EDA is the heart of data analytics.
EDA involves data wrangling and descriptive data analysis.

EDA develops a data-driven solution to your problem


by exploring the data to find which features lead to a solution.

The steps of EDA


Step 1: Data wrangling: collect, load, enrich data
Step 2: Descriptive data analysis: check data types, check distributions
Step 3: Feature engineering
Step 4: Modeling
Step 5: Story-Telling

A checklist for EDA:


Q1. What question(s) are you trying to solve (or prove wrong)?
Q2. What kind of data do you have and how do you treat different types?
Q3. Whats missing from the data and how do you deal with it?
Q4. Where are the outliers and why should you care about them?
Q5. How can you add, change or remove features to get more out of your data?
The Data Analytics Process
The data analytics process is how the business community looks at data analytics.

Step 1: Business Understanding: What are the business goals and problems?
Step 2: Data Understanding: Explore and visualize the data.
Step 3: Data Preparation: Generate features
Step 4: Modeling: Create models
Step 5: Evaluating: Train models and evaluate effectiveness
Step 6: Deploying: Use this data-driven approach for the goal of the business on a regular
basis.

This can be viewed as a highly iterative cycle:


- Define the Goal: What problem are you solving?
- Collect and Manage Data: What information do I need?
- Build the Model: Find patterns in the data that lead to solutions.
- Evaluate and Critique the Model: Does the model solve your problem?
- Present Results and Document: Establish that you can solve the problem, and how.
- Deploy Model: Deploy the model to solve the problem in the real world.
THE LESS IMPORTANT THINGS

Models and Machine Learning

Story-Telling and Visualization

Deployment and Big Data Infrastructure


THE LESS LESS IMPORTANT THINGS as these provide mainly context

Correlation, Causality, and Confounding Factors

Data Warehouses and Business Intelligence

Confirmatory Data Analysis


that is, the scientific method with
planned (not exploratory)
experimental design, data collection, and data analysis

Descriptive vs Predictive vs Prescriptive Data Analysis


Descriptive Data Analysis is describing your data from past activities, provides insight into
the past and answer: “What has happened?”
Predictive Data Analysis provides results for unseen data for future activities, uses statistical
models and forecasts to understand the future and answer: “What could happen?”
Prescriptive Data Analysis models viable solutions to a problem and the impact of consider-
ing a solution, uses optimization and simulation algorithms to advise on possible outcomes
and answer: “What should we do?”

Common questions

Powered by AI

Data wrangling ensures data readiness for further analysis by transforming messy raw data into a structured format, which involves cleansing errors, handling missing values, normalizing data for comparability, and ensuring data consistency and integrity. This results in 'Tidy Data,' making it conducive for sophisticated analysis and model building .

Data wrangling is crucial because it addresses the 'messy' nature of raw data, adhering to the principle 'Garbage-In-Garbage-Out (GIGO).' The process involves steps such as discovering, structuring, cleansing, enriching, validating, and publishing data. This often constitutes over 60% of the time and effort in data analytics . The goal is to output 'Tidy Data,' where each variable and observation are clearly defined and organized .

Understanding the business goal is foundational in the data analytics process as it sets the direction for what problems the analytics initiatives are trying to solve. This understanding shapes the approach to data collection, feature generation, model building, and evaluation to ensure that the analytics outputs directly contribute to achieving the business objectives .

Identifying and dealing with outliers during data wrangling involves challenges such as distinguishing between genuine outliers and errors, which requires thorough domain knowledge and statistical analysis. Outliers can disproportionally affect the model's performance, thus necessitating robust methods like trimming, transformation, or using resistant statistic measures. Moreover, decisions on handling outliers often require balancing between data integrity and model performance .

The main purpose of data analytics in organizations is to add value to data so it becomes actionable and aids in decision-making. In a business context, this is often termed 'monetization of data,' meaning the data is leveraged to generate insights or economic benefits, driving decisions that potentially increase profits or efficiencies .

Exploratory Data Analysis (EDA) contributes significantly to data analytics by helping identify underlying patterns, relationships, and anomalies within the data through an iterative approach. It involves data wrangling and descriptive data analysis, enabling a deeper understanding of the data types and distributions, handling missing values, and transforming features to improve model outputs. EDA is a critical step for developing a data-driven solution, as it explores which data features are beneficial for modeling and ultimate decision-making .

Feature engineering is crucial in Exploratory Data Analysis because it involves creating new features from raw data that can make machine learning models more effective. This process includes adding, changing, or removing data features to improve model performance by discovering features that provide significant insights or patterns, thus influencing the model's explanatory power and accuracy .

The factors determining the type of descriptive statistics used in evaluating a dataset include the nature of the data—whether it is categorical or continuous, structured or unstructured, and the specific characteristics such as central tendency, variation, outliers, skewness, and kurtosis. These factors dictate whether metrics like mean, median, mode, standard deviation, or inter-quartile range are appropriate for describing the dataset effectively .

Descriptive data analytics focuses on summarizing past data to understand trends and patterns, answering 'What has happened?' Predictive data analytics uses models to forecast future outcomes based on historical data, providing insights into 'What could happen?' Prescriptive data analytics evaluates potential interventions and solutions, recommending actions to optimize outcomes, addressing 'What should we do?' Each type plays a role based on distinct objectives—descriptive for understanding, predictive for forecasting, and prescriptive for decision-making .

The iterative nature of the data analytics process enhances model deployment effectiveness by allowing continuous refining and re-evaluation of data inputs, models, and assumptions. Each iteration helps uncover new insights, adjust strategies based on model feedback, and ensure the model aligns closely with the evolving business goals and data characteristics, ultimately increasing the accuracy and usefulness of the deployed solution .

You might also like