0% found this document useful (0 votes)

7 views17 pages

Unit 1-Data Foreseeing

The document outlines the fundamentals of Exploratory Data Analysis (EDA), emphasizing its importance in discovering patterns, spotting anomalies, and making informed decisions through statistical analysis and visualization. It details the steps involved in EDA, including data collection, cleaning, and various types of analysis, while also explaining different types of variables and their significance in data science. Additionally, the document introduces key Python libraries for data analysis and visualization.

Uploaded by

ghatgeashish3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views17 pages

Unit 1-Data Foreseeing

Uploaded by

ghatgeashish3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Foreseeing

Unit 1

Contents:- Data engineering, technical requirements of data during machine learning

modeling:- identify numerical and categorical variables, missing data, determine
cardinality in categorical variables, identify linear relationship, identify normal
distribution, highlighting outliers.

Applicable Course Outcome (CO) under Outcome Based Education (OBE) :-

CO1:Explain the fundamentals of exploratory data analysis.

Vinay S. Prabhavalkar 1
EDA Definition & Need
Definition
• Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics
and graphical representations.

Need
• Cleaning and preprocessing
– Finding and Rectifying Missing values
– Explore about Numerical variables
– Explore about Categorical variables
– Anomaly detection and removal, Outlier detection and removal
• Feature Engineering
– Finding relationship between features

Vinay S. Prabhavalkar 2
EDA Definition & Need
• Statistical Analysis
– Descriptive statistical analysis: Ssummarizes the information within a data set
without drawing conclusions about its contents.
– For example, if a business gave you a book of its expenses and you summarized
the percentage of money it spent on different categories of items, then you
would be performing a form of descriptive statistics.
– When performing descriptive statistics, you will often use data visualization to
present information in the form of graphs, tables, and charts to clearly convey
it to others in an understandable format.
– Typically, leaders in a company or organization will then use this data to guide
their decision making going forward.
– Inferential statistical analysis: Inferential statistics takes the results of descriptive
statistics one step further by drawing conclusions from the data and then making
recommendations.
– For example, instead of only summarizing the business's expenses, you might go on
to recommend in which areas to reduce spending and suggest an alternative
budget.
– Inferential statistical analysis is often used by businesses to inform company
decisions and in scientific research to find new relationships between variables.
• Visualization for trend analysis, discovery of patterns
Vinay S. Prabhavalkar 3
What is a Dataset?
• A dataset is a collection of data within a database.
• Typically, datasets take on a tabular format consisting of rows and
columns. Each column represents a specific variable, while each row
corresponds to a specific value. Some datasets consisting of unstructured
data are non-tabular, meaning they don’t fit the traditional row-column
format.

What is Data Analysis?

• Data analysis refers to the process of manipulating raw data to uncover
useful insights and draw conclusions. During this process, a data analyst or
data scientist will organize, transform, and model a dataset.
• Organizations use data to solve business problems, make informed
decisions, and effectively plan for the future. Data analysis ensures that
this data is optimized and ready to use.
• Some specific types of data analysis include:
– Descriptive analysis
– Diagnostic analysis
– Predictive analysis
– Prescriptive analysis
• Regardless of your reason for analyzing data, there are six simple steps
that you can follow to make the Vinaydata analysis process more efficient.
S. Prabhavalkar 4
Steps in EDA
1. Data Collection: It refers to the process of finding and loading data into
our system. Good, reliable data can be found on various public sites or
bought from private organizations. Some reliable sites for data collection
are Kaggle, Github, Machine Learning Repository, etc.
2. Data Cleaning: Refers to the process of removing unwanted variables and
values from your dataset and getting rid of any irregularities in it. Such
anomalies can disproportionately skew the data and hence adversely
affect the results. Some steps that can be done to clean data are:
– Removing missing values, outliers, and unnecessary rows/ columns.
– Re-indexing and reformatting our data.
3. Univariate Analysis: In Univariate Analysis, you analyze data of just one
variable. A variable in your dataset refers to a single feature/ column. You
can do this either with graphical means such as histograms and Box-plots.
4. Bivariate Analysis: Here, you use two variables and compare them. This
way, you can find how one feature affects the other. It is done with scatter
plots, which plot individual data points or correlation matrices that plot
the correlation in hues. You can also use boxplots.
Vinay S. Prabhavalkar 5
Steps in EDA

Vinay S. Prabhavalkar 6
Variables in a Data Set
In data science, variables are the building blocks of any analysis. They allow us
to group, compare, and contrast data points to uncover trends and draw
conclusions. But not all variables are created equal; there are different types
of variables that have specific uses in data science.
The picture below represents different types of variables one can find when
working on statistics / data science projects:

Vinay S. Prabhavalkar 7
Variables in a Data Set
• Categorical Variables are a type of data that can be grouped into categories,
based on certain characteristics. They are typically used in statistical analysis
to measure the relationships between different factors in a study. Categorical
variables are also known as qualitative variables because they represent
values without any numerical significance. The following include different
types of categorical variables:
• Binary / Boolean variables: Binary variables are commonly used when
measuring dichotomous outcomes and whether someone is classified as
belonging to a particular group or not. Examples could include gender (male /
female), current employment status (yes/no), etc.
• Nominal variables: Nominal variables are generally used when there are
multiple categories that need to be identified but cannot be compared against
each other due to their qualitative nature. Examples include eye color,
nationality and religious beliefs. The following are some of the examples of
nominal variables.
– Profession is another example of a nominal categorical variable in which
there could be many different categories depending on the context. A
person’s profession could fall into categories such as doctor, nurse, lawyer,
engineer, etc., without any Vinay
order to their relative importance or priority. 8
S. Prabhavalkar
Variables in a Data Set
– Marital status is yet another example of a nominal categorical variable with
possible values including single, married, divorced and widowed. Again there
is no ordering between these states; they are all equally important.
– Another common example of a nominal categorical variable is political party
affiliation. Categories may include Republican, Democrat, Independent or
other minor parties depending on the country being studied. In this case too
the various parties have no ranking nor do they take precedence over one
another.
• Ordinal variables: Ordinal categorical variables are variables that represent
categories of data in which the order of the categories has meaning. The
categories are ranked, so that one is “greater than” or “less than” the other. For
example, an ordinal categorical variable such as educational level might have
categories from lowest to highest such as none, primary school, secondary school,
college or university degree. The following are some of the examples of ordinal
variables:

– A survey about a customer’s satisfaction with a product or service on a scale

from 1-5, where 1 is extremely dissatisfied and 5 is extremely satisfied. In this
situation, the higher numbers represent a greater amount of satisfaction.
Vinay S. Prabhavalkar 9
Variables in a Data Set
– A survey question asking respondents to rate their agreement with a
statement on a scale from strongly disagree to strongly agree. Here again it
can easily be seen that one opinion is greater than the other when it comes to
agreement with the statement.
– A survey question asking respondents to rank their feelings towards
something on a scale from hate to love. Though not technically quantitative
data since there are only five pre-defined terms used in this example; once
again it can be seen that one feeling is greater than another when it comes to
ranking emotions towards something.

• Numerical variables are a type of variable used in data analysis to quantify or

measure the characteristics of an entity or phenomenon. They are also known as
quantitative variables because they involve counting, measuring, or assigning
values to a particular characteristic. Numerical variables can be divided into two
main types: continuous and discrete.

– The following are two different kinds of quantitative variables as shown in the
above diagram:

Vinay S. Prabhavalkar 10
Variables in a Data Set
• Continuous variables: Continuous quantitative variables provide us with numerical
information that can be measured within a given range without any breaks.
– Examples of continuous quantitative variables include things such as weight,
height, length, pressure, temperature, speed, and time.
– These are all characteristics that have no definite beginning or end point in the
range and can be measured to very precise levels.
– Weight is perhaps the most commonly used continuous variable as it can be
measured from any starting point up to virtually any given level with precision
using scales such as a balance beam scale or digital scale.
– Height is also a common continuous variable which is typically measured in
centimeters or inches depending on the unit of measurement used.
– Length is another example of a continuous variable which usually refers to the
distance between two points on an object and is often measured in feet or
meters depending on the context.
– Temperature is another important continuous variable which measures how
hot or cold something is with respect to a set reference point and is usually
measured in degrees Celsius (°C) or Fahrenheit (°F).

Vinay S. Prabhavalkar 11
Variables in a Data Set
• Discrete variables: A discrete quantitative variable is a variable that takes on either
a finite or countable number of values.
– Examples of discrete quantitative variables include the number of siblings in a
family, the number of cars owned by an individual, or the number of members
in a sports team.
– Another example could be the age of an individual, which can have discrete
values such as 18, 19, 20, etc., but not 12.45 or 10.7, for example.
– Another example is the number of pets owned by an individual. This could be
used to keep track of how many animals are being kept as pets in households
around the world, as well as to provide information on animal welfare
standards.
– Additionally, it could provide insights into how different types of pets are
being kept – such as cats versus dogs – or what breeds are most popular.
– This could even be broken down further into specific cities or states to gain
more detailed insights on pet ownership trends in certain areas.

[Link]

Vinay S. Prabhavalkar 12
Cardinality of Categorical Variables
• The number of unique categories in a variable is called cardinality.
• For example, the cardinality of the Gender variable, which takes values
of female and male, is 2, whereas the cardinality of the Civil status variable, which
takes values of married, divorced, singled, and widowed, is 4

Linear Relationship between Variables

• A linear relationship is one in which two variables have a direct connection, which
means if the value of x is changed, y must also change in the same proportion.
• It is a statistical method to get a straight line or correlated values for two variables
through a graph or mathematical formula.
• The number of variables considered in a linear equation never exceeds two.

Vinay S. Prabhavalkar 13
Normal Distribution
• Normal distribution, also known as the Gaussian distribution, is a
probability distribution that is symmetric about the mean, showing that
data near the mean are more frequent in occurrence than data far from
the mean.

• In graphical form, the normal distribution appears as a "bell curve".

Vinay S. Prabhavalkar 14
Normal Distribution
• Properties of the Normal Distribution
• Mean (average)
• Median (midpoint)
• Mode (most frequent observation)
• Standard deviation is the measure of how spread out a normally distributed set of
data is. It is a statistic that tells you how closely all of the examples are gathered
around the mean in a data set.
The Empirical Rule
– For all normal distributions, 68.2% of the observations will appear within plus
or minus one standard deviation of the mean; 95.4% of the observations will
fall within +/- two standard deviations; and 99.7% within +/- three standard
deviations. This fact is sometimes referred to as the "empirical rule,"

Vinay S. Prabhavalkar 15
Normal Distribution
• Skewness measures the degree of symmetry of a distribution. The normal
distribution is symmetric and has a skewness of zero.
• If the distribution of a data set instead has a skewness less than zero, or negative
skewness (left-skewness), then the left tail of the distribution is longer than the
right tail; positive skewness (right-skewness) implies that the right tail of the
distribution is longer than the left.

Vinay S. Prabhavalkar 16
Python Libraries for Data Analysis
• Many popular Python toolboxes/libraries:
– NumPy: It is a open source Python library created in 2005 by Travis Oliphant
used for working with arrays, linear algebra, fourier transform, and matrices
etc. NumPy stands for Numerical Python.
– SciPy: It is an open-source Python library which is used to solve scientific and
mathematical problems related to linear algebra, integration, interpolation,
special functions, FFT, signal and image processing etc.
– Pandas: It is an open-source library used for data manipulation and analysis
working with relational or labeled data. It stands for “Python Data Analysis
Library”.
– SciKit-Learn: Scikit-Learn, also known as sklearn is a python library
to implement machine learning models and statistical modelling.

• Visualization libraries
– Matplotlib: It is a comprehensive library for creating static, animated, and
interactive visualizations in Python.
– Seaborn: It is a Python data visualization library based on matplotlib.
[Link] [Link]
[Link] Vinay S. Prabhavalkar
[Link] 17
[Link] [Link]

Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
EDA Techniques in Data Science
No ratings yet
EDA Techniques in Data Science
278 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
9 pages
Data Exploration and Visualization Lecture Notes
No ratings yet
Data Exploration and Visualization Lecture Notes
274 pages
Understanding Data Collection in Analytics
No ratings yet
Understanding Data Collection in Analytics
26 pages
EDA Notes
No ratings yet
EDA Notes
6 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
6 pages
EDA Concepts in Data Analysis
No ratings yet
EDA Concepts in Data Analysis
19 pages
DSSM 3
No ratings yet
DSSM 3
38 pages
Understanding Data Structures and Quality
No ratings yet
Understanding Data Structures and Quality
14 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
404 pages
Stastical Concepts
No ratings yet
Stastical Concepts
89 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
13 pages
Understanding Data Types and EDA
No ratings yet
Understanding Data Types and EDA
12 pages
Intro To EDA
No ratings yet
Intro To EDA
30 pages
Data Analytics Techniques Overview
No ratings yet
Data Analytics Techniques Overview
26 pages
Module 5
No ratings yet
Module 5
30 pages
AI Data Science Assessment Answer Key
100% (1)
AI Data Science Assessment Answer Key
17 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
39 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
26 pages
Exploratory Data Analysis in Machine Learning
No ratings yet
Exploratory Data Analysis in Machine Learning
53 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
13 pages
Exploratory Data Analysis Overview
No ratings yet
Exploratory Data Analysis Overview
173 pages
Eda 1
No ratings yet
Eda 1
7 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
242 pages
Foundations of Data Science Concepts
No ratings yet
Foundations of Data Science Concepts
14 pages
Statistical Methods for Data Analysis
No ratings yet
Statistical Methods for Data Analysis
133 pages
Unit 2 - DA - Statistical Concepts - Student
No ratings yet
Unit 2 - DA - Statistical Concepts - Student
56 pages
Exploratory Data Analysis in Data Science
100% (3)
Exploratory Data Analysis in Data Science
113 pages
Data Analysis Techniques and Methods
No ratings yet
Data Analysis Techniques and Methods
82 pages
EDA Unit III Univariate Analysis
No ratings yet
EDA Unit III Univariate Analysis
61 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
2 pages
P and S Unit-1 Notes
No ratings yet
P and S Unit-1 Notes
13 pages
Data Science EDA Steps Explained
No ratings yet
Data Science EDA Steps Explained
23 pages
Lecture1 Makemeanalyst
No ratings yet
Lecture1 Makemeanalyst
22 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
131 pages
Essential Guide to Data Analysis Techniques
No ratings yet
Essential Guide to Data Analysis Techniques
22 pages
IDS - Unit-III (Mid-1)
No ratings yet
IDS - Unit-III (Mid-1)
12 pages
Data Science Techniques Overview
No ratings yet
Data Science Techniques Overview
19 pages
Unit Iv
No ratings yet
Unit Iv
29 pages
Types of Exploratory Data Analysis
No ratings yet
Types of Exploratory Data Analysis
9 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
18 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
Data Analytics Techniques Overview
No ratings yet
Data Analytics Techniques Overview
35 pages
Regression Analysis in Research
No ratings yet
Regression Analysis in Research
82 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
Unit 4 Fds
No ratings yet
Unit 4 Fds
22 pages
Analyzing Quantitative Data Techniques
No ratings yet
Analyzing Quantitative Data Techniques
8 pages
Statistics Fundamentals for Data Science
No ratings yet
Statistics Fundamentals for Data Science
2 pages
Chapter 5 Exploratory Data Analysis
No ratings yet
Chapter 5 Exploratory Data Analysis
67 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
21 pages
Analyzing Data Variability and Trends
No ratings yet
Analyzing Data Variability and Trends
31 pages
Comprehensive Guide to Data Analytics
No ratings yet
Comprehensive Guide to Data Analytics
83 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
3 pages
Data Analysis Strategies and Techniques
No ratings yet
Data Analysis Strategies and Techniques
24 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
Understanding Descriptive Analytics
No ratings yet
Understanding Descriptive Analytics
18 pages
Unit 2 DataScience
No ratings yet
Unit 2 DataScience
23 pages
Measures of Variability in Business Data
No ratings yet
Measures of Variability in Business Data
22 pages
Hospital Management System ER Design
100% (1)
Hospital Management System ER Design
11 pages
Functional Dependencies & Normalization Basics
No ratings yet
Functional Dependencies & Normalization Basics
58 pages
OpenText Extended ECM 20.4 Release Notes
No ratings yet
OpenText Extended ECM 20.4 Release Notes
36 pages
Class 12 Computer Science Practical Exam
No ratings yet
Class 12 Computer Science Practical Exam
16 pages
Understanding Clean Architecture Principles
No ratings yet
Understanding Clean Architecture Principles
5 pages
Diploma in Computer Engineering Syllabus
No ratings yet
Diploma in Computer Engineering Syllabus
11 pages
Data Visualization Essentials Guide
No ratings yet
Data Visualization Essentials Guide
92 pages
Understanding Data Structures Explained
No ratings yet
Understanding Data Structures Explained
49 pages
Install Django on Linux Guide
No ratings yet
Install Django on Linux Guide
2 pages
Databricks Schema Evolution Insights
No ratings yet
Databricks Schema Evolution Insights
10 pages
CMDB: Enhancing IT Service Management
No ratings yet
CMDB: Enhancing IT Service Management
7 pages
Tevta Syllabus Book
No ratings yet
Tevta Syllabus Book
94 pages
Data Science Techniques in Insurance
No ratings yet
Data Science Techniques in Insurance
16 pages
Oracle Integration Cloud (OIC) Essentials
100% (2)
Oracle Integration Cloud (OIC) Essentials
16 pages
Understanding Relational Data Models
No ratings yet
Understanding Relational Data Models
71 pages
Logging and Troubleshooting in Liberty Servers
No ratings yet
Logging and Troubleshooting in Liberty Servers
8 pages
Understanding Parquet for Analytics
No ratings yet
Understanding Parquet for Analytics
28 pages
Full-Stack Application Assignment
No ratings yet
Full-Stack Application Assignment
3 pages
Scalable Data Management Techniques
No ratings yet
Scalable Data Management Techniques
42 pages
Toast API Integration for Tills
No ratings yet
Toast API Integration for Tills
15 pages
SAS Exam-2 2025: PC03 Syllabus & Exemptions
No ratings yet
SAS Exam-2 2025: PC03 Syllabus & Exemptions
27 pages
Cloud Infrastructure Design Course Overview
No ratings yet
Cloud Infrastructure Design Course Overview
17 pages
Computer Applications in Information Centres
No ratings yet
Computer Applications in Information Centres
13 pages
Sap Basis Administration Handbook Netweaver Edition by Mereddy, Ranjit
67% (6)
Sap Basis Administration Handbook Netweaver Edition by Mereddy, Ranjit
669 pages
RDBMS Lesson Plan for Class 12
No ratings yet
RDBMS Lesson Plan for Class 12
9 pages
Mathematics 5 Summative Test Guide
No ratings yet
Mathematics 5 Summative Test Guide
4 pages
Secure Coding Practices Overview
100% (1)
Secure Coding Practices Overview
25 pages
Understanding Measures of Central Tendency
No ratings yet
Understanding Measures of Central Tendency
100 pages
PySpark Interview Questions Guide
No ratings yet
PySpark Interview Questions Guide
8 pages

Unit 1-Data Foreseeing

Uploaded by

Unit 1-Data Foreseeing

Uploaded by

Data Foreseeing

Contents:- Data engineering, technical requirements of data during machine learning

Applicable Course Outcome (CO) under Outcome Based Education (OBE) :-

What is Data Analysis?

– A survey about a customer’s satisfaction with a product or service on a scale

• Numerical variables are a type of variable used in data analysis to quantify or

Linear Relationship between Variables

• In graphical form, the normal distribution appears as a "bell curve".

You might also like