Data Foreseeing
Unit 1
Contents:- Data engineering, technical requirements of data during machine learning
modeling:- identify numerical and categorical variables, missing data, determine
cardinality in categorical variables, identify linear relationship, identify normal
distribution, highlighting outliers.
Applicable Course Outcome (CO) under Outcome Based Education (OBE) :-
CO1:Explain the fundamentals of exploratory data analysis.
Vinay S. Prabhavalkar 1
EDA Definition & Need
Definition
• Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics
and graphical representations.
Need
• Cleaning and preprocessing
– Finding and Rectifying Missing values
– Explore about Numerical variables
– Explore about Categorical variables
– Anomaly detection and removal, Outlier detection and removal
• Feature Engineering
– Finding relationship between features
Vinay S. Prabhavalkar 2
EDA Definition & Need
• Statistical Analysis
– Descriptive statistical analysis: Ssummarizes the information within a data set
without drawing conclusions about its contents.
– For example, if a business gave you a book of its expenses and you summarized
the percentage of money it spent on different categories of items, then you
would be performing a form of descriptive statistics.
– When performing descriptive statistics, you will often use data visualization to
present information in the form of graphs, tables, and charts to clearly convey
it to others in an understandable format.
– Typically, leaders in a company or organization will then use this data to guide
their decision making going forward.
– Inferential statistical analysis: Inferential statistics takes the results of descriptive
statistics one step further by drawing conclusions from the data and then making
recommendations.
– For example, instead of only summarizing the business's expenses, you might go on
to recommend in which areas to reduce spending and suggest an alternative
budget.
– Inferential statistical analysis is often used by businesses to inform company
decisions and in scientific research to find new relationships between variables.
• Visualization for trend analysis, discovery of patterns
Vinay S. Prabhavalkar 3
What is a Dataset?
• A dataset is a collection of data within a database.
• Typically, datasets take on a tabular format consisting of rows and
columns. Each column represents a specific variable, while each row
corresponds to a specific value. Some datasets consisting of unstructured
data are non-tabular, meaning they don’t fit the traditional row-column
format.
What is Data Analysis?
• Data analysis refers to the process of manipulating raw data to uncover
useful insights and draw conclusions. During this process, a data analyst or
data scientist will organize, transform, and model a dataset.
• Organizations use data to solve business problems, make informed
decisions, and effectively plan for the future. Data analysis ensures that
this data is optimized and ready to use.
• Some specific types of data analysis include:
– Descriptive analysis
– Diagnostic analysis
– Predictive analysis
– Prescriptive analysis
• Regardless of your reason for analyzing data, there are six simple steps
that you can follow to make the Vinaydata analysis process more efficient.
S. Prabhavalkar 4
Steps in EDA
1. Data Collection: It refers to the process of finding and loading data into
our system. Good, reliable data can be found on various public sites or
bought from private organizations. Some reliable sites for data collection
are Kaggle, Github, Machine Learning Repository, etc.
2. Data Cleaning: Refers to the process of removing unwanted variables and
values from your dataset and getting rid of any irregularities in it. Such
anomalies can disproportionately skew the data and hence adversely
affect the results. Some steps that can be done to clean data are:
– Removing missing values, outliers, and unnecessary rows/ columns.
– Re-indexing and reformatting our data.
3. Univariate Analysis: In Univariate Analysis, you analyze data of just one
variable. A variable in your dataset refers to a single feature/ column. You
can do this either with graphical means such as histograms and Box-plots.
4. Bivariate Analysis: Here, you use two variables and compare them. This
way, you can find how one feature affects the other. It is done with scatter
plots, which plot individual data points or correlation matrices that plot
the correlation in hues. You can also use boxplots.
Vinay S. Prabhavalkar 5
Steps in EDA
Vinay S. Prabhavalkar 6
Variables in a Data Set
In data science, variables are the building blocks of any analysis. They allow us
to group, compare, and contrast data points to uncover trends and draw
conclusions. But not all variables are created equal; there are different types
of variables that have specific uses in data science.
The picture below represents different types of variables one can find when
working on statistics / data science projects:
Vinay S. Prabhavalkar 7
Variables in a Data Set
• Categorical Variables are a type of data that can be grouped into categories,
based on certain characteristics. They are typically used in statistical analysis
to measure the relationships between different factors in a study. Categorical
variables are also known as qualitative variables because they represent
values without any numerical significance. The following include different
types of categorical variables:
• Binary / Boolean variables: Binary variables are commonly used when
measuring dichotomous outcomes and whether someone is classified as
belonging to a particular group or not. Examples could include gender (male /
female), current employment status (yes/no), etc.
• Nominal variables: Nominal variables are generally used when there are
multiple categories that need to be identified but cannot be compared against
each other due to their qualitative nature. Examples include eye color,
nationality and religious beliefs. The following are some of the examples of
nominal variables.
– Profession is another example of a nominal categorical variable in which
there could be many different categories depending on the context. A
person’s profession could fall into categories such as doctor, nurse, lawyer,
engineer, etc., without any Vinay
order to their relative importance or priority. 8
S. Prabhavalkar
Variables in a Data Set
– Marital status is yet another example of a nominal categorical variable with
possible values including single, married, divorced and widowed. Again there
is no ordering between these states; they are all equally important.
– Another common example of a nominal categorical variable is political party
affiliation. Categories may include Republican, Democrat, Independent or
other minor parties depending on the country being studied. In this case too
the various parties have no ranking nor do they take precedence over one
another.
• Ordinal variables: Ordinal categorical variables are variables that represent
categories of data in which the order of the categories has meaning. The
categories are ranked, so that one is “greater than” or “less than” the other. For
example, an ordinal categorical variable such as educational level might have
categories from lowest to highest such as none, primary school, secondary school,
college or university degree. The following are some of the examples of ordinal
variables:
– A survey about a customer’s satisfaction with a product or service on a scale
from 1-5, where 1 is extremely dissatisfied and 5 is extremely satisfied. In this
situation, the higher numbers represent a greater amount of satisfaction.
Vinay S. Prabhavalkar 9
Variables in a Data Set
– A survey question asking respondents to rate their agreement with a
statement on a scale from strongly disagree to strongly agree. Here again it
can easily be seen that one opinion is greater than the other when it comes to
agreement with the statement.
– A survey question asking respondents to rank their feelings towards
something on a scale from hate to love. Though not technically quantitative
data since there are only five pre-defined terms used in this example; once
again it can be seen that one feeling is greater than another when it comes to
ranking emotions towards something.
• Numerical variables are a type of variable used in data analysis to quantify or
measure the characteristics of an entity or phenomenon. They are also known as
quantitative variables because they involve counting, measuring, or assigning
values to a particular characteristic. Numerical variables can be divided into two
main types: continuous and discrete.
– The following are two different kinds of quantitative variables as shown in the
above diagram:
Vinay S. Prabhavalkar 10
Variables in a Data Set
• Continuous variables: Continuous quantitative variables provide us with numerical
information that can be measured within a given range without any breaks.
– Examples of continuous quantitative variables include things such as weight,
height, length, pressure, temperature, speed, and time.
– These are all characteristics that have no definite beginning or end point in the
range and can be measured to very precise levels.
– Weight is perhaps the most commonly used continuous variable as it can be
measured from any starting point up to virtually any given level with precision
using scales such as a balance beam scale or digital scale.
– Height is also a common continuous variable which is typically measured in
centimeters or inches depending on the unit of measurement used.
– Length is another example of a continuous variable which usually refers to the
distance between two points on an object and is often measured in feet or
meters depending on the context.
– Temperature is another important continuous variable which measures how
hot or cold something is with respect to a set reference point and is usually
measured in degrees Celsius (°C) or Fahrenheit (°F).
Vinay S. Prabhavalkar 11
Variables in a Data Set
• Discrete variables: A discrete quantitative variable is a variable that takes on either
a finite or countable number of values.
– Examples of discrete quantitative variables include the number of siblings in a
family, the number of cars owned by an individual, or the number of members
in a sports team.
– Another example could be the age of an individual, which can have discrete
values such as 18, 19, 20, etc., but not 12.45 or 10.7, for example.
– Another example is the number of pets owned by an individual. This could be
used to keep track of how many animals are being kept as pets in households
around the world, as well as to provide information on animal welfare
standards.
– Additionally, it could provide insights into how different types of pets are
being kept – such as cats versus dogs – or what breeds are most popular.
– This could even be broken down further into specific cities or states to gain
more detailed insights on pet ownership trends in certain areas.
[Link]
Vinay S. Prabhavalkar 12
Cardinality of Categorical Variables
• The number of unique categories in a variable is called cardinality.
• For example, the cardinality of the Gender variable, which takes values
of female and male, is 2, whereas the cardinality of the Civil status variable, which
takes values of married, divorced, singled, and widowed, is 4
Linear Relationship between Variables
• A linear relationship is one in which two variables have a direct connection, which
means if the value of x is changed, y must also change in the same proportion.
• It is a statistical method to get a straight line or correlated values for two variables
through a graph or mathematical formula.
• The number of variables considered in a linear equation never exceeds two.
Vinay S. Prabhavalkar 13
Normal Distribution
• Normal distribution, also known as the Gaussian distribution, is a
probability distribution that is symmetric about the mean, showing that
data near the mean are more frequent in occurrence than data far from
the mean.
• In graphical form, the normal distribution appears as a "bell curve".
Vinay S. Prabhavalkar 14
Normal Distribution
• Properties of the Normal Distribution
• Mean (average)
• Median (midpoint)
• Mode (most frequent observation)
• Standard deviation is the measure of how spread out a normally distributed set of
data is. It is a statistic that tells you how closely all of the examples are gathered
around the mean in a data set.
The Empirical Rule
– For all normal distributions, 68.2% of the observations will appear within plus
or minus one standard deviation of the mean; 95.4% of the observations will
fall within +/- two standard deviations; and 99.7% within +/- three standard
deviations. This fact is sometimes referred to as the "empirical rule,"
Vinay S. Prabhavalkar 15
Normal Distribution
• Skewness measures the degree of symmetry of a distribution. The normal
distribution is symmetric and has a skewness of zero.
• If the distribution of a data set instead has a skewness less than zero, or negative
skewness (left-skewness), then the left tail of the distribution is longer than the
right tail; positive skewness (right-skewness) implies that the right tail of the
distribution is longer than the left.
Vinay S. Prabhavalkar 16
Python Libraries for Data Analysis
• Many popular Python toolboxes/libraries:
– NumPy: It is a open source Python library created in 2005 by Travis Oliphant
used for working with arrays, linear algebra, fourier transform, and matrices
etc. NumPy stands for Numerical Python.
– SciPy: It is an open-source Python library which is used to solve scientific and
mathematical problems related to linear algebra, integration, interpolation,
special functions, FFT, signal and image processing etc.
– Pandas: It is an open-source library used for data manipulation and analysis
working with relational or labeled data. It stands for “Python Data Analysis
Library”.
– SciKit-Learn: Scikit-Learn, also known as sklearn is a python library
to implement machine learning models and statistical modelling.
• Visualization libraries
– Matplotlib: It is a comprehensive library for creating static, animated, and
interactive visualizations in Python.
– Seaborn: It is a Python data visualization library based on matplotlib.
[Link] [Link]
[Link] Vinay S. Prabhavalkar
[Link] 17
[Link] [Link]