CAMI16 DATA
ANALYTICS
Module I - Syllabus
• Introduction: Data Analytics- Data collection- integration-
management- modelling- analysis-visualization-prediction and
informed decision making. General Linear Regression Model,
Estimation for β, Error Estimation, Residual Analysis.
What is Data?
• Data is a group of facts that can take many different forms, such as numbers, pictures,
words, videos, observations, and more.
• Recorded measurements.
Knowledge with evaluative
component /decision, describes
why
Collective informationn through
individual expertise, describes
how
Processed data with meaning, tells
who, what, where, when
Raw Facts
Types of Data
What is Data Analytics?
• The scientific process of transforming data into insights for making
better decisions.
• Data analytics is the collection, transformation, and organization of
these facts in order to draw conclusions, make predictions, and drive
informed decision making. Companies need data analysts to sort
through this data to help make decisions about their products,
services or business strategies.
What is Data Analytics?
• Data analytics is a multidisciplinary field that employs a wide range of
analysis techniques, including math, statistics, and computer science,
to draw insights from data sets.
• Data analytics is a broad term that includes everything from simply
analyzing data to theorizing ways of collecting data and creating the
frameworks needed to store it.
What is Data Analysis?
• The process of examining, transforming, arranging raw data in a
specific way to generate useful information from it.
• Data analysis is a subcategory of data analytics that deals
specifically with extracting meaning from data. Data analytics, as a
whole, includes processes beyond analysis.
Data Analytics Examples
• Ex 1: A sneaker manufacturer might look at sales data to determine
which designs to continue and which to retire.
• Ex 2: A health care administrator may look at inventory data to
determine the medical supplies they should order.
• Ex 3: At Coursera, they may look at enrollment data to determine
what kind of courses to add to their offerings.
• Organizations that use data to drive business strategies often find that
they are more confident, proactive, and financially savvy.
Types of Data Analytics
Descriptive analytics tell us what
happened.
Diagnostic analytics tell us why something
happened.
Predictive analytics tell us what will likely
happen in the future.
Prescriptive analytics tell us how to act.
Data Science vs Data Analytics
Data science is a broad field that
encompasses data analytics and includes
other areas such as data engineering and
machine learning. Data scientists use
statistical and computational methods to
extract insights from data, build predictive
models, and develop new algorithms.
Data Science vs Data Analytics
Role Data analyst Data scientist
Purpose Data analytics is more focused on producing Data Scientists produces both broad insights by
insights to answer specific questions and exploring the data and actionable insights that
which can be put into action. answer specific questions.
Scope Data analytics is a broad field which includes Data Scientists is a multidisciplinary field including
data integration, data analysis and data data engineering, computer science, statistics,
presentation. machine learning, and predictive analytics in
addition to presentation of findings.
Approach Data analysts prepare, manage and analyze Data Scientists prepare, manage and explore large
well-defined datasets to identify trends and data sets and then develop custom analytical
create visual presentations to help models and algorithms to produce the required
organizations make better, data-driven business insights. They also communicate and
decisions. collaborate with stakeholders to define project
goals and share findings.
Mathematics Foundational math, statistics Advanced statistics, predictive analytics
Programming Basic fluency in R, Python, SQL Advanced object-oriented programming
Software and tools SAS, Excel, business intelligence software Hadoop, MySQL, TensorFlow, Spark
Other skills Analytical thinking, data visualization Machine learning, data modeling
Steps in Data Analytics
• Data collection
• Data integration
• Data management
• Data modelling
• Data analysis
• Data visualization
• Data prediction and informed decision making.
Steps in Data Analytics
• Data Collection: This stage involves gathering raw data from various sources, such as
databases, spreadsheets, websites, sensors, social media, etc. The data can be in
different formats, including structured data (e.g., tables) or unstructured data (e.g.,
text, images, audio). Sources include: Surveys and Questionnaires, Observations,
Interviews, Existing Databases and Records, Web Scraping, Sensors and IoT Devices.
Methods include: Primary, secondary, and third party.
• Data Integration: Data from different sources may have varying formats
and structures. Data integration involves combining and transforming
data to create a unified and consistent dataset that is suitable for
analysis.
• Data Management: Data management includes processes for storing,
organizing, and maintaining the data. It involves ensuring data quality,
data security, and compliance with data regulations.
• Data Modeling: In this stage, statistical and mathematical models are
created to represent patterns and relationships within the data. These
models help in gaining insights and making predictions based on the data.
Steps in Data Analytics
• Data Analysis: Data analysis involves applying various techniques and algorithms to
explore and extract meaningful patterns, trends, and insights from the data.
Descriptive, diagnostic, predictive, and prescriptive analytics are common types of
analysis used.
• Data Visualization: Data visualization is the process of presenting data in graphical or
visual formats, such as charts, graphs, dashboards, and maps. It helps in
understanding complex data patterns and communicating insights effectively.
• Prediction: Using predictive analytics, data analysts create models that forecast
future outcomes based on historical data. Predictive models can be used to
anticipate trends, behavior, and events.
• Informed Decision Making: All the insights and predictions derived from the data
analysis process are used to support decision-making processes within organizations
or individuals. Informed decisions are based on data-driven evidence rather than
intuition or assumptions.
Basic Statistical Analysis of Data
• Descriptive analysis and inferential analysis are two key branches of statistical analysis used to
understand and interpret data.
• They serve different purposes and provide distinct insights into the characteristics and relationships
within a dataset.
• Descriptive Analysis: Descriptive analysis involves summarizing and presenting the main features of a
dataset in a clear and concise manner. The goal is to provide a snapshot of the data's key
characteristics without making broader inferences or generalizations beyond the sample itself.
• Measures of central tendency: Mean, median, mode
• Measures of dispersion: Range, variance, standard deviation
• Percentiles and quartiles
• Frequency distributions and histograms
• Visualizations: bar charts, pie charts, scatter plots, box plots, etc.
• Summary statistics and graphical representations of data distribution
• Descriptive analysis is essential for understanding the basic features of the data, identifying patterns,
and detecting potential outliers or anomalies. It helps provide insights that can guide further analysis
and decision-making.
These measures help provide a
Descriptive analysis of Data
concise summary of the data's
characteristics, allowing analysts
and researchers to quickly grasp
• Mean: The arithmetic average of all the data points in a dataset. It provides key insights, identify potential
a measure of central tendency. issues like outliers, and make
informed decisions. They are
• Median: The middle value when the data points are arranged in ascending
fundamental tools in exploratory
order. It's another measure of central tendency and is less affected by
data analysis and form the basis
outliers compared to the mean. for more advanced statistical
• Mode: The value that appears most frequently in the dataset. It's useful for analyses and modeling.
identifying the most common value or category. EDA is an approach to
• Range: The difference between the maximum and minimum values in the analyze data in order to
dataset. It gives an idea of the spread of the data. summarize main
• Variance: A measure of how much the data points deviate from the mean. characteristics of the data
gain better understanding of
It quantifies the variability or dispersion of the data.
the data set,
• Standard Deviation: The square root of the variance. It provides a standard uncover relationships
measure of how much individual data points deviate from the mean. between different variables,
• Percentiles: Values that divide the data into specific percentage segments. and extract important
The median is the 50th percentile, while quartiles divide the data into four variables for the
segments. problem we're trying to
solve.
Contd… Step 1: Sort the data in ascending order:
76, 78, 81, 85, 87, 88, 89, 90, 92, 94
Exam 1 Scores: 85, 78, 92, 88, 76, 89, 94, 81, 87, 90 Exam 2 Scores: 79, Step 2: Calculate the median (Q2) of the
84, 88, 92, 75, 85, 93, 80, 86, 91
entire dataset:
Final Exam Scores: 92, 95, 89, 78, 87, 96, 82, 88, 91, 84
Median (Q2) = (87 + 88) / 2 = 87.5
[Link] 1 Scores:
Step 3: Calculate Q1 (the median of the
• Mean: (85 + 78 + 92 + 88 + 76 + 89 + 94 + 81
lower half of the data):
+ 87 + 90) / 10 = 86 In this case, the lower half of the data is:
• Median: 87.5 (middle value after sorting the 76, 78, 81, 85,87
scores) Q1 = 81
• Mode: There is no mode (no repeated values) Step 4: Calculate Q3 (the median of the
• Range: 94 - 76 = 18 upper half of the data):
• Variance = 32 In this case, the upper half of the data is:
• Standard Deviation: Approximately 5.66 88, 89, 90, 92, 94
• IQR= 9 Q3 = 90
Step 5: Calculate the Interquartile Range
In a similar way we can analyze Exam 2 Scores and Final Exam Scores (IQR):
IQR = Q3 - Q1 = 90-81= 9
Therefore, the interquartile range (IQR) for
the Exam 1 scores dataset is 9. This means
that the middle 50% of the data falls within
a range of 9 points.
Contd…
• Skewness: A measure of the asymmetry of the distribution.
Positive skewness indicates a tail on the right side, while
negative skewness indicates a tail on the left side.
• Kurtosis: A measure of the "tailedness" of the distribution. High
kurtosis indicates heavy tails and potential outliers, while low
kurtosis indicates light tails.
• Interquartile Range (IQR): The range between the first (25th
percentile) and third (75th percentile) quartiles. It provides a
measure of data spread that is less affected by outliers.
• Frequency Distribution: A table or graph that shows how often
different values occur in the dataset. It provides an overview of
the data's distribution.
• Histogram: A graphical representation of the frequency
distribution, using bars to depict the frequency of different value
ranges.
• Box Plot (Box-and-Whisker Plot): A visual representation of the
data's summary statistics, including the median, quartiles, and
potential outliers.
Contd… Kurtosis, same
formula, to the
• Skewness: 85, 78, 92, 88, 76, 89, 94, 81, 87, 90 power 4
• Mean = 86.0 instead of 3
• Standard Deviation ≈ 5.66 (as calculated in the previous example)
• Skewness = (1/n) * Σ[(X - Mean) / Standard Deviation]^3, Where n is the
number of data points, X is each data point, Mean is the mean of the dataset,
and Standard Deviation is the standard deviation of the dataset.
• Skewness ≈ -0.0554
Inferential analysis of Data
• Inferential analysis, involves drawing conclusions or making predictions about a population
based on a sample of data. The goal is to use the information obtained from the sample to
make broader inferences or predictions about a larger population. Inferential analysis
techniques include:
• Hypothesis testing: Evaluating whether observed differences or relationships in the sample are
likely to exist in the population.
• Confidence intervals: Estimating a range of values within which a population parameter is likely
to fall.
• Regression analysis: Modeling relationships between variables and making predictions based
on the model.
• Analysis of variance (ANOVA): Comparing means across different groups to determine if
differences are statistically significant.
• Probability distributions and statistical inference
• Inferential analysis is used to make predictions, test hypotheses, and make informed decisions
based on the available data. It involves assessing the likelihood that observed results are not
due to random chance and can provide valuable insights into the underlying population.
Regression Analysis
• The goal of regression is to try to predict one variable based on
another variable.
• Usually we will try to predict ‘y’ based on ‘x’.
• y – explained /Response / dependent variable
• x- explanatory /Predictor/ independent variable
• if we are having collection of xi and yi, we can plot them on scatter
plot, we can see the line on the plot, y =a+b.x
• a- Intercept, b- slope
• If we can able to find the values of a, and b, then for any new ‘x’, we
can predict what is the value of ‘y’.
Univariate Regression
• If x £ R, R is a real valued random var, - univariate regression / simple
linear regression. Eg: (y- wages, x- [Link] hrs worked)
• y= a+b.x+ £ (a-intercept, b- slope, £ - error)
• For example, suppose that height was the only determinant of body
weight. If we were to plot height (the independent or 'predictor'
variable) as a function of body weight (the dependent or 'outcome'
variable), we might see a very linear relationship, as illustrated below.
Formula: Slope and intercept
Example 1: Find the E-commerce sales (in 1000s, if
the amount spent on online advertising is dollar 2.5
(1000s)
Hint: Y= 125.8 + 171.5*X
Monthly E-commerce
Online Sales Online Advertising
Store (in 1000 s) Dollars (1000 s)
1 368 1.7
2 340 1.5
3 665 2.8
4 954 5
5 331 1.3
6 556 2.2
7 376 1.3
Example 2: Find the car price if the
cars age is 20.
Car Age (in years) Price (in dollars)
Y = 7836 – 502.4*X
4 6300
4 5800
5 5700
5 4500
7 4500
7 4200
8 4100
9 3100
10 2100
11 2500
12 2200
Multivariate Regression
• If x £ Rd , and if (d>=2) - multivariate regression / multiple linear
regression. Eg: (y- wages, x1- [Link] hrs worked,x2 – experience, x3 -
gender)
• Y= a+b1.x1+b2 . x2.+..+[Link]+£
Y= a+[Link]+£ (a-intercept, b- slope, £ - error)
Contd..
• If we take an example of ideal gas law, PV=nRT, ( P-pressure, V –
volume, n -avagodro constant, R -radius, T -temperature)
• Log (PV) = log(nRT)
• If we assume radius is constant then the eqn becomes:
• Log(P)+log(v)=nr+log (T)
• Log(P)=a – log(V)+log(T) // a= nr (constant)
• Log (P)= a+[Link](V)+[Link](T) Multivariate linear regression (pressure
depends on volume and temperature).
Contd…
Linear Regression: Mathematical
Model
Error analysis
• Comparing model predictions against reality
• Since our model will produce an output given any input or set of
inputs, we can then check these estimated outputs against the actual
values that we tried to predict.
• We call the difference between the actual value and the model’s
estimate a residual. We can calculate the residual for every point in
our data set, and each of these residuals will be of use in assessment.
• The vertical distance between any one data point yi and its estimated
value is its observed "residual":
Error - caused by Error - caused by
independent vars in the omitted vars
Residual Analysis regression model
Residual
We also define the degrees of freedom dfT, dfReg,
dfRes, the sum of squares SST, SSReg, SSRes and the
mean squares MST, MSReg, MSRes as follows:
Note:
df of a model = number of independent vars used, Note: Models that fit the data well, r² is near to 1.
df of a model error = number of samples – number of Models that poorly fit the data have r² near to 0.
independent vars.
Error analysis
Example
Contd..
(4,6)
(2,2)
(3,4) (6,7)
Contd…
tells 89.5 percentage of a dependent variable you
can expect to accurately predict based on the
value of the independent variable.
Other Errors Estimation
• Mean Absolute error:
• Mean Square error:
Error Estimation
• Root Mean Square Error: estimates the deviation of the actual y-
values from the regression line.
• Mean Absolute Percentage Error: