0% found this document useful (0 votes)
15 views64 pages

Introduction to Data Modeling Basics

This document provides an introduction to statistics, focusing on data modeling, parameter estimation, and descriptive statistics. It explains key concepts such as point and interval estimation, methods for estimating parameters like Maximum Likelihood Estimation and Method of Moments, and various measures of central tendency, spread, and shape. Additionally, it discusses the importance of graphical representations in making data more comprehensible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views64 pages

Introduction to Data Modeling Basics

This document provides an introduction to statistics, focusing on data modeling, parameter estimation, and descriptive statistics. It explains key concepts such as point and interval estimation, methods for estimating parameters like Maximum Likelihood Estimation and Method of Moments, and various measures of central tendency, spread, and shape. Additionally, it discusses the importance of graphical representations in making data more comprehensible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 1 Introduction to Data

Modeling

STATISTICS
BASIC of STATISTICS
Statistics is a branch of mathematics that deals with collecting, organizing, and
analyzing numerical data to solve real-world problems. It is widely regarded as a distinct
scientific discipline due to its vast applications across various fields.
● Helps make sense of complex data through quantitative models.
● Plays a critical role in decision-making in fields like weather forecasting, stock
market analysis, insurance, and data science.
PARAMETER ESTIMATION
● Parameter estimation is all about figuring out the unknown
values in a mathematical model based on data we have
collected.
● In parameter estimation, we use sample data to estimate
the characteristics (parameters) of a larger population. A
parameter is a measurable characteristic of a population
(such as the population mean, variance, or proportion).
● Since we usually cannot collect data from the entire
population, we rely on samples to estimate these
parameters.
● The goal of parameter estimation is to provide the best
possible estimate of these values, given the data we have.
● Parameter estimation involves creating mathematical
models that represent real-world processes.
● The parameters are the variables within these models that
need to be estimated from data. For instance, in a simple
linear regression model, the slope and intercept are
parameters that can be estimated using data points.
PARAMETER ESTIMATION
Types of Parameter Estimation
● Point Estimation
● Interval Estimation

Point Estimation

Point estimation provides a single best guess of a parameter's


value. The result is one number that is considered the best
approximation of the population parameter based on the
sample data.

• For example, if we want to estimate the average height of


students in a school, the sample mean (average) we
calculate from a group of students serves as the point
estimate of the population mean.
PARAMETER ESTIMATION

• Interval estimation provides a range of values within which


the true parameter likely falls. This is more informative than
point estimation because it includes a measure of
uncertainty
.
• For instance, if we estimate that the average height of
students in a school is between 150 cm and 160 cm with
95% confidence, it means that we are 95% sure that the
true population mean lies within this interval.
PARAMETER ESTIMATION
There are various methods for estimating parameters, each suited for different
types of data and situations. Here, we will discuss three common methods:
● Maximum Likelihood Estimation (MLE)
● Method of Moments

Maximum Likelihood Estimation (MLE)

● MLE is one of the most popular and widely used methods of parameter
estimation. The idea behind MLE is to find the parameter values that maximize
the likelihood function, which represents the probability of observing the given
sample data.
● The parameter value that maximizes the likelihood function is considered the
best estimate. MLE works well for large sample sizes and has desirable
statistical properties, such as being consistent (the estimate gets closer to the
true value as the sample size increases).
PARAMETER ESTIMATION
Method of Moments

• The method of moments is a simpler, less computationally intense


approach than MLE. It is based on the idea that sample moments (such as
the sample mean, variance, etc.) can be used to estimate the population
moments (such as the population mean, variance, etc.).
• While not as precise as MLE, the method of moments is often easier to
apply, especially for small datasets.
Descriptive Statistics

❑ Methods of describing the characteristics of a data set.


❑ Useful because they allow you to make sense of the data.
❑ Helps exploring and making conclusions about the data in
order to make rational decisions.

❑ Includes calculating things such as the average of the data,


its spread and the shape it produces.
Descriptive Statistics

❑ Descriptive statistics involves describing, summarizing


and organizing the data so it can be easily understood.
❑ Graphical displays are often used along with the
quantitative measures to enable clarity of communication.
Describing Data
• Qualitative data-
the variable which yield non numerical data.

– E.g.- education, marital status, eye colour

– Frequency- number of observations falling into particular


class/ category of the qualitative variable.

– Frequency distribution- table listing all classes & their


frequencies.

– Graphical representation- Pie chart, Bar graph.


Describing Data
• Quantitative data-
– Can be presented by a frequency distribution.
– If the discrete variable has a lot of different values, or if the data is a
continuous variable then data can be grouped into classes/
categories.

– Class interval / BINS- covers the range between maximum &


minimum values.
– Class limits- end points of class interval.
– Class frequency- number of observations in the data that belong
to each class interval.

– Usually presented as a Histogram or a Bar graph.


Frequency Distribution and
Histogram
Descriptive Statistics
❑ When analyzing a graphical display, you can draw
conclusions based on several characteristics of the graph.
❑ You may ask questions such ask:
• Where is the approximate middle, or center, of the graph?
• How spread out are the data values on the graph?
• What is the overall shape of the graph?
• Does it have any interesting patterns?
Descriptive Statistics
The following measures are used to describe a data set:
❑ Measures of position (also referred to as central tendency
or location measures).
❑ Measures of spread (also referred to as variability or
dispersion measures).
❑ Measures of shape.
Descriptive Statistics
❑ If assignable causes of variation are affecting the process,
we will see changes in:
• Position.
• Spread.
• Shape.
• Any combination of the three.
Properties of
Numerical data &
Measures

Central Dispersio Shape


tendency n

Mean Range Skewness

Median Standard Kurtosis


Deviation

Mode
Descriptive Statistics
Measures of Position:
❑ Position Statistics measure the data central tendency.
❑ Central tendency refers to where the data is centered.
❑ You may have calculated an average of some kind.
❑ Despite the common use of average, there are different
statistics by which we can describe the average of a data
set:
• Mean.
• Median.
• Mode.
Measures of center
❑ Central tendency- In any distribution, majority of the observations pile
up, or cluster around in a particular region.

❑ Mean- sum of observed values in a data divided by the


number of observations

❑ Median- observation in the data set that divides the data set intohalf.

❑ Mode- value of the data set which occurs with greatest frequency

❑ Mean & Median can be applied only to Quantitative data

❑ Mode can be used either to Qualitative or Quantitative data.

❑ Outlier- observation that falls far from the rest of the data. Mean
gets highly influenced by the outlier.
Descriptive Statistics
Mean:
❑ The total of all the values divided by the size of the data set.
❑ It is the most commonly used statistic of position.
❑ It is easy to understand and calculate.
❑ It works well when the distribution is symmetric and there
are no outliers.
❑ The mean of a sample is denoted by ‘x-bar’.
❑ The mean of a population is denoted by ‘μ’.

Mean

0 1 2 3 4 5 6 7 8 9
Descriptive Statistics
Median:
❑ The middle value where exactly half of the data values
are
above it and half are below it. Median
Mean
❑ Less widely used.
❑ A useful statistic due to its 0 1 2 3 4 5
6 7 8 9
robustness.
❑ It can reduce the effect of outliers.

❑ Often used when the data is nonsymmetrical.


❑ Ensure that the values are ordered before calculation.
❑ With an even number of values, the median is the mean of
the two middle values.
Descriptive Statistics
Median
Calculation:
Example
23 12
1,2,1,1,3,4,100
33 30
34 31
36 37 Mean = 16
38 38 median = 2
40 40
41 41 mode = 1
41 41
44 44 Assume 100 is an outlier
45 Mean =2
Median = 38 + 40 / 2 = 39
median = 1.5
mode = 1
Descriptive Statistics
❑ Why can the mean and median be
different?

Median Mean

0 1 2 3 4 5 6 7 8 9
Descriptive Statistics
Mode:
❑ The value that occurs the most often in a data set.
❑ It is rarely used as a central tendency measure
❑ It is more useful to distinguish between unimodal
and multimodal distributions
• When data has more than one peak.
Normal distribution
❑ Bell shaped symmetric distribution.
❑ Why is it important?
❑ Many things are normally distributed, or very close to it.
❑ It is easy to work with mathematically
❑ Most inferential statistical methods make use of properties of
the normal distribution.

❑ Mean = Median = Mode


Descriptive Statistics
Measures of Spread:
❑ The Spread refers to how the data deviates from the
position measure.
❑ It gives an indication of the amount of variation in the
process.
• An important indicator of quality.
• Used to control process variability and improve quality.
❑ All manufacturing and transactional
Spread
processes are variable to some
degree.
❑ There are different statistics by which
we can describe the spread of a data
set:
• Range.
• Standard deviation.
❑ Range- difference between the largest observed value in the data set
and the smallest one.
❑ So, while considering range great deal of information is ignored.

❑ Standard deviation- it is a kind of average of the absolute deviation


of observed values from the mean of the variable.
❑ It is defined using the sample mean & values get strongly
affected by few extreme observations.

❑ Variance- square of standard deviation


Descriptive Statistics
Standard Deviation:
❑ The average distance of the data points from their own
mean.
❑ A low standard deviation indicates that the data points
are clustered around the mean.
❑ A large standard deviation indicates that they are
widely scattered around the mean.
❑ The standard deviation of a sample
is denoted by ‘s’.
❑ The standard deviation of a
population is denoted by “μ”.
Descriptive Statistics
Standard Deviation:
❑ Perceived as difficult to understand because it is not easy
to picture what it is.
❑ It is however a more robust measure of variability.
❑ Standard deviation is computed as follows:

∑ (x–x
s )2
= n-1 Mean (x-bar)

s = standard
deviation x = mean
x = values of the data set
n = size of the data set
Descriptive Statistics
Range:
❑ The difference between the highest and the lowest values.
❑ The simplest measure of variability.
❑ Often denoted by ‘R’.
❑ It is good enough in many practical cases.
❑ It does not make full use of the available data.
❑ It can be misleading when the data is skewed or in the
presence of outliers.
• Just one outlier will
increase the range 0 1 2 3 4 5
dramatically. 6 7 8 9
Range
Descriptive Statistics
Measures of Shape:
❑ Data can be plotted into a histogram to have a general idea
of its shape, or distribution.
❑ The shape can reveal a lot of information about the data.
Shape
❑ Skewness- Lack of symmetry in distribution. It can be interpreted
from frequency distribution.

❑ Properties-
❑ Mean, median & mode fall at different points.
❑ Curve is not symmetrical but stretched more to one side.

❑ Distribution may be positively or negatively skewed. Limits


for coefficient of skewness is ± 3.

❑ Kurtosis- convexity of a curve.


❑ Gives an idea about the flatness/ peakedness of the curve.
❑ Gives an idea about how much weights are at the tail end of the
distribution
Descriptive Statistics
Measures of Shape:
❑ It may be symmetrical or nonsymmetrical.
❑ In a symmetrical distribution, the two sides of the
distribution are a mirror image of each other.
❑ Examples of symmetrical distributions include:
• Uniform.
• Normal.
• Camel-back.
Descriptive Statistics
Measures of Shape:
❑ The shape helps identifying which descriptive statistic is
more appropriate to use in a given situation.
❑ If the data is symmetrical, then we may use the mean or
median to measure the central tendency as they are almost
equal.
❑ If the data is skewed, then the median will be a
more appropriate to measure the central tendency.
❑ Two common statistics that measure the shape of the data:
• Skewness.
• Kurtosis.
Descriptive Statistics
Skewness:
❑ Describes whether the data is distributed symmetrically
around the mean.
❑ A skewness value of zero indicates perfect symmetry.
❑ A negative value implies left-skewed data.
❑ A positive value implies right-skewed data.
XXXXXXX

XXXXXX
X X
XX X X
XXXX

X XX

X
XXX

XX

XX
XX
XX
X

X
X
X

X
X

X
X
X
(+) – SK > 0 (-) – SK < 0
Descriptive Statistics
Kurtosis:
❑ Measures the degree of flatness (or peakness) of the shape.
❑ When the data values are clustered around the middle, then
the distribution is more peaked.
• A greater kurtosis value.
❑ When the data values are spread around more evenly, then
the distribution is more flatted.
• A smaller kurtosis values.

X X X X X X
X
X

X X
X X

X
X

X X X

X X
X X

X X X
X

X
X X

X
X X

X X
X
X
X
X

X
X
X
X
X

X
(-) (0) (+)
Platykurtic Mesokurtic Leptokurtic
Descriptive Statistics
Further Information:
❑ Variance is a measure of the variation around the mean.
❑ It measures how far a set of data points are spread out
from their mean.
❑ The units are the square of the units used for the original
data.
• For example, a variable measured in meters will have a
variance measured in meters squared. Variance = s2
❑ It is the square of the standard
deviation.
PRESENTATION OF DATA

• Classification and tabulation reduce the


complexity of vast and complicated statistical
data but still it is not easy to interpret the
tabulated data.
• Diagrams and graphs will catch the eye more
easily than tables which provide array of
figures.
• A glance over a graph or diagram will enable any
layman (without statistical knowledge) to get an
idea about the essential characteristics of the
tabulated data without much strain or effort.
FUNCTIONS OF DIAGRAMS AND GRAPHS

• It will attract the attention of a large number of


persons.
• They carry a “birds – eye view” impression in the
human mind.
• It saves a lot of valuable time if presented in a form
of suitable charts & graphs instead of pages of
numerical figures.
• To facilitate comparison between two or more
sets of data.
• Prediction equations can be represented by
graphs and these will be much helpful in
forecastings.
LIMITATIONS OF DIAGRAMS AND
GRAPHS
• They are approximate indicators.
• Exact and accurate information's can be
obtained from original tabular information.
• They cannot substitute the tabular information.
• They fail to disclose small difference when large
figures are involved.
GRAPHICAL REPRESENTATION OF
DATA
Graphical representation is done when the data
are classified in the form of a frequency
distribution. The different graphs are
• Histogram
• Frequency Polygon
• Frequency Curve
• Ogive
• Lorenz Curve
Histogram
• It is a vertical bar diagram without gap between the
bars.
• It consists of bars erected over the true class interval,
their areas being proportional to the frequencies of the
respective classes.
• Since the intervals are of equal width, the height of
each bar serves as a measure of the corresponding
frequency.
• Draw the two diagonals in the highest modal class
rectangles at its top corner to the pre and post modal
rectangle corners and the x co-ordinate of the point of
intersection is the mode.
Frequency Polygon
• If points are plotted with the x co-ordinate equal to the
mid value
of the class intervals and the corresponding frequencies as
the y co-ordinate and these points are joined by means of a
straight line, we obtain frequency polygon.
• These points are the midpoints of the top of the bars in the
histogram.
Frequency Curve
• If points are plotted with the x co-ordinate
equal to the mid value of the class intervals
and the corresponding frequencies as the y
co-ordinate and these points are

joined bymeans of smooth curve then we


a frequency curve. get

A Frequency Curve is a smooth curve which


corresponds to the limiting case of a
histogram computed for a frequency
distribution of a continuous distribution as the
number of data points becomes very large.
Difference Between Frequency Curve
And Frequency Polygon

• The main difference between a frequency


polygon and a frequency curve is that a
frequency polygon is drawn by connecting
points with a straight line, whereas a
frequency curve is drawn by connecting points
following a curve.
DIAGRAMMATIC REPRESENTATION
OF DATA
• Points to be followed in drawing a diagram

• For each diagram, a suitable short heading


should be given.
• It should be drawn to exhibit the statistical
matter clearly. It should be such as to
allow its significant feature to be clearly
shown out by adopting suitable scale and
will depend upon the space available.
• Diagram should be drawn accurately with
the help of drawing instruments.
• Colouring and different markings should be done
with pencil or with colours.
• Different colours or marks or dottings are used
to show different items. In such cases legend
should be given for the column and item it refers.
In doing so, we should see that the visual
impression conveyed by the diagram is not in any
way affected.
• The original data on which the diagram has been
based should be given, if necessary facing the
diagram as this will help the observer to see
the details with clarity.
• Reference to the source of the table should be
provided.
Types of a diagram

• One dimensional diagram


• Line diagram
• Bar Diagram
• Two dimensional (or) Area Diagram
• Pie diagram
• Square diagram and rectangle diagrams
• Three dimensional (or)Volume diagrams
• Cubes
• Spheres, Cylinders etc.
• Pictogram
• Actual pictures
ONE DIMENSIONAL DIAGRAM
Line diagram consisting of curves and lines as well as bars
• Line diagram
• This requires vertical lines to be drawn at equal
intervals each of length proportional to the
magnitude of the variable for the different items.
• It has no width and hence of very poor visual
effect.
• It makes comparison easy although it is less
attractive.
• Bar Diagram
• It is the simplest of all statistical diagrams.
• It consists of bars of equal width (all horizontal or
vertical) standing on a common base line at
equal intervals, the length of the bars being
proportional to the magnitude of the variable
for different items.
Line Bar
Diagram Diagram
BAR DIAGRAM

❖ Sub-divided bar diagram or component bar diagram

• Sometimes the variable is capable of being


sub-divided into two or more component parts each
representing a sub variable.
• In this case, all the bars are subdivided by lines in
the same order so that each subdivision
represents the parts in magnitude in the same
scale.
• They are properly coloured or marked differently for
visual guidance.
Sub-divided bar diagram or component
bar diagram
Superimposed or Multiple bar diagram

• Bars may sometimes be superimposed


for comparative purpose.
• A multiple bar graph shows the relationship
between different values of data.
• Each data value is represented by a column in
the graph. In a multiple bar graph, multiple
data points for each category of data are
shown with the addition of columns.
• These are used also for two or more sets
of interrelated data.
Superimposed or Multiple bar
diagram
❖ Percentage bar diagram

• When the component parts are expressed


in percentages of the whole, the resulting bar
diagram is called a percentage bar diagram.
• In this case all the bars are of equal length.
• A percentage bar chart, bars of length equal
to
100 for each class are drawn in the first step
and sub-divided into the proportion of the
percentage of their component in the second
step.
• The diagram so obtained is called a
percentage component bar chart or
percentage stacked bar chart.
TWO DIMENSIONAL (OR) AREA
DIAGRAM
• Two-dimensional diagrams comprise only two
dimensions (factors), such as length and
width. They are represented by squares,
rectangles, circles, and the variation of the
circle is known as a pie- diagram. Here, each
category is represented by the area proportional
to the data points.
• Pie diagram
• Square diagram and rectangle diagrams
Pie Chart
• The “pie chart” is also known as a “circle chart”,
dividing the circular statistical graphic into sectors
or sections to illustrate the numerical problems.
Each sector denotes a proportionate part of the
whole. To find out the composition of something,
Pie-chart works the best at that time.
• The pie chart is an important type of data
representation. It contains different segments and
sectors in which each segment and sector of a pie
chart forms a specific portion of the
total(percentage). The sum of all the data is
equal to 360°.
• The total value of the pie is always 100%.
Square diagram

• Their areas should be proportional to the magnitudes of


the data.
• For square diagrams, we will have to take the square
root of the given figures which will give the
measurement of the sides of the square. By adopting
suitable scale we can draw squares.
• A square is a quadrilateral with four equal sides.
There are many objects around us that are in the
shape of a square. Each square shape is identified by
its equal sides and its interior angles that are equal to
90°.
Rectangle diagrams
• In the case of rectangle diagrams, if we take
equal breath (width) for the rectangles, then
the areas will be proportional to the lengths and
hence the lengths will be proportional to the
magnitude of the given variables.
PICTOGRAM
• Tabular data can also be represented by pictogram,
cartogram, maps and pictures as these device help in
attracting the attention to statistical matter
which when presented in the ordinary diagrammatic
form is very often ignored.
• Pictograms are diagrams of pictorial or semi-pictorial
nature and are drawn in different sizes according to
scale. Though they are useful in attracting the
attention of the people, they very often lean on
tables, ignoring the pictorial diagrams.
• They cannot be made use of with certain complicated
data.

You might also like