0% found this document useful (0 votes)
12 views19 pages

Data Analysis Simplified Notes

Data analysis involves evaluating data through analytical reasoning to derive conclusions, utilizing methods applicable across various fields including science and business. The process includes data preparation, coding, and statistical analysis, with techniques such as descriptive and inferential statistics to summarize and interpret data. Visual displays and hypothesis testing are essential components, aiding in decision-making and understanding relationships within the data.

Uploaded by

mkevin6646
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Data Analysis Simplified Notes

Data analysis involves evaluating data through analytical reasoning to derive conclusions, utilizing methods applicable across various fields including science and business. The process includes data preparation, coding, and statistical analysis, with techniques such as descriptive and inferential statistics to summarize and interpret data. Visual displays and hypothesis testing are essential components, aiding in decision-making and understanding relationships within the data.

Uploaded by

mkevin6646
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

DATA ANALYSIS

Introduction

Data analysis is the process of evaluating data using analytical and logical reasoning to examine
each component of the data provided. This form of analysis is just one of the many steps that
must be completed when conducting a research experiment. Data from various sources is
gathered, reviewed, and then analyzed to form some sort of finding or conclusion.
Data Analysis Methods
Data analysis is the process of extracting useful information from the given data series, that will
be useful in taking important decisions. As the job opportunities for data analysts are on the rise,
knowledge of data analysis methods is essential.

Data analysis methods help us to understand facts, observe patterns, formulate explanations, and
try out hypotheses. They are not only used in all kinds of science and business processes, but
also in administration and policy-making.

Data analysis can be carried out in all domains, including medicine and social sciences. All the
analysis that is carried out is well-documented for future use.

Data Analysis Explained

Data analysis is defined as a practice in which, unorganized or unfinished data is ordered and
organized, so that useful information can be highlighted. It involves processing and working on
data, in order to understand what all is present in the data and vice-versa.

Here's where correct data analysis methods and procedures come into picture. Charts, graphs,
and write-ups in text form, are various methods to analyze data. These methods are designed to
polish and refine the data, so that the end users can reap interesting or useful information,
without any need of going through the entire data themselves.

DATA ANALYSIS
DATA PREPARATION AND DESCRIPTION
Once the data begins to flow in, attention turns to data analysis. If the project has been done
correctly, the analysis planning is already done.

Data preparation
This includes editing, coding and data entry. These activities ensure the accuracy of the data and
their conversion from raw form to reduced and classified forms that are more appropriate for
analysis.
Editing
Editing detects errors and omissions, corrects them when possible and certifies that minimum
data quality standards have been achieved. The editor’s purpose is to guarantee that data are:

 Accurate
 Consistent with intent of the question and other information in the survey
 Uniformly entered
 Complete
 Arranged to simplify coding and tabulation
Field editing
In large projects, field editing review is a responsibility of the field supervisor. It should be done
soon after the data have been gathered. During the stress of data collection, the researcher often
uses ad hoc abbreviations and special symbols. Soon after the interview, experiment or
observation, the investigator should review the reporting forms. It is difficult to complete what
was abbreviated or written in shorthand or noted illegibly if the entry is not caught that day.
When entry gaps are present from interviews, a call back should be made rather than guessing
what the respondent ‘probably would have said’. Self-interviewing has no place in quality
research.

Central editing
For a small study, the use of a single editor produces maximum consistency. In large studies, the
tasks may be broken down so that each editor can deal with one entire section. This approach
will not identify inconsistencies between answers in different sections. However, this problem
can be handled by identifying points of possible inconsistency and having one editor check
specifically for them.

Rules to guide editors in their work

 Be familiar with instructions given to interviewers and coders


 Do not destroy, erase or make illegible the original entry by the interviewer, original entries
should be crossed out with a single line to remain legible.
 Make all entries on an instrument in some distinctive colour and in a standardized form.
 Initial all answers changed or supplied.
 Place initials and date of editing on each instrument completed.
Coding
Coding involves assigning numbers or other symbols to answers so the responses can be grouped
into a limited number of classes or categories. The classifying of data into limited categories
sacrifices some data detail but is necessary for efficient analysis. Coding helps the researcher to
reduce several thousand replies to a few categories containing the critical information needed for
analysis. In coding, categories are the partitioning of a set and categorization is the process of
using rules to partition a body of data.

Coding rules
The categories should be:

 Appropriate to the research problem and purpose: Categories must provide the best
partitioning of data for testing hypotheses and showing relationships.
 Exhaustive
 Mutually exclusive
 Derived from one classification principle
Coding closed questions
The responses to closed questions include scaled items and others for which answers can be
anticipated. When codes are established early in the research process, it is possible to pre-code
the questionnaire. Pre-coding is particularly helpful for data entry because it makes the
intermediate step of completing a coding sheet unnecessary. The data are accessible directly
from the questionnaire. A respondent, interviewer, field supervisor or researcher is able to assign
an appropriate numerical response on the instrument by checking, circling or printing it in the
proper coding location.

Coding open-ended questions


Open-ended questions are always used where insufficient information or lack of a hypothesis
prohibits preparing response categories in advance, need to measure sensitive or disapproved
behaviour, discover salience or encouraging natural modes of expressions. Content analysis is
always used to analyse open-ended questions. Converse and Presser (1986) define content
analysis as a research technique for the objective, systematic and quantitative description of the
manifest content of a communication.

Content analysis follows a systematic process i.e.


 Selection of a unitization scheme. The units may be syntactical, referential, prepositional
or thematic
 Selection of a sampling plan
 Development of recording and coding instructions
 Data reduction
 Inferences about the context
 Statistical analysis
Content analysis guards against selective perception of the content, provides for the rigorous
application of reliability and validity criteria and is amenable to computerization.
“Don’t know” replies
“Don’t know” replies are evaluated in light of the questions nature and the respondent. While
many don’t know are legitimate, some result from questions that are ambiguous or from an
interviewing situation that is not motivating. It is better to report don’t knows as a separate
category unless there are compelling reasons to treat them otherwise.

Data entry
Data entry converts information gathered by secondary or primary methods to a medium for
viewing and manipulation. Data entry is accomplished by keyboard entry from pre-coded
instruments, optical scanning, real time keyboarding, telephone pad data entry, bar codes, voice
recognition, optical mark recognition (OMR) and data transfers from electronic notebooks and
laptop computers. Database programs, spreadsheets and editors in statistical software programs
e.g. SPSS and SAS offer flexibility for entering, manipulating and transferring data for analysis,
warehousing and mining.

Data description
The objective of descriptive statistical analysis is to develop sufficient knowledge to describe a
body of data. This is accomplished by understanding the data levels for the measurements we
choose, their distributions and characteristics of location, spread and shape. The discovery of
miscoded values, missing data and other problems in the data set is enhanced with descriptive
statistics
There are three general areas that make up the field of statistics: descriptive statistics, relational
statistics, and inferential statistics:

DESCRIPTIVE STATISTICS
Descriptive statistics fall into one of two categories: measures of central tendency (mean,
median, and mode) or measures of dispersion (standard deviation and variance). Their purpose is
to explore hunches that may have come up during the course of the research process, but most
people compute them to look at the normality of their numbers. Examples include descriptive
analysis of sex, age, race, social class, and so forth.

VISUAL DISPLAYS OF DATA


In addition to numerical summaries of location, spread and shape, visual displays can be used to
provide a complete and accurate impression of distribution and variable relationships.

 Frequency table arrays data from highest to lowest values with counts and percentages.
They are most useful for inspecting the range of responses and their repeated occurrence.
 Bar charts and pie charts are appropriate for relative comparisons of nominal data.
 Histograms are optimally used with continuous variables where intervals group the
responses.
 Stem and leaf displays present actual data values using a histogram type device that
allows inspection of spread and shape.
 Box plots use the five-number summary to convey a detailed picture of a distribution’s
main body, tails and outliers.
 Control charts displays sequential measurements of a process together with a centre line
and control limits. The selection of a control chart depends on the level of data one is
measuring. It helps manager’s focus on special causes of variation by revealing whether a
system is under control and substantiating results from improvements.
 The Pareto diagram is a bar chart whose percentages sum to 100 percent. The causes of
the problem under investigation are sorted in decreasing importance with bar height
descending from left to right. Its pictorial array reveals the highest concentration of
quality improvement potential in the fewest number of remedies.

INFERENTIAL STATISTICS
Hypothesis: It’s a statement about a population parameter developed for the purpose of testing.

Hypothesis testing: It’s a procedure based on sample evidence and probability theory to
determine whether the hypothesis is a reasonable statement.

Procedure for testing a hypothesis


1. State the null and alternate hypothesis
2. Identify the test statistic
3. Formulate a decision rule and identify the rejection region
4. Compute the value of the test statistic
5. Make a conclusion.
State the null hypothesis (HO) and alternate hypothesis (HA)

 The null hypothesis is a statement about the value of a population parameter. It should be
stated as “There is no significant difference between ……………”. It should always
contain an equal sign.
 The alternate hypothesis is a statement that is accepted if sample data provide enough
evidence that the null hypothesis is false.
One-tailed and Two-tailed tests
 A test is one tailed when the alternate hypothesis states a direction e.g.
Ho: The mean income of women is equal to the mean income of men
HA: The mean income of women is greater than the mean income of men
 A test is two tailed if no direction is specified in the alternate hypothesis
Ho: There is no difference between the mean income of women and the mean income
of men
HA: There is a difference between the mean income of women and the mean income
of men
Identify the test statistic
A test statistic is the statistic that will be used to test the hypothesis e.g.
Formulating a decision rule and identifying the rejection region
A decision rule is a statement of the conditions under which the null hypothesis is rejected and
the conditions under which it is not rejected. It is determined by the level of significance which is
designated by  and should be between 0 –1.

Compute the value of the test statistic and make a conclusion.

The value of the test statistic is determined from the sample information, and is used to
determine whether to reject the null hypothesis or not.

Types of errors that can be committed


i. Type I error: it is rejecting the null hypothesis, when it is true.
ii. Type II error: It is not rejecting the null hypothesis, when it is false.

Null hypothesis Do not reject HO Reject HO

HO is True Correct decision Type I error

HO is false Type II error Correct decision

TESTING THE POPULATION MEAN WHEN THE POPULATION VARIANCE IS


KNOWN

When the population variance is known and the population is normally distributed, the test

statistic for testing hypothesis about is .

Estimating the population mean when the population variance is known


The confidence interval estimator of when is known is

Examples
1. A study by the Coca-Cola Company showed that the typical adult Kenyan consumes 18
gallons of Coca-Cola each year. According to the same survey, the standard deviation of the
number of gallons consumed is 3.0. A random sample of 64 college students showed they
consumed an average (mean) of 17 gallons of cola last year. At the 0.05 significance level,
can we conclude that there is a significance difference between the mean consumption rate of
college students and other adults?
2. The manager of a departmental store is thinking about establishing a new billing system for
the stores credit customers. After a thorough financial analysis, she determines that the new
system will not be cost effective if the average monthly account is less than 70,000. A
random sample of 200 monthly accounts is drawn, for which the mean monthly account is
Sh. 66,000. With  = 0.05, is there sufficient evidence to conclude that the new system will
not be cost effective? Assume that the population standard deviation is Sh. 30,000.
3. Past experience indicates that the monthly long distance telephone bill per household in a
particular community is normally distributed, with a mean of Sh. 1012 and a standard
deviation of Sh. 327. After an advertising campaign that encouraged people to make long
distance telephone calls more frequently, a random sample of 57 households revealed that the
mean monthly long distance bill was Sh. 1098. Can we conclude at the 10% significance
level that the advertising campaign was successful?

Testing the population proportion


The null and alternate hypotheses of tests of proportions are set up in the same way as the

hypothesis of tests about mean and variance. The test statistic for is

Confidence interval estimator of is

Example:

1. An inventor has developed a system that allows visitors to museums, zoos and other
attractions to get information at the touch of a digital code. For example, zoo patrons can
listen to an announcement (recorded on a microchip) about each animal they see. It is
anticipated that the device would rent for $3.00 each. The installation cost for the complete
system is expected to be about $400,000. The ABC zoo is interested in having the system
installed, but the management is uncertain about whether to take the risk. A financial analysis
of the problem indicates that if more than 10% of the zoo visitors rent the system, the zoo
will make a profit. To help make the decision, a random sample of 400 zoo visitors is given
details of the systems capabilities and cost. If 48 people say that they would rent the device,
can the management of the zoo conclude at the 5% significance level that the investment
would result in a profit?
2. In a random sample of 100 units from an assembly line, 22 were defective.
(a) Does this provide sufficient evidence at the 10% significance level to allow us to
conclude that the defective rate among all units exceeds 10%?
(b) Find a 99% confidence interval estimate of the defective rate.
3. A manufacturer of computer chips claims that more than 90% of his products conform to
specifications. In a random sample of 1,000 chips drawn from a large production run, 75
were defective. Do the data provide sufficient evidence at the 1% level of significance to
enable us to conclude that the manufacturer’s claim is true?
Chi-square test of a multinomial experiment

A multinomial experiment is a generalized version of a binomial experiment that allows for


more than two possible outcomes on each trial of the experiment.

Properties of a multinomial experiment


 The experiment consists of a fixed number of trials.
 The outcome of each trial can be classified into exactly one of categories called cells
 The probability that the outcome of a trial will fall into a cell remains constant for each
trial, for moreover, .
 Each trial of the experiment is independent of the other trials.

Test statistic is

Rejection region is

Example
1. Two companies A and B have recently conducted aggressive advertising campaigns in order
to maintain and possibly increase their respective shares of the market for a particular
product. These two companies enjoy a dominant position in the market. Before advertising
campaigns began, the market share for Company A was 45% while Company B had a
market share of 40%. Other competitors accounted for the remaining market share of 15%.
To determine whether these market shares changed after the advertising campaigns, a
marketing analyst solicited the preferences of a random sample of 200 consumers of this
product. Of the 200 consumers, 100 indicated a preference for Company’s A’s product, 85
preferred Company’s B product and the remainder preferred one or another of the products
distributed by other competitors. Conduct a test to determine at the 5% level of significance,
whether the market shares have changed from the levels they were at before the advertising
campaigns occurred.
2. To determine if a single die, is balanced, or fair, the die was rolled 600 times. The observed
frequencies with which each of the six sides of the die turned up are recorded in the following
table: -
Face 1 2 3 4 5 6

Observed frequency 114 92 84 101 107 102

Is there sufficient evidence to conclude at the 5% level of significance, that the die is not
fair?
3. Grades assigned by an economics instructor have historically followed a symmetrical
distribution.
Grade A B C D F

Percentage 5 25 40 25 5

A sample of 150 grades revealed the following

Grade A B C D F

Number 11 32 62 29 16

Can we conclude at the 1% level of significance that this year’s grades are distributed
differently than they were in the past?

Rule of five
For the discrete distribution of the test statistic to be adequately approximated by the
continuous chi-square distribution, the conventional rule is to require that the expected frequency
for each cell be at least 5. Where necessary, cells should be combined in order to satisfy this
condition. The choice of cells to be combined should be made in such a way that meaningful
categories result from the combination.

CHI-SQUARE TEST OF A CONTIGENCY TABLE


A contingency table is a rectangular table which items from a population are classified according
to two characteristics. The objective is to analyze the relationship between two qualitative
variables i.e. to investigate whether a dependence relationship exists between two variables or
whether the variables are statistically independent. The number of degrees of freedom for a
contingency table with rows and columns is .

Examples
1. The trustee of a company’s pension plan has solicited the opinions of a sample of the
company’s employees regarding a proposed revision of the plan. A breakdown of the
responses is shown in the table below: -
Response Lower level Middle Top
management management management

For 67 32 11

Against 63 18 9
Is there sufficient evidence at the 5% significance level, to conclude that the responses differ
among the three groups of employees?

2. The operations manager at a shirt manufacturing plant has been concerned about the large
number of defects that the company’s three shifts have been producing. They appear to be
three types of defects: Improper stitching, buttons not aligned with button holes and
inconsistent colouring. The manager decides to investigate the problem. As a first step to
improving the quality, she wants to know if the number and type of defects are the same for
all three shifts. A random sample of one day’s shirt production is taken. The number of each
type of defect and the number of perfect shirts for each are shown in the following table.
Shift

Shirt condition 1 2 3 Total

Perfect 224 249 238 711

Improperly stitched 15 19 21 55

Unaligned buttons 8 12 12 32

Inconsistent colour 17 16 11 44

Total 264 296 282 842

Do these results allow the operations manager to conclude that at the 10% significance level,
there are differences in quality among the three shifts?

3. There are three distinct types of hardware wholesalers; independents (independently owned),
Wholesaler voluntaries (groups of independents acting together) and retailer cooperatives
(retailer owned). In a random sample of 137 retailers, the retailers were categorized
according to the type of wholesaler they primarily used and according to their store location
as shown in the table below:
Store Location Retailer Wholesaler Independents
cooperatives
Voluntaries

Multiple locations 14 10 5

Free- standing 29 26 13

Others (Mall, strips) 20 14 6

At the 5% significance level, is there sufficient evidence to conclude that the type of
wholesaler primarily used by a retailer is related to the retailers location?
RELATIONAL STATISTICS
Relational statistics fall into one of three categories: univariate, bivariate, and multivariate
analysis. Univariate analysis is the study of one variable for a sub-population. Bivariate analysis
is the study of a relationship between two variables. Multivariate analysis is the study of
relationship between three or more variables. The relational statistics include correlation,
regression, discriminant analysis, conjoint analysis, factor analysis and cluster analysis
 Discriminant analysis: It is used to classify people or objects into groups based on several
predictor variables. The groups are defined by a categorical variable with two or more values,
whereas the predictors are metric. The effectiveness of the discriminant equation is based not
only on its statistical significance but also on its success in correctly classifying cases to
groups.
 Conjoint analysis: It is a technique that typically handles non-metric independent variables.
It allows the researcher to determine the importance of product or service attributes and the
levels or features that are most desirable. Respondents provide preference data by ranking or
rating cards that describe products. These data become utility weights of product
characteristics by means of optimal scaling and log linear algorithms.
 Factor analysis: It attempts to reduce the umber of variables and discover the underlying
constructs that explain the variance. A correlation matrix is used to derive a factor matrix
from which the best linear combination of variables may be extracted.
 Cluster analysis: It is a set of techniques for grouping similar objects or people. The cluster
procedure starts with an undifferentiated group of people, events or objects and attempts to
reorganize them into homogeneous sub-groups.

REGRESSION ANALYSIS
Regression involves developing a mathematical equation that analyses the relationship between
the variable to be forecast (dependent variable) and the variables that the statistician believes are
related to the forecast variable (independent variable).
Regression is the estimation of unknown values or the prediction of one variable from known
values of other variables.
Types of regression
 Simple linear regression: Involves a relationship between two variables only.
 Multiple regression: Analyses or considers the relationship between three or more variables.

Simple Regression

The first step in establishing the relationship between X and Y is to obtain observations on the
two variables and analyze the data using a scatter diagram to indicate whether a positive or
negative relationship exists between X and Y. the relationship can be approximated by a straight
line. Algebraically, the relationship is
The above function is deterministic since it gives exact relationship between X and Y. when the
line is plotted, not all the points will fall on the line because of the following reasons:-

 Omission of other explanatory variables from the function


 Random behavior of human beings
 Imperfect specification of the functional form of the model
 Errors of aggregation
 Errors of measurement

To account for the deviations of some points from the straight line, the error term is introduced.
The introduction of the error term makes the function stochastic . To estimate
the values of the coefficients and , we need observations on Y, X and the error term.
However, the error term is not observable and therefore we make assumptions about the error
term.

Assumptions of the error term

 The error term is a real random variable which has a mean of zero and constant variance
( Assumption of homoscedasticity)
 The error term is normally distributed
 The error term corresponding to different values of X for different periods are not correlated
(assumption of no autocorrelation)
 There is no relationship between the explanatory variables and the error term
 The explanatory variables are measured without error. The error absorbs the influence of
omitted variables and errors of measurement in the dependent variable.
All the above assumptions are called stochastic assumptions

Other assumptions

 The explanatory variables are not perfectly linearly related or correlated (No
multicollinearity)
 The variables are correctly aggregated
 The relation being estimated is identified
 The relationship is correctly specified

The regression equation of Y on X


 It used to predict the values of Y from the given values of X.
 It is expressed as follows
 To determine the values of and the following two normal equations are to be solved
simultaneously
 Alternatively the values of and can be got using the following formula’s

Regression analysis helps determine the relationship between two variables out of which, one is
dependent and the other is independent. The analysis determines the behavior of the dependent
variable, when one of the independent variables is varied and the others are kept fixed.

Example
1. A random sample of eight auto drivers insured with a company and having similar auto insurance
policies was selected. The following table lists their driving experience (in years) and the monthly
auto insurance premium (in Sh.000) paid by them.
Driving experience (Years) 5 2 12 9 15 6 25 16

Monthly auto insurance premium 64 87 50 71 44 56 42 60

(In Sh.000)

i. Find the least squares regression line by identifying the appropriate dependent and
independent variable
ii. Interpret the meaning of the constants calculated in part (i) above.
iii. Compute the coefficient of correlation and coefficient of determination and interpret their
values.

2. A farmer wanted to find out the relationship between the amount of fertilizer used and the
yield of corn. He selected seven acres of his land on which he used different amounts of
fertilizer to grow corn. The following table gives the amount (in kg) of fertilizer used and the
yield (in Tonnes) of corn for each of the seven acres.
Fertilizer used 120 80 100 70 88 75 110

Yield of corn 138 112 129 96 119 104 134

i. Find the least squares regression line by identifying the appropriate dependent and
independent variable.
ii. Interpret the meaning of the constants calculated in part (i) above.
iii. Compute the coefficient of correlation and coefficient of determination and interpret their
values.
iv. Predict the yield of corn per acre for 105 kg of fertilizer used.
3. In an attempt to get a better idea of some of the determinants of medical expenditures by
families, a social worker collected data on family size and average weekly medical bills, with
the results shown in the following table;
Family size 2 2 4 5 7 3 8 10 5 2 3 5 2

Weekly medical
expenses ( in Sh. ’00’)
20 28 52 50 78 35 102 88 51 22 29 49 25

i. Find the least squares regression line by identifying the appropriate dependent and
independent variable.
ii. Interpret the meaning of the constants calculated in part (i) above.
iii. Compute the coefficient of correlation and coefficient of determination and interpret their.

CORRELATION
Definition: It is the existence of some definite relationship between two or more variables.
Correlation analysis is a statistical tool used to describe the degree to which one variable is
linearly related to another variable.

Types of Correlation
Correlation may be classified in the following ways:-
Positive and negative correlation
Correlation is said to be positive if two series move in the same direction, otherwise it is negative
(opposite Direction).
Linear and Non-Linear correlation
Correlation is linear if the amount of change in one variable tends to bear a constant ratio to the
amount of change in the other variable otherwise it is non-linear.
Simple, partial and multiple correlation
Simple correlation is where two variables are studied while partial or multiple involves three or
more variables.

Methods of calculating simple correlation


1. Scatter diagram
2. Karl Pearson’s coefficient of correlation
3. Spearman’s rank correlation coefficient
4. Method of least squares

Scatter diagram
It is a chart that potrays the relationship between two variables.
Advantages
 It is simple and non-mathematical method of studying correlation between variables.
 It is not influenced by the size of extreme values
Limitation
 One cannot establish the exact degree of correlation between the variables.

Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation)


The coefficient of correlation (r) is a measure of strength of the linear relationship between two
variables.

Interpretation of the coefficient of correlation


 When r = +1, there is a perfect positive correlation between the variables
 When r = -1, there is a perfect negative correlation between the variables
 When r = 0, there is no correlation between the variables
 The closer r is to +1 or to –1, the closer the relationship between the variables and the closer r
is to 0, the less close the relationship.
 The closeness of the relationship is not proportional to r.
The following table lists the interpretations for various correlation coefficients:
Value Comment
0.8 to 1.0 Very strong

0.6 to 0.8 Strong

0.4 to 0.6 Moderate

0.2 to 0.4 Weak

0.0 to 0.2 Very weak

Advantage
 It summarizes in one figure the degree of correlation and whether it is positive or negative.

Limitations
 It assumes linear relationship regardless of the fact whether that assumption is true or not.
 The coefficient can be misinterpreted.
 The value of the coefficient is unduly affected by the extreme values.
 It is time consuming.
Method of least squares

Spearman’s Rank Correlation


Definition
 It is the correlation between the ranks assigned to individuals by two different characters.
 It is a non-parametric technique for measuring strength of relationship between paired
observations of two variables when the data are in ranked form.

It is denoted by R or p
In rank correlation, there are two types of problems:-
i. Where actual ranks are given
ii. Where actual ranks are not given

Where actual ranks are given


Steps:
 Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d.
 Square these differences and obtain the total

 Use the formula


Example
The ranks given by two judges to 10 individuals are given below.
Individual 1 2 3 4 5 6 7 8 9 10
Judge 1(X) 1 2 7 9 8 6 4 3 10 5
Judge 2 (Y) 7 5 8 10 9 4 1 6 3 2
Calculate the spearman’s rank correlation.

Where ranks are not given


Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same
method should be followed in case of all the variables.

Example
Calculate the Rank correlation coefficient for the following data of marks given to 1 st year B
Com students:
CMS 100 45 47 60 38 50
CAC 100 60 61 58 48 46
Merits of the Rank method
 It is simpler to understand and easier to apply compared to the Karl Pearson’s method.
 Where the data are of qualitative nature like honesty, efficiency, intelligence etc, the method
can be used with great advantage.
 It is the only method that can be used where we are given the ranks and not the actual values.
Limitations
 The method cannot be used for finding out correlation in a grouped frequency distribution.
 Where the number of observations exceeds 30, the calculations become quite tedious and
require a lot of time.

Coefficient of determination (r2)


It is the square of the correlation coefficient. It shows the proportion of the total variation in the
dependent variable Y that is explained or accounted for by the variation in the independent
variable X. e.g. If the value of r = 0.9, r2 = 0.81, this means 81% of the variation in the dependent
variable has been explained by the independent variable.
 Qualitative Data Analysis

Qualitative research analysts define 15 types of data analysis methods. Let's go through
each one of them:

1. Typology: It's basically a classification system or methodology, taken from patterns,


themes or other kinds of groups of data. This type of method implements the thought that,
ideally, categories should be mutually exclusive and exhaustive, if possible. Here's a list
of categories as example: acts, activities, meanings, participation, relationships, settings,
etc.

2. Taxonomy: This method is a complex classification containing multiple levels of


conceptions or abstractions. Higher levels include lower levels forming superordinate and
subordinate categories.

3. Constant Comparison/Grounded Theory: This method was developed in the 60s,


and has the following steps:

Look at the document to be analyzed, such as a field note.


Identify parameters to categorize events and behavior, which will be named and coded on
document.
Code comparison will help find consistencies and [Link] is done till categories
saturate, and no new codes related to it are formed.
Finally, certain categories become centrally-focused categories, more commonly known
as core categories. These core categories are made subjects of case study.

4. Analytic Induction: This is one of the oldest and the most appreciated method. Here,
an event is studied and a hypothetical statement is developed of whatever happened.
Now, other similar events are studied, and checked if they fit the hypothesis. If they don't,
then the hypothesis needs to be revised. This process is started by first looking for
exceptions in the derived hypothesis, and then, each of them is revised to suit all
examples encountered. Eventually, hypotheses is developed that supports all the observed
cases.

5. Logical Analysis/Matrix Analysis: It is basically an outline of generalized causation,


logical reasoning process, etc. It mostly includes, the use of flow charts, diagrams, etc., to
graphically represent them, as well as written descriptions.

6. Quasi-statistics: More often than not, enumeration is used in this method to provide
manifest for categories formed, or to determine if observations are untrue.

7. Event Analysis/Microanalysis: In this method, importance is given to finding


accurate beginnings and endings of events, by determining specific boundaries or points,
that mark boundaries or events. This is the method that is specifically oriented towards
film and video making. After end points are determined, repeated viewing can help us
find phases in the event.
8. Metaphorical Analysis: Here, it's required to go on with various metaphors while
checking how well they correspond with what is being observed. Participant may be
asked for metaphors which they should interpret. For example; "Hallway as a highway."
Many participants will take highway and its components in different ways like, students
as traffic and teachers as police, etc.

9. Domain Analysis: This type of analysis is mostly used to describe social and cultural
situations, and patterns within it. The method starts by emphasizing what is social
situation to participants, while they can interrelate it with cultural meanings.

10. Hermeneutical Analysis: The word 'hermeneutical' literally means, not going for
objective meaning of text, but interpreting the text for the people involved in the
situation. This is done by never overemphasizing self in an analysis, instead reiterating
the people's story. Meaning of any content resides in the author intent, context, and the
reader - finding themes and relating these three is involved in this method.

11. Discourse analysis: This method usually involves video taping of events, so that they
can be played over and over again for deeper analysis.

12. Semiotics: Here, we determine how signs and symbols are related to their meanings
while they are being constructed. The analysis needs to assume that the meaning is not
inherent, and it comes from other things related to the symbol.

13. Content Analysis: This method is never used with video, and it is only qualitative in
development of categories. Standard rules of categorization in content analysis include:

A chunk of data to be analyzed at a time (whether it is a line, a sentence, a phrase, a


paragraph?) must be identified
Categories must be inclusive and mutually exclusive
Should have precisely defined properties
All data must fit some category, i.e., exhaustive categorization

14. Phenomenology/Heuristic Analysis: There is emphasis on individual explanation to


people. This method emphasizes the effects of research and the researcher's personal
experience. The term "phenomenology" is used to describe a researcher's experience.

15. Narrative Analysis: Also known as 'Discourse analysis', this method gives more
importance to interaction. How the narrator chooses to tell frame wise, decides how
he/she will be perceived. Always compare ideas while avoiding the revelation of
negatives about self. This analysis can involve study of literature, journals or folklore.
Thus, it can be observed that data analysis methods have multiple aspects and approaches, along
with diverse techniques and variety of names. It comes to use in different domains like business,
science, and social science. This field of statistics is a very complex one, and the number of
methods for data analysis aren't quite easy to learn without training and practice under expert
guidance.

REFERENCE

: [Link]

You might also like