0% found this document useful (0 votes)
8 views11 pages

Improving Marketing Measurement Techniques

Uploaded by

h23158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Improving Marketing Measurement Techniques

Uploaded by

h23158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Paradigm for Developing Better Measures of Marketing Constructs

Author(s): Gilbert A. Churchill, Jr.


Source: Journal of Marketing Research , Feb., 1979, Vol. 16, No. 1 (Feb., 1979), pp. 64-73
Published by: Sage Publications, Inc. on behalf of American Marketing Association

Stable URL: [Link]

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@[Link].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
[Link]

American Marketing Association and Sage Publications, Inc. are collaborating with JSTOR to
digitize, preserve and extend access to Journal of Marketing Research

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
Measure and Construct Validity Studies
I

GILBERT A. CHURCHILL, JR.*

A critical element in the evolution of a fundamental body of knowledge


in marketing, as well as for improved marketing practice, is the development
of better measures of the variables with which marketers work. In this article
an approach is outlined by which this goal can be achieved and portions
of the approach are illustrated in terms of a job satisfaction meosure.

A Paradigm for Developing Better Measures


of Marketing Constructs

In an article in the April 1978 issue of the Journal Burleigh Gardner, President of Social Research,
of Marketing, Jacoby placed much of the blame for Inc., makes a similar point with respect to attitude
the poor quality of some of the marketing literature measurement in a recent issue of the Marketing News
on the measures marketers use to assess their variables (May 5, 1978, p. 1):
of interest (p. 91): Today the social scientists are enamored of numbers
More stupefying than the sheer number of our measures and counting . .. Rarely do they stop and ask, "What
is the ease with which they are proposed and the lies behind the numbers?"
uncritical manner in which they are accepted. In point
When we talk about attitudes we are talking about
of fact, most of our measures are only measures because
constructs of the mind as they are expressed in response
someone says that they are, not because they have
to our questions.
been shown to satisfy standard measurement criteria
(validity, reliability, and sensitivity). Stated somewhat But usually all we really know are the questions we
differently, most of our measures are no more sophisti- ask and the answers we get.
cated than first asserting that the number of pebbles
a person can count in a ten-minute period is a measure Marketers, indeed, seem to be choking on their
of that person's intelligence; next, conducting a study measures, as other articles in this issue attest. They
and finding that people who can count many pebbles seem to spend much effort and time operating by
in ten minutes also tend to eat more; and, finally, the routine which computer technicians refer to as
concluding from this: people with high intelligence tend GIGO-garbage in, garbage out. As Jacoby so suc-
to eat more. cinctly puts it, "What does it mean if a finding is
significant or that the ultimate in statistical analytical
techniques have been applied, if the data collection
instrument generated invalid data at the outset?" (1978,
*Gilbert A. Churchill is Professor of Marketing, University of
Wisconsin-Madison. The significant contributions of Michael Hous- p. 90).
ton, Shelby Hunt, John Nevin, and Michael Rothschild through What accounts for this gap between the obvious
their comments on a draft of this article are gratefully acknowledged, need for better measures and the lack of such mea-
as are the many helpful comments of the anonymous reviewers. sures? The basic thesis of this article is that although
The AMA publications policy states: "No article(s) will be
the desire may be there, the know-how is not. The
published in the Journal of Marketing Research written by the Editor
or the Vice President of Publications." The inclusion of this article situation in marketing seems to parallel the dilemma
was approved by the Board of Directors because: (1) the article which psychologists faced more than 20 years ago,
was submitted before the author took over as Editor, (2) the author when Tryon (1957, p. 229) wrote:
played no part in its review, and (3) Michael Ray, who supervised
the reviewing process for the special issue, formally requested
If an investigator should invent a new psychological
he be allowed to publish it.
test and then turn to any recent scholarly work for

64

Journal of Marketing Research


Vol. XVI (February 1979), 64-73

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 65

guidance on how to determine its reliability, he would nate. Much more typical is the measurement where
confront such an array of different formulations that the XO score differences also reflect (Selltiz et al.,
he would be unsure about how to proceed. After fifty 1976, p. 164-8):
years of psychological testing, the problem of discover-
1. True differences in other relatively stable charac-
ing the degree to which an objective measure of
behavior reliably differentiates individuals is still con- teristics which affect the score, e.g., a person's
fused. willingness to express his or her true feelings.
2. Differences due to transient personal factors, e.g.,
Psychology has made progress since that time. a person's mood, state of fatigue.
Attention has moved beyond simple questions of 3. Differences due to situational factors, e.g., whether
the interview is conducted in the home or at a central
reliability and now includes more "direct" assess-
facility.
ments of validity. Unfortunately, the marketing litera-
4. Differences due to variations in administration, e.g.,
ture has been slow to reflect that progress. One of
interviewers who probe differently.
the main reasons is that the psychological literature 5. Differences due to sampling of items, e.g., the
is scattered. The notions are available in many bits specific items used on the questionnaire; if the items
and pieces in a variety of sources. There is no or the wording of those items were changed, the
overriding framework which the marketer can embrace XO scores would also change.
to help organize the many definitions and measures 6. Differences due to lack of clarity of measuring
of reliability and validity into an integrated whole so instruments, e.g., vague or ambiguous questions
that the decision as to which to use and when is which are interpreted differently by those respond-
obvious. ing.
7. Differences due to mechanical factors, e.g., a check
This article is an attempt to provide such a frame-
mark in the wrong box or a response which is coded
work. A procedure is suggested by which measures
incorrectly.
of constructs of interest to marketers can be devel-
oped. The emphasis is on developing measures which Not all of these factors will be present in every
have desirable reliability and validity properties. Part
measurement, nor are they limited to information
of the article is devoted to clarifying these notions,collected by questionnaire in personal or telephone
particularly those related to validity; reliability notions
interviews. They arise also in studies in which self-ad-
are well covered by Peter's article in this issue. Finally,
ministered questionnaires or observational techniques
the article contains suggestions about approaches on are used. Although the impact of each factor on the
which marketers historically have relied in assessing XO score varies with the approach, their impact is
the quality of measures, but which they would do predictable. They distort the observed scores away
well to consider abandoning in favor of some newer from the true scores. Functionally, the relationship
alternatives. The rationale as to why the newer al- can be expressed as:
ternatives are preferred is presented.
XO= XT + X + XR
THE PROBLEM AND APPROACH
where:
Technically, the process of measurement or opera-
Xs = systematic sources of error such as stable char-
tionalization involves "rules for assigning numbers
acteristics of the object which affect its score,
to objects to represent quantities of attributes" (Nun- and
nally, 1967, p. 2). The definition involves two key XR = random sources of error such as transient per-
notions. First, it is the attributes of objects that are sonal factors which affect the object's score.
measured and not the objects themselves. Second,
the definition does not specify the rules by which A measure is valid when the differences in observed
the numbers are assigned. However, the rigor with scores reflect true differences on the characteristic
which the rules are specified and the skill with which one is attempting to measure and nothing else, that
they are applied determine whether the construct has is, XO = XT. A measure is reliable to the extent that
been captured by the measure. independent but comparable measures of the same
Consider some arbitrary construct, C, such as cus- trait or construct of a given object agree. Reliability
tomer satisfaction. One can conceive at any given depends on how much of the variation in scores is
point in time that every customer has a "true" level attributable to random or chance errors. If a measure
of satisfaction; call this level XT. Hopefully, each is perfectly reliable, XR = 0. Note that if a measure
measurement one makes will produce an observed is valid, it is reliable, but that the converse is not
score, Xo, equal to the object's true score, Xr necessarily true because the observed score when
Further, if there are differences between objects with XR = 0 could still equal XT + X,. Thus it is often
respect to their Xo scores, these differences would said that reliability is a necessary but not a sufficient
be completely attributable to true differences in the condition for validity. Reliability only provides nega-
characteristic one is attempting to measure, i.e., true tive evidence of the validity of the measure. However,
differences in XT. Rarely is the researcher so fortu- the ease with which it can be computed helps explain

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
66 JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

its popularity. Reliability is much more routinely keting constructs. The suggested
reported than is evidence, which is much more difficult well in several instances in produ
desirable psychometric properties (see Churchill et
to secure but which relates more directly to the validity
of the measure. al., 1974, for one example). Some readers will un-
The fundamental objective in measurement is to doubtedly disagree with the suggested process or with
produce Xo scores which approximate Xr scores as the omission of their favorite reliability or validity
closely as possible. Unfortunately, the researcher coefficient. The following discussion, which details
never knows for sure what the XT scores are. Rather, both the steps and their rationale, shows that some
the measures are always inferences. The quality of of these measures should indeed be set aside because
these inferences depends directly on the procedures there are better alternatives or, if they are used, that
that are used to develop the measures and the evidence they should at least be interpreted with the proper
supporting their "goodness." This evidence typically awareness of their shortcomings.
takes the form of some reliability or validity index, The process suggested is only applicable to multi-
of which there are a great many, perhaps too many. item measures. This deficiency is not as serious as
The analyst working to develop a measure must it might appear. Multi-item measures have much to
contend with such notions as split-half, test-retest, recommend them. First, individual items usually have
and alternate forms reliability as well as with face, considerable uniqueness or specificity in that each
content, predictive, concurrent, pragmatic, construct, item tends to have only a low correlation with the
convergent, and discriminant validity. Because some attribute being measured and tends to relate to other
of these terms are used interchangeably and others attributes as well. Second, single items tend to cate-
are often used loosely, the analyst wishing to develop gorize people into a relatively small number of groups.
a measure of some variable of interest in marketing For example, a seven-step rating scale can at most
faces difficult decisions about how to proceed and distinguish between seven levels of an attribute. Third,
what reliability and validity indices to calculate. individual items typically have considerable measure-
Figure 1 is a diagram of the sequence of steps that ment error; they produce unreliable responses in the
can be followed and a list of some calculations that sense that the same scale position is unlikely to be
should be performed in developing measures of mar- checked in successive administrations of an instru-
ment.
All three of these measurement difficulties can be
Figure 1
SUGGESTED PROCEDURE FOR DEVELOPING BETTER
diminished with multi-item measures: (1) the specific-
ity of items can be averaged out when they are
MEASURES
combined, (2) by combining items, one can make
Recommended Coefficients
relatively fine distinctions among people, and (3) the
or Techniques reliability tends to increase and measurement error
decreases as the number of items in a combination
1. Specify domain
Literature search increases.
of construct

The folly of using single-item measures is illustrated


by a question posed by Jacoby (1978, p. 93):
__2 . Generate sample
of items Literature search How comfortable would we feel having our intelligence
Experience survey
Insight stimulating examples
assessed on the basis of our response to a single
Critical incidents question?" Yet that's exactly what we do in consumer
Collect
data
Focus groups
research.... The literature reveals hundreds of in-
stances in which responses to a single question suffice
+
to establish the person's level on the variable of interest
Coefficient alpha
and then serves as the basis for extensive analysis
Factor analysis and entire articles.

. . . Given the complexity of our subject matter, what


5. | Collect
data
j
makes us think we can use responses to single items
(or even to two or three items) as measures of these
4
concepts, then relate these scores to a host of other
Coefficient alpha variables, arrive at conclusions based on such an
Split-half reliability
investigation, and get away calling what we have done
"quality research?"
7. | Assess
validity
Multitrait-multimethod matrix
Criterion validity
In sum, marketers are much better served with
4 multi-item than single-item measures of their con-
structs, and they should take the time to develop them.
8. | Develop
norms
Average and other statistics
summarizing distribution of
This conclusion is particularly true for those investi-
scores
gating behavioral relationships from a fundamental

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 67

as well as applied perspective, although it applies also done so, one of the main problems cited by Kollat,
to marketing practitioners. Engel, and Blackwell as impairing progress in con-
sumer research-namely, the use of widely varying
SPECIFY DOMAIN OF THE CONSTRUCT
definitions-could have been at least diminished (Kol-
The first step in the suggested procedure for lat etdevel-
al., 1970, p. 328-9).
oping better measures involves specifying theCertainly domain definitions of constructs are means rather
of the construct. The researcher must be exacting than ends in themselves. Yet the use of different
in delineating what is included in the definition and definitions makes it difficult to compare and accumu-
what is excluded. Consider, for example, the construct late findings and thereby develop syntheses of what
customer satisfaction, which lies at the heart of the is known. Researchers should have good reasons for
marketing concept. Though it is a central notion in proposing additional new measures given the many
modern marketing thought, it is also a construct which available for most marketing constructs of interest,
marketers have not measured in exacting fashion. and those publishing should be required to supply
Howard and Sheth (1969, p. 145), for example, define their rationale. Perhaps the older measures are inade-
customer satisfaction as quate. The researcher should make sure this is the
. . .the buyer's cognitive state of being adequately orcase by conducting a thorough review of literature
inadequately rewarded in a buying situation for the in which the variable is used and should present a
sacrifice he has undergone. The adequacy is a conse- detailed statement of the reasons and evidence as to
quence of matching actual past purchase and consump- why the new measure is better.
tion experience with the reward that was expected from
the brand in terms of its anticipated potential to satisfy GENERA TE SAMPLE OF ITEMS
the motives served by the particular product class.
The second step in the procedure for developin
It includes not only reward from consumption of the
better measures is to generate items which captur
brand but any other reward received in the purchasing
and consuming process. the domain as specified. Those techniques that are
typically productive in exploratory research, including
Thus, satisfaction by their definition seems to be literature searches, experience surveys, and insight-
attitude. Further, in order to measure satisfaction, stimulating examples, are generally productive here
it seems necessary to measure both expectations at (Selltiz et al., 1976). The literature should indicate
the time of purchase and reactions at some time after how the variable has been defined previously and how
purchase. If actual consequences equal or exceed many dimensions or components it has. The search
expected consequences, the consumer is satisfied, but for ways to measure customer satisfaction would
if actual consequences fall short of expected conse- include product brochures, articles in trade magazines
quences, the consumer is dissatisfied. and newspapers, or results of product tests such as
But what expectations and consequences should the those published by Consumer Reports. The experience
marketer attempt to assess? Certainly one would want survey is not a probability sample but a judgment
to be reasonably exhaustive in the list of product sample of persons who can offer some ideas and
features to be included, incorporating such facets as insights into the phenomenon. In measuring consumer
cost, durability, quality, operating performance, and satisfaction, it could include discussions with (1)
aesthetic features (Czepeil et al., 1974). But what about appropriate people in the product group responsible
purchasers' reactions to the sales assistance they for the product, (2) sales representatives, (3) dealers,
received or subsequent service by independent dealers, (4) consumers, and (5) persons in marketing research
as would be needed, for example, after the purchase or advertising, as well as (6) outsiders who have a
of many small appliances? What about customer special expertise such as university or government
reaction to subsequent advertising or the expansion personnel. The insight-stimulating examples could in-
of the channels of distribution in which the productvolve a comparison of competitors' products or a
is available? What about the subsequent availability detailed examination of some particularly vehement
of competitors' alternatives which serve the same complaints in unsolicited letters about performance
needs or the publishing of information about the of the product. Examples which indicate sharp con-
environmental effects of using the product? To detail trasts or have striking features would be most produc-
which of these factors would be included or how tive.
customer satisfaction should be operationalized is Critical incidents and focus groups also can be used
beyond the scope of this article; rather, the example to advantage at the item-generation stage. To use the
emphasizes that the researcher must be exacting in critical incidents technique a large number of scenarios
the conceptual specification of the construct and what describing specific situations could be made up and
is and what is not included in the domain. a sample of experienced consumers would be asked
It is imperative, though, that researchers consult what specific behaviors (e.g., product changes, war-
the literature when conceptualizing constructs and ranty handling) would create customer satisfaction or
specifying domains. Perhaps if only a few more had dissatisfaction (Flanagan, 1954; Kerlinger, 1973, p.

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
68 JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

536). The scenarios might be presented to the respon- Rather, each item can be expected to have a certain
dents individually or 8 to 10 of them might be brought amount of distinctiveness or specificity even though
together in a focus group where the scenarios could it relates to the concept.
be used to trigger open discussion among participants, The average correlation in this infinmitely large ma-
although other devices might also be employed to trix, r, indicates the extent to which some common
promote discourse (Calder, 1977). core is present in the items. The dispersion of correla-
The emphasis at the early stages of item generation tions about the average indicates the extent to which
would be to develop a set of items which tap each items vary in sharing the common core. The key
of the dimensions of the construct at issue. Further, assumption in the domain sampling model is that all
the researcher probably would want to include items items, if they belong to the domain of the concept,
with slightly different shades of meaning because the have an equal amount of common core. This statement
original list will be refined to produce the final measure. implies that the average correlation in each column
Experienced researchers can attest that seemingly of the hypothetical matrix is the same and in turn
identical statements produce widely different answers. equals the average correlation in the whole matrix
By incorporating slightly different nuances of meaning (Ley, 1972, p. 111; Nunnally, 1967, p. 175-6). That
in statements in the item pool, the researcher provides is, if all the items in a measure are drawn from the
a better foundation for the eventual measure. domain of a single construct, responses to those items
Near the end of the statement development stage should be highly intercorrelated. Low interitem cor-
the focus would shift to item editing. Each statement relations, in contrast, indicate that some items are
would be reviewed so that its wording would be as not drawn from the appropriate domain and are pro-
precise as possible. Double-barreled statements would ducing error and unreliability.
be split into two single-idea statements, and if that
proved impossible the statement would be eliminated Coefficient Alpha
altogether. Some of the statements would be recast The recommended measure of the internal consis-
to be positively stated and others to be negatively tency of a set of items is provided by coefficient
stated to reduce "yea-" or "nay-" saying tendencies. alpha which results directly from the assumptions of
The analyst's attention would also be directed at the domain sampling model. See Peter's article in this
refining those questions which contain an obvious issue for the calculation of coefficient alpha.
"socially acceptable" response. Coefficient alpha absolutely should be the first
After the item pool is carefully edited, further measure one calculates to assess the quality of the
refinmement would await actual data. The type of data instrument. It is pregnant with meaning because the
collected would depend on the type of scale used square root of coefficient alpha is the estimated
to measure the construct. correlation of the k-item test with errorless true scores
(Nuinnally, 1967, p. 191-6). Thus, a low coefficient
PURIFY THE MEASURE
alpha indicates the sample of items performs poorly
The calculations one performs in purifying a in capturing the construct which motivated the mea-
measure
depend somewhat on the measurement model one sure. Conversely, a large alpha indicates that the k-item
embraces. The most logically defensible model is the test correlates well with true scores.
domain sampling model which holds that the purpose If alpha is low, what should the analyst do?' If
of any particular measurement is to estimate the score the item pool is sufficiently large, this outcome sug-
that would be obtained if all the items in the domain gests that some items do not share equally in the
were used (Nunnally, 1967, p. 175-81). The score that common core and should be eliminated. The easiest
any subject would obtain over the whole sample way to finmd them is to calculate the correlation of
domain is the person's true score, XT. each item with the total score and to plot these
In practice, though, one does not use all of the correlations by decreasing order of magnitude. Items
items that could be used, but only a sample of them. with correlations near zero would be eliminated.
To the extent that the sample of items correlates with Further, items which produce a substantial or sudden
true scores, it is good. According to the domain drop in the item-to-total correlations would also be
sampling model, then, a primary source of measure- deleted.
ment error is the inadequate sampling of the domain
of relevant items.
Basic to the domain sampling model is the concept 'What is "low" for alpha depends on the purpose of the research.
of an infinitely large correlation matrix showing all For early stages of basic research, Nunnally (1967) suggests reliabil-
correlations among the items in the domain. No single ities of .50 to .60 suffice and that increasing reliabilities beyond
item is likely to provide a perfect representation of .80 is probably wasteful. In many applied settings, however, where
important decisions are made with respect to specific test scores,
the concept, just as no single word can be used to "a reliability of .90 is the minimum that should be tolerated, and
test for differences in subjects' spelling abilities and a reliability of .95 should be considered the desirable standard"
no single question can measure a person's intelligence. (p. 226).

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 69

If the construct had, say, five identifiable dimen- content of each half [part] before looking at component
sions or components, coefficient alpha would be intercorrelations.
calculated for each dimension. The item-to-total cor-
When factor analysis is done before the purification
relations used to delete items would also be based
steps suggested heretofore, there seems to be a ten-
on the items in the component and the total score
dency to produce many more dimensions than can
for that dimension. The total score for the construct
be conceptually identified. This effect is partly due
would be secured by summing the total scores for
to the "garbage items" which do not have the common
the separate components. The reliability of the total
core but which do produce additional dimensions in
construct would not be measured through coefficient
the factor analysis. Though this application may be
alpha, but rather through the formula for the reliability
satisfactory during the early stages of research on
of linear combinations (Nunnally, 1967, p. 226-35).
a construct, the use of factor analysis in a confirmatory
Some analysts mistakenly calculate split-half reli-
fashion would seem better at later stages. Further,
ability to assess the internal homogeneity of the mea-
theoretical arguments support the iterative process of
sure. That is, they divide the measure into two halves.
the calculation of coefficient alpha, the elimination
The first half may be composed of all the even-num-
of items, and the subsequent calculation of alpha until
bered items, for example, and the second half all the
a satisfactory coefficient is achieved. Factor analysis
odd-numbered items. The analyst then calculates a
then can be used to confirm whether the number of
total score for each half and correlates these total
dimensions conceptualized can be verified empirically.
scores across subjects. The problem with this approach
is that the size of this correlation depends on the Iteration
way the items are split to form the two halves. With,
The foregoing procedure can produce several out-
say, 10 items (a very small number for most measure-
comes. The most desirable outcome occurs when the
ments), there are 126 possible splits.2 Because each
measure produces a satisfactory coefficient alpha (or
of these possible divisions will likely produce a dif-
alphas if there are multiple dimensions) and the dimen-
ferent coefficient, what is the split-half reliability?
sions agree with those conceptualized. The measure
Further, as the average of all of these coefficients
is then ready for some additional testing for which
equals coefficient alpha, why not calculate coefficient
a new sample of data should be collected. Second,
alpha in the first place? It is almost as easy, is not
factor analysis sometimes suggests that dimensions
arbitrary, and has an important practical connotation.
which were conceptualized as independent clearly
Factor Analysis overlap. In this case, the items which have pure
loadings on the new factor can be retained and a
Some analysts like to perform a factor analysis on new alpha calculated. If this outcome is satisfactory,
the data before doing anything else in the hope of additional testing with new data is indicated.
determining the number of dimensions underlying the The third and least desirable outcome occurs when
construct. Factor analysis can indeed be used to the alpha coefficient(s) is too low and restructuring
suggest dimensions, and the marketing literature is of the items forming each dimension is unproductive.
replete with articles reporting such use. Much less In this case, the appropriate strategy is to loop back
prevalent is its use to confirm or refute components to steps 1 and 2 and repeat the process to ascertain
isolated by other means. For example, in discussing what might have gone wrong. Perhaps the construct
a test composed of items tapping two common factors, was not appropriately delineated. Perhaps the item
verbal fluency and number facility, Campbell (1976, pool did not sample all aspects of the domain. Perhaps
p. 194) comments: the emphases within the measure were somehow
distorted in editing. Perhaps the sample of subjects
Recognizing multidimensionality when we see it is not
always an easy task. For example, rules for when to was biased, or the construct so ambiguous as to defy
stop extracting factors are always arbitrary in some measurement. The last conclusion would suggest a
sense. Perhaps the wisest course is to always make fundamental change in strategy, starting with a re-
the comparison between the split half and internal thinking of the basic relationships that motivated the
consistency estimates after first splitting the compo- investigation in the first place.
nents into two halves on a priori grounds. That is,
every effort should be made to balance the factor ASSESS RELIABILITY WITH NEW DATA

The major source of error within a test or meas


is the sampling of items. If the sample is appropr
2The number of possible splits with 2n items is given by the and the items "look right," the measure is said to
(2n)!
formula (Bohrnstedt, 1970). For the example cited, haveface or content validity. Adherence to the steps
2(n !) (n!) suggested will tend to produce content valid measures.
10!
n = 5 and the formula reduces to
But that is not the whole story! What about transient
2(5!) (5!) personal factors, or ambiguous questions which pro-

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
70 JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

duce guessing, or any of the other extraneous influ- consistent or internally homogeneous set of items.
ences, other than the sampling of items, which tend Consistency is necessary but not sufficient for con-
to produce error in the measure? struct validity (Niunnally, 1967, p. 92).
Interestingly, all of the errors that occur within a Rather, to establish the construct validity of a
test can be easily encompassed by the domain sampling measure, the analyst also must determine (1) the extent
model. All the sources of error occurring within a to which the measure correlates with other measures
measurement will tend to lower the average correlation designed to measure the same thing and (2) whether
among the items within the test, but the average the measure behaves as expected.
correlation is all that is needed to estimate the reliabil-
Correlations With Other Measures
ity. Suppose, for example, that one of the items is
vague and respondents have to guess its meaning. A fundamental principle in science is that any
This guessing will tend to lower coefficient alpha, particular construct or trait should be measurable by
suggesting there is error in the measurement. Subse- at least two, and preferably more, different methods.
quent calculation of item-to-total correlations will then Otherwise the researcher has no way of knowing
suggest this item for elimination. whether the trait is anything but an artifact of the
Coefficient alpha is the basic statistic for determin- measurement procedure. All the measurements of the
ing the reliability of a measure based on internal trait may not be equally good, but science continually
consistency. Coefficient alpha does not adequately emphasizes improvement of the measures of the vari-
estimate, though, errors caused by factors external ables with which it works. Evidence of the convergent
to the instrument, such as differences in testing situa- validity of the measure is provided by the extent to
tions and respondents over time. If the researcher which it correlates highly with other methods designed
wants a reliability coefficient which assesses the to measure the same construct.
between-test error, additional data must be collected. The measures should have not only convergent
It is also advisable to collect additional data to rule validity, but also discriminant validity. Discriminant
out the possibility that the previous findings are due validity is the extent to which the measure is indeed
to chance. If the construct is more than a measurement novel and not simply a reflection of some other
artifact, it should be reproduced when the purified variable. As Campbell and Fiske (1959) persuasively
sample of items is submitted to a new sample of argue, "Tests can be invalidated by too high correla-
subjects. tions with other tests from which they were intended
Because Peter's article treats the assessment of to differ" (p. 81). Quite simply, scales that correlate
reliability, it is not examined here except to suggest too highly may be measuring the same rather than
that test-retest reliability should not be used. The basicdifferent constructs. Discriminant validity is indicated
problem with straight test-retest reliability is respon- by "predictably low correlations between the measure
dents' memories. They will tend to reply to an item of interest and other measures that are supposedly
the same way in a second administration as theynot didmeasuring the same variable or concept" (Heeler
in the first. Thus, even if an analyst were to put and Ray, p. 362).
together an instrument in which the items correlate A useful way of assessing the convergent and
poorly, suggesting there is no common core and thus discriminant validity of a measure is through the
no construct, it is possible and even probable that multitrait-multimethod matrix, which is a matrix of
the responses to each item would correlate well across zero order correlations between different traits when
the two measurements. The high correlation of the each of the traits is measured by different methods
total scores on the two tests would suggest the measure (Campbell and Fiske, 1959). Table 1, for example, is
had small measurement error when in fact very little the matrix for a Likert type of measure designed to
is demonstrated about validity by straight test-retest assess salesperson job satisfaction (Churchill et al.,
correlations. 1974). The four essential elements of a multitrait-
multimethod matrix are identified by the numbers in
ASSESS CONSTRUCT VALIDITY
the upper left corner of each partitioned segment.
Specifying the domain of the construct, generating Only the reliability diagonal (1) corresponding to
items that exhaust the domain, and subsequently the Likert measure is shown; data were not collected
purifying the resulting scale should produce a measure for the thermometer scale because it was not of interest
which is content or face valid and reliable. It may itself. The entries reflect the reliability of alternate
or may not produce a measure which has construct forms administered two weeks apart. If these are
validity. Construct validity, which lies at the very unavailable, coefficient alpha can be used.
heart of the scientific process, is most directly related Evidence about the convergent validity of a measure
to the question of what the instrument is in fact is provided in the validity diagonal (3) by the extent
measuring-what construct, trait, or concept underlies to which the correlations are significantly different
a person's performance or score on a measure. from zero and sufficiently large to encourage further
The preceding steps should produce an internally examination of validity. The validity coefficients in

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 71

Table 1
MULTITRAIT-MULTIMETHOD MATRIX

Method l--Likert Scale Method 2--Thermometer Scale

Job Role Role Job Role Role


Satisfaction Conflict Ambiguity Satisfaction Conflict Ambiguity

Job Satisfaction

Method 1--
Role Conflict
Likert Scale

Role Ambiguity

Job Satisfaction - 5 .4 082 -.0546

Method 2--
Role Conflict -.239 4
Thermometer
Scale

Role Ambiguity -. 252 .141 .464

Table 1 of .450, .395 and .464 are all significant at relation coefficient such as the coefficient of con-
the .01 level. cordance can be computed if there are a great many
Discriminant validity, however, suggests three com- comparisons.
parisons, namely that:
The last requirement is generally, though not com-
1. Entries in the validity diagonal (3) should be higher
pletely, satisfied by the data in Table 1. Within each
than the correlations that occupy the same row and
column in the heteromethod block (4). This is a heterotrait triangle, the pairwise correlations are con-
minimum requirement as it simply means that the sistent in sign. Further, when the correlations within
correlation between two different measures of the each heterotrait triangle are ranked from largest posi-
same variable should be higher than the correlationstive to largest negative, the same order emerges except
"between that variable and any other variable which for the lower left triangle in the heteromethod block.
has neither trait nor method in common" (CampbellHere the correlation between job satisfaction and role
and Fiske, 1959, p. 82). The entries in Table 1 satisfy
ambiguity is higher, i.e., less negative, than that
this condition.
between job satisfaction and role conflict whereas
2. The validity coefficients (3) should be higher than
the opposite was true in the other three heterotrait
the correlations in the heterotrait-monomethod tri-
angles (2) which suggests that the correlation within
triangles (see Ford et al., 1975, p. 107, as to why
a trait measured by different methods must be higherthis single violation of the desired pattern may not
than the correlations between traits which have represent a serious distortion in the measure).
Ideally, the methods and traits generating the multi-
method in common. It is a more stringent require-
ment than that involved in the heteromethod com- trait-multimethod matrix should be as independent as
parisons of step 1 as the off-diagonal elements inpossible (Campbell and Fiske, 1959, p. 103). Some-
the monomethod blocks may be high because of times the nature of the trait rules out the opportunity
method variance. The evidence in Table I is consis-
for measuring it by different methods, thus introducing
tent with this requirement.
the possibility of method variance. When this situation
3. The pattern of correlations should be the same in
arises, the researcher's efforts should be directed to
all of the heterotrait triangles, e.g., both (2) and
(4). This requirement is a check on the significance
obtaining as much diversity as possible in terms of
of the traits when compared to the methods and data sources and scoring procedures. If the traits are
can be achieved by rank ordering the correlation not independent, the monomethod correlations will
coefficients in each heterotrait triangle; though a be large and the heteromethod correlations between
visual inspection often suffices, a rank order cor- traits will also be substantial, and the evidence about

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
72 JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

the discriminant validity of the measure will not be attitude or satisfaction. The analyst should be cautious
as easily established as when they are independent. in making such an interpretation, though. Suppose
Thus, Campbell and Fiske (1959, p. 103) suggest that the 350 score represents the highest score ever
it is preferable to include at least two sets of indepen- achieved on this instrument. Suppose it represents
dent traits in the matrix. the lowest score. Clearly there is a difference.
A better way of assessing the position of the
Does the Measure Behave as Expected? individual on the characteristic is to compare the
Internal consistency is a necessary but insufficient person's score with the score achieved by other people.
condition for construct validity. The observables may The technical name for this process is "developing
all relate to the same construct, but that does not norms," although it is something everyone does im-
prove that they relate to the specific construct that plicitly every day. Thus, by saying a person "sure
motivated the research in the first place. A suggested is tall," one is saying the individual is much taller
final step is to show that the measure behaves as than others encountered previously. Each person has
expected in relation to other constructs. Thus one a mental standard of what is average, and classifies
often tries to assess whether the scale score can people as tall or short on the basis of how they compare
differentiate the positions of "known groups" or with this mental standard.
whether the scale correctly predicts some criterion In psychological measurement, such processes are
measure (criterion validity). Does a salesperson's job formalized by making the implicit standards explicit.
satisfaction, as measured by the scale, for example, More particularly, meaning is imputed to a specific
relate to the individual's likelihood of quitting? It score in unfamiliar units by comparing it with the
should, according to what is known about dissatisfied total distribution of scores, and this distribution is
employees; if it does not, then one might question summarized by calculating a mean and standard devia-
the quality of the measure of salesperson job satisfac- tion as well as other statistics such as centile rank
tion. Note, though, there is circular logic in the of any particular score (see Ghiselli, 1964, p. 37-102,
foregoing argument. The argument rests on four sepa- for a particularly lucid and compelling argument about
rate propositions (Nunnally, 1967, p. 93): the need and the procedures for norm development).
Norm quality is a function of both the number of
1. The constructs job satisfaction (A) and likelihood
of quitting (B) are related. cases on which the average is based and their repre-
2. The scale X provides a measure of A. sentativeness. The larger the number of cases, the
3. Y provides a measure of B. more stable will be the norms and the more definitive
4. X and Y correlate positively. will be the conclusions that can be drawn, if the sample
is representative of the total group the norms are to
Only the fourth proposition is directly examined represent. Often it proves necessary to develop distinct
with empirical data. To establish that X truly measures norms for separate groups, e.g., by sex or by occupa-
A, one must assume that propositions 1 and 3 are tion. The need for such norms is particularly common
correct. One must have a good measure for B, and in basic research, although it sometimes arises in
the theory relating A and B must be true. Thus, the applied marketing research as well.
analyst tries to establish the construct validity of a Note that norms need not be developed if one wants
measure by relating it to a number of other constructs only to compare salespersons i and j to determine
and not simply one. Further, one also tries to use who is more satisfied, or to determine how a particular
those theories and hypotheses which have been suffi- individual's satisfaction has changed over time. For
ciently well scrutinized to inspire confidence in their these comparisons, all one needs to do is compare
probable truth. Thus, job satisfaction would not be the raw scores.
related to job performance because there is much
disagreement about the relationship between these SUMMA R Y AND CONCLUSIONS
constructs (Schwab and Cummings, 1970). The purpose of this article is to outline a proced
which can be followed to develop better measures
DEVELOPING NORMS
of marketing variables. The framework represents an
Typically, a raw score on a measuring instrument attempt to unify and bring together in one place the
used in a marketing investigation is not particularly scattered bits of information on how one goes about
informative about the position of a given object on developing improved measures and how one assesses
the characteristic being measured because the units the quality of the measures that have been advanced.
in which the scale is expressed are unfamiliar. For Marketers certainly need to pay more attention to
example, what does a score of 350 on a 100-item measure development. Many measures with which
Likert scale with 1-5 scoring imply about a salesper- marketers now work are woefully inadequate, as the
son's job satisfaction? One would probably be tempted many literature reviews suggest. Despite the time and
to conclude that because the neutral position is 3, a dollar costs associated with following the process
350 score with 100 statements implies slightly positive suggested here, the payoffs with respect to the genera-

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]
PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 73

tion of a core body of knowledge are substantial. Campbell, Donald R. and Donald W. Fiske. "Convergent
As Torgerson (1958) suggests in discussing the ordering and Discriminant Validation by the Multitrait-Multimethod
of the various sciences along a theoretical-correlational Matrix," Psychological Bulletin, 56 (1959), 81-105.
continuum (p. 2): Campbell, John P. "Psychometric Theory," in Marvin D.
Dunette, ed., Handbook of Industrial and Organizational
It is more than a mere coincidence that the sciences Psychology. Chicago: Rand McNally, Inc., 1976, 185-222.
would order themselves in largely the same way if Churchill, Gilbert A., Jr., Neil M. Ford, and Orville C.
they were classified on the basis to which satisfactory Walker, Jr. "Measuring the Satisfaction of Industrial
measurement of their important variables has been Salesmen," Journal of Marketing Research, 11 (August
achieved. The development of a theoretical science 1974), 254-60.
. . . would seem to be virtually impossible unless its Converse, Paul D. "The Development of a Science in
variables can be measured adequately. Marketing," Journal of Marketing, 10 (July 1945), 14-23.
Cronbach, L. J. "Coefficient Alpha and the Internal Struc-
Progress in the development of marketing as a science ture of Tests," Psychometrika, 16 (1951), 297-334.
certainly will depend on the measures marketers de- Czepeil, John A., Larry J. Rosenberg, and Adebayo Akerale.
velop to estimate the variables of interest to them "Perspectives on Consumer Satisfaction," in Ronald C.
(Bartels, 1951; Buzzell, 1963; Converse, 1945; Hunt, Curhan, ed., 1974 Combined Proceedings. Chicago: Amer-
1976). ican Marketing Association, 1974, 119-23.
Persons doing research of a fundamental nature are Flanagan, J. "The Critical Incident Technique," Psychologi-
well advised to execute the whole process suggested cal Bulletin, 51 (1954), 327-58.
here. As scientists, marketers should be willing to Ford, Neil M., Orville C. Walker, Jr. and Gilbert A.
make this committment to "quality research." Those Churchill, Jr. "Expectation-Specific Measures of the
Intersender Conflict and Role Ambiguity Experienced by
doing applied research perhaps cannot "afford" the Industrial Salesmen," Journal of Business Research, 3
execution of each and every stage, although many
(April 1975), 95-112.
of their conclusions are then likely to be nonsense, Gardner, Burleigh B. "Attitude Research Lacks System to
one-time relationships. Though the point could be Help It Make Sense," Marketing News, 11 (May 5, 1978),
argued at length, researchers doing applied work and 1+.
practitioners could at least be expected to complete Ghiselli, Edwin E. Theory of Psychological Measurement.
the process through step 4. The execution of steps New York: McGraw-Hill Book Company, 1964.
1-4 can be accomplished with one-time, cross-section- Heeler, Roger M. and Michael L. Ray. "Measure Validation
al data and will at least indicate whether one or more in Marketing," Journal of Marketing Research, 9 (No-
vember 1972), 361-70.
isolatable traits are being captured by the measures
Howard, John A. and Jagdish N. Sheth. The Theory of
as well as the quality with which these traits are being Buyer Behavior. New York: John Wiley & Sons, Inc.,
assessed. At a minimum the execution of steps 1-4 1969.
should reduce the prevalent tendency to apply ex- Hunt, Shelby D. "The Nature and Scope of Marketing,"
tremely sophisticated analysis to faulty data and there- Journal of Marketing, 40 (July 1976), 17-28.
by execute still another GIGO routine. And once steps Jacoby, Jacob. "Consumer Research: A State of the Art
1-4 are done, data collected with each application Review," Journal of Marketing, 42 (April 1978), 87-96.
of the measuring instrument can provide more and Kerlinger, Fred N. Foundations of Behavioral Research,
more evidence related to the other steps. As Ray points 2nd ed. New York: Holt, Rinehart, Winston, Inc., 1973.
out in the introduction to this issue, marketing re- Kollat, David T., James F. Engel, and Roger D. Blackwell.
"Current Problems in Consumer Behavior Research,"
searchers are already collecting data relevant to steps
Journal of Marketing Research 7 (August 1970), 327-32.
5-8. They just need to plan data collection and analysis Ley, Philip. Quantitative Aspects of Psychological Assess-
more carefully to contribute to improved marketing ment. London: Gerald Duckworth and Company, Ltd.,
measures. 1972.
Nunnally, Jum C. Psychometric Theory. New York: Mc-
REFERENCES
Graw-Hill Book Company, 1967.
Bartels, Robert. "Can Marketing Be a Science?," Journal Schwab, D. P. and L. L. Cummings, "Theories of Perfor-
of Marketing, 15 (January 1951), 319-28. mance and Satisfaction: A Review," Industrial Relations,
Bohmstedt, George W. "Reliability and Validity Assessment 9 (1970), 408-30.
in Attitude Measurement," in Gene F. Summers, ed., Selltiz, Claire, Lawrence S. Wrightsman, and Stuart W.
Attitude Measurement. Chicago: Rand McNally and Cook. Research Methods in Social Relations, 3rd ed. New
Company, 1970, 80-99. York: Holt, Rinehart, and Winston, 1976.
Buzzell, Robert D. "Is Marketing a Science," Harvard Torgerson, Warren S. Theory and Methods of Scaling. New
Business Review, 41 (January-February 1963), 32-48. York: John Wiley & Sons, Inc., 1958.
Tryon, Robert C. "Reliability and Behavior Domain Validity:
Calder, Bobby J. "Focus Groups and the Nature of Qualita-
tive Marketing Research," Journal of Marketing Research, Reformulation and Historical Critique," Psychological
14 (August 1977), 353-64. Bulletin, 54 (May 1957), 229-49.

This content downloaded from


[Link] on Fri, 01 Mar 2024 [Link] +00:00
All use subject to [Link]

You might also like