Pid 1
Pid 1
What is Measurement?
Measurement is defined as the systematic assignment of numerals to observed
phenomena (King & Minium, 2018). It entails the quantification of specific attributes of objects
or events, thereby enabling comparisons across different entities (Britannica, 2023). In
psychological science, measurement constitutes the foundational basis for testing and behavioral
analysis. Through the application of standardized tools and procedures, numerical values are
attributed to various constructs or occurrences. Psychology, as a discipline, is primarily
concerned with the study of human behavior and its underlying mechanisms. To explore the
determinants of behavior, researchers employ systematic methodologies that facilitate the
controlled observation of behavioral patterns. One such methodology is the experiment, which
involves the structured observation of behavior under rigorously controlled conditions (Urbina,
2004). The act of creating and evaluating instruments to gauge psychological traits is known as
psychological measurement. Intelligence, attitudes and ideas, emotions, and personality are some
of these attributes which are measured for some examples. Psychological-measurement is
systematic, objective and standardized. Psychological-measurement also allows understanding of
an individual‘s thoughts, feelings, emotions and behaviours. It allows psychologists to
understand contexts and underpinnings of human cognition and behavior (Morgan & King,
2017).
Scales of Measurement
Measurement assigns numbers to observations meaningfully. However, different
variables follow different measurement rules. As per Mangal (2004), the psychologist S. S.
Stevens (1946) identified four main scales of measurement:
Nominal Scale
The nominal scale is the most basic type of measurement, where elements such as objects
or individuals are grouped based on similarities or differences in characteristics. For instance, if
eye color is the variable, individuals may be categorized as blue-eyed, brown-eyed, or
2
green-eyed. Further, numbers or symbols may be used to distinguish individuals within these
categories.
However, nominal scales allow only for statements of equality or difference and do not
measure quality, such that the numbers assigned to players do not indicate their skill level; they
only differentiate team members (Mangal, 2010).
Ordinal Scale
Ordinal scales provide a more structured classification than nominal scales by ranking
individuals or objects based on merit, quality, or performance. For example, students may be
ranked as first, second, or third in a class based on their academic performance.
However, a major limitation of ordinal scales is that the difference between ranks is not
necessarily equal. For instance, the gap in achievement between the first and second positions
may differ from that between the second and third. While ordinal scales establish order, they do
not provide precise measurements. These scales are widely used in psychology, education, and
sociology, where statistical tools like percentiles and median rankings help analyze data (Mangal,
2010).
Interval Scale
An interval scale not only categorizes and ranks but also ensures that the intervals
between values are equal. A key feature of this scale is that it lacks a true zero point. Examples
include temperature measured in Celsius or Fahrenheit and scores on intelligence tests.
In such scales, the zero point is often arbitrary. For example, a temperature of 40°C is not
twice as hot as 20°C because the scale does not start from an absolute zero. Similarly,
intelligence test scores do not indicate an absolute absence of intelligence. Because of this
limitation, measurements in psychology and education often rely on approximations rather than
exact calculations (Mangal, 2010).
3
Ratio Scale
The ratio scale is the most advanced type of measurement, offering equal intervals
between values and incorporating a true zero point. This means that values can be compared in
absolute terms. Examples include length, weight, height, and volume.
2. Behavior Sampling: Tests are just limited samples of behavior due to time constraints
and practical limits, even when focusing on a narrow domain. What matters is how well
this sample reflects the broader set of relevant behaviors.
3. Scoring and Classification: Tests yield scores or categories. As Thorndike (1918) put it,
anything that exists exists in some amount, and McCall (1939) added that anything
measurable can be quantified. Psychological tests, like those in the physical sciences, use
numbers to represent traits or abilities. A test either produces a score or shows whether
someone fits into a particular group.
4. Norms and Standards: To interpret a test score, it must be compared with a reference
group—this is where norms come in. Norms summarize how a large, representative
group performed on the test. This standardization group must reflect the population the
test is intended for. Without this, it’s hard to judge an individual’s performance. Norms
provide an average and show how often various scores occur, helping assess how typical
or unusual a test score is.
5. Predictive Use: The main goal of a test is often to forecast behaviors not directly
measured by the test itself. In many cases, the real interest lies in the behaviors the test
can predict rather than the answers given during testing. Whether a test can successfully
predict these behaviors is determined through thorough research conducted post-release.
The development of psychological testing has evolved across different historical periods,
as detailed by Gregory (2015) and Urbina (2004).
Physiognomy is the idea that a person’s character can be inferred from their physical
features, a concept introduced by Aristotle in the 4th century BCE and later expanded by Johann
Lavater in the 18th century. Phrenology, developed by Franz Joseph Gall, proposed that different
brain areas were responsible for various traits and faculties, which could be assessed by
examining skull shape. In 1931, Henry C. Lavery invented the psychograph, a device aimed at
analyzing these traits. Despite early interest, the psychograph lost popularity by the mid-1930s
(Gregory, 2015).
In the late 19th century, psychology began transitioning from introspective approaches to
empirical, replicable methods. Researchers employed brass instruments to measure reaction
times and sensory thresholds. While these objective methods marked progress, they eventually
proved inadequate, as early psychologists incorrectly associated sensory processing speed with
intelligence. This era coincided with Wilhelm Wundt's founding of the first psychological
laboratory in Leipzig, Germany, where he also developed the "thought meter" (Gregory, 2015).
Sir Francis Galton, a trailblazer in experimental psychology, aimed to quantify traits like
beauty, personality, and religious efficacy. In 1884, he established a psychometric lab in London
and tested over 17,000 individuals using simple sensory and motor tasks. Though these methods
failed to accurately measure intelligence, Galton's work demonstrated the feasibility of objective
testing and the use of standardized procedures to yield meaningful results (Gregory, 2015).
James McKeen Cattell expanded upon Galton’s work by introducing the term “mental
test” and promoting the systematic study of individual differences in cognitive responses. He
viewed mental and physical energy as intertwined. However, in the early 20th century, the
limitations of reaction time as an indicator of intelligence became evident. Clark Wissler’s
research showed no significant correlation between such test scores and academic performance,
6
which led to broader acceptance of Alfred Binet’s approach. Binet introduced his intelligence
scale in 1905, emphasizing higher mental processes, and H. H. Goddard later adapted it for use
in the United States (Gregory, 2015).
Rating scales are commonly used in psychology to quantify subjective variables. Their
roots can be traced to the Greco-Roman physician Galen, who introduced a nine-point scale
based on hot-cold characteristics. Christian Thomasius was the first to apply rating scales in
psychology, employing judges to assess individuals on a 12-point scale and publishing
quantitative data from five cases (Gregory, 2015).
By the late 19th century, distinctions began to be made between emotional disturbances
and intellectual disabilities. Previously, individuals with such conditions were often subjected to
harsh treatments. However, growing humanistic perspectives led to increased interest in the
diagnosis and support of people with intellectual disabilities. French physicians J. E. D. Esquirol
and O. E. Seguin significantly influenced this shift, paving the way for Binet’s development of
diagnostic intelligence testing (Gregory, 2015).
J. E. D. Esquirol was among the first to distinguish between mental illness and
intellectual disability. He argued that while mental illness typically emerged suddenly in
adulthood and could be treated, intellectual disability was a lifelong developmental condition
that was largely irreversible. Esquirol emphasized language ability as a key diagnostic indicator,
a focus that likely influenced Binet's later emphasis on verbal components in intelligence testing.
He proposed an early classification system for intellectual disability based on verbal ability:
individuals who could only cry, those who used monosyllabic words, and those capable of using
short phrases. These categories closely align with what are now known as profound, severe, and
moderate intellectual disability (Gregory, 2015).
7
Alfred Binet, along with Victor Henri, proposed in 1896 that intelligence should be
assessed through higher-level cognitive processes. By 1902, Binet and Simon began adapting a
set of diagnostic tests by Dr. Blin and M. Damaye to better identify children with intellectual
disabilities. In 1905, they introduced the first formal scale for evaluating children's intelligence.
Their aim was to identify, not quantify, children needing specialized education. Although their
approach lacked precision in measurement, it proved effective in assigning students to
appropriate educational settings (Gregory, 2015).
Binet and Simon revised their scale in 1908, removing overly simple tasks and
incorporating more complex items. The concept of “mental level” was introduced. A third
revision in 1911 standardized the scale with five tasks per age level and extended it to adults.
Binet introduced the idea of "mental age" to represent a child’s intellectual performance relative
to their chronological age. In 1916, Lewis Terman and colleagues at Stanford University adapted
the Binet scale into the Stanford-Binet Intelligence Scale and introduced the term “IQ”
(intelligence quotient), calculated by multiplying the mental-to-chronological age ratio by 100.
Despite the popularity of IQ, Binet’s colleague Simon criticized this interpretation as a departure
from their original intent (Gregory, 2015).
8
Henry H. Goddard first translated the Binet-Simon scale into English in 1906. He made
slight modifications to adapt the scale for American use and applied it to 378 institutionalized
individuals. He classified them using terms such as “idiot,” “imbecile,” and
“feebleminded”—terms now considered outdated and offensive. Goddard also tested 1,547
typical children and labeled those whose mental age was four or more years below their
chronological age as feebleminded. He advocated for the segregation of such individuals to
prevent what he viewed as negative societal impacts (Gregory, 2015).
Leta Stetter Hollingworth contributed significantly to the study of giftedness using the
Stanford-Binet IQ test. She found that children with IQs around 165 outperformed those with
scores near 146, even though both were considered highly intelligent. Hollingworth challenged
prevailing beliefs that gifted children should not be accelerated in school. She proposed the
creation of a revolving fund to support their development. A feminist, she attributed differences
in male and female achievements to sociocultural influences rather than inherent ability
(Gregory, 2015).
With the U.S. entering World War I in 1917, group intelligence testing advanced rapidly.
Harvard psychologist Robert M. Yerkes persuaded the Army to implement intelligence tests for
recruits. Two main assessments were developed: the Army Alpha (a verbal test for literate
recruits) and the Army Beta (a nonverbal test for illiterate or non-English-speaking individuals).
These tests aimed to identify recruits’ cognitive abilities, eliminate those deemed unfit, and
match capable individuals with appropriate military roles. This large-scale testing effort
significantly advanced the science of test construction and psychometrics (Gregory, 2015).
Educational Testing
Following World War I, there was increased interest in applying psychological testing to
education, industry, and research. The National Intelligence Test was administered to millions of
9
American children, reflecting this surge in demand. The College Entrance Examination Board
(CEEB), with C.C. Brigham—a student of Yerkes—at the forefront, developed the Scholastic
Aptitude Test (SAT) using objective formats. The Educational Testing Service (ETS) later
assumed responsibility for such exams and introduced the Graduate Record Examination (GRE)
and Law School Admission Test (LSAT). Simultaneously, Terman and colleagues created the
Stanford Achievement Test, a widely used tool that incorporated contemporary psychometric
techniques (Gregory, 2015).
While intelligence tests assess broad cognitive capabilities, aptitude tests focus on
specific skill areas. Batteries of aptitude tests are designed to evaluate multiple distinct abilities.
Despite the early development of intelligence tests, aptitude tests advanced more slowly. It was
during World War II that the need for identifying individuals capable of performing highly
technical and specialized tasks led to the creation of a 20-test aptitude battery. This battery was
administered to those who had passed initial screening, and it proved essential for selecting
individuals suited for roles such as pilots, navigators, and bombardiers (Gregory, 2015).
Projective testing traces its roots to Francis Galton’s late 19th-century word association
experiments, which suggested unconscious mental processes. Influences from Freud’s
psychoanalytic theory were also foundational. Further refinements were made by Wundt and
Kraepelin, and Carl Jung advanced the method significantly. Hermann Rorschach later
10
developed the inkblot test to explore personality dynamics. Other techniques such as sentence
completion (initiated by Payne) and children’s drawings (analyzed by Goodenough) also
emerged. In Europe, the Szondi Test briefly gained prominence, though empirical criticism led to
its decline (Gregory, 2015).
Interest inventories began in the early 20th century as tools for vocational guidance and
counseling. Their origins are tied to the work of Thorndike, with one of the first formal
inventories created by Yoakum in 1919–1920 and later improved by Cowdery. Edward K. Strong
revised this work to produce the Strong Vocational Interest Blank (SVIB), which eventually
evolved into the Strong Interest Inventory. Another significant contribution was the Kuder
Preference Record, which used forced-choice items within triads to assess the relative strength of
interests. Modern versions include the Kuder General Interest Survey and the Kuder
Occupational Interest Survey (Gregory, 2015).
During the 1940s, structured personality assessments gained recognition for their clinical
utility and relevance to everyday functioning. The MMPI emerged as a cornerstone in psychiatric
diagnosis and has since been adapted for use in various domains, including medical, forensic,
and career counseling. Other influential instruments include the 16PF, derived through factor
analysis; the CPI, assessing traits like dominance and flexibility; and the MBTI, which is based
on Jungian typology and widely employed in corporate environments. Contemporary personality
research is increasingly aligned with the Big Five model, encompassing neuroticism,
extraversion, openness, agreeableness, and conscientiousness. Tests such as the NEO-PI-R,
Five-Factor Personality Inventory, and NEO-PI-3 exemplify this framework (Gregory, 2015).
In the 21st century, psychological testing has seen substantial growth in both
individualized clinical contexts and large-scale societal applications. New subspecialties such as
clinical neuropsychology and health psychology have emerged within clinical practice.
Simultaneously, group testing continues to expand in education, professional certification, and
11
standardized assessment. More than 100 million tests—ranging from IQ and achievement
assessments to screening and readiness tools—are administered annually. High-stakes
professional exams like the MCAT, LSAT, and GMAT remain central to training and licensure
processes (Gregory, 2015).
The rise of evidence-based practice in psychology has emphasized the need for
psychometrically sound assessment tools. Evidence-based psychological practice integrates
validated instruments into therapeutic settings to monitor and guide treatment. One such tool, the
Outcome Rating Scale, provides a brief yet reliable measure of a client’s current functioning,
contributing to ongoing assessment within psychotherapy (Gregory, 2015).
Role in Supports experimental research; not Central to hypothesis testing and the
Research used to generate or test formal generation of new knowledge.
hypotheses.
A high-quality psychological test must demonstrate reliability, validity, and the presence
of norms. These components are explained below:
Reliability
Reliability refers to the consistency of scores obtained by the same persons when they are
re-examined with the same test on different occasions, or with different sets of equivalent items,
or under other variable examining conditions. This concept of reliability underlies the
computation of the error of measurement of a single score, whereby we can predict the range of
14
x=T+e
Where, x is the obtained score, T refers to the true scores, and e refers to the error
variants (either from the tester, the environmental conditions, or the test itself). An ideal situation
would be when the obtained score is equal to the true score with no errors, but such a situation
isn't possible. X and T may come close to each other but they cannot be fully equal. Similarly, its
hard to ensure that all error variants are absent from the testing situation. When e is greater, x is
further away from T, and when e is lesser, x is nearer to T.
Absolute reliability and relative reliability are two important concepts used to evaluate
the consistency of test scores. Both of them serve different purposes.
Relative reliability: It assesses how well a test maintains the relative positions of
individuals overtime, i.e. it refers to the consistency of individual rankings within a group across
repeated measurements. A commonly used statistical method for measuring relative reliability is
Pearson’s correlation coefficient. This form of reliability is especially valuable when the primary
interest is comparing individuals within a group rather than examining changes in individual
scores over time. For example, if a group of students takes an intelligence test twice and those
who scored highest on the first test also score highest on the second test (maintaining their rank
order), the test demonstrates high relative reliability (Portney and Watkins, 2009).
Absolute Reliability: It measures how much an individual’s score varies from one test
administration to another. It assesses the consistency of scores on an absolute scale, considering
measurement error. A commonly used statistical measure for estimating absolute reliability is
Standard Error of Measurement (SEM) and Minimal Detectable Change (MDC). This reliability
is more useful when one wants to measure an individual’s performance over a period of time. It
also helps us determine whether any observed change in score reflects real improvement or just
measurement error. An example would be when, in a clinical setting, a psychologist monitors a
patient's anxiety levels over time using a standardised scale (Baumgartner & Jackson, 1987).
15
Advantages: One key benefit of this method is that it offers clear evidence regarding the
consistency of a test over time, making it particularly useful for evaluating stable characteristics
such as intelligence or personality. It is generally straightforward to conduct, and the outcomes
are simple to interpret and apply in real-world settings. Additionally, the test-retest method can
highlight time-related sources of error, like changes in mood, memory recall, or environmental
conditions, which can guide improvements in test development or administration.
Limitations: Despite its usefulness, this method has certain drawbacks. If the gap between
the two testing sessions is too brief, individuals may recall their previous answers, resulting in
artificially high reliability due to memory effects. Repeated exposure to the test can also lead to
improved scores the second time around, a phenomenon known as the practice effect. Thus,
choosing the right time interval is crucial—too short, and earlier responses might influence the
results (carryover effects); too long, and real changes in the measured trait could lower
reliability. Furthermore, this method is not well-suited for traits that naturally fluctuate over time,
such as mood or anxiety, since a drop in reliability might reflect genuine changes in the trait
rather than flaws in the test itself.
Alternate Forms Method: This method also tests reliability by administering two
different but equivalent versions of a test to the same group of participants. Both versions are
designed to measure the same construct and share similar statistical properties, differing only in
content and difficulty. This method aims to eliminate the carry-over effect found in the test-retest
method. However, it introduces the potential for discrepancies in item selection, which can affect
accuracy.
16
Advantages: A major strength of this method is that it minimizes the effects of memory
and practice, as different sets of items are used in each version, making it less likely that
participants will remember their previous responses. This makes the approach especially suitable
for situations where individuals need to be tested multiple times.
Limitations: However, developing two test versions that are truly comparable in terms of
content, difficulty level, and measurement characteristics can be both difficult and
time-intensive. There is also a risk of construct inequivalence—small differences between the
test forms can result in score variations that reflect inconsistencies in the test rather than actual
differences in the trait being measured, which may undermine the reliability of the results.
Split-Half Method: This method evaluates a test’s reliability by dividing it into two equal
halves and assessing the scores from each part. Participants complete the test once, and the
correlation between the two halves is calculated. One common approach to dividing the test is by
separating odd-numbered and even-numbered items, though this can be problematic if the
difficulty of the test increases. If the two halves are unbalanced, this method may not be
effective. In order to obtain reliability of the whole test using scores obtained from the halves,
Spearman Brown formula is used, which is as follows:
2𝑟ℎℎ
𝑟𝑆𝐵 = 1+𝑟ℎℎ
Advantages: Split-half reliability offers the benefit of being calculated from a single
administration of a test, making it more efficient and less taxing for participants compared to
methods like test-retest or alternate-form reliability. Since the test is only taken once, there's no
risk of scores being inflated due to familiarity or memory of the content.
Limitations: However, it can be challenging to divide a test into two halves that are truly
equal in terms of content and difficulty. Any imbalance between the halves may result in
misleading reliability estimates. Additionally, tests with a limited number of items may not be
suitable for splitting, as doing so can reduce the accuracy of the reliability measurement.
17
𝑁 1−Σ𝑝𝑞
𝑁−1
=( 2 )
σ
Here, N refers to the number of items in the test, p is the proportion of people who got an
item right, and q refers to the proportion who got it wrong. 2 is the variance of total test scores.
Disadvantages: However, it is not applicable to tests with items that have more than two
response options. Additionally, if the test measures multiple dimensions or constructs, the
reliability may be overestimated, as the formula assumes that all items reflect a single underlying
trait.
2
𝑘 1−Σσ 𝑖
𝑟𝑎 = ( 𝑘−1
)( σ2
)
Advantages: Unlike KR-20, coefficient alpha (Cronbach’s alpha) is suitable for both
dichotomous items and items with multiple response options, such as those on a Likert scale,
making it highly adaptable for various psychological and educational assessments. It can be
computed from a single test administration, offering both convenience and efficiency. Its
simplicity has made it a widely accepted reliability measure in both research and applied
settings.
reliability can appear inflated simply by adding more items, regardless of their quality.
Additionally, since alpha assumes that all items measure the same underlying construct, it may
produce misleading reliability estimates if the test actually assesses multiple dimensions.
Interscorer Reliability: Also referred to as interrater reliability, this concept measures the
level of agreement or consistency between multiple individuals evaluating the same test,
behavior, or response. It is especially crucial for assessments that rely on subjective judgment,
such as essays, performance-based tasks, or behavioral observations. A high level of agreement
among scorers indicates that the results are more likely to represent the test-taker's actual
performance rather than variations in individual scorer interpretations. Common statistical tools
used to assess this type of reliability include Cohen’s kappa and intraclass correlation
coefficients.
Advantages: Interscorer reliability ensures that results are not solely dependent on the
perspective of one evaluator, which enhances both fairness and trustworthiness in scoring. This is
particularly important for assessments involving open-ended answers, behavioral evaluations, or
clinical judgments, where subjectivity plays a significant role. Consistent scoring among
different raters demonstrates that scoring guidelines are being applied uniformly, which supports
the overall integrity of the assessment process.
Factors Influencing Reliability: Several elements can influence a test's reliability. One
major category involves factors related to the test administrator. If the examiner is untrained, it
can result in incorrect scoring, leading to inaccurate outcomes. Bias or prejudice against a
participant can also distort results. On the other hand, participant-related factors can also affect
reliability. For example, if a participant is emotionally unsettled or distressed, their focus may
suffer, and their responses might not accurately represent their true abilities or traits.
The difficulty level of the items also impacts reliability. Tests should include items of
moderate difficulty—not so easy that they bore the participant, nor so hard that they lead to
frustration. Including a discrimination index and ensuring inter-item correlations are present is
also important; without these, test items might measure unrelated traits, reducing consistency.
Reliable tests deliver consistent results when conditions remain the same, making them
essential for accuracy and fairness in assessment. Reliability is also a foundation for validity—if
a test doesn’t consistently measure a construct, it can’t be considered valid. Reliable tests are
crucial in clinical, educational, and workplace settings, where decisions are often based on test
outcomes (Urbina, 1997).
Validity
Validity refers to how well the evidence and theory support the use of the test results for
their intended purpose (Urbina, 2004). A test must be reliable to be valid, but a reliable test is not
necessarily valid. Validity ensures that a test accurately measures what it claims to. A test's
reliability determines its validity. Low dependability is unlikely to correspond with independent
standards.
20
Content Validity: The question of whether a test sufficiently covers the content domain it
is meant to measure is known as content validity. It evaluates how well test items or questions
capture the entire spectrum of information, abilities, or actions associated with the relevant
construct. Content experts typically assess content validity by reviewing the test items to
determine their relevance and how well they reflect the concept being measured. For example,
experts would ensure that an intelligence test includes questions that assess a broad range of
cognitive abilities.
Construct Validity: It is the most thorough and essential kind of validity, concentrating on
the underlying concept or characteristic under examination. This analysis looks at how well a test
captures the desired construct or theoretical idea. Building a body of evidence from several
sources to back up the interpretation of test results is known as construct validity. Convergent
and discriminant validity are the two forms of construct validity. When all of a test's items go in
the same direction i.e. toward measuring the characteristics of a specific construct, this is
referred to as convergent validity. Test items for intelligence tests should evaluate characteristics
of intelligence, such creativity and cognitive speed, among other things. If a test can distinguish
between something that isn't the construct it's meant to assess and the construct itself, then it has
discriminant validity.
Factors Influencing Validity: Validity is influenced by the quality of the test items and
how well they represent the concept being measured. Items that are poorly constructed or unclear
can compromise the test’s ability to accurately reflect the intended construct. A test validated for
one population may not remain valid for another if there are notable differences, such as in age,
culture, or educational background. Additionally, environmental conditions—like excessive
noise or uncomfortable temperatures—can introduce unrelated variables that lower validity. The
test’s length also plays a role: extremely short tests may not fully capture the target construct,
while very long ones can lead to fatigue, which negatively affects responses and validity (Shaffer
& Kipp, 2014).
Norms
Norms are aggregate descriptors of test results for certain groups of people, as specified
by some shared factor like age or grade level, according to Urbina, (2004). Norms offer a point
22
of reference for analyzing test results by contrasting them with the accomplishments of a
pertinent group. In order to establish subgroups for more precise comparisons, norms frequently
take demographic factors like age, gender, or educational attainment into consideration.
Developmental Norms: One way in which meaning can be attached to test scores is to
indicate how far along the normal developmental path the individual has progressed. Thus an
8-year-old who performs as well as the average 10-year-old on an intelligence test may be
described as having a mental age of 10; a mentally retarded adult who performs at the same level
would likewise be assigned an MA of 10. Other developmental systems utilize more highly
qualitative descriptions of behaviour in specific functions, such as sensorimotor activities or
concept formation. However expressed, scores based on developmental norms tend to be
psychometrically crude and do not lend themselves well to precise statistical treatment.
Nevertheless, they have considerable appeal for descriptive purposes, especially in the intensive
clinical study of individuals and for certain research purposes.
Age and Grade Norms: Age norms are determined by averaging the behavior or
performance of people in a certain age range. However, grade norms are focused on academic
achievement and are determined by the educational year or grade level of the person. They make
it possible for teachers and researchers to evaluate how well a pupil is doing in comparison to
other students in the same grade.
Percentile Norms: These indicate the percentage of individuals in the reference group
who scored equal to or below a particular score. For example, if a person's performance falls at
the 75th percentile, it means they scored better than 75% of the individuals in the reference
group.
23
𝑋−𝑀
𝑍= 𝑆𝐷
Here, Z stands for the standard score, X is the raw score, M is the mean of the
distribution, and SD is the standard deviation of the distribution. A positive score of Z indicates
that one's score falls above average for that test. Similarly, a negative score of Z means a score
falling below the mean score.
T-scores are another form of standardized scores, which also indicate how far a raw score
is from the mean. In T-scores, a score above 50 suggests above-average performance, and a score
below 50 reflects below-average performance. Z-scores can be converted to T-scores using the
formula, where T refers to T score and Z refers to standard score.
𝑇 = 10𝑧 + 50
Stanine Score: It is short for "standard nines" are a way of categorizing scores into nine
equal-width groups. Each stanine represents a range of scores, and they are often labeled from 1
to 9, with 5 typically representing the average performance in the reference group.
Sten Scores: Similar to stanines, sten scores divide scores into ten equal-width groups.
They are based on a metric that sets the average at 5.5, with higher sten values indicating better
performance relative to the reference group.
Types of Tests
The classification of psychological tests into different types, according to Gregory (2015)
is as follows:
24
Based on Administration
Individual Tests: Individual tests are conducted one person at a time, involving direct
interaction between the examiner and the test taker. This format allows for detailed and subtle
observations of behavior and responses, making it especially valuable when in-depth, qualitative
information is required, such as in clinical, neuropsychological, or educational assessments
(Urbina, 2014).
Advantages: One key benefit of individual testing is the opportunity for close monitoring
of behavior, which is essential for accurate diagnosis and planning interventions. The examiner
can adjust the testing pace and clarify instructions as needed, which improves accessibility for
children and individuals with special needs.
Group Tests: These tests are designed to be administered to multiple individuals at once,
making them more efficient and cost-effective, particularly in educational and workplace
settings. They are usually standardized, with fixed instructions and scoring systems that require
little examiner involvement. These tests are especially useful for screening or selection when
comparing results across large groups (Urbina, 2014).
Advantages: They are time-saving and economical for evaluating many people
simultaneously. The standardized procedures minimize examiner bias and enhance the
objectivity of scoring.
Limitations: Because the examiner cannot focus on each person individually, important
behavioral signals might be missed, and issues like test anxiety or misunderstanding of
instructions may go unnoticed. The requirement for standardized instructions means they cannot
be adjusted, which may disadvantage individuals with special needs (Urbina, 2014).
25
Speed Tests: In speed tests, all items are of equal difficulty, and the focus is on how
quickly and accurately a person can respond. The test is designed so that no one is expected to
complete all items within the allotted time. Speed tests are beneficial for efficiently assessing
large groups because they can be administered quickly. However, they may disadvantage
individuals who work carefully or have slower processing speeds. Factors like fatigue, anxiety,
or stress can also significantly affect performance. An example is the Minnesota Clerical Test
(Urbina, 2014).
Power Tests: Power tests provide ample time for completion and include items with
varying difficulty levels. The main focus is on the hardest items an examinee can answer, rather
than how quickly they respond. These tests allow for a more thorough assessment of an
individual’s reasoning or knowledge in a specific area. They are suitable for individuals with
disabilities or slower processing speeds. However, power tests generally take longer to
administer, which may not be feasible in all situations. In educational contexts, performance
might reflect the amount of preparation rather than natural ability. An example is Raven’s
Progressive Matrices (Urbina, 2014).
Performance Tests: Also called nonverbal or "hands-on" tests, these assessments require
test-takers to interact with materials or complete tasks instead of just answering written
questions. They evaluate skills like reasoning, coordination, and problem-solving while
minimizing language dependence. These tests are especially helpful for individuals with
language barriers, low literacy, or cultural differences, and are often used in intelligence testing,
26
neuropsychological evaluations, and job assessments. The Block Design Subtest from the
Wechsler Intelligence Scales is an example.
Based of Applicability
Culture-Specific Tests: These tests are specifically designed to reflect the language,
norms, and values of a particular cultural group. Their purpose is to ensure greater relevance and
validity for that population, rather than applying universally. They are commonly used in
cross-cultural studies, bilingual educational settings, or psychological evaluations within
indigenous or minority communities. One example is the System of Multicultural Pluralistic
Assessment (SOMPA).
Verbal Tests: These assessments rely heavily on language for both presenting questions
and obtaining responses, making them closely tied to an individual's education and language
skills (Urbina, 2014). Because they require spoken or written language, they are not suitable for
individuals who are illiterate.
27
Non-Verbal Tests: Non-verbal tests use visual or spatial content and require little to no
language, either in how questions are presented or in how answers are given. They are designed
to evaluate cognitive skills like visual reasoning, spatial awareness, and problem-solving without
depending on literacy or verbal ability. A common example is Raven’s Progressive Matrices
(RPM).
Gregory (2015), states that psychological tests may be categorized according to the
following attributes that are measured.
Aptitude Tests: These evaluation tools gauge definite and dependable skill sets, and your
underlying potential. Aptitude tests come in two varieties: single tests and multiple-test batteries.
While a battery of aptitude tests offers a profile of results for a variety of skills, a single aptitude
test assesses a particular skill. For example, Differential Aptitude Test (DAT).
Personality Testing: A person's distinctive traits, attributes, and behaviors are identified
through personality tests, which may be used to forecast future behavior. Tests come in many
forms, such as inventories, checklists, and projective methods like inkblots and sentence
completions. For example, NEO-FFI.
According to Gregory (2015), by far the most common use of psychological tests is to
make decisions about persons. For example, educational institutions frequently use tests to
determine placement levels for students, and universities ascertain who should be admitted, in
part, on the basis of test scores. Even the individual practitioner exploits tests, in the main, for
decision making. Examples include the consulting psychologist who uses a personality test to
determine that a police department hires one candidate and not another. But simple decision
making is not the only function of psychological testing. It is convenient to distinguish five uses
of tests:
Clinical Applications: In clinical contexts, psychological tests are vital tools for
diagnosing and planning treatment. For instance, intelligence tests are crucial for identifying
intellectual disabilities, while personality assessments can help determine the type and severity
29
of emotional disorders. These tools also allow professionals to monitor progress over time or
during recovery. An example is the Beck Depression Inventory (BDI), which is commonly used
to assess depression levels.
Research: Psychological tests are fundamental to both applied and theoretical behavioral
research. Researchers use these tools to assess key variables, collect data, and explore
relationships between different psychological constructs. Standardized testing ensures consistent
and reliable data collection, allowing researchers to draw valid conclusions and generalize
findings. These assessments provide empirical evidence that shapes the development of
psychological theories, informs intervention strategies, and influences policy decisions, thereby
advancing the scientific understanding of human behavior, mental processes, and emotional
functioning.
personality traits, and job-related competencies to assess their fit for specific positions. Beyond
hiring, they aid in identifying areas for employee training, facilitating career development, and
improving organizational effectiveness. They are also used in appraising job performance by
assessing behaviors and competencies relevant to workplace success, which informs promotion
and development decisions. Tools like the Wonderlic Personnel Test and Situational Judgment
Tests are frequently used to optimize employee selection and management strategies.
Self- Knowledge: Psychological tests also can supply a potent source of self-knowledge.
In some cases, the feedback a person receives from psychological tests can change a career path
or otherwise alter a person’s life course. course. Of course, not every instance of psychological
testing provides self-knowledge. Perhaps in the majority of cases the client already knows what
the test results divulge.
Programme Evaluation: Another use for psychological tests is the systematic evaluation
of educational and social programs. programs. Social programs are designed to provide services
that improve social conditions and community life. For example, Project Head Start is a federally
funded program that supports nationwide preschool teaching projects for underprivileged
children.
Psychological testing raises several ethical concerns that must be thoughtfully addressed.
According to Urbina (1946), the following ethical principles are crucial:
Informed Consent
Confidentiality
Participants have the right to strict confidentiality concerning their personal information
and test outcomes. Psychologists must implement secure methods for storing, sharing, and
disposing of data. Any identifying details should be removed or anonymized to maintain privacy.
Minimizing Harm
Some tests may provoke emotional distress, especially those involving sensitive or traumatic
topics. Psychologists should take precautions to minimize discomfort, offer thorough debriefing,
and provide support as needed. They must also carefully consider how test results may influence
a person’s self-image, confidence, or future opportunities.
Use of Deception
When deception is necessary to maintain the integrity of the study, it must be ethically
justified. Participants must be fully informed after the study about the nature of the deception and
why it was used—a process known as debriefing.
Debriefing
Researchers are required to explain the true purpose and findings of the study to
participants after its completion, especially when deception was involved, to ensure transparency
and ethical closure.
Sharing Results
Researchers may choose to provide participants with their individual results, especially if
requested. This practice builds trust and allows participants to gain insight into their own traits or
cognitive abilities, supporting self-awareness and informed personal decisions.
When animals are involved in research, efforts must be made to minimize pain and
distress. Procedures involving surgery must be performed under anesthesia, and if euthanasia is
necessary, it must be conducted humanely and ethically.
32
Only trained and competent professionals should conduct and interpret tests. They must
follow standardized protocols and scoring systems to uphold the test’s reliability and fairness.
The testing process should be free of any bias, discrimination, or favoritism.
References
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Prentice Hall.
Baron, R. A., & Misra, G. (2016). Psychology (5th ed.). Pearson India.
Cohen, R. J., & Swerdlik, M. E. (2018). Psychological testing and assessment: An introduction
Gregory, R. J. (2015). Psychological testing: History, principles, and applications (7th ed.).
Pearson.
Kass, C. H. (2008). Psychological testing and assessment (I. B. Weiner & W. E. Craighead, Eds.;
King, B. M., & Minium, E. W. (2018). Statistical reasoning in the behavioral sciences (7th ed.).
Wiley.
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability.
Morgan, C. T., & King, R. A. (2017). Introduction to psychology (Revised ed.). McGraw Hill
Education.