0% found this document useful (0 votes)
3 views64 pages

Lesson 5.instrument, Validity, and Reliability

The document outlines the development of quantitative research instruments, emphasizing the importance of validity, reliability, and alignment with research objectives. It details the types of instruments, steps in their development, and considerations for ensuring their effectiveness. Additionally, it discusses various methods for assessing validity and reliability, including expert reviews, pilot testing, and statistical analyses.

Uploaded by

Denmark Yonson
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views64 pages

Lesson 5.instrument, Validity, and Reliability

The document outlines the development of quantitative research instruments, emphasizing the importance of validity, reliability, and alignment with research objectives. It details the types of instruments, steps in their development, and considerations for ensuring their effectiveness. Additionally, it discusses various methods for assessing validity and reliability, including expert reviews, pilot testing, and statistical analyses.

Uploaded by

Denmark Yonson
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Instrument

Development,
Validity, and
Reliability
Denmark L. Yonson, PhD
ED 702 Instructor

Disclaimer: Discussions in the succeeding slides are made with the assistance of Artificial Intelligence (ChatGPT 5)
Objective:
• Design quantitative research instruments that
demonstrate validity, reliability, and alignment with
research objectives and questions.
The Research Instrument
• A quantitative research instrument is a standardized tool
or device used to collect numerical data in a research
study.
• These instruments ensure that researchers can measure,
compare, and test hypotheses based on objective,
quantifiable data.
Key characteristics of quantitative
instruments
• Standardized: Administered and scored consistently for all
participants.
• Structured: Use closed-ended questions, fixed scales, or
objective criteria.
• Measurable: Generate numerical data for statistical analysis.
• Reliable and valid: Provide consistent results and measure what
they intend to measure.
Type of Instrument Description Uses in Research
Questionnaires/Surveys Sets of structured questions with fixed - Measuring attitudes, beliefs,
response options (e.g., Likert scales, perceptions, and behaviors.
multiple choice). - Widely used in social
sciences, education,
psychology, and market
research.
Standardized Pre-developed, validated instruments - Assessing learning outcomes
Tests/Achievement Tests measuring knowledge, skills, or or competencies.
aptitude. - Comparing performance
across groups.
Scales and Inventories Instruments specifically designed to - Evaluating latent variables.
measure psychological or behavioral - Often used in psychology,
constructs (e.g., self-esteem, education, and health
motivation). sciences.
Type of Instrument Description Uses in Research
Observation Structured tools used to systematically - Measuring classroom
Checklists/Rating Scale observe and record behaviors or behaviors, teaching
phenomena. strategies, or student
engagement.
Structured Interviews Pre-determined set of questions - Collecting quantifiable
administered verbally with responses while maintaining
standardized response categories. some interaction.
Physiological or Biometric Devices that record measurable - Common in experimental and
Measures physical responses (e.g., heart rate, health-related research.
reaction time).
Administrative or Archival Existing data collected for other - Used in secondary data
Records purposes but repurposed for research. analysis and longitudinal
studies.
Operationalizing
Variables

Data Collection

Comparative Use in Research


Analysis

Testing Hypotheses

Evaluating Interventions
Considerations in Choosing or Designing
Instruments
• Validity: Does the instrument measure what it intends to
measure?
• Reliability: Are the results consistent and stable over time?
• Practicality: Is it easy to administer, score, and interpret?
• Cultural and Contextual Fit: Is the instrument appropriate for the
population being studied?
Steps in Instrument Development
1. Define the Construct and Research Purpose
2. Conduct an Extensive Literature Review
3. Determine the Measurement Approach
4. Develop the Item Pool
5. Conduct Expert Review (Content Validation)
6. Pre-Test or Cognitive Interviewing
7. Pilot Testing and Item Analysis
8. Assess Reliability
9. Validate the Instrument
10. Finalize the Instrument
11. Document and Update Periodically
1. Define the Construct and Research Purpose

• Before writing a single question, clarify what you want to measure


and why.
• Construct Definition: Clearly define the concept (e.g., “teacher self-
efficacy,” “student motivation,” “attitude toward AI”).
• Theoretical Framework: Ground your instrument in theory and existing
literature.
• Objectives: State how the data will be used (e.g., description, correlation,
prediction, evaluation).
2. Conduct an Extensive Literature Review
• Review existing instruments to understand their structure,
limitations, and measurement techniques.
• Identify conceptual dimensions or subconstructs.
• Look for gaps — your instrument should offer added value or be
contextually appropriate for your target population.
3. Determine the Measurement Approach
• Decide the type of instrument and measurement level:
• Questionnaire, test, scale, checklist, inventory, etc.
• Nominal, ordinal, interval, or ratio level data.
• Self-report vs. observer-rated.
• Also, decide the measurement model:
• Reflective: Items reflect the latent construct (e.g., motivation scale).
• Formative: Items cause or form the construct (e.g., socioeconomic
status index).
4. Develop the Item Pool
• Generate a large list of potential items (often 2–3 times more than
you need).
• Ensure content coverage: Each item should represent an aspect of the
construct.
• Use clear, concise, and unambiguous language.
• Avoid double-barreled, leading, or biased questions.
• Choose an appropriate response format (e.g., Likert scale, semantic
differential, frequency).
5. Conduct Expert Review (Content Validation)

• Ask subject-matter experts to review your items for:


• Relevance – Does the item measure the intended construct?
• Clarity – Is the wording understandable and unambiguous?
• Coverage – Are all dimensions represented?
• Use indices such as:
• Content Validity Index (CVI)
• Content Validity Ratio (CVR)
6. Pretest or Cognitive Interviewing
• Before a pilot test, conduct cognitive interviews or small-sample
pre-tests (~10–20 respondents):
• Check if items are interpreted as intended.
• Identify confusing or culturally inappropriate language.
• Revise based on feedback.
7. Pilot Testing and Item Analysis
• Administer the instrument to a pilot sample (usually 30–100
participants) to assess:
• Item clarity and variability
• Item-total correlations
• Difficulty and discrimination indices (for tests)
• Descriptive statistics (mean, SD, skewness, kurtosis)
8. Assess Reliability
• Reliability ensures consistency of measurement. Common
methods:
• Internal Consistency: Cronbach’s α or McDonald’s ω ≥ .70
• Test–Retest Reliability: Stability over time
• Inter-Rater Reliability: Agreement between observers (if applicable)
Cronbach’s Alpha Values
Cronbach’s alpha Value Interpretation
≥ 0.90 Excellent (high-stakes testing, clinical decisions)
0.80 – 0.89 Good (strong internal consistency)
0.70 – 0.79 Acceptable (adequate for most research)
0.60 – 0.69 Questionable (may need revision or more items)
0.50 – 0.59 Poor (instrument likely unreliable)
< 0.50 Unacceptable (instrument needs major revision)
(based on Nunnally & Bernstein, 1994; George & Mallery, 2003)
9. Validate the Instrument
• Establish different types of validity evidence:
Type of Validity Purpose How to Establish

Content Validity Items represent the Expert review (CVI, CVR)


construct
Construct Validity Instrument structure reflects Exploratory and
theory Confirmatory Factor Analysis
Criterion-related Validity Scores relate to external Correlation or regression
criteria with existing measures
Convergent/Discriminant Distinguish from Multi-Trait Multi-Method
Validity similar/different constructs (MTMM), correlations
10. Finalize the Instrument
• After revisions based on reliability and validity results:
• Finalize the number of items per construct.
• Establish scoring procedures (sum, mean, cut-off points).
• Develop a user manual with instructions, scoring keys, and interpretation
guidelines.

11. Document and Update


• Document the entire development process (for publication or
replication)
• Periodically revalidate the instrument for new contexts,
populations, or cultural settings.
Validity
• Validity refers to the extent to which an instrument measures the
construct it intends to measure. A valid instrument allows researchers to
make accurate, meaningful, and evidence-based inferences from their
data.
Validity = Truthfulness + Accuracy of Instrument
• Modern validity theory (Messick, 1995; AERA, APA, & NCME, 2014) treats
validity as a single, unified concept supported by multiple sources of
evidence. These sources help prove that your instrument is not only
consistent (reliable) but also truly measuring the intended construct.
Main Types of Validity
• Content Validity
• Construct Validity
• Criterion-Related Validity
• Face Validity
Content Validity
• Definition: The degree to which the items of an instrument
adequately represent the entire domain of the construct being
measured.
• Focus: Relevance, representativeness, and coverage of items.
• Usually established before data collection.
• How to test:
• Expert Panel Review: Specialists rate each item’s relevance.
• Content Validity Index (CVI): Measures agreement among experts.
• Content Validity Ratio (CVR): Assesses whether items are “essential”
(Lawshe method).
Content Validity Index (CVI)
• The Content Validity Index (CVI) measures the degree of
agreement among a panel of experts regarding the relevance,
clarity, simplicity, and representativeness of each item in your
instrument.
• Two Levels of CVI
• Item-Level (I-CVI)
• Scale-Level CVI (S-CVI)
Item-Level CVI (I-CVI)
• assesses individual items.
• Experts rate each item (usually on a 4-point scale):
• 1 – Not relevant
• 2 – Somewhat relevant
• 3 – Quite relevant
• 4 – Highly relevant
• Then, calculate the proportion of experts who rated the item as 3
or 4.
• Formula:
• Interpretation:
Number of Experts Minimum Acceptable I-CVI
3–5 1.00 (100%)
6 – 10 > 0.78
➢ 10 > 0.80

• Example:
• 8 experts reviewed an item.
• 7 rated it as 3 or 4
7
𝐼 − 𝐶𝑉𝐼 = = 0.875
8
• Interpretation: The item has a strong content validity.
S-CVI/Ave (Average)
• Average of all I-CVI scores.
• Most commonly used.

• Interpretation:
• S-CVI > 0.90 → Excellent content validity
• S-CVI > 0.80 → Acceptable content validity
• Example: 20 items, I-CVI scores: 0.90, 0.88, 1.00, … (sum = 17.8)

𝐶𝑉𝐼 17.8
S− = = 0.89
𝐴𝑣𝑒 20

• Interpretation: Good overall content validity.


Content Validity Ration (CVR)
• The Content Validity Ratio (CVR) (Lawshe, 1975) focuses
specifically on whether each item is “essential” to the construct
being measured.
• How to compute:
• Experts rate each item on a 3-point scale:
• 1 – Not essential
• 2 – Useful but not essential where:
n = number of experts rating the item as essential
• 3 – Essential N = total number of experts
• Range: - 100 to + 100
• +1.00 → All experts agree the item is essential
• 0.00 → Half agree, half don’t
• –1.00 → No expert thinks it’s essential
• Example: 8 experts, 7 say “essential”
(7 − 4)
𝐶𝑉𝑅 = = 0.75
4

• Interpretation: Good content validity.


Lawshe’s Critical CVR Table (Minimum Acceptable CVR)
Number of Experts Minimum CVR
5 0.99
6 0.99
7 0.99
8 0.75
9 0.78
10 0.62
15 0.49
20 0.42
If an item’s CVR ≥ critical value → Retain the item.
If below → Consider revising or removing the item.
CVI and CVR Key Differences
Feature CVI CVR
Purpose Measures relevance, clarity, Measures whether items are
representativeness essential
Rating Scale Usually 4-point (Not relevant → Highly 3-point (Not essential →
relevant) Essential)
Focus Agreement on quality of content Agreement on necessity of
content
Level Item-level (I-CVI) and scale-level (S-CVI) Item-level only
Interpretation Proportion of experts rating item as Adjusted proportion accounting
relevant for chance agreement

Use both CVI and CVR together during content validation — CVI ensures content coverage, while CVR
ensures each item is essential.
Construct Validity
• Definition: The extent to which the instrument actually measures the
theoretical construct it is intended to measure.
• Focus: Theoretical accuracy and internal structure.
• Often the most important type of validity in social science research.
• How to test:
• Exploratory Factor Analysis (EFA)
• Used in early stages to discover the underlying factor structure.
• Helps verify whether items group into expected subscales.
• Confirmatory Factor Analysis (CFA)
• Tests whether the data fit a hypothesized factor structure based on theory.
• Convergent Validity
• Measures correlate positively with other instruments measuring the same construct.
• Discriminant Validity
• Instrument does not correlate too strongly with unrelated constructs.
Criterion-Related Validity
• Definition: The extent to which the scores on an instrument are
related to an external criterion measure.
• Focus: Predicting or correlating with an external benchmark.
• Types:
• Concurrent Validity
• Correlates with a criterion measured at the same time
• Predictive Validity
• Predicts a future outcome
• Postdictive Validity
• Correlates with past data (less common)
Face Validity
• Definition: The extent to which the instrument appears to
measure what it claims to measure — based on non-expert
judgment (often participants’ perceptions).
• It is not a statistical test and is considered the weakest form of validity.
• Still important for participant engagement and practical use.
Validity Type Purpose How to Establish When to use
Content Validity Measures full Expert judgment, Instrument design
domain coverage CVI, CVR stage

Construct Validity Reflects theoretical EFA, CFA, Pilot and main study
construct convergent/discrimin
ant validity

Criterion-Related Correlates with Correlation, After initial validation


Validity external criterion regression, ROC

Face Validity Appears appropriate Non-expert judgment Early design


to users
Reliability
• Reliability refers to the consistency, stability, and
dependability of an instrument’s scores. A reliable instrument
produces the same results under consistent conditions.
• Key Principle: Reliability ≠ Validity.
• A test can be reliable but not valid.
• But a valid test must always be reliable.
Major Types of Reliability
• Test–Retest Reliability (Stability Over Time)
• Parallel-Forms Reliability (Equivalence of Versions)
• Split-Half Reliability (Internal Consistency Across Halves)
• Internal Consistency Reliability (Item Interrelatedness)
• Inter-Rater Reliability (Observer Consistency)
• Intra-Rater Reliability (Individual Consistency)
Test–Retest Reliability
• Definition: Measures the stability of test scores over time — how
consistent the results are when the same instrument is
administered to the same group at two different points.
• It assesses temporal stability of the instrument.
• High correlation (r ≥ .70) between Time 1 and Time 2 scores indicates
good reliability.
• When to Use:
• Attitude scales
• Personality inventories
• Psychological constructs expected to remain stable
Parallel-Forms Reliability
• Definition: Measures the consistency of results between two
equivalent forms of the same instrument designed to assess the
same construct.
• Both forms are administered to the same participants.
• Scores should correlate highly (r ≥ .70) if the forms are equivalent.
• When to Use:
• Standardized tests
• Instruments where multiple versions are needed to prevent memory
effects
Split-Half Reliability
• Definition: Assesses the internal consistency of a test by
splitting the items into two halves (e.g., odd vs. even items) and
correlating the scores from each half.
• A high correlation indicates that items are measuring the same construct
consistently.
• Often corrected using the Spearman–Brown prophecy formula.
• When to Use:
• Questionnaires
• Knowledge or achievement tests
Internal Consistency Reliability
• Definition: Assesses the degree to which items in a test measure
the same underlying construct and are consistent with one
another.
• When to Use:
• Multi-item scales (e.g., Likert-type surveys)
• Psychological or attitudinal instruments
Inter-Rater Reliability
• Definition: Measures the degree of agreement or consistency
among different raters or observers assessing the same
phenomenon.
• Essential when data collection involves subjective judgment.
• When to Use:
• Observational research
• Rubric-based assessments
• Qualitative coding with multiple raters
Intra-Rater Reliability
• Definition: Measures the consistency of the same rater’s
judgments across multiple instances.
• Ensures that scoring remains stable over time by the same observer.
• When to Use:
• Performance assessments
• Clinical ratings
• Any scenario where the same rater scores multiple times
Type Purpose What it Measures Common Statistics

Test–Retest Consistency over time Temporal stability Pearson’s r, ICC


Parallel-Forms Consistency across Version equivalence Pearson’s r
equivalent tests
Split-Half Consistency between two Internal homogeneity Spearman–Brown
halves of a test
Internal Consistency among items Inter-item correlation Cronbach’s α, ω, KR-20
Consistency

Inter-Rater Consistency among different Agreement between Cohen’s κ, ICC


raters raters
Intra-Rater Consistency of one rater over Rater stability ICC, Pearson’s r
time
Pilot Testing
• Pilot testing is a small-scale preliminary trial of your research
instrument conducted before the main data collection. Its main
goal is to test the quality, clarity, reliability, and validity of the
instrument and the research process.
Objectives of Pilot Testing
• Check if items are clear, understandable, and interpreted
correctly
• Evaluate time required to complete the instrument
• Assess initial reliability and validity evidence
• Identify ambiguous, confusing, or redundant items
• Ensure that scales, instructions, and response options work as
intended
• Test data collection procedures and logistics
Guide in Conducting Pilot Testing
• Define the Purpose and Scope of the Pilot
• Select an Appropriate Sample
• Administer the Instrument Under Realistic Conditions
• Collect Feedback on Item Clarity and Format
• Analyze the Pilot Data
• Revise the Instrument
• Document the Pilot Testing Process
Define the Purpose and Scope of the Pilot
• Before conducting a pilot, clarify what you want to learn:
• Are you testing item clarity and wording?
• Are you checking internal consistency or reliability?
• Are you examining construct structure (via EFA)?
• Are you testing the data collection procedure or platform?
Select an Appropriate Sample
• Choose a sample similar to your target population but smaller
in size. This ensures that the results reflect how the actual
participants might respond.
• Cognitive Pretest / Item Clarity 10–20 participants
• Initial Pilot (Item Analysis) 30–60 participants
• Reliability / EFA (Exploratory) 100–200 participants
• CFA / Validation (Advanced) ≥ 200 participants

Use a convenience sample (e.g., students, teachers, professionals) who


resemble your actual participants.
Administer the Instrument Under Realistic
Conditions
• Simulate the actual research environment as closely as possible.
• Provide instructions exactly as you would in the main study.
• Keep time limits consistent.
• Use the same delivery format (e.g., online survey platform, printed
questionnaire).

Observe respondents as they answer — note hesitation, confusion, or


skipped items.
Collect Feedback on Item Clarity and Format
• After respondents complete the pilot instrument, conduct a short
debriefing session (can be written or oral). Ask questions such as:
• Were any items unclear or confusing?
• Did any question feel ambiguous or repetitive?
• Were the instructions easy to follow?
• Did the response options make sense?

Best practice: Include an open-ended feedback section at the end of the


pilot questionnaire.
Analyze the Pilot Data
• Once you collect pilot data, analyze it to evaluate the quality of
your instrument. Here’s what to check:
• Descriptive Statistics
• Item Analysis
• Reliability Testing
• Factor Analysis (Optional for Larger Samples)
Revise the Instrument
• Based on feedback and data analysis:
• Remove items with low item-total correlations (< .30)
• Revise ambiguous or confusing items
• Reorganize sections for better flow
• Adjust the number of response categories if necessary
• Add items if certain subconstructs are underrepresented

Re-run a second pilot test if you make significant changes.


Document the Pilot Testing Process
• For your thesis, dissertation, or publication, document every
step:
• Sample description and selection
• Administration procedure
• Item analysis results and decisions
• Reliability and validity results
• Specific revisions made based on pilot results
Tips:
• Always pilot before the final study — even if you’re adapting an
existing tool.
• Include a mix of qualitative feedback (comments) and
quantitative analysis.
• Pilot your data collection logistics (e.g., consent forms,
instructions, timing).
• Don’t aim for “perfect” — the goal is refinement, not final
validation.
Threats to Validity
Types of Validity Threats
• Researchers usually classify threats into four main categories of
validity (Cook & Campbell, 1979; Shadish, Cook, & Campbell,
2002):
• Internal validity – Are the results caused by the treatment and not by
other factors?
• External validity – Can the results be generalized beyond the study
sample?
• Construct validity – Are we measuring the intended construct
accurately?
• Statistical conclusion validity – Are the statistical relationships valid and
accurate?
Internal Validity Threats
• Definition: Internal validity is the degree to which the observed
effect is truly caused by the independent variable — not by
confounding factors.
• These threats occur mainly in experimental or quasi-
experimental designs.
History Experimental Mortality (Attrition)
Maturation Selection Bias
Testing Effect Diffusion/Imitation of Treatment
Instrumentation Compensatory Rivalry/Resentful Demoralization
Statistical Regression
External Validity Threats
• Definition: External validity refers to the generalizability of your
findings — whether results apply to other people, settings, times,
or conditions.
• Interaction of Selection and Treatment
• Interaction of Setting and Treatment
• Interaction of History and Treatment
• Multiple-Treatment Interference
Construct Validity Threats
• Definition: Construct validity is the extent to which your
instrument or manipulation accurately represents the
theoretical concept.
• Inadequate Explication of Constructs
• Mono-Operation Bias
• Mono-Method Bias
• Confounding Constructs with Levels of Constructs
• Experimenter Expectancy
• Evaluation Apprehension / Demand Characteristics
Statistical Conclusion Validity Threats
• Definition: This type of validity concerns the accuracy of
statistical inferences about relationships or effects.
• Low Statistical Power
• Violation of Statistical Assumptions
• Fishing / Data Dredging
• Measurement Error
• Range Restriction
Minimizing Threats
• Plan ahead – Design your study with potential threats in mind.
• Use control groups and randomization – Reduce internal validity
issues.
• Ensure instrument validity and reliability – Strengthen construct
validity.
• Pilot test and pretest – Identify and fix problems early.
• Replicate studies – Improve external validity.
• Conduct power analysis – Protect statistical conclusion validity.
Activity 1.
• What will be your possible instrument for your
identified research?
• How are you going to conduct the pilot testing?
• What are the validity and reliability test you are
going to perform?
• What are the possible threats to validity?

You might also like