Student Learning Assessment Strategies
Student Learning Assessment Strategies
Learning 1
i
Foreword
COVID-19 has affected the world at large, but this has
also given us a glimpse of the good that exists.
- Amit Gupta
ii
Table of Contents
Foreword ii
iii
CHAPTER 1
OUTCOMES-BASED EDUCATION
Overview
In response to the need for standardization of education systems and
processes, many higher education institutions in the Philippines shifted
attention and efforts toward implementing OBE system on school level. The
shift to OBE has been propelled predominantly because it is used as a
framework by international and local academic accreditation bodies in school-
and program-level accreditation, on which many schools invest their efforts into.
The Commission on Higher Education (CHED) even emphasized the need for
the implementation of OBE by issuing a memorandum order on the
―Policy Standard to enhance quality assurance in Philippine Higher Education
through an Outcomes-Based and Typology Based QA‖. Consequently, a
Handbook of Typology, Outcomes-Based Education, and Sustainability
Assessment was released in 2014.
Given the current status of OBE in the country, this lesson aims to shed
light on some critical aspects of the framework with the hope of elucidating
important concepts that will ensure proper implementation of OBE. Also, it
zeroes in inferring implications of OBE implementation for assessment and
evaluation of students‟ performance.
Objective
Upon completion of this chapter, the students can achieve a good
grasp of outcomes-based education.
Pre-discussion
Primarily, this chapter will deal with the shift of educational focus from
content to learning outcomes particularly on the OBE: matching intentions with
the outcomes of education. The students can state and discuss the change of
educational focus from content to learning outcomes. They can
1
present a sample educational objectives and learning outcomes in K to 12
subjects of their own choice.
What to Expect?
At the end of the lesson, the students can:
discuss outcomes-based education, its meaning, brief history and
characteristics;
identify the procedures in the implementation of OBE in subjects or
courses; and
define outcomes and discuss each type of outcomes.
Meaning of Education
According to some learned people the word education has been derived
from the Latin term ―educatum” which means the act of teaching or training.
Other groups of educationalists say that it has come from another Latin word
―educare‖ which means to bring up or to raise. For a few others, the word
education has originated from another Latin word “educere‖ which means to
lead forth or to come out. All these meanings indicate that education seeks to
nourish the good qualities in man and draw out the best in every individual; it
seeks to develop the inner, innate capacities of man. By educating an individual,
we attempt to give him/her the knowledge, skills, understanding, interests,
attitudes, and critical thinking. That is, he/she acquires knowledge of history,
geography, arithmetic, language, and science.
Today, outcome-based education is the main thrust of the Higher
Education Institutions in the Philippines. The OBE comes in the form of
competency-based learning standards and outcomes-based quality assurance
monitoring and evaluating spelled out under the CHED Memorandum Order
No. 46. Accordingly, CHED OBE is different from Transformational OBE on the
following aspects:
The CMO acknowledges that there are 2 different OBE frameworks,
namely: the strong and the weak.
2
CHED subscribes to a weak or lower case due to the realities of the
Philippine higher education.
CHED recognizes that there are better OBE frameworks than what
they implemented, which does not limit HEIs to the implementation of
the weak vs. the strong OBE.
Spady’s OBE or what is otherwise called transformational OBE is
under the strong category of OBE.
What is OBE?
Outcomes-Based Education (OBE) is a process that involves the
restructuring of curriculum, assessment and reporting practices in education to
reflect the achievement of high order learning and mastery rather than the
accumulation of course credits. It is a recurring education reform model, a
student-centered learning philosophy that focuses on empirically measuring
student’s performance, which are called outcomes and on the resources that
are available to students, which are called inputs.
Furthermore, Outcome-Based Education means clearly focusing and
organizing everything in an educational system around what is essential for all
students to be able to do successfully at the end of their learning experiences.
This means starting with a clear picture of what is important for students to be
able to do, then organizing the curriculum, instruction, and assessment to make
sure that this learning ultimately happens.
For education stalwart Dr. William Spady, Outcome-Based Education
(OBE) is a paradigm shift in the education system that’s changing the way
students learn, teachers think and schools measure excellence and success.
He came to the Philippines to introduce OBE in order to share the benefits of
OBE. Spady said in conceptualizing OBE in 1968, he observed the US
education system was more bent on how to make them achieve good scores.
―So there are graduates who pass exams, but lack skills. Then there are those
who can do the job well yet are not classic textbook learners.‖ Furthermore, he
said that OBE is also more concerned not with one standard for assessing the
success rate of an individual. ―In OBE, real outcomes take us far beyond the
paper-and-pencil test.‖ An OBE-oriented learner thinks of the process of
3
learning as a journey by itself. He acknowledged that all students can learn and
succeed, but not on the same day in the same way.
As a global authority in educational management and the founder of OBE
learning philosophy, Spady sees that unlike previous learning strategies where
a learner undergoes assessment to see how much one has absorbed lessons,
OBE is more concerned with how successful one is in achieving what needs to
be accomplished in terms of skills and strategies. ―It’s about developing
a clear set of learning outcomes around which an educational system can
focus,‖ he said. Outcomes are clear learning results that students can
demonstrate at the end of significant learning experiences. They are what
learners can actually do with what they know and have learned.‖ Outcomes-
Based Education expects active learners, continuous assessment, knowledge
integration, critical thinking, learner-centered, and learning programs. Also, it is
designed to match education with actual employment. Philippine higher
education institutes are encouraged to implement OBE not only to be locally
and globally competitive but also to work for transformative education.
4
Philippines, learning materials are aligned with OBE through the following
features:
Learning Objectives - Statements that describe what learners/students are
expected to develop by the time they finish a particular chapter. This may
include the cognitive, psychomotor, and affective aspects of learning.
Teaching Suggestions - This section covers ideas, activities, and strategies that
are related to the topic and will help the instructor in achieving the Learning
Objectives.
Chapter Outline - This section shows the different topics/subtopics found in
each chapter of the textbook.
Discussion Questions - This section contains end-of-chapter questions that will
require students to use their critical thinking skills to analyze the factual
knowledge of the content and its application to actual human experiences.
Experiential Learning Activities - This includes activities that are flexible in
nature. This may include classroom/field/research activities, simulation
exercises, and actual experiences in real-life situations.
Objective type of tests to test knowledge of students may include any of the
following:
- Identification
- True or False
- Fill in the blank
- Matching type
- Multiple Choice
Answer Keys to the test questions must be provided*
Assessment for Learning - This may include rubrics that will describe and
evaluate the level of performance/expected outcomes of the learners.
5
…CAN read and demonstrate good comprehension of text in areas of
the student’s interest or professional field.
…CAN demonstrate the ability to apply basic research methods in
psychology, including research design, data analysis, and
interpretation.
…CAN identify environmental problems, evaluate problem-solving
strategies, and develop science-based solutions.
…CAN demonstrate the ability to evaluate, integrate, and apply
appropriate information from various sources to create cohesive,
persuasive arguments, and to propose design concepts.
6
learning context should challenge students enough to activate and enable
higher order thinking skills (e. g., critical thinking, decision making, problem
solving, etc.), and should be more authentic (e. g., performance tests,
demonstration exercise, simulation or role play, portfolio, etc.).
Expanded opportunity. The first and second principles importantly
necessitate that educators deliver students‟ learning experiences at an
advanced level. In the process, many students may find it difficult complying
with the standards set for a course. As a philosophical underpinning of
OBE, Spady (1994) emphasized that ―all students can learn and succeed,
but not on the same day, in the same way.‖ This discourages educators
from generalizing manifestations of learned behavior from students,
considering that every student is a unique learner. Thus, an expanded
opportunity should be granted to students in the process of learning and
more importantly in assessing their performance. The expansion of
opportunity can be considered multidimensional (i. e., time, methods and
modalities, operational principles, performance standards, curriculum
access and structuring). In the assessment practice and procedures, the
time dimension implies that educators should give more opportunities for
students to demonstrate learning outcomes at the desired level. Thus,
provisions of remedial, make-up, removal, practice tests, and other
expanded learning opportunities are common in OBE classrooms.
Design down. This is the most crucial operating principle of OBE. As
mentioned in the previous section, OBE implements a top-down approach
in designing and stating the outcomes of education (i. e., culminating -
enabling - discrete outcomes). The same principle can be applied in
designing and implementing outcomes‟ assessments in classes.
Traditionally, the design of assessments for classes is done following a
bottom-up approach. Educators would initially develop measures for micro
learning tasks (e. g., quizzes, exercises, assignments, etc.), then proceed
to develop the end-of-term tasks (e. g., major examination, final project,
etc.). In OBE context, since the more important outcomes that should be
primarily identified and defined are the culminating ones, it follows that the
same principle should logically apply.
7
However, in a traditional education system and economy, students are
given grades and rankings compared to each other. Content and performance
expectations are based primarily on what was taught in the past to students of
a given age. The basic goal of traditional education was to present the
knowledge and skills of the old generation to the new generation of students,
and to provide students with an environment in which to learn, with little
attention (beyond the classroom teacher) to whether or not any student ever
learns any of the material. It was enough that the school presented an
opportunity to learn. Actual achievement was neither measured nor required by
the school system.
In fact, under the traditional model, student performance is expected to
show a wide range of abilities. The failure of some students is accepted as a
natural and unavoidable circumstance. The highest-performing students are
given the highest grades and test scores, and the lowest performing students
are given low grades. Local laws and traditions determine whether the lowest
performing students were socially promoted or made to repeat the year.
Schools used norm-referenced tests, such as inexpensive, multiple-choice
computer-scored questions with single correct answers, to quickly rank
students on ability. These tests do not give criterion-based judgments as to
whether students have met a single standard of what every student is expected
to know and do: they merely rank the students in comparison with each other.
In this system, grade-level expectations are defined as the performance of the
median student, a level at which half the students score better and half the
students score worse. By this definition, in a normal population, half of students
are expected to perform above grade level and half the students below grade
level, no matter how much or how little the students have learned.
In outcomes-based education, classroom instruction is focused on the
skills and competencies that students must demonstrate when they exit. There
are two types of outcomes: immediate and deferred outcomes.
Immediate outcomes are competencies and skills acquired upon
completion of a subject; a grade level, a segment of a program, or of a program
itself. Examples of these are:
Ability to communicate in writing and speaking
8
Mathematical problem-solving skills
Skill in identifying objects by using the different senses
Ability to produce artistic or literary works
Ability to do research and write the results
Ability to present an investigative science project
Skill in story-telling
Promotion to a higher grade level
Graduation from a program
Passing a required licensure examination
Initial job placement
On the other hand, deferred outcomes refer to the ability to apply
cognitive, psychomotor, and affective skills/competencies in various situations
many years after completion of a subject; grade level or degree program.
Examples of these are:
Success in professional practice or occupation
Promotion in a job
Success in career planning, health, and wellness
Awards and recognition
Summary
The change in educational perspective is called Outcomes-Based
Education (OBE) which is characterized with the following:
It is student-centered; that is, it places the students at the center of the
process by focusing on Student Learning Outcome (SLO).
It is faculty driven; that is, it encourages faculty responsibility for
teaching, assessing program outcomes, and motivating participation
from the students.
It is meaningful; that is, it provides data to guide the teacher in making
valid and continuing improvement in instruction and other assessment
activities.
To implement OBE on the subject or the course, the teacher should
identify the educational objectives of the subject course so that he/she can help
students develop and enhance their knowledge, skills, and attitudes;
9
he/she must list down all learning outcomes specified for each subject or the
course objectives. A good source of learning outcomes statements is the
taxonomy of educational objectives by Benjamin Bloom which is grouped into
three domains: the Cognitive, also called knowledge, refers to mental skills
such as remembering, understanding, applying, analyzing, evaluating,
synthesizing, creating; the Psychomotor, also referred to as skills, includes
manual or physical skills, which proceed from mental activities and range from
the simplest to the complex such as observing, imitating, practicing, adapting,
and innovating; the Affective, also known as the attitude, refers to growth in
feelings or emotions, from the simplest behavior to the most complex such as
receiving, responding, valuing, organizing, and internalizing.
The emphasis in an OBE education system is on measured outcomes
rather than "inputs," such as how many hours students spend in class, or what
textbooks are provided. Outcomes may include a range of skills and knowledge.
Generally, outcomes are expected to be concretely measurable, that is,
"Student can run 50 meters in less than one minute" instead of "Student enjoys
physical education class." A complete system of outcomes for a subject area
normally includes everything from mere recitation of fact ("Students will name
three tragedies written by Shakespeare") to complex analysis and interpretation
("Student will analyze the social context of a Shakespearean tragedy in an
essay"). Writing appropriate and measurable outcomes can be very difficult,
and the choice of specific outcomes is often a source of local controversies.
Learning outcomes describe the measurable skills, abilities, knowledge
or values that students should be able to demonstrate as a result of a
completing a course. They are student-centered rather than teacher-centered,
in that they describe what the students will do, not what the instructor will teach.
They are not standalone statements. They must all relate to each other and to
the title of the unit and avoid repetition. Articulating learning outcomes for
students is part of good teaching. If you tell students what you expect them to
do, and give them practice in doing it, then there is a good chance that they will
be able to do it on a test or major assignment. That is to say, they will have
learned what you wanted them to know. If you do not tell them what they
10
will be expected to do, then they are left guessing what you want. If they guess
wrong, they will resent you for being tricky, obscure or punishing.
Finally, outcomes assessment procedures must also be drafted to
enable the teacher to determine the degree to which the students are attaining
the desired learning outcomes. It identifies for every outcome the data that will
be gathered which will guide the selection of the assessment tools to be used
and at what point assessment will be done.
Enrichment
Assessment
Activity 1. Fill up the matrix based from your findings of the Educational
Objectives (EO) and create your own Learning Outcomes (LO).
11
Activity 2. Research the nature of education and be able to submit/present
your outputs in power point/slides.
Activity 3. The following statements are incorrect. On the blank before each
number, write the letter of the section which makes the sentence wrong, and
on the blank after each number, re-write the wrong section to make the
sentence correct.
1. Because of knowledge explanation/ brought about by the use of/
(a) (b)
computers in education/ the teacher ceased to be the sole source
(c) (d)
of knowledge.
12
5. Education comes/ from the Latin root/ ―educare‖ or ―educere‖/ which
(a) (b) (c)
means to ―pour in‖.
(d)
8. The content and the outcome/ are the two/ main elements/ of the
(a) (b) (c) (d)
educative process.
13
Activity 4. Give the meaning of the following word or group of words. Write
your answers on the spaces provided for after each number.
1. Outcomes-Based Education
2. Immediate Outcome
3. Deferred Outcome
4. Educational Objective
5. Learning Outcome
14
6. Student-Centered Instruction
7. Content-Centered Instruction
8. Psychomotor Skill
9. Cognitive Skill
References
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Macayan, Jonathan (2017).Implementing Outcome-Based Education (OBE)
Framework: Implications for Assessment of Students’ Performance.
Educational Measurement and Evaluation Review (2017), Vol. 8 (1).
Navarro, R., Santos, R. and Corpuz, B. (2017). Assessment of Learning I (3 rd.
ed.). Metro Manila: Lorimar Publishing, Inc.
15
CHAPTER 2
INTRODUCTION TO ASSESSMENT IN LEARNING
Overview
Clear understanding of the course on Assessment of Learning has to
begin with one’s complete awareness of the fundamental terms and principles.
Most importantly, a good grasp of the concepts like assessment, learning,
evaluation, measurement, testing and test is a requisite knowledge for every
pre-service teacher. Sufficient information of these pedagogic elements would
certainly heighten his or her confidence in teaching. The principles behind
assessment are similarly necessary to be studied as all activities related to it
must be properly grounded; otherwise, it is not sound and meaningless.
Objective, content, method, tool, criterion, recording, procedure, feedback, and
judgment are some significant factors that must be considered to undertake
quality assessment.
Objective
Upon completion of the unit, the students can discuss the fundamental
concepts, principles, purposes, roles and classifications of assessment, as well
as align the assessment methods to learning targets.
Pre-discussion
Study the picture in Figure 1.
Has this something to do with
assessment? What are your
comments?
16
What to Expect?
At the end of the lesson, the students can:
1. make a personal definition of assessment;
2. compare assessment with measurement and evaluation;
3. discuss testing and grading;
4. explain the different principles in assessing learning;
5. relate an experience as a student or pupil related to each principle;
6. comment on the tests administered by the past teachers; and
7. perform simple evaluation.
What is assessment?
Let us have some definitions of assessment from varied sources:
1. Assessment involves the use of empirical data on student learning to refine
programs and improve student learning. (Assessing Academic Programs in
Higher Education by Allen 2004)
2. Assessment is the process of gathering and discussing information from
multiple and diverse sources in order to develop a deep understanding of
what students know, understand, and can do with their knowledge as a
result of their educational experiences; the process culminates when
assessment results are used to improve subsequent learning. (Learner-
Centered Assessment on College Campuses: shifting the focus from
teaching to learning by Huba and Freed 2000)
3. Assessment is the systematic basis for making inferences about the
learning and development of students. It is the process of defining,
selecting, designing, collecting, analyzing, interpreting, and using
information to increase students' learning and development. (Assessing
Student Learning and Development: A Guide to the Principles, Goals, and
Methods of Determining College Outcomes by Erwin 1991)
4. Assessment is the systematic collection, review, and use of information
about educational programs undertaken for the purpose of improving
student learning and development (Palomba & Banta, 1999).
5. Assessment refers to the wide variety of methods or tools that educators
use to evaluate, measure, and document the academic readiness, learning
17
progress, skill acquisition, or educational needs of students (Great School
Partnership, 2020).
6. David et al. (2020:3) defined assessment as the ―process of gathering
quantitative and/or qualitative data for the purpose of making decisions.‖
7. Assessment is defined as a process that is used to keep track of learners’
progress in relation to learning standards and in the development of 21st
century skills; to promote self-reflection and personal accountability among
students about their own learning; and to provide bases for the profiling of
student performance on the learning competencies and standards of the
curriculum (DepEd Order No. 8, s. 2015).
Assessment is one of the most critical dimensions of the education
process; it focuses not only on identifying how many of the predefined education
aims and goals have been achieved but also works as a feedback mechanism
that educators should use to enhance their teaching practices. Assessment is
located among the main factors that contribute to a high quality teaching and
learning environment.
The value of assessment can be seen in the links that it forms with other
education processes. Thus, Lamprianou and Athanasou (2009:22) pointed out
that assessment is connected with the education goals of
―diagnosis, prediction, placement, evaluation, selection, grading, guidance or
administration‖. Moreover, Biggs (1999) regarded assessment to be a critical
process that provides information about the effectiveness of teaching and the
progress of students and also makes clearer what teachers expect from
students.
Meaning of Learning
We all know that the human brain is immensely complex and still
somewhat of a mystery. It follows then, that learning as a primary function of
the brain is appreciated in many different senses.
To provide you sufficient insights of the term, here are several manners
that learning can be described:
1. A change in human disposition or capability that persists over a period of
time and is not simply ascribable to processes of growth.‖ (From The
Conditions of Learning by Robert Gagne)
18
2. Learning is the relatively permanent change in a person’s knowledge or
behavior due to experience. This definition has three components: 1) the
duration of the change is long-term rather than short-term; 2) the locus of
the change is the content and structure of knowledge in memory or the
behavior of the learner; 3) the cause of the change is the learner’s
experience in the environment rather than fatigue, motivation, drugs,
physical condition or physiologic intervention. (From Learning in
Encyclopedia of Educational Research, Richard E. Mayer)
3. It has been suggested that the term learning defies precise definition
because it is put to multiple uses. Learning is used to refer to (1) the
acquisition and mastery of what is already known about something, (2) the
extension and clarification of meaning of one’s experience, or (3) an
organized, intentional process of testing ideas relevant to problems. In other
words, it is used to describe a product, a process, or a function. (From
Learning How to Learn: Applied Theory for Adults by R.M. Smith)
4. A process that leads to change, which occurs as a result of experience and
increases the potential of improved performance and future learning. (From
Make It Stick: The Science of Successful Learning by Peter C. Brown, Henry
L. Roediger III, Mark A. McDaniel)
5. The process of gaining knowledge and expertise. (From How Learning
Works: Seven Research-Based Principles for Smart Teaching by Susan
Ambrose, et al.)
6. A persisting change in human performance or performance potential which
must come about as a result of the learner’s experience and interaction with
the world. (From Psychology of Learning for Instruction by M. Driscoll)
7. Learning is ―a process that leads to change, which occurs as a result of
experience and increases the potential for improved performance and future
learning‖ (Ambrose et al, 2010:3). The change in the learner may happen
at the level of knowledge, attitude or behavior. As a result of learning,
learners come to see concepts, ideas, and/or the world differently. It is not
something done to students, but rather something students themselves do.
It is the direct result of how students interpret and respond to their
experiences.
19
From the foregoing definitions, learning can be briefly stated as a change
in learner’s behaviour towards an improved level resulting from one’s
experiences and interactions with his environment.
Study the following figures to appreciate better the meaning of
―learning.‖
Figure 2
Figure 3
Figure 4
20
You may be thinking that learning to bake cookies and learning
something like Chemistry are not the same at all. In a way, you are right
however, the information you get from assessing what you have learned is the
same. Brian used what he learned from each batch of cookies to improve the
next batch. You also learn from every homework assignment that you complete,
and in every quiz you take what you still need to study to know the material.
21
know and understand; how far they have progressed and how fast; and how
their scores and progress compare to those of other students.
In short, evaluation is the process of making judgments based on
standards and evidences derived from measurements. It is now giving meaning
to the measured attributes. With this, it is implicit that a sound evaluation is
dependent on the way measurement was carried out. Ordinarily, teachers’
decision to pass or fail a learner is determined by his obtained grade relative to
the school standard. Thus, if one’s final grade is 74 or lower then it means
failing; otherwise, it is a passing when the final grade is 75 or better since the
standard passing or cut-off grade is 75. The same scenario takes place in the
granting of academic excellence awards such as Valedictorian, Salutatorian,
First Honors, Second Honors, Cum laude, Magna cum laude, Summa cum
laude, etc. Here, evaluation means comparing one’s grade or achievement
against an established standards or criteria to arrive at a decision. Therefore,
grading of students in schools must be credible to ensure that giving of awards
would be undisputable.
22
degree of achievement. In similar way, the Professional Regulation
Commission (PRC) and Civil Service Commission (CSC) are administering
licensure and eligibility examinations to test the readiness or competence of
would-be professionals.
On the other hand, grading implies combining several assessments,
translating the result into some type of scale that has evaluative meaning, and
reporting the result in a formal way. Hence, grading is a process and not merely
quantitative values. It is the one of the major functions, results, and outcomes
of assessing and evaluating students’ learning in the educational setting
(Magno, 2010). Practically, grading is the process of assigning value to the
performance or achievement of a learner based on specified criteria like
performance task, written test, major examinations, and homework. It is also a
form of evaluation which provides information as whether a learner passed or
failed in a certain task or subject. Thus, a student is given a grade of 85 after
scoring 36 in a 50-item midterm examination. He also received a passing grade
of 90 in Mathematics after his detailed grades in written test and performance
task were computed.
Models in Assessment
The two most common psychometric theories that serve as frameworks
for assessment and measurement especially in the determination of the
psychometric characteristics of a measure (e.g., tests, scale) are the classical
test theory (CTP) and the item response theory (IRT).
The CTT, also known as the true score theory, explains that variations
in the performance of examinees’ on a given measure is due to variations in
their abilities. It assumes that an examinees’ observed score in a given measure
is the sum of the examinees’ true scores and some degree of error in the
measurement caused by some internal and external conditions. Hence, the
CTT also assumes that all measures are imperfect and the scores obtained
from a measure could differ from the true score (i.e., true ability of an
examinee).
The CTT provides an estimation of the item difficulty based on the
frequency of number of examinees who correctly answer a particular item; items
with a fewer number of examinees with correct answers are considered
23
more difficult. It also provides an estimation of item discrimination based on the
number of examinees with higher or lower ability to answer a particular item. If
an item is able to distinguish between examinees with higher ability (i.e., higher
total test score) and lower ability (i.e., lower total test score), then an item is
considered to have good discrimination. Test reliability can also be estimated
using approaches from CTT (e.g., Kuder-Richardson 20, Cronbach’s alpha).
Item analysis based on this theory has been the dominant approach because
of the simplicity of calculating the statistics (e.g., item difficulty index, item
discrimination index, item-total correlation).
The IRT, on the other hand, analyzes test items by estimating the
probability that an examinee answers an item correctly or incorrectly. One of
the central differences of IRT from CTT is that in IRT, it is assumed that the
characteristic of an item can be estimated independently of the characteristic
or ability of an examinee, and vice-versa. Aside from item difficulty and item
discrimination indices, IRT analysis can provide significantly more information
on item and test, such as fit statistics, item characteristic curve (ICC), and tests
characteristic curve (TCC). There are also different IRT models (e.g., one-
parameter model, 3-parameter model) which can provide different item and test
information that cannot be estimated using the CTT. In previous years, there
has been an increase in the use of IRT analysis as measurement framework
despite the complexity of the analysis involved due to the availability of IRT
software.
Types of Assessment
The most common types of assessment are diagnostic, formative and
summative, criterion-referenced and norm-referenced, traditional and
authentic. Other experts added ipsative and confirmative assessments.
Pre-assessment or diagnostic assessment
Before creating the instruction, it is necessary to know for what kind of
students you are creating the instruction. Your goal is to get to know your
student’s strengths, weaknesses and the skills and knowledge they
possess before taking the instruction. Based on the data you have
collected, you can create your instruction. Usually, a teacher conducts a
pre-test to diagnose the learners.
24
Formative assessment
Formative assessment is a continuous and several assessments done
during the instructional process for the purpose of improving teaching or
learning (Black & William, 2003).
Summative assessment
Summative assessments are quizzes, tests, exams, or other formal
evaluations of how much a student has learned throughout a subject. The
goal of this assessment is to get a grade that corresponds to a student’s
understanding of the class material as a whole, such as with a midterm or
cumulative final exam.
Confirmative assessment
When your instruction has been implemented in your classroom, it is still
necessary to take assessment. Your goal with confirmative assessments
is to find out if the instruction is still a success after a year, for example,
and if the way you are teaching is still on point. You could say that a
confirmative assessment is an extensive form of a summative assessment
(LMS, 2020).
Norm-referenced assessment
This assessment primarily
compares one’s learning
performance against an
average norm. It indicates
the student’s performance
in contrast with other
students (see Figure 5).
Also, the age and question
paper are same for both of
them. It assesses whether the students have performed better or worse
than the others. It is the theoretical average determined by comparing
scores.
25
Criterion-referenced assessment
It measures student’s
performances against a fixed
set of predetermined criteria or
learning standards (see Figure
6). It checks what students are
expected to know and be able
to do at a specific stage of their
education. Criterion-
referenced tests are used to
evaluate a specific
body of knowledge or skill set; it is a test to evaluate the curriculum taught
in a course. In practice, these assessments are designed to determine
whether students have mastered the material presented in a specific unit.
Each student’s performance is measured based on the subject matter
presented (what the student knows and what the student does not know).
Again, all students can get 100% if they have fully mastered the material.
Ipsative assessment
It measures the performance of a student against previous performances
from that student. With this method you are trying to improve yourself by
comparing previous results. You are not comparing yourself against other
students, which may be not so good for your self-confidence (LMS, 2020).
Traditional Assessment
Traditional assessments refer to conventional methods of testing, usually
matching type test items. In general, they measure students’ knowledge of
the content. Common examples are: True or False, multiple choice tests,
standardized tests, achievement tests, intelligence tests, and aptitude
tests.
Authentic Assessment
Authentic assessments refer to evaluative activities wherein students are
asked to perform real-world tasks that demonstrate meaningful application
of what they have learned. They measure students’ ability to apply
knowledge of the content in real life situations and ability to use
26
what they have learned in meaningful ways. Common examples are:
demonstrations, hands-on experiments, computer simulations, portfolios,
projects, multi-media presentations, role plays, recitals, stage plays and
exhibits.
Principles of Assessment
There are many principles in the assessment in learning. Different
literature provides their unique list yet closely related set of principles of
assessment. According to David et al. (2020), the following may be considered
as core principles in assessing learning:
1. Assessment should have a clear purpose. The methods used in
collecting information should be based on this purpose. The
interpretation of the data collected should be aligned with the purpose
that has been set. This principle is congruent with the outcome-based
education (OBE) principles of clarity of focus and design down.
2. Assessment is not an end in itself. It serves as a means to enhance
student learning. It is not a simple recording or documentation of what
learners know and do not know. Collecting information about student
learning, whether formative or summative, should lead to decision that
will allow improvement of the learners.
3. Assessment is an on-going, continuous, and a formative process. It
consists of a series of tasks and activities conducted over time. It is not
a one-shot activity and should be cumulative. Continuous feedback is
an important element of assessment. This principle is congruent with
the OBE principle of expanded opportunity.
4. Assessment is learner-centered. It is not about what the teacher does
but what the learner can do. Assessment of learners provides teachers
with an understanding on how they can improve their teaching, which
corresponds to the goal of improving student learning.
5. Assessment is both process- and product-oriented. It gives equal
importance to learner performance or product in the process. They
engaged in to perform or produce a product.
6. Assessment must be comprehensive and holistic. It should be
performed using a variety of strategies and tools designed to assess
27
student learning in a holistic way. It should be conducted in multiple
periods to assess learning overtime. This principle is also congruent
with the OBE principle of expanded opportunity.
7. Assessment requires the use of appropriate measures. For
assessment to be valid, the assessment tools or measures used must
have sound psychometric properties, including, but not limited to,
validity and reliability. Appropriate measures also mean that learners
must be provided with challenging but age- and context-appropriate
assessment tasks. This principle is consistent with the OBE principle of
high expectation.
8. Assessment should be authentic as possible. Assessment tasks or
activities should closely, if not fully, approximate real-life situations or
experiences. Authenticity of assessment can be taught of as a
continuum from least authentic to most authentic, with more authentic
tasks expected to be more meaningful for learners.
Summary
Assessment is a systematic process of defining, selecting, designing,
collecting, analyzing, interpreting, and using information to increase
students' learning and development.
Assessment may be described in terms of its purpose such as assessment
FOR, assessment OF and assessment AS.
Learning is a change in the learner’s behaviour towards an improved level
as a product of one’s experience and interaction with his environment.
Measurement is a process of determining or describing the attributes or
characteristics of learners generally in terms of quantity.
Evaluation is the process of making judgments based on standards and
evidences derived from measurements.
A test is a tool consists of a set of questions administered during a fixed
period of time under comparable conditions for all students. Testing
measures the level of skill or knowledge that has been reached.
Grading is a form of evaluation which provides information as to whether a
learner passed or failed in a certain task or subject.
28
The most common psychometric theories that serve as frameworks for
assessment and measurement in the determination of the psychometric
characteristics of a measure are the classical test theory (CTT) and the
item response theory (IRT).
The most common types of assessment are diagnostic, formative and
summative, criterion-referenced and norm-referenced, traditional and
authentic. Other experts added ipsative and confirmative assessments.
Principles of assessment are guides for teachers in their design, and
development of outcomes-based assessment tools.
Assessment
1. What is assessment in learning? What is assessment in learning for you?
2. Differentiate the following:
2.1. Measurement and evaluation
2.2. Testing and grading
2.3. Formative and summative assessment
2.4. Classical test theory and Item response theory
3. Based on the principles that you have learned, make a simple plan on how
you will undertake your assessment with your future students. Consider 2
principles only.
Principles Plan for applying the principle in your classroom
assessment
1.
2.
29
examinations that were out of the topics. What
made it worse is that he would get angry when
asked about the mismatch. I think the teacher did
not consider the validity of his test, and it was not
appropriate.
2.
3.
4.
30
Enrichment
Secure a copy of DepEd Order No. 8, s. 2015 on the Policy Guidelines on
Classroom Assessment for the K to 12 Basic Education Program. Study
the policies and be ready to clarify any provisions during G-class. You can
access the Order from this link: [Link]
8-s-2015-policy-guidelines-on-classroom-assessment-for-the-k-to-12-
basic-education-program/
Read DepEd Order No. 5, s. 2013 (Policy Guidelines on the
Implementation of the School Readiness Year-end Assessment (SReYA)
for Kindergarten. (Please access through
[Link]
the-implementation-of-the-school-readiness-year-end-assessment-sreya-
for-kindergarten/).
Questions
1. What assessment is cited in the Order? What is the purpose of giving
such assessment?
2. How would you classify the assessment in terms of its nature? Justify.
3. What is the relevance of this assessment to students, parents and
teachers and the school?
References
Alberta Education (2008, October 1). Types of Classroom Assessment.
Retrieved from
[Link]
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Fisher, M. Jr. R. (2020). Student Assessment in Teaching and Learning.
Retrieved from [Link]
teaching-and-learning/
Navarro, L., Santos, R. and Corpuz, B. (2017). Assessment of Learning 1 (3 rd
ed.). Quezon City: Lorimar Publishing, Inc.
Magno, C. (2010). The Functions of Grading Students. The Assessment
Handbook, 3, 50-58.
31
Lesson 2: Purposes of Classroom Assessment, Educational Objectives,
Learning Targets and Appropriate Methods
Pre-discussion
To be able to achieve the intended learning outcomes of this lesson, one
is required to understand the basic concepts, theories and principles in
assessing the learning of students. Should these things are not yet cleared and
understood, it is advised that a thorough review be made of the previous
chapter.
What to Expect?
At the end of the lesson, the students can:
1. articulate the purpose of classroom assessment;
2. tell the difference between the Bloom’s Taxonomy and the Revised;
Bloom’s Taxonomy in stating learning objectives;
3. apply the Revised Bloom’s Taxonomy in writing learning objectives;
4. discuss the importance of learning targets in instruction;
5. formulate learning targets; and
6. match the assessment methods with specific learning
objectives/targets.
32
learning is taking place in the common tasks of the school day – and how
much insight into student learning teachers can mine from this material
(McNamee and Chen, 2005: 76).
Assessment for learning is on-going assessment that allows teachers to
monitor students on a day-to-day basis and modify their teaching based on
what the students need to be successful. This assessment provides
students with the timely, specific feedback that they need to make
adjustments to their learning.
After teaching a lesson, we need to determine whether the lesson was
accessible to all students while still challenging to the more capable; what
the students learned and still need to know; how we can improve the lesson
to make it more effective; and, if necessary, what other lesson we might
offer as a better alternative. This continual evaluation of instructional
choices is at the heart of improving our teaching practice (Burns, 2005).
33
Is used continually by providing Is presented in a periodic report.
descriptive feedback.
Usually uses detailed, specific and Usually compiles data into a single
descriptive feedback - in a formal or number, score or mark as part of a
informal report. formal report.
Is not reported as part of an Is reported as part of an achievement
achievement grade. grade.
Usually focuses on improvement, Usually compares the student's
compared with the student's ―previous learning either with other students'
best‖ (self-referenced, making learning (norm-referenced, making
learning more personal). learning highly competitive) or the
standard for a grade level (criterion-
referenced, making learning more
collaborative and individually focused).
Involves the student. Does not always involve the student.
Adapted from Ruth Sutton, unpublished document, 2001, in Alberta Assessment
Consortium, Refocus: Looking at Assessment for Learning (Edmonton, AB: Alberta
Assessment Consortium, 2003), p. 4.
34
assessment must be used. While it is difficult to perform an assessment with all
three purposes in mind, teachers must be able to understand the three
purposes of assessment, including knowing when and how to use them.
35
particular. Teachers need information on whether the learners have met the
intended learning outcomes after the instruction is fully implemented. The
learners’ placement or promotion to the next educational level is informed
by the assessment results.
Facilitative. Classroom assessment may affect student learning. On the part
of teachers, assessment for learning provides information on students’
learning and achievement that teachers can use to improve instruction and
the learning experiences of learners. On the part of learners, assessment
as learning allows them to monitor, evaluate, and improve their own
learning strategies. In both cases, student learning is facilitated.
Motivational. Classroom assessment can serve as a mechanism for learners
to be motivated and engaged in learning and achievement in the classroom.
Grades, for instance, can motivate and demotivate learners. Focusing on
progress, providing effective feedback, innovating assessment tasks, and
using scaffolding during assessment activities provide opportunities for
assessment activities provide opportunities for assessment to be
motivating rather than demotivating.
36
stated with the use of verbs. The most popular taxonomy of educational
objectives is Bloom’s Taxonomy of Educational Objectives.
The Bloom’s Taxonomy of Educational Objectives
Bloom’s Taxonomy consists of three domains: cognitive, affective and
psychomotor. These three domains correspond to the three types of goals that
teachers want to assess: knowledge-based goals (cognitive), skills-based goals
(psychomotor), and effective goals (affective). Hence, there are there
taxonomies that can be used by teachers depending on the goals. Each
taxonomy consists of different levels of expertise with varying degrees of
complexity. The most popular among the three taxonomies is the Bloom’s
Taxonomy of Educational Objectives for Knowledge-Based Goals. The
taxonomy describes six levels of expertise: knowledge, comprehension,
application, analysis, synthesis, and evaluation. Table 1 presents the
description, illustrative verbs, and a sample objective for each of the six levels.
Table 1. Bloom’s Taxonomy of Educational
Objectives in the Cognitive Domain
37
the nature and objectives in the
association among cognitive domain.
the elements
Synthesis Construction of composes Compose learning
elements or parts constructs, targets using
from different creates, Bloom’s taxonomy.
sources to form a designs, and
more complex or integrates
novel structure
Evaluation Making judgment of appraises, Evaluate the
ideas or methods evaluates, congruence
based on sound and judges, between learning
established criteria concludes, targets and
and criticizes assessment
methods.
38
Below is an example of an educational or learning objective:
Students will be able to differentiate qualitative research and quantitative
research.
In the example, differentiate is the verb that represents the type of
cognitive process (in this case, analyze), while qualitative research and
quantitative research is the noun phrase that represents the type of knowledge
(in this case, conceptual).
Tables 2 and 3 present the definition, illustrative verbs, and sample
objectives of the cognitive process dimensions and knowledge dimensions of
the Revised Bloom’s Taxonomy.
Table 2. Cognitive Process Dimensions in the Revised
Bloom’s Taxonomy of Educational Objectives
39
Table 3. Knowledge Dimensions in the Revised Bloom’s
Taxonomy of Educational Objectives
LEARNING TARGETS
―Students who can identify what they are learning significantly outscore
those who cannot.‖ – Robert Marzano
The metaphor that Connie Moss and Susan Brookhart use to describe
learning targets in their Educational Leadership article, ―What Students Need
to Learn,‖ is that of a global positioning system (GPS). Much like a GPS
communicates timely information about where you are, how far and how long
40
until your destination, and what to do when you make a wrong turn. A learning
target provides a precise description of the learning destination. They tell
students what they will learn, how deeply they will learn it, and how they will
demonstrate their learning.
Learning targets describe in student-friendly language the learning to
occur in the day’s lesson. Learning targets are written from the students’ point
of view and represent what both the teacher and the students are aiming for
during the lesson. Learning targets also include a performance of
understanding, or learning experience, that provides evidence to answer the
question ―What do students understand and what are they able to do?‖
As Moss and Brookhart write, while a learning target is for a daily lesson,
―Most complex understandings require teachers to scaffold student
understanding across a series of interrelated lessons.‖ In other words, each
learning target is a part of a longer, sequential plan that includes short and long-
term goals.
McMillan (2014) defined learning targets as a statement of student
performance for a relatively restricted type of learning outcome that will be
achieved in a single lesson or a few days, and contains what students should
know, understand and be able to do at the end of the instruction and criteria for
judging the level of demonstrated performance. It is more specific and clear
than the educational goals, standards, and learning objectives. To avoid
confusion of terms, De Guzman and Adamos (2015) wrote that definition of
learning targets is similar to that of learning outcomes.
Now, how does a learning target differ from an instructional objective?
An instructional objective describes an intended outcome and the nature of
evidence that will determine mastery of that outcome from a teacher’s point of
view. It contains content outcomes, conditions, and criteria. A learning target,
on the other hand, describes the intended lesson-sized learning outcome and
the nature of evidence that will determine mastery of that outcome from a
student’s point of view. It contains the immediate learning aims for today’s
lesson (ASCD, 2021).
41
Why Use Learning Targets?
According to experts, one of the most powerful formative strategies for
improving student learning is clear learning targets for students. In Visible
Learning, John Hattie emphasizes the importance of ―clearly communicating
the intentions of the lessons and the criteria for success. Teachers need to
know the goals and success criteria of their lessons, know how well all students
in their class are progressing, and know where to go next.‖
Learning targets ensure that students:
know what they are supposed to learn during the lesson; without a
clear learning target, students are left guessing what they are
expected to learn and what their teacher will accept as evidence of
success.
build skilfulness in their ability to assess themselves and be
reflective.
are continually monitoring their progress toward the learning goal
and making changes as necessary to achieve their goal.
are in control of their own learning, and not only know where they
are going, they know exactly where they are relative to where they
are going; they are able to choose strategies to help them do their
best, and they know exactly what it takes to be successful.
know the essential information to be learned and how they will
demonstrate that learning to achieve mastery.
Learning targets are a part of a cycle that includes student goal
setting and teacher feedback. Formative assessment, assessment for
learning, starts when the teacher communicates the learning target at the
beginning of the lesson. Providing examples of what is expected along
with the target written in student-friendly language gives students the
opportunity to set goals, self-assess, and make improvements.
42
Table 4. Types of Learning Targets, Description and Sample
Types Description Sample
Knowledge Knowledge targets I can explain the role of
Know, list, represent the factual conceptual framework
identify, information, procedural in a research.
understand,
knowledge, and I can identify
conceptual metaphors and similes
explain understandings that I can read and write
underpin each discipline quadratic equations.
or content area. These I can describe the
targets form the function of a cell
foundation for each of the membrane.
other types of learning I can explain the effects
targets. of an acid on a base.
Skills Skill targets are those I can facilitate a focus
Demonstrate, where a demonstration or group discussion
pronounce, a physical skill-based (FGD) with research
performance is at the participants.
perform
heart of the learning. I can measure mass in
Most skill targets are metric and SI units.
found in subjects such as I can use simple
physical education, visual equipment and tools to
and performing arts, and gather data.
foreign languages. Other I can read aloud with
content areas may have fluency and expression.
a few skill targets. I can participate in civic
discussions with the
aim of solving current
problems.
I can dribble to keep
the ball away from an
opponent.
Reasoning Reasoning targets I can justify my
Predict, infer, specify thought research problems with
summarize, processes students must a theory.
compare,
learn to do well across a I can use statistical
range of subjects. methods to describe,
analyze, classify Reasoning analyze, evaluate, and
Involves thinking and make decisions.
applying-using I can make a prediction
knowledge to solve a based on evidence.
problem, make a I can examine
decision, etc. These data/results and
targets move students propose a meaningful
beyond mastering interpretation.
content knowledge to the I can distinguish
application of knowledge. between historical fact
and opinion.
43
Product Product targets describe I can write a thesis
Create, design, learning in terms of proposal.
write, draw, artifacts where creation of I can construct a bar
a product is the focus of graph.
make
the learning target. With I can develop a
product targets, the personal health-related
specifications for quality fitness plan.
of the product itself are I can construct a
the focus of teaching and physical model of an
assessment. object.
Other experts consider a fifth type of learning target – affect. This refers
to affective characteristics that students can develop and demonstrate because
of instruction. This includes the attitudes, beliefs, interests, and values. Some
experts use disposition as alternative term for affect.
44
matched to the item stems. There should always be one more answer
choice than the number of item stems. Generally, matching items are well
suited for testing understanding of concepts and principles.
True-false items have the advantage of being easy to write, more can be
given in the same amount of time compared to MC items, reading time is
minimized, and they are easy to score.
Constructed-response items require the student to answer a question,
commonly referred to as a ―prompt.‖ A constructed-response exam is
considered to be a subjective exam because the correctness of the answer is
based on a rater’s opinion, typically with the use of a rubric scale to guide the
scoring. Essay and short answer exams are constructed-response
assessments because the student has to ―construct‖ the answer.
Teachers Observation
Teacher observation has been accepted readily in the past as a
legitimate source of information for recording and reporting student
demonstrations of learning outcomes. As the student progresses to later
45
years of schooling, less and less attention typically is given to teacher
observation and more and more attention typically is given to formal
assessment procedures involving required tests and tasks taken under explicit
constraints of context and time. However, teacher observation is capable of
providing substantial information on student demonstration of learning
outcomes at all levels of education.
For teacher observation to contribute to valid judgments concerning
student learning outcomes, evidence needs to be gathered and recorded
systematically. Systematic gathering and recording of evidence requires
preparation and foresight. Teacher observation can be characterised as two
types: incidental and planned.
Incidental observation occurs during the ongoing (deliberate) activities of
teaching and learning and the interactions between teacher and students.
In other words, an unplanned opportunity emerges, in the context of
classroom activities, where the teacher observes some aspect of individual
student learning. Whether incidental observation can be used as a basis
for formal assessment and reporting may depend on the records that are
kept.
Planned observation involves deliberate planning of an opportunity for the
teacher to observe specific learning outcomes. This planned opportunity
may occur in the context of regular classroom activities or may occur
through the setting of an assessment task (such as a practical or
performance activity).
Student Self-Assessment
One form of formative assessment is self-assessment or self-reflection
by students. Self-reflection is the evaluation or judgment of the worth of one’s
performance and the identification of one’s strengths and weaknesses with a
view to improving one’s learning outcomes, or more succinctly, reflecting on
and monitoring one’s own work processes and/or products (Klenowski, 1995).
Student self-assessment has long been encouraged as an educational and
learning strategy in the classroom, and is both popular and positively regarded
by the general education community (Andrade, 2010).
46
Besides, McMillan and Hearn (2008) described self-assessment as a
process by which students 1) monitor and evaluate the quality of their thinking
and behavior when learning and 2) identify strategies that improve their
understanding and skills. That is, self-assessment occurs when students judge
their own work to improve performance as they identify discrepancies between
current and desired performance. This aspect of self-assessment aligns closely
with standards-based education, which provides clear targets and criteria that
can facilitate student self-assessment. The pervasiveness of standards-based
instruction provides an ideal context in which these clear-cut benchmarks for
performance and criteria for evaluating student products, when internalized by
students, provide the knowledge needed for self- assessment. Finally, self-
assessment identifies further learning targets and instructional strategies
(correctives) students can apply to improve achievement.
47
Table 6. Matching Learning Targets with other Types of Assessment
Learning Project-based Portfolio Recitation Observation
Targets
Knowledge 1 3 3 2
Reasoning 2 2 3 2
Skill 2 3 1 2
Product 3 3 1 1
Note: Higher numbers indicate better matches (e.g., 5 = Excellent, 1 = Poor).
There are still other types of assessment, and it is up to the teachers to
select the method of assessment and design appropriate assessment tasks
and activities to measure the identified learning targets.
Summary
In educational setting, the purpose of assessment may be classified in
terms of assessment of learning, assessment for learning, and
assessment as learning.
Assessment OF learning is held at the end of a subject or a course to
determine performance. It is equivalent to summative assessment.
Assessment FOR learning is done repeatedly during instruction to check
the learners’ progress and teacher’s strategies so that intervention or
changes can be made.
Assessment AS learning is done to develop the learners’ independence
and self-regulation.
Classroom assessment in the teaching-learning process has the following
roles: formative, diagnostic, evaluative, and motivational.
Educational objectives are best explained through Bloom’s Taxonomy. It
consists of three (3) domains, namely: cognitive, affective and
psychomotor which are the main goals of teachers.
An instructional objectives guide instruction, and we write them from the
teacher’s point of view. Learning targets guide learning and are expressed
in language that students understand, the lesson-sized portion of
information, skills, and reasoning processes that students will come to
know deeply.
Assessment methods may be categorized as selected-response,
constructed-response, teacher observation and student self-assessment.
48
Learning targets may be knowledge, skills, reasoning or product.
Teachers match learning targets with appropriate assessment methods.
Assessment
1. Describe the 3 purposes of classroom assessment by completing the
matrix below.
Assessment OF Assessment Assessment AS
learning FOR learning learning
WHAT?
WHY?
WHEN?
Sample
statements
49
transfer energy and matter in an ecosystem.
5. I can recall the influences that promote alcohol, tobacco,
and other drug use.
6. I can use characteristic properties of liquids to distinguish
one substance from another.
7. I can evaluate the quality of my own work to refine it.
8. I can identify the main idea of a passage.
9. I can dribble the basketball with one hand.
10. I can list down the first 5 Philippine Presidents.
11. I can construct a bar graph.
12. I can develop a personal health-related fitness plan.
13. I can measure the length of an object.
14. I can introduce myself in Chinese.
15. I can compare forms of government.
50
Review
Product write an effective
APA Guidelines review section of a
in Citations and thesis proposal
References
Title of Lesson:
Instructional Lesson Content Type of Sample Learning
Objective/learning Learning Targets
objectives Targets
51
9. I can match assessment method
appropriate to specific learning targets.
10. I can select or design an assessment task
or activity to measure a specific learning
target.
Enrichment
Open the DepEd’s K to 12 Curriculum Guide from this link:
[Link]
curriculum/grade-1-to-10-subjects/. and make yourself familiar with the
content standards, performance standards and competency.
Choose a specific lesson for a subject area, and grade level that you want
to teach in the future. Prepare an assessment plan using the matrix.
Subject
Grade level
Performance standards
Specific lesson
Learning targets
Assessment
task/activity
52
References
Andrade, H. (2010). Students as the definitive source of formative
assessment: Academic self-assessment and the self-regulation of
learning. In H. Andrade & G. Cizek (Eds.), Handbook of formative
assessment (pp. 90–105). New York, NY: Routledge.
Clayton, Heather. ―Power Standards: Focusing on the Essential.‖ Making the
Standards Come Alive! Alexandria, VA: Just ASK Publications, 2016.
Access at [Link]/just-ask-resource-center/e-
newsletters/msca/power-standards/
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
EL Education (2020). Students Unpack a Learning Target and Discuss
Academic Vocabulary. [Video]. [Link]
Hattie, John. Visible Learning for Teachers: Maximizing Impact on Learning.
New York: Routledge, 2012.
Klenowski, V. (1995). Student self-evaluation processes in student-centred
teaching and learning contexts of Australia and England. Assessment
in Education: Principles, Policy & Practice, 2(2).
Maxwell, Graham S. (2001). Teacher Observation in Student Assessment.
(Discussion Paper). The University of Queensland.
Moss, Connie and Susan Brookhart. Learning Targets: Helping Students Aim
for Understanding in Today’s Lesson. Alexandria: ASCD, 2012.
Navarro, L., Santos, R. and Corpuz, B. (2017). Assessment of Learning 1 (3 rd
ed.). Quezon City: Lorimar Publishing, Inc.
53
Lesson 3: Different Classifications of Assessment
Pre-discussion
Ask the students about their experiences when they took the National
Achievement Test (NAT) during their elementary and high school days. Who
administered it? How did you answer them? What do you think was the purpose
of the NAT? What about their experiences in taking quarterly tests or quizzes?
What other assessments or tests did they take before? What are your notable
experiences relative to taking tests?
What to Expect?
At the end of the lesson, the students can:
1. compare the following forms of assessment: educational vs.
psychological, teacher-made vs. standardized, selected-response vs.
constructed-response, achievement vs. aptitude, and power vs. speed;
2. give examples of each classification of test;
3. illustrate situations on the use of different classifications of
assessment; and
4. decide on the kind of assessment to be used.
Classifications of Assessment
The different forms of assessment are classified according to purpose,
form, interpretation of learning, function ability, and kind of learning.
Classification Type
Purpose Educational and Psychological
Form Paper and pencil, and Performance-based
Function Teacher-made and Standardized
Kind of learning Achievement and Aptitude
Ability Speed and Power
Interpretation of Norm-referenced and Criterion-referenced
learning
54
classroom setting, it focuses on identifying the knowledge, skills, and attitudes
students have acquired via a lesson, a course, a grade level, and so on. It is an
ongoing process, ranging from the activities that teachers do with students in
classrooms every day to standardized testing, college theses and instruments
that measure the success of corporate training programs.
Let’s understand educational assessments by looking at its many
aspects:
The forms of educational assessment can take
The need for educational assessment
The essentials of a good assessment
Types of educational assessment
Education assessments can take any form:
It may involve formal tests or performance-based activities.
It may be administered online or using paper and pencil or other
materials.
It may be objective (requiring a single correct answer) or subjective
(there may be many possible correct answers, such as in an essay).
It may be formative (carried out over the course of a project) or
summative (administered at the end of a project or a course).
What these types of educational assessments have in common is that,
all of them measure the learners’ performance relative to previously defined
goals, which are usually stated as learning objectives or outcomes. And,
because assessment is so widespread, it is vital that educators, as well as
parents and students, understand what it is and why it is used.
Psychological assessment is the use of standardized measures to
evaluate the abilities, behaviors, and personal qualities of people. Typically,
psychological tests attempt to shed light on an individual’s intelligence,
personality, motivation, interest, psychopathology, or ability. Traditionally, these
tests were formed on clinical or psychiatric populations and were used primarily
for diagnosis and treatment. However, with the increasing presence of forensic
psychologists in the courtroom, these tests are being used to help determine
legal questions or legal constructs. As a result, there is a growing debate over
the utility of these tests in the courtroom.
55
Paper-pencil and Performance-based Assessments
Paper-and-pencil instruments refer to a general group of assessment
tools in which students read questions and respond in writing. This includes
tests, such as knowledge and ability tests, and inventories, such as personality
and interest inventories. It can be used to assess job-related knowledge and
ability or skill qualifications. The possible range of qualifications which can be
assessed using paper-and-pencil tests is quite broad. For example, such tests
can assess anything from knowledge of office procedures to knowledge of
federal legislation, and from the ability to follow directions to the ability to solve
numerical problems. Because many takers can be assessed at the same time
with a paper-and-pencil test, such tests are an efficient method of assessment.
All assessment methods must provide information that is relevant to the
qualification(s) being assessed. There are four (4) steps in developing paper-
and-pencil tests, namely: listing topic areas/tasks; specifying the response
format, number of questions, the time limit and difficulty level; writing the
questions and developing the scoring guide; and reviewing the questions and
scoring guide.
56
What type of response format should I choose?
The three most common response formats are:
(a) multiple-choice;
(b) short answer; and
(c) essay.
With a multiple-choice response format, a large number of different topic
areas/tasks can be covered within the same test and the questions are easy
to score. However, because all potential answers must be chosen by some
candidates, it is time-consuming to write good questions.
With a short-answer response format, as in multiple choice, a large number
of different topic areas/tasks can be covered within the same test and these
questions are easy to score. In addition, less time is required to write these
questions compared to multiple-choice ones.
With an essay response format, only a few topic areas/tasks can be covered
due to the amount of time it takes to answer questions; however, the content
can be covered in greater detail. Essay questions require little time to write
but they are very time-consuming to score.
Although at first glance a multiple-choice format may seem a relatively easy
and logical choice if breadth of coverage is emphasized, don't be fooled. It
is hard to write good multiple-choice questions and you should only choose
this type of response format if you are willing to devote a lot of time to editing,
reviewing, and revising the questions. If depth of coverage is emphasized,
use an essay response format.
Performance-based Assessment
Performance assessment is one alternative to traditional methods of
testing student achievement. While traditional testing requires students to
answer questions correctly, performance assessment requires students to
demonstrate knowledge and skills, including the process by which they solve
problems. Performance assessments measure skills such as the ability to
integrate knowledge across disciplines, contribute to the work of a group, and
develop a plan of action when confronted with a new situation. Performance
assessments are also appropriate for determining if students are achieving
57
the higher standards set by states for all students. This brochure explains
features of this assessment alternative, suggests ways to evaluate it, and offers
exploratory questions you might ask your child's teacher about this subject.
The following six (6) types of activities provide good starting points for
assessments in performance-based learning.
1. Presentations
One easy way to have students complete a performance-based activity
is to have them do a presentation or report of some kind. This activity could be
done by students, which takes time, or in collaborative groups.
The basis for the presentation may be one of the following:
Providing information
Teaching a skill
Reporting progress
Persuading others
Students may choose to add in visual aids or a PowerPoint presentation
or Google Slides to help illustrate elements in their speech. Presentations work
well across the curriculum as long as there is a clear set of expectations for
students to work with from the beginning.
2. Portfolios
Student portfolios can include items that students have created and
collected over a period. Art portfolios are for students who want to apply to art
programs in college. Another example is when students create a portfolio of
their written work that shows how they have progressed from the beginning to
the end of class. The writing in a portfolio can be from any discipline or a
combination of disciplines.
Some teachers have students select those items they feel represents
their best work to be included in a portfolio. The benefit of an activity like this
58
is that it is something that grows over time and is therefore not just completed
and forgotten. A portfolio can provide students with a lasting selection of
artefacts that they can use later in their academic career.
Reflections may be included in student portfolios in which students may
make a note of their growth based on the materials in the portfolio.
3. Performances
Dramatic performances are one kind of collaborative activities that can
be used as a performance-based assessment. Students can create, perform,
and/or provide a critical response. Examples include dance, recital, dramatic
enactment. There may be prose or poetry interpretation.
This form of performance-based assessment can take time, so there
must be a clear pacing guide. Students must be provided time to address the
demands of the activity; resources must be readily available and meet all safety
standards. Students should have opportunities to draft stage work and practice.
Developing the criteria and the rubric and sharing these with students
before evaluating a dramatic performance is critical.
4. Projects
Projects are commonly used by teachers as performance-based
activities. They can include everything from research papers to artistic
representations of information learned. Projects may require students to apply
their knowledge and skills while completing the assigned task. They can be
aligned with the higher levels of creativity, analysis, and synthesis.
Students might be asked to complete reports, diagrams, and maps.
Teachers can also choose to have students work individually or in groups.
Journals may be part of a performance-based assessment. They can be used
to record student reflections. Teachers may require students to complete
journal entries. Some teachers may use journals as a way to record
participation.
59
things like history fairs to art exhibitions. Students work on a product or item
that will be exhibited publicly.
Exhibitions show in-depth learning and may include feedback from
viewers. In some cases, students might be required to explain or defend their
work to those attending the exhibition. Some fairs like science fairs could
include the possibility of prizes and awards.
6. Debates
A debate in the classroom is one form of performance-based learning
that teaches students about varied viewpoints and opinions. Skills associated
with debate include research, media and argument literacy, reading
comprehension, evidence evaluation, public speaking, and civic skills.
60
It is prepared to measure the outcomes and content of local curriculum.
It is very much flexible so that, it can be adopted to any procedure and material.
It does not require any sophisticated technique for preparation. Taylor has
highly recommended for the use of these teacher-made objective type tests,
which do not require all the four steps of standardised tests nor need the
rigorous processes of standardisation. Only the first two steps planning and
preparation are sufficient for their construction.
61
5. To assess how far specified instructional objectives have been achieved.
6. To know the efficacy of learning experiences.
7. To diagnose students learning difficulties and to suggest necessary
remedial measures.
8. To certify, classify or grade the students on the basis of resulting scores.
9. Skilfully prepared teacher-made tests can serve the purpose of
standardised test.
10. Teacher-made tests can help a teacher to render guidance and
counselling.
11. Good teacher-made tests can be exchanged among neighbouring schools.
12. These tests can be used as a tool for formative, diagnostic and summative
evaluation.
13. To assess pupils’ growth in different areas.
Standardized Test
A standardized test is a test that is given to students in a very consistent
manner. It means that the questions on the test are all the same, the time given
to each student is also the same, and the way in which the test is scored is the
same for all students. Standardized tests are constructed by experts along with
explicit instructions for administration, standard scoring procedures, and a table
of norms for interpretation.
Thus, a standardized test is administered and scored in a consistent or
"standard" manner. These tests are designed in such a way that the questions,
conditions for administering, scoring procedures, and interpretations are
consistent.
Any test in which the same test is given in the same manner to all test
takers, and graded in the same manner for everyone, is a standardized test.
Standardized tests do not need to be high-stakes tests, time-limited tests, or
multiple-choice tests. The questions can be simple or complex. The subject
matter among school-age students is frequently academic skills, but a
standardized test can be given on nearly any topic, including driving tests,
creativity, personality, professional ethics, or other attributes.
62
The purpose of standardized tests is to compare the performance of one
individual with another, an individual against a group, or one group with another
group.
Below are lists of common standardized tests. You can explore the
details of these test titles from [Link]
• Standardized K-12 Exams
• ISEE: Independent School Entrance Examination
• SSAT: Secondary School Admission Test
• HSPT: High School Placement Test
• SHSAT: Specialized High School Admissions Test
• COOP: Cooperative Admissions Examination Program
• PSAT: Preliminary Scholastic Aptitude Test
• GED: General Educational Development Test
• HiSET: High School Equivalency Test
• ACT: American College Test
• SAT: Scholastic Aptitude Test
63
Table 1. NAT Examination Information
Grade/Year Examinee Description
Grade 3 (Elementary) All students in Serves as an entrance
both public and assessment for the
private schools. elementary level.
Grade 6 (Elementary) One of the entrance
examinations to proceed in
Junior High School.
Grade 10 (Junior High School) One of the entrance
examinations to proceed in
Senior High School.
Grade 12 (Senior High Graduating Taken for purposes of
School Completers, called students in both systems evaluation; not a
Basic Education Exit public and private prerequisite for graduation or
Assessment (BEEA)) schools. college enrolment.
Note: The test is a system-based assessment designed to gauge learning outcomes across target levels
in identified periods of basic education. Empirical information on the achievement level of
pupils/students serve as a guide for policy makers, administrators, curriculum planners, principles,
and teachers, along with analysis on the performance of regions, divisions, schools, and other
variables overseen by DepEd.
64
determine what you are capable of; they are designed to evaluate what you
know and your level of skill at the given moment.
Achievement tests are often used in educational and training settings. In
schools, achievements tests are frequently used to determine the level of
education for which students might be prepared. Students might take such a
test to determine if they are ready to enter into a particular grade level or if they
are ready to pass of a particular subject or grade level and move on to the next.
Standardized achievement tests are also used extensively in
educational settings to determine if students have met specific learning goals.
Each grade level has certain educational expectations, and testing is used to
determine if schools, teachers, and students are meeting those standards.
Aptitude Test
Unlike achievement tests, which are concerned with looking a person's
level of skill or knowledge at any given time, aptitude tests are instead focused
on determining how capable of a person might be of performing a certain task.
An aptitude test is designed to assess what a person is capable of doing
or to predict what a person is able to learn or do given the right education and
instruction. It represents a person's level of competency to perform a certain
type of task. Such aptitude tests are often used to assess academic potential
or career suitability and may be used to assess either mental or physical talent
in a variety of domains.
Some examples of aptitude tests include:
• A test assessing an individual's aptitude to become a fighter pilot
• A career test evaluating a person's capability to work as an air traffic
controller
• An aptitude test is given to high school students to determine which
type of careers they might be good at
• A computer programming test to determine how a job candidate might
solve different hypothetical problems
• A test designed to test a person's physical abilities needed for a
particular job such as a police officer or firefighter
65
Students often encounter a variety of aptitude tests throughout school
as they think about what they might like to study in college or do for as a career
someday. High school students often take a variety of aptitude tests designed
to help them determine what they should study in college or pursue as a career.
These tests can sometimes give a general idea of what might interest students
as a future career.
For example, a student might take an aptitude test suggesting that they
are good with numbers and data. The results might imply that a career as an
accountant, banker, or stockbroker would be a good choice for that particular
student. Another student might find that they have strong language and verbal
skills, which might suggest that a career as an English teacher, writer, or
journalist might be a good choice.
Thus, an aptitude test measures one’s ability to reason and learn new
skills. Aptitude tests are used worldwide to screen applicants for jobs or
educational programs. Depending on your industry and role, you may have to
take one or more of the following kinds of test, each focused on specific skills:
• Numerical Reasoning Test
• Verbal Reasoning Test
• Abstract Reasoning Test
• Mechanical Aptitude Test
• Inductive Reasoning Test
66
individual differences in the scores are attributed to differences in the ability
under assessment, not to differences in basic cognitive abilities such as
processing speed or reaction time.
An example of a speed test is a typing test in which examinees are
required to type correctly as many words as possible given a limited amount of
time. An example of a power test was the one developed by the National
Council of Teachers in Mathematics that determine the ability of the examinees
to utilize data to reason and become creative, formulate, solve, and reflect
critically on the problems provided.
Summary
In this lesson, we did identify and distinguish from each other the different
classifications of assessment. We learned when to use educational and
psychological assessment, or paper-and-pencil and performance-based
assessment. Also, we were able to differentiate teacher-made and
standardized test, achievement and aptitude test, as well as, speed and
power tests.
Assessment
1. Which classification of assessment is commonly used in the classroom
setting? Why?
2. To demonstrate understanding, try giving more examples for each type of
assessment.
Type Examples
Educational
Psychological
Paper and pencil
Performance-based
Teacher-made
Standardized
Achievement
Aptitude
Speed
Power
Norm-referenced
Criterion-referenced
67
3. Match the learning target with the appropriate assessment methods.
Check if the type of assessment is appropriate. Be ready to justify.
Learning targets Selected- Essay Performance Teacher Self-
response Task observation assessment
Example: Exhibit √ √ √
proper dribbling of
a basket ball
1. Identify parts
of a
microscope
and its
functions
2. Compare the
methods of
assessment
3. Arrange the
eating utensils
on table
4. Perform the
dance steps in
―Pandanggo
sa Ilaw‖
5. Define
assessment
6. Compare and
contrast
testing and
grading
7. List down all
the Presidents
of the
Philippines
8. Find the
speed of a car
9. Recite the
mission of
SKSU
10. Prepare a
lesson plan in
Mathematics
68
4. Give the features and use of the following assessments.
Classifications of Assessment Description Use or purpose
1. Speed vs. Power tests
2. Achievement vs. Aptitude
Test
3. Educational vs.
Psychological tests
4. Selected and constructed-
response test
5. Paper-pencil vs.
performance-based test
69
Enrichment
Check the varied products of Center for Educational Measurement (CEM)
as regards standardized tests. Access it through this link:
[Link]
Try taking a free Personality Test available online. You can also try an IQ
test. Share the results with the class.
References
Aptitude Tests. Retrieved from [Link]
[Link]
Cherry, Kendra (2020, February 06). How Achievement Tests Measure What
People Have Learned. Retrieved from
[Link]
Classroom Assessment. Retrieved from
[Link]
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Improving your Test Questions. [Link]
evaluation/exam-scoring/improving-your-test-questions?src=cte-
migration-map&url=%2Ftesting%2Fexam%2Ftest_ques.html
Navarro, L., Santos, R. and Corpuz, B. (2017). Assessment of Learning 1 (3rd
ed.). Quezon City: Lorimar Publishing, Inc.
University of Lethbridge (2020). Creating Assessments. Retrieved from
[Link]
70
CHAPTER 3
DEVELOPMENT AND ENHANCEMENT OF TEST
Overview
This chapter deals on the process and mechanics in developing a written
test that is understandably a teacher-made type. As future professional
teachers, one has to be competent in the selection of the learning objectives or
outcomes, preparation of a table of specifications (TOS), the guidelines in
writing varied written test formats, and writing the test itself. Adequate
knowledge of the TOS construction is indispensable in formulating a valid test
in terms of content and construct. Also, the complete understanding of the rules
and guidelines in writing a specific test format would probably ensure an
acceptable and unambiguous test which is fair to the learners. In addition,
reliability and validity are 2 important characteristics of test that shall likewise
be included to guarantee quality. For test item enhancement, topics such as
difficulty index, index of discrimination and even distracter analysis are to be
introduced.
Objective
Upon completion of the unit, the students can demonstrate their
knowledge, understanding and skills in planning, developing and enhancing a
written test.
Pre-discussion
The setting of learning objectives for an assessment of a course or
subject are and the construction of a table of specifications for a classroom test
require specific skills and experience. To successfully perform these foregoing
tasks, a pre-service teacher should be able to distinguish the different levels of
cognitive behavior and identify the appropriate assessment method for them. It
is assumed that in this lesson, the competencies for instruction that are
cognitive in nature are the ones identified as the targets in
71
developing a written test, which should be reflected in the test’s table of
specifications to be created.
What to Expect?
At the end of the lesson, the students can:
1. define the necessary instructional outcomes to be included in a written
test;
2. describe what is a table of specifications (TOS) and its formats;
3. prepare a TOS for a written test; and
4. demonstrate the systematic steps in making a TOS.
72
assessment. On the other hand, they provide the students with the reasons and
motivation to study and endure. They provide students the opportunities to be
aware of what they need to do to be successful in the course, take control and
ownership of their progress, and focus on what they should be learning. Setting
objectives for assessment is the process of establishing direction to guide both
the teacher in teaching and the student in learning.
73
In developing the cognitive domain of instructional objectives, key verbs
can be used. Benjamin Bloom created a taxonomy of measurable verbs to help
us describe and classify observable knowledge, skills, attitudes, behaviors and
abilities. The theory is based upon the idea that there are levels of observable
actions that indicate something is happening in the brain (cognitive activity.) By
creating learning objectives using measurable verbs, you indicate explicitly
what the student must do in order to demonstrate learning. Please refer to
Figure 2 and Table 1.
For better understanding, Bloom has the following description for each
cognitive domain level:
Knowledge - Remember previously learned information
Comprehension - Demonstrate an understanding of the facts
Application - Apply knowledge to actual situations
Analysis - Break down objects or ideas into simpler parts and find
evidence to support generalizations
Synthesis - Compile component ideas into a new whole or propose
alternative solutions
74
Evaluation - Make and defend judgments based on internal evidence or
external criteria
Bloom’s Definitions
Remembering - Exhibit memory of previously learned material by recalling
facts, terms, basic concepts, and answers.
Understanding - Demonstrate understanding of facts and ideas by
organizing, comparing, translating, interpreting, giving descriptions, and
stating main ideas.
Applying - Solve problems to new situations by applying acquired
knowledge, facts, techniques and rules in a different way.
Analyzing - Examine and break information into parts by identifying
motives or causes. Make inferences and find evidence to support
generalizations.
Evaluating - Present and defend opinions by making judgments about
information, validity of ideas, or quality of work based on a set of criteria.
Creating - Compile information together in a different way by combining
elements in a new pattern or proposing alternative solutions
75
Table of Specifications
A table of specifications (TOS), sometimes called a test blueprint, is a
tool used by teachers to design a written test. It is a table that maps out the test
objectives, contents, or topics covered by the test; the levels of cognitive
behavior to be measured; the distribution of items, number, placement, and
weights of test items; and the test format. It helps ensure that the course’s
intended learning outcomes, assessments, and instruction are aligned.
Generally, the TOS is prepared before a test is created. However, it is
deal to prepare one even before the start of instruction. Teachers need to create
a TOS for every test that they intend to develop. The test TOS is important
because it does the following:
Ensures that the instructional objectives and what the test captures
match
Ensures that the test developer will not overlook details that are
considered essential to a good test
Makes developing a test easier and more efficient
Ensures that the test will sample all important content areas and
processes
Is useful in planning and organizing
Offers an opportunity for teachers and students to clarify achievement
expectations.
76
that can be best captured by a written test. There are objectives that are not
meant for a written test. For example, if you test the psychomotor domain, it
is better to do a performance-based assessment. There are also cognitive
objectives that are sometimes better assessed through performance-based
assessment. Those that require the demonstration or creation of something
tangible like projects would also be more appropriately measured by
performance-based assessment. For a written test, you can consider
cognitive, ranging from remembering to creating of ideas that could be
measured using common formats for testing, such as multiple choice,
alternative response test, matching type, and even essays or open-ended
tests.
2. Determine the coverage of the test. The next step in creating the TOS is
to determine the contents of the test. Only topics or contents that have been
discussed in class and are relevant should be included in the test
3. Calculate the weight for each topic. Once the test coverage is
determined, the weight of each topic covered in the test is determined. The
weight assigned per topic in the test is based on the relevance and the time
spent to cover each topic during instruction. The percentage of theme for a
topic in a test is determined by dividing the time spent for that topic covered
in the test. For example, for a test on the Theories of Personality for General
Psychology 101 class, the teacher spent ¼ to 1 ½ hours class sessions. As
such, the weight for each topic is as follows:
77
4. Determine the number of items for the whole test. To determine the
number of items to be included in the test, the amount of time needed to
answer the items are considered. As a general rule, students are given 30-
60 seconds for each item in test formats with choices. For one-hour class,
this means that the test should not exceed 60 items. However, because you
need also to give time for test paper/booklet distribution and giving
instructions, the number of items should be less, maybe just 50 items.
5. Determine the number of items per topic. To determine the number of
items to be included in the test, the weights per topic are considered. Thus,
using the examples above, for a 60-item final test, Theories & Concepts,
Humanistic Theories, Cognitive Theories, Behavioral Theories, and social
Learning Theories will have 5 items, Trait Theories – 10 items, and
Psychoanalytic Theories – 15 items.
Topic Percent of Time No. of Items
(Weight)
Theory & Concepts 10.0 5
Psychoanalytic 30.0 15
Theories
Trait Theories 20.0 10
Humanistic Theories 10.0 5
Cognitive Theories 10.0 5
Behavioral Theories 10.0 5
Social Learning 10.0 5
Theories
Total 100 50 items
78
Topics Test Objectives No. of Format and No. and
Hours Placement of Percent of
Spent Items Items
Theories and Recognize 0.5 Multiple 5 (10.0%)
Concepts important Choice Item
concepts in #s 1-5
personality
theories
Psychoanalytic Identify the 1.5 Multiple 1 (30.0%)
Theories different theories Choice Item
of personality #s 6-20
under the
Psychoanalytic
Model
Others xxx xxx xxx xxx
Total 5 50 (100%)
2. Two-Way TOS. A two-way TOS reflects not only the content, time spent,
and number of items but also the levels of cognitive behavior targeted per
test content based on the theory behind cognitive testing. For example, the
common framework for testing at present in the DepEd Classroom
Assessment Policy is the Revised Bloom’s Taxonomy (DepEd, 2015). One
advantage of this format is that it allows one to see the levels of cognitive
skills and dimensions of knowledge that are emphasized by the test. It also
shows the framework of assessment used in the development of the test.
Nonetheless, this format is more complex than the one-way format.
Content Time No. & KD* Level of Cognitive Behavior, Item Format, No.
Spent Percent and Placement of Items
of Items R U AP AN E C
Theories 0.5 5 (10.0%) F I.3
and Hours #1-3
Concepts C I.2
#4-5
Psycho- F I.2
analytic #6-7
Theories C I.2 I.2
#8-9 #10-11
P I.2 1.2
#12-13 #14-15
M 1.3 II.1 II.1
#16-18 #41 #42
Others
Scoring 1 point per 2 points per item 3 points per
item item
Overall 5 50 20 20 10
Total (100.0%)
79
Another presentation is shown below:
Content Time No. of Level of cognitive Behavior & Knowledge Dimension*,
Spent Items Item Format, No. & Placement of Items
R U AP AN E C
Theories 0.5 5 I.3 I.2
and hours (10.0%) #1-3 #4-5
Concepts (F) (C)
Psycho- 1.5 15 I.2 I.2 I.2 1.2 II.1 II.1
Analytic hours (30.0%) #6-7 #8-9 #10-11 #14-15 #41 #42
Theories (F) (C) (C) (P) (M) (M)
I.2 1.3
#12-13 #16-18
(P) (M)
Others
Scoring 1 point per item 3 points per item 5 points per item
Overall 50 20 20 10
Total (100.0%)
*Legend: KD = Knowledge Dimension (Factual, Conceptual, Procedural, Metacognitive)
I-Multiple Choice; II – Open-Ended
3. Three-Way TOS. This type of TOS reflects the features of one-way and two-
way TOS. One advantage of this format is that it challenges the test writer
to classify objectives based on the theory behind the assessment. It also
shows the variability of thinking skills targeted by the test. However, it takes
a much longer to develop this type of TOS.
80
personality I.2 1.3 (M)
under #12-13 #16-18
psychoanalyti (P) (M)
c model
Others
Scoring 1 point per 3 points per item 5 points
item per item
Overall 50 20 20 10
Total (100%)
*Legend: KD = Knowledge Dimension (Factual, Conceptual, Procedural, Metacognitive)
I - Multiple Choice; II – Open-Ended
Summary
Bloom's taxonomy is a set of three hierarchical models used to classify
learning objectives into levels of complexity and specificity. The three lists
cover the learning objectives in cognitive, affective and psychomotor
domains.
The cognitive domain list has been the primary focus of most traditional
education and is frequently used to structure curriculum learning
objectives, assessments and activities.
In the original version of the taxonomy, the cognitive domain is broken into
the following six levels of objectives, namely: knowledge, comprehension,
application, analysis, synthesis and evaluation.
In the 2001 revised edition of Bloom's taxonomy, the levels are slightly
different: Remember, Understand, Apply, Analyze, Evaluate, Create
(replacing Synthesize).
Knowledge involves recognizing or remembering facts, terms, basic
concepts, or answers without necessarily understanding what they mean.
Comprehension involves demonstrating an understanding of facts and
ideas by organizing, comparing, translating, interpreting, giving
descriptions, and stating the main ideas.
Application involves using acquired knowledge—solving problems in new
situations by applying acquired knowledge, facts, techniques and rules.
Learners should be able to use prior knowledge to solve problems, identify
connections and relationships and how they apply in new situations.
81
Analysis involves examining and breaking information into component
parts, determining how the parts relate to one another, identifying motives
or causes, making inferences, and finding evidence to support
generalizations.
Synthesis involves building a structure or pattern from diverse elements; it
also refers to the act of putting parts together to form a whole.
Evaluation involves presenting and defending opinions by making
judgments about information, the validity of ideas, or quality of work based
on a set of criteria.
A Table of Specifications or a test blueprint is a table that helps teachers
align objectives, instruction, and assessment. This strategy can be used
for a variety of assessment methods but is most commonly associated
with constructing traditional summative tests.
Written test has varied formats and have a set of guidelines to follow.
Enrichment
1. Read the research article titled, ―Classroom Test Construction: The Power
of a Table of Specifications‖ from
[Link]
nstruction_The_Power_of_a_Table_of_Specifications.
2. Watch the video titled, “How to use an automated Table of Specifications:
TOS Made Easy 2019.‖ Accessible from
[Link]
3. Explore the post of Jessica Shabatura (September 27, 2013) on ―Using
Bloom’s Taxonomy to Write Effective Learning Objectives.‖ Use this link
[Link]
4. Watch the video titled, “How to write learning objectives using Bloom’s
Taxonomy.‖ Accessible from
[Link]
Assessment
1. Answer the following questions:
1. When planning for a test, what should you do first?
82
2. Are all instructional objectives measured by a paper-pencil test?
3. When constructing a TOS where objectives are set without classifying
them according to their cognitive behavior, what format do you use?
4. If you designed a two-way TOS for your test, what does this format
have?
5. Why a teacher would consider a three-way TOS than the other
formats?
2. To be able check whether you have learned the important information
about planning the test, please provide your answer to the questions given
in the graphical representation.
83
2. Sample 2 in Science
Check (√) the competencies appropriate for the given test format or
method
Competencies Appropriate Appropriate for Appropriate for
for Constructed Methods other
Objectives Type of Test than a Written
Test Format Format Test
1. Infer that the weather
changes during the
day and from day-to-
day
2. Practice care and
concern for animals
3. Participate in
campaigns and
activities for
improving/managing
one’s environment
4. Compare the ability of
land and water to
absorb and release
heat
5. Describe the four types
of climate in the
Philippines
3. Sample 3 in Language
Check (√) the competencies appropriate for the given test format or
method.
Competencies Appropriate Appropriate for Appropriate for
for Constructed Methods other
Objectives Type of Test than a Written
Test Format Format Test
1. Use words that describe
persons, places, animals,
and events
2. Draw conclusions based
on picture-stimuli/
passages
3. Write a different story
ending
4. Write a simple friendly
letter observing the
correct format
5. Compose riddles, slogans
and announcements from
the given stimuli
84
4. For the table of specifications, you can apply what you have learned by
creating a two-way TOS of the final exams of your class. Take into
considerations the content or topic, time spent for each topic; knowledge
dimension; and item format, number, and placement for each level of
cognitive behavior. An example of a TOS for a long exam for Abnormal
Psychology class is shown below. Some parts are missing. Complete the
TOS based on the given information.
Content Time # of KD* Level of Cognitive Behavior, Item Format, No. and
Spent Items Placement of Items
R U AP AN E C
Disorder Usually 3 hours ? F I.10 I.10 I.10
First Diagnosed in #1-10 #? ?
Infancy, Childhood
or Adolescence
Cognitive Disorder 3 ? C I.10 I.10 I.10
? #? #?
Substance Related 1 10% P I.5 I.5
Disorder (10) #? #?
Schizophrenia and 3 ? M I.10 I.10 I.10
other Psychotic #? #? #?
Disorder
Total ? ? ? ? ? ? ?
10 100 45 25 30
Overall Total
10 100% 45% 25% 30%
5. Test Yourself
Choose the letter of the correct answer to every item given.
1. The instructional objective focuses on the development of learners’
knowledge. Can this objective be assessed using the multiple-choice
format?
A. No, this objective requires an essay format.
B. No, this objective is better assessed using matching type test.
C. Yes, as multiple-choice is appropriate is assessing knowledge.
D. Yes, as multiple-choice is the most valid format when assessing
learning.
2. You prepared an objective test format for your quarterly test in
Mathematics. Which of the following could NOT have been your test
objective?
A. Interpret a line graph
B. Construct a line graph
C. Compare the information presented in a line graph
85
D. Draw conclusions from the data presented in a line graph
3. Teacher Lanie prepared a TOS as her guide in developing a test. Why
is this necessary?
A. To guide the planning of instruction
B. To satisfy the requirements in developing a test
C. To have a test blueprint as accreditation usually requires this plan
D. To ensure that the test is designed to cover what it intends to
measure
4. Mr. Arceo prepared a TOS that shows both the objectives and the
different levels of cognitive behavior. What format could he have used?
A. One-way format
B. Two-way format
C. Three-way format
D. Four-way format
5. The School Principal wants the teachers to develop a TOS that uses the
two-way format than a one-way format. Why do you think this is the
principal’s preferred format?
A. So that the different levels of cognitive behavior to be tested are
known
B. So that the formats of the test are known by just looking at the TOS
C. So that the test writer would know the distribution of test items
D. So that objectives for instruction are also reflected in the TOS
6. Review the table if specifications that you have developed for your
quarterly examination.
6.1. Is the purpose of assessment clear and relevant to measure desired
learning outcome?
6.2. Are the topics or course contents discussed in class well covered by
the test? Is the number of test items per topic and for the whole test
enough? Does the test cover only relevant topics?
6.3. Are all levels of thinking skills appropriately represented across
topics?
6.4. Are the test formats chosen for the specific desires learning outcomes
the most appropriate method to use? Can you employ other types of
test?
86
6.5. Would you consider your table of specifications good and effective to
guide you in developing your test? Are there components in the TOS
that need major revisions? How can improve the TOS?
7. Evaluate your skills in planning your test in terms of setting objectives and
designing a table of specifications based on the following scale. Circle the
performance level you are at for (1) setting test objectives and (2) creating
a table of specifications.
Level Performance Benchmark Setting Test Creating Table of
Objectives Specifications
Proficient I know them very well. I can 4 4
teach others where and when
to use them appropriately.
Master I can do it by myself, though, I 3 3
sometimes make mistake.
Developing I am getting there, though I 2 2
still need help to be able to
perfect it.
Novice I cannot do it myself. I need 1 1
help to plan for my tests.
Based on your self-assessment above, choose the following tasks to help you
enhance your skills and competencies in setting course objectives and in
designing a table of specifications.
Level Possible Tasks
Proficient Help or mentor peer or classmates who are having difficulty in setting
test objectives and designing table of specifications.
Master Examine the areas that you need to improve on and address them
immediately. Benchmark with the test objectives and TOS developed
by your peers/classmates who are known to be proficient in this area.
87
Educator’s Feedback
In an interview with a high school teacher, this is what he shared on his
practice when preparing a test.
When I plan my test, I first design its TOS, so I know what I should
cover. I usually prepare a Two-way TOS. Actually, because I have been
teaching the same course for many years now, I have come to a point
that all my tests have their two-way TOS ready to be shown to anybody,
most specially my students. Hence, even at the start of term, I know
what I should teach and how they would be assessed. I know those
topics that are appropriately assessed through a written test. Weeks
before the test is given, I usually give the TOS to my students, so they
have a guide in preparing for the test. I allot time in my class for my
students to examine the TOS of the test for them to check if there were
topics not actually taught in the class. My students usually are surprised
when I do this as they don’t normally see TOS of their teacher’s test. But
I do this as I want them to be successful. I find it fair for them to know
how much weight is given to every topic covered in the test. Most often,
the outcome of the test is good as almost all, if not all, of my students
would pass my test.
References
Armstrong, P. (2020). Bloom’s Taxonomy. TN: Vanderbilt University Center
for Teaching. Retrieved from [Link]
pages/blooms-taxonomy/
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Fives, H. & DiDonato-Barnes, N. (February 2013). Classroom Test
Construction: The Power of a Table of Specifications. Practical
Assessment, Research & Evaluation, Volume 18 (3).
88
Isaacs, Geoff (1996). Bloom’s Taxonomy of Educational Objectives. The
University of Queensland: TEDI. Retrieved from
[Link]
Macayan, J. (2017). Implementing Outcome-Based Education (OBE)
Framework: Implications for Assessment of Students’ Performance.
Educational Measurement and Evaluation Review, Vol. 8 (1).
Magno, C. (2011). A Closer Look at other Taxonomies of Learning: A Guide
for Assessing Student Learning. The Assessment Handbook, Vol. 5.
89
Lesson 2: Construction of Written Tests
Pre-discussion
The construction of good tests requires specific skills and experience.
To be able to successfully demonstrate your knowledge and skills in
constructing traditional types of tests that are most applicable to a particular
learning outcome, you should be able to distinguish the different test types and
formats, and understand the process and requirements in setting learning
objectives and outcomes and in preparing the table of specifications. For proper
guidance in this lesson, the performance tasks and success indicators are
presented below.
What to Expect?
At the end of the lesson, the students can:
1. describe the characteristics of selected-response and constructed-
response tests;
2. classify whether a test is selected-response or constructed-response;
3. identify the test format that is most appropriate to a particular learning
outcome/target;
4. apply the general guidelines in constructing test items;
5. prepare a written test based on the prepared TOS; and
6. evaluate a given teacher-made test based on guidelines.
90
Constructing various Types of Traditional Test Formats
Classroom assessments are an integral part of learners’ learning. They
do more than just measure learning. They also inform the learners what needs
to be learned and to what extent and how to learn them. They also provide the
parents some feedback about their child’s achievement of the desired learning
outcomes. The schools also get to benefit from classroom assessments
because the learners’ test results can provide them evidence- based data that
are useful for instructional planning and decision-making. As such, it is
important that assessment tasks or tests are meaningful and further promote
deep learning; as well as fulfill the criteria and principles of test construction.
There are many ways by which learners can demonstrate their
knowledge and skills and show evidence of their proficiencies at the end of a
lesson, unit, or subject. While authentic or performance-based assessments
have been advocated as the better and more appropriate methods in assessing
learning outcomes, particularly as they assess higher-level thinking skills
(HOTS), the traditional written assessment methods, such as multiple- choice
tests, are also considered as appropriate and efficient classroom assessment
tools for some types of learning targets. This is mainly true for large classes
and when test results are needed immediately for some educational decisions.
Traditional tests are also deemed reliable and exhibit excellent content and
construct validity.
To learn or enhance your skills in developing good and effective test
items for a particular test format, you need to possess adequate knowledge on
different test formats; how and when to choose a particular test format that is
the most appropriate measure of the identified learning objectives and desired
learning outcomes of your subject; and how to construct good and effective
items for each format.
91
cannot measure this outcome through a multiple-choice test or a matching- type
test.
Hence, to guide you on choosing the appropriate test format and
designing fair and appropriate yet challenging tests, you should ask the
following important questions:
92
factual (F), conceptual (C), procedural (P), and metacognition (M). You may
return to Lesson 2 and Lesson 4 to review the different levels of Cognitive
Behaviour and Knowledge Dimensions.
3. Is the test match or aligned with the course’s DLOs and the course contents
or learning activities?
The assessment tasks should be aligned with the instructional
activities and the DLOs. Thus, it is important that you are clear about what
DLOs are to be addressed by your test and what course activities or tasks
are to be implemented to achieve the DLOs.
For example, if you want learners to articulate and justify their stand
on ethical decision-making and social responsibility practices and business
(i.e., DLO); then an essay test and class debate are appropriate measures
and tasks for this learning outcome. A multiple-choice test may be used but
only if you intend to assess learners’ ability to recognize what is ethical
versus unethical decision-making practice. In the same manner, matching-
type items may be appropriate if you want to know whether your students
can differentiate and match the different approaches or terms to their
definitions.
93
cognitive capabilities required to answer selected-response items are different
from those required by constructed-response items, regardless of contents.
94
and/or other higher-order cognitive skills (e.g., reasoning, analysis, critical
thinking and skills).
A. ANCOVA C. Chi-Square
B. ANOVA D. Mann-Whitney Test
2. Do not lift and use statements from the textbook or other learning materials
as test questions.
3. Keep the vocabulary simple and understandable based on level of
learners/examinees.
4. Edit and proofread the items for grammatical and spelling before
administering to the learners.
95
B. Stem
1. Write the directions in the stem in a clear and understandable manner.
Faulty: Read each question and indicate your answer by shading the circle
corresponding to your answer.
Good: This test consists of two parts. Part A is a reading comprehension
test, and Part B is grammar/language test. Each question is a
multiple-choice test item with five (5) options. You need to answer
each question but will not be penalized the wrong answer or for
guessing. You can go back and review your answer during the time
allotted.
2. Write stems that are consistent in form and structure, that is, present all
items either in question form or in description or declarative form.
Faulty: (1) Who was the Philippine president during Martial Law?
(2) The first president of the Commonwealth of the Philippines was
.
Good: (1) Who was the Philippine president during Martial Law?
(2) Who was the first president of the Commonwealth of the
Philippines?
3. Express the stem positively and avoid double negatives, such as NOT and
EXCEPT in a stem. If a negative word is necessary, underline or capitalize
the words for emphasis.
Faulty: Which of the following is not the measure of variability?
Good: Which of the following is NOT a measure of variability?
4. Refrain from making the stem too wordy or containing too much
information unless the problem or question requires the facts presented to
solve the problem.
Faulty: What does DNA stand for, and what is the organic chemical of
complex molecular structure found in all cells and viruses and codes
genetic information for the transmission of inherited traits?
Good: As a chemical compound, what does DNA stand for?
96
C. Options
1. Provide three (3) to five (5) options per item, with only one being the correct
or best answer/alternative.
2. Write options that are parallel or similar in form and length to avoid giving
clues about the correct answer.
Faulty: What is an ecosystem?
A. It is a community of living organisms in conjunction with the non-living
components of their environmental that interact as a system. These
biotic and abiotic components are linked together through nutrient
cycles and energy flows.
B. It is a place on Earth’s surface where life dwells.
C. It is an area that one or more individual organisms defend against
competition from other organisms.
D. It is the biotic and abiotic surroundings of an organism or population.
E. It is the largest division of the Earth’s surface filled with living organisms.
Good: What is an ecosystem?
A. It is a place on the Earth’s surface where life dwells.
B. It is the biotic and abiotic surroundings of an organism or population.
C. It is the largest division of the Earth’s surface filled with living
organisms.
D. It is a large community of living and non-living organisms in a particular
area.
E. It is an area that one or more individual organisms defend against
competition from other organisms.
3. Place options in a logical order (e.g., alphabetical, from shortest to
longest).
Faulty: Which experimental gas law describes how the pressure of a gas
tends to increase as the volume of the container decreases? (i.e., ―The
absolute pressure exerted by a given mass of an ideal gas is inversely
proportional to the volume it occupies.‖)
A. Boyle’s Law D. Avogadro’s Law
B. Charles’ Law E. Faraday’s Law
C. Beer Lambert Law
97
Good: Which experimental gas law that describes how the pressure of gas
tends to increase as the volume of the container decreases? (i.e.,
―The absolute pressure exerted b y a given mass of an ideal gas is
inversely proportional to the volume it occupies.‖)
A. Avogadro’s Law D. Charles Law
B. Beer Lambert Law E. Faraday’s Law
C. Boyle’s Law
4. Place correct response randomly to avoid a discernable pattern of correct
answers.
5. Use None-of-the-above carefully and only when there is one absolutely
correct answer, such as in spelling or math items.
Faulty: Which of the following is a nonparametric statistic?
A. ANCOVA D. t-test
B. ANOVA E. None of the Above
C. Correlation
Good: Which of the following is a nonparametric statistic?
A. ANCOVA D. Mann-Whitney U
B. ANOVA E. t-test
C. Correlation
6. Avoid All of the Above as an option, especially if it is intended to be correct
answer.
Faulty: Who among the following has become the President of Philippine
Senate?
A. Ferdinand Marcos D. Quintin Paredes
B. Manuel Quezon E. All of the Above
C. Manuel Roxas
Good: Who was the first ever President of the Philippines Senate?
A. Eulogio Rodriguez D. Manuel Roxas
B. Ferdinand Marcos E. Quintin Paredes
C. Manuel Quezon
7. Make all options realistic and reasonable.
98
General Guidelines in Writing Matching-type items
The matching test item requires learners to match a word, sentence, or
phrase in one column (i.e., premise) to a corresponding word, sentence, or
phrase in a second column (i.e., response). It is most appropriate when you
need to measure the learners’ ability to identify the relationship or association
between similar items. They work best when the course content has many
parallel concepts. While matching-type test format is generally used for simple
recall of information, you can find ways to make it applicable or useful in
assessing higher level of thinking such as applying and analyzing.
The following are the general guidelines in writing good and effective
matching-type tests:
1. Clearly state in the directions the basis for matching the stimuli with the
responses.
Faulty: Directions: Match the following.
Good: Directions: Column I is a list of countries while Column II presents
the continents where these countries are located. Write the letter of the
continent corresponding to the country on the line provided in Column I.
Item #1’s instruction is less preferred as it does not detail the basis for
matching the stem and the response options.
2. Ensure that the stimuli are longer and the responses are shorter.
Faulty: Match the description of the flag to its country.
A B
Bangladesh A. Green background with red circle in the center
Indonesia B. One red strip on top and white strip at the bottom
Japan C. Red background with white five-petal flower in the
center
Singapore D. Red background with large yellow circle in the center
Thailand E. Red background with large yellow pointed star in the
center
F. White background with large red circle in the center
99
Good: Match the description of the flag to its country.
A B
Green background with a red circle in the center A. Bangladesh
One red strip on top and white strip at the bottom B. Hong Kong
Red background with five-petal flower in the center C. Indonesia
Red background with large yellow pointed star in the center D. Japan
White background with red circle in the center E. Singapore
F. Vietnam
Item #2 is a better version because the descriptions are presented in the
first column while the response options are in the second column. The
stems are also longer than the options.
3. For each item, include only topics that are related with one another and
share the same foundation of information.
Faulty: Match the following:
A B
1. Indonesia A. Asia
2. Malaysia B. Bangkok
3. Philippines C. Jakarta
4. Thailand D. Kuala Lumpur
5. Year ASEAN was established E. Manila
F. 1967
Good: On the line to the left of each country in Column I, write the letter
of the country’s capital presented in column II.
Column I Column II
1. Indonesia A. Bandar Seri Begawan
2. Malaysia B. Bangkok
3. Philippines C. Jakarta
4. Thailand D. Kuala Lumpur
E. Manila
Item #1 is considered an unacceptable item because its response
options are not parallel and include different kinds of information that can
provide clues to the correct/wrong answers. On the other hand, item #2
details the basis for matching and the response options only include
related concepts.
100
4. Make the response options short, homogeneous, and arranged in logical
order.
Faulty: Match the chemical elements with their characteristics.
A B
Gold A. Au
Hydrogen B. Magnetic metal used in steel
Iron C. Hg
Potassium D. K
Sodium E. With lowest density
F. Na
Good: Match the chemical elements with their symbols.
A B
Gold A. Au
Hydrogen B. Fe
Iron C. H
Potassium D. Hg
Sodium E. K
F. Na
In item #1, response options are not parallel in content and length.
They are not also arranged alphabetically.
5. Included response options that are reasonable and realistic and similar in
length and grammatical form.
Faulty: Match the subjects with their course description.
A B
History A. Studies the production and distribution of
goods/services
Political Science B. Study of politics and power
Psychology C. Study of society
Sociology D. Understand role of mental functions in social
behaviour
E. Uses narratives to examine and analyze past
events
101
Good: Match the subjects with their course description
A B
1. Study of living things A. Biology
2. Study of mind and behaviour B. History
3. Study of policies and power C. Political Science
4. Study of recorded events in D. Psychology
the past
5. Study of society E. Sociology
F. Zoology
102
General Guidelines in Writing True or False items
True or False items are used to measure learners’ ability to identify
whether a statement or proposition is correct/true or incorrect/false. They are
best used when learners’ ability to judge or evaluate is one of the desired
learning outcomes of the course.
There are different variants of the true or false items. These include the
following:
1. T-F Correction or Modified True or False Question. In this format,
the statement is presented with a key word or phrase that is
underlined, and the learner has to supply the correct word or phrase.
e.g., Multiple-choice test is authentic.
2. Yes-No Variation. In this format, the learner has to choose yes or no,
rather than true or false.
e.g., The following are kinds of test. Circle Yes if it is authentic test
and No if not.
Multiple Choice Test Yes No
Debates Yes No
End-of-the Term Project Yes No
True or False Test Yes No
3. A-B Variation. In this format, the learners has to choose A or B, rather
than true or false.
e.g., Indicate which of the following are traditional or authentic tests
by circling A if it is a traditional test and B if it is authentic.
Traditional Authentic
Multiple Choice Test A B
Debates A B
End-of-the Term Project A B
True or False Test A B
103
1. Include statements that are completely true or completely false
Faulty: The presidential system of government, where the president is
only the head of state or government, is adopted by the United
States, Chile, Panama, and South Korea.
Good: The presidential system, where the president is only the head of
the state or government, is adopted by Chile.
104
Absolute words such as “always” and “never” restrict possibilities and
make a statement as true 100 percent or all the time. They are also hint
for a “false” answer.
5. Express a single idea in each test item.
Faulty: If an object is accelerating, a net force must be acting on it, and
the acceleration of an object is directly proportional to the net
force applied to the object.
Good: If an object is accelerating, a net force must be acting on it.
105
2. Do not omit too many words from the statement such that the intended
meaning is lost.
Faulty: is to Spain as the is to United States and as
is to Germany.
Good: Madrid is to Spain as the _ is to France.
Item # 1 is prone to many and varied answers. For example, a student may
answer the question based on the capital of these countries or based on
what continent they are located. Item # 2 is preferred because it is more
specific and requires only one correct answer.
3. Avoid obvious clues to the correct response.
Faulty: Ferdinand Marcos declared martial law in 1972. Who was the
president during that period?
Good: The president during the martial law year was .
Item #1 already gives a clue that Ferdinand Marcos was the president during
this time because only the president of a country can declare martial law.
4. Be sure that there is only one correct response.
Faulty: the government should start using renewable energy sources for
generating electricity, such as .
Good: the government should start using renewable sources of energy
by using turbines called .
Item #1 has many possible answers because the statement is very general
(e.g., wind, solar, biomass, geothermal, and hydroelectric). Item # 2 is more
specific and only requires one correct answer (i.e., wind).
5. Avoid grammatical clues to the correct response.
Faulty: A subatomic particle with a negative electric charge is called an
.
Good: A subatomic particle with a negative electric charge is called
a(n) .
The word “an” in item #1 provides a clue that the correct answer starts with
a vowel.
6. If possible, put the blank at the end of a statement rather than at the
beginning.
Faulty: is the basic building block matter.
106
Good: The basic building block of matter is .
In Item #1, learners may need to read the sentence until the end before they
can recognize the problem, and then re-read it again and then answer the
question. On the other hand, in item #2, learners can already identify the
context of the problem by reading through the sentence only once and
without having to go back and re-read the sentence.
107
These are the general guidelines in constructing good essay questions:
1. Clearly define the intended learning outcomes to be assessed by the
essay test.
To design effective essay questions or prompts, the specific intended
learning outcomes are identified. If the intended learning outcomes to
be assessed lack clarity and specificity, the questions or prompts may
assess something other than what they intend to assess. Appropriate
direct verbs that most closely match the ability of the learners should
demonstrate must be used in the prompts. These include verbs such as
compose, analyze, interpret, explain, and justify, among others.
2. Refrain from using essay test for intended learning outcomes that are
better assessed by other kinds of assessment.
Some intended learning outcomes can be efficiently and reliably
assessed by selected-type test rather than by essay test. In the same
manner, there are intended learning outcomes that are better assessed
using other authentic assessments, such as performance test, rather
than by essay test. Thus, it is important to take into consideration the
limitations of essay tests when planning and
108
deciding what assessment method to employ for an intended learning
outcome.
3. Clearly define and situate the task within a problem situation as well as
the type of thinking required to answer the test.
Essay questions or prompts should provide clear and well- defined tasks
to the learners. It is important to carefully choose the directive verb, to
write clearly the object or focus of the directive verb, and to delimit the
scope of the task. Having clear and well-defined tasks will guided
learners on what to focus on when answering the prompts, thus avoiding
responses that contain ideas that are unrelated or irrelevant, too long,
or focusing only on some part of the task. Emphasizing the types of
thinking required to answer the question will also guide students on the
extent to which they should be creative, deep, complex, and analytical
in addressing and responding to the questions.
4. Present tasks that are fair, reasonable, and realistic to the students.
Essay questions should contain tasks or questions that students will be
able to do or address. These include those that are within the level of
instruction or training, expertise, and experience of the students.
5. Be specific in the prompts about the time allotment and criteria for
grading the response.
Essay prompts and directions should indicate the approximate time
given to the students to answer the essay questions to guide them on
how much time they should allocate for each item, especially if several
essay questions are presented. How the responses are to be graded or
rated should also be clarified to guide the students on what to include in
their responses.
109
General Guidelines in Problem-solving Test items
Problem-solving test items are used to measure learners’ ability to solve
problems that require quantitative knowledge and competencies and/or critical
thinking skills. These items present a problem situation or task that will require
learners to demonstrate work procedures or come up with a correct solution.
Full or partial credit can be assigned to the answer, depending on the answers
or solutions required.
There are different variations of the quantitative problem-solving items.
These included the following:
1. One answer choice - This type of question contains four or five options,
and students are required to choose the best answer.
Example: What is the mean of the following score distribution: 32, 44. 56.
69, 75, 77, 95, 96?
A. 68 D. 74
B. 69 E. 76
C. 72
110
2. All possible answer choices - This type of question has four or five
options, and students are required to choose all of the options that are
correct.
Example: Consider the following score distribution: 12, 14, 14, 14, 17, 24,
27, 28, and 30. Which of the following is/are the correct measure/s of central
tendency? Indicate all possible answers.
A. Mean = 20 D. Median = 17
B. Mean = 22 E. Mode = 14
C. Median = 16
Options A, D, and E are all correct answers.
3. Type-in answer – This type of question does not provide options to choose
from. Instead, the learners are asked to supply the correct answer. The
teacher should inform the learners at the start how their answer will be
rated. For example, the teacher may require just the correct answer or may
require learners to present the step-by-step procedures in coming up their
answers. On the other hand, for non- mathematical problem solving, such
as a case study, the teacher may present a rubric how their answer will be
rated.
Example: Compute the mean of the following score distribution: 32, 44, 56,
69, 75, 77, 95, and 96. Indicate your answer in the blank provided.
In this case, the learners will only need to give the correct answer
without having to show the procedures for computation.
Example: Lillian, a 55-year old accountant, has been suffering from
frequent dizziness, nausea, and light-headedness. During the
interview, Lillian was obviously restless, and sweating. She reported
feeling so stressed and fearful of anything without any apparent
reason. She could not sleep and eat well. She also started to
withdraw from family and friends, as she experienced frequent panic
attacks. She also said that she was constantly worrying about
everything in work and at home. What might be Lillian’s problem?
What should she do to alleviate all her symptoms?
111
Problem-solving test items are good test format as they minimize
guessing, measure instructional objectives that focus in higher cognitive
levels, and measure extensive amount of contents or topics. However, they
require more time for teachers to construct, read, and correct, and are
prone to rater bias, especially when scoring rubrics/criteria are not
available. It is therefore important that good quality problem-solving test
items are constructed.
The following are some of the general guidelines in constructing
good problem-solving test items:
1. Identify and explain the problem clearly.
Faulty: Tricia was 135.6 lbs. when she started with her zumba exercises.
After three months of attending the sessions three times a week,
her weight was down to 122.8 lbs. About how many lbs. did she
lose after three months? Write your final answer in the space
provided and show your computations.
[This question asks “about how many” and does not indicate whether
learners need to give the exact weight or whether they need to round off
their answer and to what extent.]
Good: Tricia was 135.6 lbs. when she started with her zumba exercises.
After three months of attending the sessions three times a week,
her weight was down to 122.8 lbs. Did she lose after three
months? Write your final answer in the space provided and show
your computations. Write the exact weight; do not round off.
2. Be specific and clear of the type of response required from the students.
Faulty: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity
juice in Philippines, aside from their Singapore market. The sales
for the juice in the Singapore market were $5 million more than
those of their Philippine market in 2016, S$3 million more in
2017, and S$4.5 million in 2018. If the sales in Philippine market
in 2018 were PHP35million, what were the sales in Singapore
market during that year?
112
[This is a faulty question because it does not specify in what currency
the answer be presented.]
Good: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity
juice in Philippines, aside from their Singapore market. The sales
for the juice in the Singapore market were S$5 million more than
those of their Philippine market 2016, S$3 million more in 2017,
and S$4.5 million in 2018. If the sales in Mexican market in 2018
were PHP 35 million, what were the sales in U.S. market during
that year? Provide answer in Singapore dollars (1S$ =
PHP36.50). [This is a better item because it specifies in what
currency should the answer be presented, and the exchange
rate was given.]
3. Specify in the directions the bases for grading students’
answer/procedures.
Faulty: VCV Consultancy Firm was commissioned to conduct a survey
on the voters’ preferences in VIsayas and Mindanao for
upcoming presidential election. In Visayas, 65% are for Liberal
Party (LP) candidate, while 35% are for the Nationalists, while
30% are LP supporters. A survey was conducted among 200
voters for each region. What is the probability that the survey will
show a greater percentage of Liberal Party supporters in
Mindanao than in the Visayas region?
113
What is the probability that the survey will show a greater
percentage of Liberal Party supporters in Mindanao than in the
Visayas region? Please show your solutions to support your
answer. Your answer will be graded as follows:
0 points = for wrong answer and wrong solution
1 points = for correct answer only (i.e., without or wrong
solution)
3 points = for correct answer with partial solutions
5 points = for correct answer with complete solutions
Assessment
A. Let us review what you have learned about constructing traditional tests.
1. What factors should be considered when choosing a particular test
format?
2. What are the major categories and formats of traditional tests?
3. When are the following traditional tests appropriate to use?
Multiple-choice test - short-answer test
Matching-type test - essay test
True or false test - problem-solving tests
4. How should the items for the above traditional tests be constructed?
To check whether you have learned the important information about
constructing the traditional types of tests, please complete the following
graphical representation:
114
an assessment plan for your chosen subject. List down the desired
learning outcomes and subject topic or lesson; and for each desired
learning outcome, identify the appropriate test format to assess learners’
achievement of the outcome. It is important that you have an
assessment plan for each subject.
Effects of change of
demand and supply on
market price
Apply the concepts of Exchange Rate, Essay, problem sets,
demand and supply in Change in the Price of case analysis, and
actual cases Goods in the Market, exercises
Price Ceiling and Price
Floor
Others
B. Now that you are able to identify the types of assessment that you will
employ for each desired learning outcome for a subject, you are now ready
to construct sample tests for the subject. Construct a three-part test that
includes test formats of your choice. In the development of the test, you will
need the following information:
1. Desired learning outcomes for subject area.
2. Level of cognitive/thinking skills appropriate to assess the desired
learning outcomes
3. Appropriate test format to use
4. Number of items per learning outcome or area and the weights
5. Number of points for each item and total number of points the whole
test
115
Note: In the development of the test, you should take into consideration
the guidelines on developing table of specifications and on constructing
the test items.
C. Evaluate the sample tests that you have developed by using the following
checklist for the three test formats that you used.
116
1. Is the item completely true or completely false?
2. Is the item written in simple, easy-to-follow statements?
3. Are negatives avoided?
4. Are absolutes such as ―always‖ and ―never‖ used sparingly
or not at all?
5. Do items express only a single idea?
6. Is the use of unfamiliar vocabulary avoided?
7. Is the item or statement not lifted from the text, lecture, or
other materials?
D. Evaluate the level of your skills in developing different test formats using
the following scale:
Level Performance Multiple- Matching- True- Short- Essay
Benchmarking Choice Type False Answer
Proficient I know this 4 4 4 4 4
every well. I
can teach
others on how
to make one.
Master I can do it by 3 3 3 3 3
myself, though I
117
sometimes
make mistakes.
Developing I am getting 2 2 2 2 2
there, though I
still need help
to be able to
perfect it.
Novice I cannot do it 1 1 1 1 1
myself. I need
help to make a
good/effective
test
F. Test your understanding about constructing test items for different test
formats. Answer the following items.
1. What are these statements that learners are expected to do or
demonstrate as a result of engaging in the learning process?
A. Desired learning outcomes C. Learning intents
B. Learning goals D. Learning objectives
2. Which of the following is NOT a factor to consider when choosing a
particular test format?
A. Desired learning outcomes of the lesson
B. Grade level of students
C. Learning activities
D. Level of thinking to be assessed
118
3. Ms. Daniel is planning to use a traditional/conventional type of
classroom assessment for her Trigonometry quarterly quiz. Which of
the following test formats she will likely NOT use?
A. Fill-in-the-blank test C. Multiple-choice
B. Matching type D. Oral presentation
4. What is the type of test in which the learners are asked to formulate
their own answers?
A. Alternative response type C. Multiple-choice type
B. Constructed-response type D. Selected-response type
5. What is the type of true or false test item in which the statement is
presented with a key word or brief phrase that is underlined, and the
student ha to apply the correct word or phrase?
A. A-B variation C. T-F substitution variation
B. T-F correction question D. Yes-No variation
6. What is the type of test item in which learners are required to answer a
question by filling in a blank with the correct word or phrase?
A. Essay test
B. Fill-in-the-blank or completion test item
C. Modified true or false test
D. Short answer test
7. What is the most appropriate test format to use if teachers want to
measure the learners’ higher order thinking skills, particularly their
abilities to reason, analyze, synthesize, and evaluate?
A. Essay C. Problem solving skills
B. Matching type D. True or False
8. What is the first step when planning to construct a final examination in
Algebra?
A. Come up with a table of specifications
B. Decide on the length of the test
C. Define the desired learning outcomes
D. Select the type of test to construct
9. What is the type of learning outcome that Dr. Oňas is assessing if he
wants to construct a Multiple-choice test for his Philippine History
class?
119
A. Knowledge C. Problem solving skills
B. Performance D. Product
10. In constructing a fill-in-the-blank or completion test, what guidelines
should be followed?
Educators’ Feedback
Ms. Cudera teaches Practical Research 1 and 2 in a public senior high
school. When asked about his experiences in writing test items for his
subjects, he cited his practice of referring back to the expected learning
outcomes as specified in the DepEd Curriculum Guide and using varied
types of assessments to measure his students’ achievement of these
expected outcomes. This is what he shared:
120
“As a teacher in senior high school, I always make sure that my periodical exams
measure the expected learning competencies as stipulated in the curriculum guide
of the Department of Education. I then create a table of specifications, wherein I
follow the correct item allocation per competency based on the number of hours
being taught in the class and the appropriate cognitive domain expected of every
learning competency. I make sure that in assessing students, I am always guided
by the DepEd Order No. 8, s. 2015 also known as the Policy Guidelines on
Classroom Assessment for the K to 12 Basic Education Program.
For this school year, I was assigned to teach Practical Research 1 and 2 courses.
To assess students’ learning or achievement, I first conducted formative
assessment to provide me some background on what students know about
Research. The result of the formative assessment allowed me to revise my lesson
plans and gave me some directions on how to proceed with and handle the
courses.
As part of the course requirements, I gave the students a lot of writing activities,
wherein they were required to write the drafts of each part of research. For each
work submitted, I read, checked, and gave comments and suggestions on how to
improve their drafts. I then allowed them to rewrite and revise their works. The
final research paper is used as basis for summative assessment.
I made use of different types of tests to determine how my students are performing
in my class. I administered selected-response type of test such as multiple-choice
test, matching type, completion tests and true or false to determine how much
they have learned about the different concepts, methods, and data gathering and
analysis procedures used in research. In the development of the test items, I made
sure that I edit them for content, grammar, and spelling. I also checked if the test
items conformed to the table of specifications.
Furthermore, I also relied heavily on essay tests and other performance tasks. As
I have mentioned. I required students to produce or write the different parts of a
research paper as outputs. They were also required to gather data for their
research. I utilized a rubric that was conceptualized collaboratively with my
students in order to evaluate their outputs. I used 360-degrees evaluation of their
output, wherein aside from my assessment, other members would assess the
work of others and leader would also evaluate the work of its members.
I also conducted item analysis after every periodical exams to identify the least
mastered competencies for a given period, which to improve the performance of
the students.”
121
References
Brame, C. (2013) Writing good multiple choice test questions. Retrieved on
August 26, 2020 from [Link]
pages/writing-good-multiple-choice-test-questions/..
Clay, B. (2001). A Short Guide to Writing Effective Test Questions. Kansas
Curriculum Center, Department of Education: Kansas, USA. Retrieved
on August 25, 2020 from
[Link]
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Popham, W. (2011). Classroom Assessment: What teachers need to know.
Boston, MA: Pearson Education, Inc.
Reiner et al. (2020). Preparing Effective Essay Questions: A Self-directed
Workbook for Educators. Utah, USA: New Forums Press. Available in
[Link]
Truckee Meadows Community College (2015, February 18). Writing Multiple
Choice Test Questions. [Video]. YouTube.
[Link]
122
Lesson 3: Improving a Classroom-based Assessment
Pre-discussion
By now, it is assumed that you have known how to plan a classroom test
by specifying the purpose for constructing it, the instructional outcomes to be
assessed, and preparing a test blueprint to guide the construction process. The
techniques and strategies for selecting and constructing different item formats
to match the intended instructional outcomes make up the second phase of the
test development process which is the content of the preceding lesson. The
process however is not complete without ensuring that the classroom
instrument is valid for the purpose for which it is intended. Ensuring requires
reviewing and improving the items which is the next stage in the process. This
lesson offers the pre-service teachers the practical and necessary ways for
improving teacher-developed assessment tools.
What to Expect?
At the end of the lesson, the students can:
1. list down the different ways for judgmental item-improvement and other
empirically-based procedures;
2. evaluate which type of test item-improvement is appropriate to use;
3. compute and interpret the results for index of difficulty, index of
discrimination and distracter efficiency; and
4. demonstrate knowledge on the procedures for improving a classroom-
based assessment.
Judgmental Item-Improvement
This approach basically makes use of human judgment in reviewing the
items. The judges are teachers themselves who know exactly what the test for,
the instructional outcomes to be assessed, and the items’ level of difficulty
appropriate to his/her class; the teacher’s peers or colleagues who are familiar
with the curriculum standards for the target grade level, the subject matter
content, and the ability of the learners; and the students themselves who can
perceive difficulties based on their past experiences.
123
Teachers’ Own Review
It is always advisable for teachers to take a second look at the
assessment tools he/she has devised for a specific purpose. To presume
perfection right away after its construction may lead to failure to detect
shortcomings of the test or assessment tasks. There are five suggestions given
by Popham (2011) for the teachers to follow exercising judgment:
1. Adherence to item-specific guidelines and general item-writing
commandments. The preceding lesson has provides specific
guidelines in writing various forms of objectives and non-objective
constructed-response types and the selected-response type for
measuring higher-level thinking skills. These guidelines should be used
by the teachers to check how the items have been planned and written
particularly and their alignment to their intended instructional outcomes.
2. Contribution to score-based inference. The teacher examines if the
expected scores generated by the test contribute to making valid
inference about the learners. Can the scores reveal the amount of
learning achieved or show what have been mastered? Can the score
infer the students’ capability to move on to the next instructional level?
Or rather the scores obtained do not make any differences at all in
describing or differentiating various abilities.
3. Accuracy of contents. This review should especially be considered
when tests have been developed after a certain period of time.
Changes that may occurred due to new discoveries or developments
can refined the test contents of a summative test. If this happens, the
items or the key to correction may be to be revisited.
4. Absence of content gaps. This review criterion is especially useful in
strengthening the score-based inference capability of the test. If the
current tool misses out on important content now prescribed by a new
curriculum standard, the score will likely not give an accurate
description of what is expected to be assessed. The teacher always
ensures that the assessment tool matches what is currently required to
be learned. This is a way to check on the content validity of the test.
5. Fairness. The discussion on item-writing guidelines always give
warning unintentionally favoring the uninformed student obtain higher
124
scores. These are due inadvertent grammatical clues, unattractive
distracters, ambiguous problems and messy test instructions.
Sometimes, unfairness can happen because of due advantage
received by a particular group like those seated in the front of the
classroom or those coming from a particular socio-economic level.
Getting rid of faulty and biased items and writing clear instructions
definitely add to the fairness of the test.
Peer review
There are schools that encourage peer or collegial review of assessment
instruments among themselves. Time is provided for this activity and it has
almost always yielded good results for improving tests and performance-based
assessment tasks. During these teacher dyad or triad sessions, those teaching
the same subject area can openly review together the classroom tests and
tasks they have devised against some consensual criteria. The suggestions
given by test experts can actually be used collegially as basis for a review
checklist:
a. Do the items follow the specific and general guidelines in writing
items especially on:
Being aligned to instructional objectives?
Making the problem clear and unambiguous?
Providing plausible options?
Avoiding unintentional clues?
Having only one correct answer?
b. Are the items free from inaccurate content?
c. Are the items free from obsolete content?
d. Are the test instructions clearly written for students to follow?
e. Is the level of difficulty of the test appropriate to level of learners?
f. Is the test fair to all kinds of students?
Student Review
Engagement of students in reviewing items has become a laudable
practice for improving classroom test. The judgment is based on the students’
125
experience in taking the test, their impressions and reactions during the testing
event. The process can be efficiently carried out through the use review
questionnaire. Popham (2011) illustrates a sample questionnaire shown in the
textbox below. It is better to conduct the review activity a day after taking the
test so the students still remember the experience when they see a blank copy
of the test.
Item-Improvement Questionnaire for Students
Empirically-based Procedures
Item-improvement using empirically-based methods is aimed at
improving the quality of an item using students’ response to the test. Test
developers refer to this technical process as item analysis as it utilizes data
obtained data separately for each item. An item is considered good when its
quality indices, i.e., difficulty index and discrimination index, meet certain
126
characteristics. For a norm-referenced test, these two indices are related since
the level of difficulty of an item contributes to its discriminability. An item is good
if it can discriminate between those who perform well in the test and those who
do not. However, an extremely easy item, that which can be answered correctly
by more than 85% of the group, or an extremely difficult item, that which can
only be answered correctly by 15%, is not expected to perform well as a
―discriminator‖. The group will appear to be quite homogenous with items
of this kind. They are weak items since they do not contribute to ―score-based
inference‖.
The difficulty index, however, takes a different meaning when used in
the context of criterion-referenced interpretation or testing for mastery. An item
with a high difficulty index will not be considered as an ―easy item‖ and
therefore a weak item, but rather an item that displays the capability of the
learners to perform the expected outcome. It therefore becomes an evidence
of mastery.
Particularly for objective tests, the responses are binary in form, i.e., right
or wrong, translated into numerical figures as 1 and 0, for obtaining nominal
data like frequency, percentage and proportion. Useful data then are in the
form:
a. Total number of students answering the item (T)
b. Total number of students answering the item right (R)
Difficulty Index
An item is difficult if majority of students are unable to provide the correct
answer. The item is easy if majority of the students are able to answer correctly.
An item can discriminate if the examinees who score high in the test can answer
more the items correctly than examinees who got low scores.
Below is a data set of five items on the additional and subtraction of
integers. Follow the procedure to determine the difficulty and discrimination of
each item.
1. Get the total score of each student and arrange scores from highest to
lowest.
127
Item 1 Item 2 Item 3 Item 4 Item 5
Student 1 0 0 1 1 1
Student 2 1 1 1 0 1
Student 3 0 0 0 1 1
Student 4 0 0 0 0 1
Student 5 0 1 1 1 1
Student 6 1 0 1 1 0
Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 9 1 0 1 1 1
Student 10 1 0 1 1 0
2. Obtained the upper and lower 27% of the group. Multiply 0.27 by the total
number of students, you will get a value of 2.7. The rounded whole number
value is 3.0. Get the top three students and the bottom 3 students based on
their scores. The top three students are students 2, 5, and 9. The bottom
three students are students 7, 8, and 4. The rest of the students are not
included in the item analysis.
3. Obtain the proportion of correct for each item. This is computed for the upper
27% group and the lower 27% group. This is done by summating the correct
answer per item and dividing it by the total number of students.
128
Item 1 Item 2 Item 3 Item 4 Item 5 Total
score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Total 2 2 3 2 3
Proportion of the 0.67 0.67 1.00 0.67 1.00
high group (pH)
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
Total 0 1 2 1 1
Proportion of the 0.00 0.33 0.67 0.33 0.33
low group (pL)
Item difficulty =
Computations
Item 1 Item 2 Item 3 Item 4 Item 5
Discrimination Index
Obviously, the power of an item to discriminate between informed and
uninformed groups or between more knowledgeable and less knowledgeable
learners are shown using the item-discrimination index (D). This is an item
statistics that can reveal useful information for improving an item. Basically,
129
an item discrimination index shows the relationship between the student’s
performance in an item (i.e., right or wrong) and his total performance in the
test represented by the total score. Item-total correlation is usually part of a
package from item analysis. Getting high item-total correlations indicate that the
items contribute well to the total score so that responding item-total correlations
indicate that the items contribute well to the total score so that responding
correctly to these items gives a better chance of obtaining relatively high total
scores in the whole test or subtest.
For classroom tests, the discrimination index shows if a difference exists
between the performance of those who scored high and those who scored low
in the item. As a general rule, the higher the discrimination index (D), the more
marked the magnitude of the difference is, and thus, the more discriminating
the item is. The nature of the difference however, can take different directions.
a. Positively discriminating item – proportion of high scoring group is
greater than that of the low scoring group
b. Negatively discriminating item – proportion of high scoring group is
less than that of the low scoring group
c. Not discriminating item – proportion of high scoring group is equal
to that of the low scoring group
Computing the discrimination index therefore requires obtaining the
difference between the proportion of the high-scoring group getting the item
correctly and the proportion of the low-scoring group getting the item correctly
using this simple formula:
D = RU/TU – RL/TL
where D = is item discrimination index
RU = number of upper group getting the item correct
TU = number of upper group
RL = number of lower group getting the item correct
TL = number of lower group
Another calculation can bring about the same result as:
D = (RU – RL)/T
130
where RU = number of upper group getting the item correct
RL = number of lower group getting the item correct
T = number of either group
As you can see R/T is actually getting the p value of an item. So to get
D is to get the difference between the p-value involving the upper half and the
p-value involving the lower half. So the formula for discrimination index (D) can
also be given as (Popham, 2011):
D = pU – pL
where pU is the p-value for upper group (RU/TU)
pL is the p-value for lower group (RL/TL)
131
For purposes of evaluating the discriminating power of items, Popham
(2011) offers the guidelines proposed by Ebel and Frisbie (1991) shown below.
The teachers can be guided on how to select the satisfactory items and what to
do to improve the rest.
Discrimination Item Evaluation
Index
.40 and above Very good items
.30 - .39 Reasonably good items, but possibly
subject to improvement
.20 - .29 Marginal items, usually needing
improvement
.19 and below Poor items, to be rejected or improved by
revision
Distracter Analysis
Another empirical procedure to discover areas for item-improvement
utilizes an analysis of the distribution of responses across the distracters.
Obviously, when the difficulty index and discrimination index of the item seem
132
to suggest its being candidate for revision, distracter analysis becomes a useful
follow-up.
In distractor analysis, however, we are no longer interested in how test
takers select the correct answer, but how the distracters were able to function
effectively by drawing the test takers away from the correct answer. The number
of times each distractor is selected is noted in order to determine the
effectiveness of the distractor. We would expect that the distractor is selected
by enough candidates for it to be a viable distractor. What exactly is an
acceptable value? This depends to a large extent on the difficulty of the item
itself and what we consider to be an acceptable item difficulty value for test
times. If we are to assume that 0.7 is an appropriate item difficulty value, then
we should expect that the remaining 0.3 be about evenly distributed among the
distractors. Let us take the following test item as an example:
Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly. What
about the remaining 30 students and the effectiveness of the three distractors?
If all 30 selected D, the distractors B and C are useless in their role as
distractors. Similarly, if 15 students selected D and another 15 selected B, then
C is not an effective distractor and should be replaced. The ideal situation would
be for each of the three distractors to be selected by 10 students. Therefore, for
an item which has an item difficulty of 0.7, the ideal effectiveness of each
distractor can be quantified as 10/100 or 0.1. What would be the ideal value for
distractors in a four option multiple choice item when the item difficulty of the
item is 0.4? Hint: You need to identify the proportion of students who did not
select the correct option.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly from
133
the regular item discrimination that we have looked at earlier. Instead of
expecting a positive value, we should logically expect a negative value as more
students from the lower group should select distracters. Each distractor can
have its own item discrimination value in order to analyse how the distracters
work and ultimately refine the effectiveness of the test item itself. If we use the
above item as an example, the item discrimination concept can be used to
assess the effectiveness of each distractor. Consider a class of 100 students,
then shall form the upper and lower groups of 30 students each. Assume the
following results are observed:
134
Distractor analysis can be a useful tool in evaluating the effectiveness of
our distractors. It is important for us to be mindful of the distractors that we use
in a multiple choice format test as when distractors are not effective, they are
virtually useless. As a result, there is a greater possibility that students will be
able to select the correct answer by guessing as the options have been
reduced.
Summary
Judgmental item-improvement is accomplished through teacher’s own
review, peer review, and student review.
Enhancement of test and test items may be possible using empirically-
based procedures like computing the index of difficulty, discrimination index
or distracter analysis.
For items with one correct alternative worth a single point, the item difficulty
is simply the percentage of students who answer an item correctly.
Item discrimination refers to the ability of an item to differentiate among
students on the basis of how well they know the material being tested.
One important element in the quality of a multiple choice item is the quality
of the item's distractors. A distractor analysis addresses the performance of
these incorrect response options.
Enrichment
Read the following studies:
1. ―Difficulty Index, Discrimination Index and Distractor Efficiency in
Multiple Choice Questions,‖ available from
[Link]
2. ―Item Discrimination and Distractor Analysis: A Technical Report on
Thirty Multiple Choice Core Mathematics Achievement Test Items,‖
available from [Link]
3. ―Index and Distractor Efficiency in a Formative Examination in
Community Medicine,‖ available from
[Link]
135
4. ―Impact of distractors in item analysis of multiple choice questions.‖
Available from : [Link]
Assessment
A. Below are descriptions of procedures done to review and improve test item.
On the space provided, write J if a judgmental approach is uded and E if
empirically-based.
1. The Math coordinator of Grade 7 classes examined the periodical tests
by the Math teachers to see if their items are aligned to the target
outcomes for the first quarter.
2. The alternatives of the multiple-choice items of the Social Studies test
were reviewed to discover if they have only one correct answer.
3. To determine if the items are efficiently discriminating between the more
able students from the less able ones, a Biology teacher obtained a
discrimination index (D) of the items.
4. A Technology Education teacher was interested to see if the criterion-
referenced test he has devised shows a difference in the item’s post- test
and pre-test’s p-values.
5. An English teacher conducted a session with his students to find out if
there are other responses acceptable in their literature test. He
encouraged them to rationalize their answers.
B. A final test in Science was administered to a Grade 6 class of 50. The
teacher wants to improve further the items for next year’s use. Calculate a
quality index using the given data and indicate the possible revision needed
by some items.
136
C. Below are additional data collected for the same items. Calculate another
quality index and indicate what needs to be done with the obtained index as
a basis.
Item Upper Lower Index Revision needed to be done
Group Group
1 25 9
2 9 9
3 2 8
4 38 8
5 1 7
D. A distracter analysis is given for a test item given to a class of 60. Obtain
the necessary item statistics using the given data.
Item Difficult Discriminatio Group Alternatives
N=30 y index n index A B C D Omit
1 Upper
Lower
E. For each item, write the letter of your correct answer on the space
provided for.
1. Below are different ways of utilizing the concept of discrimination as an
index of item quality EXCEPT
a. Getting the proportion of those answering the item correctly over
those answering the items
137
b. Obtaining the difference between the proportion of high-scoring
group and the proportion of low-scoring group getting the item
correctly
c. Getting how much better the performance of the class by item is after
instruction than before
d. Differentiating the performance in an item of a group that has
received instruction and a group that has not
2. What can enable some students to answer items correctly even without
having enough knowledge for what is intended to be measured?
a. Clear and brief test instructions
b. Comprehensible statement of the item stem
c. Obviously correct and obviously wrong alternatives
d. Simple sentence structure of the problem
3. An instructor is going to prepare and end-of-course summative test.
What major consideration should it observe so it will differ from a unit
test?
a. Inclusion of all intended learning outcomes of the course
b. Appropriate length of the test to cover all subject matter topics
c. Preparation of a key to correction in advance for ease of scoring
d. Adequate sampling of higher-level learning outcomes
4. Among the strategies for improving test questions given below, which
is empirical in approach?
a. Items that students find confusing are collected and are revised
systematically
b. Teachers who are teaching the same subject matter collegially
meet to discuss the alignment of items to their learning outcomes
c. Item responses of high-scoring group are compared with those of
the low-scoring group
d. The teacher examines the stem and alternatives for accuracy of
content
138
5. Which of the following multiple-choice item data shows a need for
revision?
Item A B C D
1 Upper Group 5* 4 9 2
Lower Group 15 0 5 0
2 Upper Group 2 4 12* 2
Lower Group 4 4 5 7
3 Upper Group 2 14* 2 0
Lower Group 4 4 5 7
4 Upper Group 2 4 2 10*
Lower Group 8 5 0 7
*correct answer
References
Conduct the Item Analysis. Retrieved from
[Link]
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
ExamSoft (2015, August 4). Putting it All Together: Using Distractor Analysis.
[Video]. YouTube. [Link]
(2015, July 21). The Definition of Item Difficulty. [Video]. YouTube.
[Link]
(2015, July 23). Twenty-Seven Percent: The Index of Discrimination.
[Video]. YouTube. [Link]
Exploring Reliability in Academic Achievement. Retrieved from
[Link]
Mahjabeen et al. (2017). Efficiency in Multiple Choice Questions. Annals of
PIMS. Available in
[Link]
Popham, W. (2011). Classroom Assessment: What teachers need to know.
Boston, MA: Pearson Education, Inc.
Professional Testing, Inc. (2020). Building High Quality Examination
Programs. Retrieved from
[Link]
The Graide Network, Inc. (2019). Importance of Validity and Reliability in
Classroom Assessments. Retrieved from
[Link]
quality-testing-reliability-and-validity
139
Lesson 4: Establishing Test Validity and Reliability
Pre-discussion
To be able to successfully perform the expected performance tasks,
students should have prepared a test following the proper procedure with clear
learning targets (objectives), table of specifications, and pre-test data per item.
In the previous lesson, guidelines were provided in constructing test following
different formats. They have also learned that assessment becomes valid when
the test items represent a good set of objectives, and this should be found in
table of specifications. The learning objectives or targets will help them
construct appropriate test items.
What to Expect?
At the end of this lesson, the students can:
1. explain the different tests of validity;
2. identify the most practical test to apply when validating a typical
teacher-made assessment;
3. tell when to use a certain type of reliability test;
4. apply the suitable method of reliability test given a set of assessment
results/test data; and
5. decide whether a test is valid or reliable.
Test Validity
A test is valid when it measures what it is supposed to measure. Validity
pertains to the connection between the purpose of the test and which data the
teacher chooses to quantify that purpose.
If a quarterly exam is valid, then the contents should directly measure
the objectives of the curriculum. If a scale that measure personality is
140
composed of five factors, then the scores on the five factors should have items
that are highly correlated. If an entrance exam is valid, it should predict
students’ grades after the first semester.
It is better to understand the definition through looking at examples of
invalidity. Colin Foster, an expert in mathematics education at the University of
Nottingham, gives the example of a reading test meant to measure literacy that
is given in a very small font size. A highly literate student with bad eyesight may
fail the test because they cannot physically read the passages supplied. Thus,
such a test would not be a valid measure of literacy (though it may be a valid
measure of eyesight). Such an example highlights the fact that validity is wholly
dependent on the purpose behind a test. More generally, in a study plagued by
weak validity, ―it would be possible for someone to fail the test situation rather
than the intended test subject.‖
141
that are strongly determine which items are highly
correlated. correlated to form a factor.
Concurrent When two or more The scores on the measures should be
Validity measures are present correlated.
for each examinee that
measure the same
characteristic
Convergent When the components Correlation is done for the factors of the
Validity or factors of a test a are best.
hypothesized to have a
positive correlation
Divergent When the components Correlation is done for the factors of the
Validity or factors of a test are test.
hypothesized to have a
negative correlate are
the scores in a test on
intrinsic and extrinsic
motivation.
There are cases for each type of validity provided that illustrates how it
is conducted. After reading the cases references about the different kinds of
validity look for a partner and answer the following questions. Discuss your
answer. You may use other references and browse the internet.
1. Content Validity
A coordinator in science is checking the science test paper for Grade 4.
She asked the Grade 4 science teacher to submit the table of specifications
containing the objectives of the lesson and the corresponding items. The
coordinator checked whether each item is aligned with the objectives.
How are the objectives used when creating test items?
How is content validity determined when given the objectives and the
items in a test?
What should be present in a test table of specifications when determining
content validity?
Who checks the content validity of items?
2. Face Validity
The assistant principal browsed the test paper made by the math
teacher. She checked if the contents of the items are about mathematics. She
examined if instructions are clear. She browsed through the items if the
142
grammar is correct and if the vocabulary is within the student’s level of
understanding.
What can be done in order to ensure that the assessment appears to be
effective?
What practices are done in conducting face validity?
Why is face validity the weakest form validity?
3. Predictive Validity
The school admission’s office developed an entrance examination. The
officials wanted to determine if the results of the entrance examination are
accurate in identifying good students. They took the grades of the students
accepted for the first quarter. They correlated the entrance exam results and
the first quarter grades. They found significant and positive correlations
between the entrance examination scores and grades. The entrance
examination results predicted the grades of students after the first quarter.
Thus, there was predictive-prediction validity.
Why are two measures needed in predictive validity?
What is the assumed connection between these two measures?
How can we determine if a measure has predictive validity?
What statistical analysis is done to determine predictive validity?
How can the test results of predictive validity be interpreted?
4. Concurrent Validity
A school Guidance Counsellor administered a math achievement test to
Grade 6 students. She also has a copy of the students’ grades in math. She
wanted to verify if the math grades of the students are measuring the same
competencies as the math achievement test. The school counsellor correlated
the math achievement scores and math grades to determine if they are
measuring the same competencies.
What needs to be available when conducting concurrent validity?
At least how many tests are needed for conducting concurrent validity?
What statistical analysis can be used to established concurrent validity?
How are the results of a correlation coefficient interpreted for concurrent
validity?
143
5. Construct Validity
A science test was made by a Grade 10 teacher composed of four
domains: matter, living things, force and motion, and earth space. There are 10
items under each domain. The teacher wanted to determine if the 10 items
made under each domain really belonged to that domain. The teacher
consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if
the items written will load under the domain they belong.
What type of test requires construct validity?
What should the test have in order to verify its constructs?
What are constructs and factors in a test?
How can these factors be verified if they are appropriate for the test?
What results come out in construct validity?
How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The
following are guided questions used when searching for the construct validity
of a measure from reports:
What was the purpose of construct validity?
What type of test was used?
What are the dimensions or factors that were studied using construct
validity?
What procedure was used to establish the construct validity?
What statistics was used for the construct validity?
What were the results of the test’s construct validity?
6. Convergent Validity
A Math teacher developed a test to be administered at the end of the
school year, which measures number sense, patterns and algebra,
measurement, geometry, and statistics. It is assumed by the math teacher that
students’ competencies in number sense improve their capacity to learn
patterns and algebra and other concepts. After administering the test, the
scores were separated for each area, and these five domains were inter-
correlated using Pearson r. the positive correlation between number sense
144
and patterns and algebra indicates that, when number sense scores increase,
the patters and algebra scores also increase. This shows student learning of
number sense scaffold patterns and algebra competencies.
What should a test have in order to conduct convergent validity?
What are done with the domains in a test on convergent validity?
What analysis is used to determine convergent validity?
How are the results in convergent validity interpreted?
7. Divergent Validity
An English teacher taught metacognitive awareness strategy to
comprehend a paragraph for Grade 11 students. She wanted to determine if
the performance of her students in reading comprehension would reflect well in
the reading comprehension test. She administered the same reading
comprehension test to another class which was not taught the metacognitive
awareness strategy. She compared the results using a t-test of independent
samples and found that the class that was taught metacognitive awareness
strategy performed significantly better that the other group. The test has
divergent validity.
What conditions are needed to conduct divergent validity?
What assumption is being proved in divergent validity?
What statistical analysis can be used to establish divergent validity?
How are the results of divergent validity interpreted?
Test Reliability
Reliability is not at all concerned with intent, instead asking whether the
test used to collect data produces accurate results. In this context, accuracy is
defined by consistency or as to whether the results could be replicated.
Also, it is the consistency of the responses to measure under three
conditions:
1. when retested on the same person;
2. when retested on the same measure; and
3. similarity of responses across items that measure the same
characteristic.
145
In the first condition, consistent response is expected when the test is
given to the same participants. In the second condition, reliability is attained if
the responses to the same test are consistent with the same characteristic
equivalent or another test that measures but measures the same characteristic
when administered at a different time. In the third condition, there is reliability
when the person responded in the same way or consistently across items that
measure the same characteristic.
There are different factors that affect the reliability of a measure. The
reliability of a measure can be high or low, depending on the following factor:
1. The number of items in a test – The more items a test has, the
likelihood of reliability is high. The probability of obtaining consistent
scores is high because of the large pool of items.
2. Individual difference of participants – every participant possesses
characteristics that affect their performance in a test, such as fatigue,
concentration, innate ability, perseverance, and motivation. These
individual factors change over time and affect the consistency of the
answers in a test.
3. External environment – The external environment may include room
temperature, noise level, depth of instruction, exposure to materials,
and quality of instruction which could affect changes in the responses
of examinees in a test.
146
Method in How is this reliability done? What is statistics is
Testing used?
Reliability
1. Test-retest You have a test, and you need to Correlate the test
administer it at one time to a group of scores from the first
examinees. Administer it again at and the next
another time to the ―sane group‖ of administration.
examinees. There is a time interval of Significant and positive
not more than 6 months between the correlation indicates
first and second administration of test that the test has
that measure stable characteristics, temporal stability
such as standardized aptitude tests. overtime.
The post-test can be given with a
minimum time interval of 30 minutes. Correlation refers to a
The response in the test should more or statistical procedure
less be the same across the two points where linear
in time. relationship is expected
for two variables.
Test-retest is applicable for tests that Pearson Product
measure stable variables, such as Moment Correlation or
aptitude and psychomotor measures Person r may be used
(e.g., typing test, tasks in physical because test data are
education). usually in an interval
scale (refer to a
statistics book for
Pearson r).
2. Parallel There are two versions of a test. The Correlate the test
Forms items need to exactly measure the results for the first form
same skill. Each test version is called a and the second form.
―form.‖ Administer one form at one time Significant and positive
and the other form to another time to the correlation coefficient is
―same‖ group of participants. The expected. The
responses on the two forms should be significant and positive
more or less the same. correlation indicates
that the responses in
Parallel forms are applicable I there are the two forms are the
two versions of the test. This is usually same or consistent.
done when the test is repeatedly used Pearson r is usually
for different groups, such as entrance used for this analysis.
examinations and licensure
examinations. Different versions of the
test are given to a different group of
examinees.
3. Split-Half Administer a test to a group of Correlate the two sets
examinees. The items need to be split in of scores using
halves, usually using the odd-even Pearson r. after the
technique. In this technique, get the correlation use another
sum of the points in the odd-numbered formula called
items and correlate it with the sum of Spearman-Brown
points of the even-numbered items. Coefficient. The
Each examinee will have two scores correlation coefficient
coming from the same test. The scores obtained using Pearson
on each set should be close or r and Spearman Brown
consistent. should be significant
147
and positive to mean
Split-half is applicable when the test has that the test has
a large number of items. internal consistency
reliability.
4. Test of This procedure involves determining if A statistical analysis
Internal the scores for each item are consistently called Cronbach’s
Consistency answered by the examinees. After alpha or the Kuder-
Using administering the test to a group of Richardson is used to
Kuder- examinees, it is necessary to determine determine the internal
Richardson and record the scores for each item. consistency of the
and The idea here is to see if the responses items. A Cronbach’s
Cronbach’s per item are consistent with each other. alpha value of 0.60 and
Alpha This technique will work well when the above indicates that the
Method assessment tool has a large number of test items have internal
items, it is also applicable for scales and consistency
inventories (e.g., Likert scale from
“strongly agree” to “strongly disagree”)
5. Inter-rater This procedure is used to determine the A statistical analysis
Reliability consistency of multiple raters when called Kendall’s tau
using rating scales and rubrics to judge coefficient of
performance. The reliability here refers concordance is used to
to the similar or consistent ratings determine if the ratings
provided by more than one rater or provided by multiple
judge when they use an assessment raters agree with each
tool. other. Significant
Kendall’s tau value
Inter-rater is applicable when the indicates that the raters
assessment requires the use of multiple concur or agree with
raters. each other in their
rating.
1. Liner regression
Linear regression is demonstrated when you have two variables that are
measured, such as two set of scores in a test taken at two different times by
the same participants. When the two scores are plotted in a graph (with X- and
Y-axis), they tend to form a straight line. The straight line formed the two sets
of scores can produce a linear regression. When a straight line is formed, we
can say that there is a correlation between the two sets scores. This can be
seen in the graph shown. This correlation is shown in the graph given. The
graph is called a scatterplot. Each point in the scatterplot is a respondent with
two scores (one for each test).
148
Figure 1. Scatterplot diagram
2. Computation of Pearson r correlation
The index of the linear regression is called a correlation coefficient. When
the points in a scatterplot tend to fall within the linear line, the correlation is said
to be strong. When the direction of the scatterplot is directly proportional, the
correlation coefficient will have a positive value. If the line is inverse, the
correlation coefficient will have a negative value. The statistical analysis used
to determine the correlation coefficient is called the Pearson r. How the Pearson
r is obtained by the following formula and is illustrated below.
Formula:
where
∑X – Add all the X scores (Monday XY – Multiply the X and Y scores
scores) ∑X2 - Add all the squared values of X
∑Y – Add all the Y scores (Tuesday ∑Y2 – Add all the squared values of Y
scores) scores)
X – Square the value of the X scores
2
∑XY – Add all the production of X and Y
(Monday
Y – Square the value of the Y scores
2
(Tuesday scores)
149
Monday Test Tuesday Test
X Y X2 Y2 XY
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
∑X=87 ∑Y=139 ∑X =871
2
∑Y =2125
2
∑XY=1328
Applying the formula, we have:
0.80
150
Determining the Strength of a Correlation
The strength of the correlation also indicates the strength of the reliability
of the test. This is indicated by the value of the correlation coefficient. The closer
the value to 1.00 or -1.00, the stronger is the correlation. Below is the guide:
0.80-1.00 every strong relationship
0.6-0.79 Strong relationship
0.40-0.59 Substantial/marked relationship
0.2-0.39 Weak relationship
0.00-0.19 Negligible relationship
Student Item Item Item Item Item Total for Score- (Score-Mean)2
1 2 3 4 5 each case (x) Mean
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
Total for 14 21 16 17 13 Xcase=16.2 ∑(Score-
each item Mean)2= 22.8
(∑X)
Mean =
2.8 4.2 3.2 3.4 2.6 5.7
SD2 2.2 0.7 0.7 0.3 1.3 ∑ =5.2
151
The Cronbach’s alpha formula is given by:
Hence,
The scores given by the three raters are first computed by summing up
the total rating for each demonstration. The mean is obtained for the sum of
ratings (XRatings=8.4). The mean is subtracted from each of the Sum of Ratings
(D). Each difference is squared (D2), then the sum of squares is computed
152
(∑D2=33.2). The mean and summation of squared different is substituted in the
Kendall’s W formula. In the formula, m is the numbers of raters while k is the
number of students who perform the demonstrations.
Let us consider the formula and the substitution of values:
Summary
A test is valid when it measures what it is supposed to measure. It can be
categorized as face, content, construct, predictive, concurrent, convergent,
or divergent validity.
Reliability is the consistency of the responses to measure. It can be
implemented through test-retest, parallel forms, split-half, internal
consistency and inter-rater reliability.
Enrichment
A. Get a journal article about a study that developed a measure or conducted
validity or reliability tests. You may also download from any of the following
open source.
Google Scholar
Directory of open access journals
Multidisciplinary open access journals
Allied academics journals
Your task is to write a short report focusing on important information on how
the authors conducted and established test validity and reliability. Provide
the following information.
153
1. Purpose of the study
2. Describe the instrument with its underlying factors
3. Validity technique used in the study and analysis they used
4. Reliability techniques used in the study and analysis used
5. Results of the tests validity and reliability
B. Learn more on Reliability and Validity in Student Assessment by watching
a clip from [Link]
C. Read on Magno’s (2009) work titled, ―Demonstrating the Difference
between Classical Test Theory and Item Response Theory Using Derived
Test Data‖ published in the International Journal of Educational and
Psychological Assessment, Volume 1. Access through
[Link]
Assessment
A. Indicate the type of reliability applicable for each case. Write the type of
reliability on the space before the number.
Reliability Cases
Type
1. Mr. Perez conducted a survey of his students to determine
their study habits. Each item is answered using a five-point
scale (always, often, sometimes, rarely, never). He wanted
to determine if the responses for each item are consistent.
What reliable technique is recommended?
2. A teacher administered a spelling test to her students. After
a day, another spelling test was given with the same length
and stress of words. What reliability can be used for the
two spelling tests?
3. A PE teacher requested two judges to rate the dance
performance of her students in physical education. What
reliability can be used to determine the reliability of the
judgements?
4. An English teacher administered a test to determine
students’ use of verb given a subject with 20 items. The
scores were divided into items 1 to 10, and another for
items 11 to 20. The teacher correlated the two set of
scores that form the same test. What reliability is done
here?
5. A computer teacher gave a set of typing tests in
Wednesday and gave the same set of the following week.
The teacher wanted to know if the students’ typing skills
are consistent. What reliability can be used?
154
B. Indicate the type of validity applicable for each case. Write the type of
validity on the blank before the number.
1. The science coordinator developed a science test to determine
who among the students will be placed in an advanced science
section. The students who scored high in the science test were
selected. After two quarters, the grades of the students in the
advanced science were determined. The scores in the science
test were correlated with the science grades to check if the
science test was accurate in the selection of students. What type
of validity was used?
2. A test composed of listening comprehension, reading
comprehension, and visual comprehension items was
administered to students. The researcher determined if the
scores on each area refers to the same skill on comprehension.
The researcher hypothesized a significant and positive
relationship among these factors. What validity was established?
3. The guidance counsellor conducted an interest inventory that
measured the following factors: realistic, investigative, artistic,
scientific, enterprising, and conventional. The guidance
counsellor wanted to provide evidence that the items constructed
really belong to the factor proposed. After her analysis, the
proposed items had high factor loadings on the domain they
belong to. What validity was conducted?
4. The technology and livelihood education teacher developed a
performance task to determine student competency in preparing
a dessert. The students were tasked with selecting a dessert,
preparing the ingredients, and making the dessert in the kitchen.
The teacher developed a set of criteria to assess the dessert.
What type of validity is shown here?
5. The teacher in a robotics class taught students how to create a
program to make the arms of a robot move. The assessment
was a performance task making a program to make three kinds
of robot arm movements. The same assessment task was given
to students’ with no robotics class. The programming
performance of the two classes was compared. What validity
was established?
155
Your task is to determine whether the spelling test is reliable and valid using
the data to determine the following: (1) split-half, (2) Cronbach’s alpha, (3)
predictive validity with the English grade, (4) convergent validity of between
words with single and two stresses, and (5) difficulty index of each item.
Student Item Item Item Item Item Item Item Item Item Item English
No. 1 2 3 4 5 6 7 8 9 10 grades
1 1 0 0 1 1 1 0 1 1 0 80
2 0 0 0 1 1 1 1 1 0 0 81
3 1 1 0 0 1 0 1 0 1 1 83
4 0 1 0 0 1 1 1 1 1 0 85
5 0 1 1 0 1 1 1 0 1 1 84
6 1 0 1 0 1 1 1 1 1 1 89
7 1 0 1 1 1 1 1 1 0 1 87
8 1 1 1 0 1 1 1 1 1 1 87
9 1 1 1 1 1 1 1 1 0 1 89
10 1 1 1 1 0 0 1 1 1 1 90
11 0 1 1 1 0 1 1 1 1 0 90
12 1 0 1 1 1 1 1 1 1 1 87
13 1 1 1 1 1 1 1 0 1 1 88
14 1 1 0 1 1 1 1 1 1 1 88
15 1 1 1 1 1 0 1 1 0 1 85
D. Create a short test and report its validity and reliability. Select a grade level
and subject. Choose one or two learning competencies and make at least
10-20 items for these two learning competencies. Consult your teacher on
the items and the table of specification.
1. Have your items checked by experts if they are aligned with the
selected competencies.
2. Revise your items based on the reviews provided by the experts.
3. Make a layout of you test and administer to about 100 students.
4. Encode you data and you may use an application to compute for the
needed statistical analysis.
5. Determine the following:
Split-half reliability
Cronbach’s alpha
Item difficulty and discrimination
Write a report on you procedure. The report will contain the following parts:
156
Introduction. Give the purpose of the study. Describe the test measures,
its component, the competencies selected, and kind of items. Rationalize the
need to determine the validity and reliability of the test.
Method. Describe the participants who took the test. Describe what the
test measures, number of items, test format, and how content validity was
established. Describe the procedure on how data was collected or how the test
was administered. Describe what statistical analysis was used.
Results. Present the results in a table and provide the necessary
interpretations. Make sure to show the results of the split-half reliability,
Cronbach’s alpha, construct validity of the items with the underlying factors,
convergent validity of the domains, and item difficulty and discrimination.
Discussion. Provide implications about the test validity and reliability.
E. Multiple Choice
Choose the letter of the correct and best answer in every item.
1. Which is a way in establishing test reliability?
A. The test is examined if free from errors and properly administered.
B. Scores in a test with different versions are correlated to test if they
are parallel.
C. The components or factors of the test contain items that are
strongly uncorrelated.
D. Two or more measures are correlated to show the same
characteristics of the examinee.
2. What is being established if items in the test are consistently answered
by the students?
A. Internal consistency C. test-retest
B. Inter-rater reliability D. split-half
3. Which type of validity was established if the components or factors of a
test are hypothesized to have a negative correlation?
A. Construct validity C. Content validity
B. Predictive validity D. Divergent validity
4. How do we determine of an item is easy or difficult?
157
A. An item is easy if majority of students are not able to provide the
correct answer. The item is easy if majority of the students are able
to answer correctly.
B. An item is difficult if majority of students are not able to provide the
correct answer. The item is difficult if majority of the students are
able to answer correctly.
C. An item can be determine difficult if the examinees who are high in
the test can answer more the items correctly than the examinees
who got low scores. If not, the item is easy.
D. An item can be determine easy if the examinees who are high in the
test can answer more the items correctly than the examinees who
got low scores. If not, the item is difficult.
5. Which is used when the scores of the two variables measured by a test
taken at two different times by the same participants are correlated?
A. Pearson r correlation C. Significance of the correlation
B. Linear regression D. positive and negative correlation
158
Results The tables and There is one There are two There are more
interpretation table and tables and than two tables
necessary are all interpretation interpretations and
present. All the missing. One that are interpretations
required analyses table and/or missing. Two that are
are complete and interpretation tables and missing. Three
accurately does not interpretations or more or
interpreted. have have more tables
accurate inaccurate and
content information. interpretations
have
inaccurate
information.
Discussion Implications of the Implications of Implications of Implications of
test’s validity and the test’s the test’s the test’s
reliability are well validity and validity and validity and
explained with reliability are reliability are reliability are
three or more explained with explained with not explained,
supporting reviews. two no supporting and there is no
Detailed discussion supporting review. Two of supporting
on the results of reviews. One the results for review. Three
reliability and of the results the results for or more of the
validity are for reliability the validity and validity and
provided with and validity reliability are no reliability are
explanation. are not not provided not provided
provided with with with
explanation. explanation. explanation.
G. Summarized the result of your performance in doing the culminating task
using the checklist below.
Not yet
Ready Learning Targets
ready
1. I can independently decide on the appropriate type of
□ □ validity and reliability to be used for a test.
□ □ 2. I can analyse results of the test data independently.
3. I can interpret the results from the statistical analysis of
□ □ the test.
□ □ 4. I can distinguish the use of each type of test reliability
□ □ 5. I can distinguish then use of each type of test validity.
6. I can explain the procedure on establishing test validity
□ □ and reliability.
References
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Exploring Reliability in Academic Achievement. Retrieved from
[Link]
159
Exploring Reliability in Academic Achievement. Retrieved from
[Link]
Price et al. (2017). Reliability and Validity of Measurement. In Research
Method in Psychology (3rd ed.). California, USA: The Saylor
Foundation. Retrieved from
[Link]
measurement/
Professional Testing, Inc. (2020). Building High Quality Examination
Programs. Retrieved from
[Link]
The Graide Network, Inc. (2019). Importance of Validity and Reliability in
Classroom Assessments. Retrieved from
[Link]
quality-testing-reliability-and-validity
160
CHAPTER 4
ORGANIZATION, UTILIZATION, AND COMMUNICATION OF TEST
RESULTS
Overview
As we have learned in previous lessons, tests as used to measure
learning or achievement are form of assessment. They are undertaken to gather
data about student learning. These test results can assist teachers and the
school in making informed decisions to improve curriculum and instruction.
Thus, collected information such as test scores should have to be organized to
appreciate its meaning. Usually, the use of charts and tables are the common
ways in the presentation of data. In addition, statistical measures are also
utilized to help in interpreting correctly the data.
Most often, students are interested to know, ―What is my score in the
test?‖ Nonetheless, the more critical question is, ―What does one’s score
means?‖ Test score interpretation is important not just for the students
concerned but also for the parents. Knowing how certain student performs with
respect to the group or other members of the class is important. Similarly, it is
significant to determine the intellectual characteristics of the students through
their scores or grades.
Moreover, a student who received an overall score in the 60th percentile
in mathematics would place the learner in the average group. The learner’s
performance is as good or better than 60% of the students in the group. A closer
look into the sub-skill scores of the pupil can help teachers and parents in
identifying problem areas. For instance, a child may be good in addition and
subtraction but he or she may be struggling in multiplication and division.
In some cases, assessment and grading are used interchangeably, but
they are seemingly different. One difference is that assessment focuses on the
learner. It gathers information about what the student knows and what he/she
can do. Grading is a part of evaluation because it involves judgment made by
the teacher. This chapter concludes with the grading system in the Philippines’
K to 12 program. Other reporting systems shall likewise be introduced and
discussed. A short segment on progress monitoring is
161
included to provide pre-service teachers with an idea of how to track student
progress through formative assessment.
Objective
Upon completion of the chapter, the students can demonstrate their
knowledge, understanding and skills in organizing, presenting, utilizing and
communicating the test results.
Pre-discussion
At the end of this lesson, pre-service teachers are expected to present
in an organized manner the test collected data from existing database or those
from pilot-tested materials in any of the assessment tools implemented in the
earlier lessons. Your success in this performance task would be determined
when you can do organizing ungroup raw test results through tables, using
frequency distribution for presenting test data, describing the characteristics of
frequency polygons, histograms, bar graphs, and their interpretation,
interpreting test data presented through tables and graphs, determining which
types of tables and graphs are appropriate for given set data, and using
technology like statistical software in organizing and interpreting test data.
What to Expect?
At the end of the lesson, the students can:
1. organize the raw data from a test;
2. construct a frequency distribution;
3. acquire knowledge on the basic rules in preparing tables and graphs;
4. Summarize test data using appropriate table or graph;
5. use Microsoft Excel to construct appropriate graphs for a data set;
6. interpret the graph of a frequency and cumulative frequency
distribution; and
7. characterize a frequency distribution graph in terms of skewness and
kurtosis.
162
Frequency Distribution
A different tabulation scheme aggregates values into bins such that each
bin encompasses a range of values. For example, the heights of the students
in a class could be organized into the following frequency table.
163
This principle of classifying data into groups is called frequency
distribution. In this process, we combine the scores into relatively small
numbers of class intervals and then indicate number of cases in each class.
Step 2:
Second step is to decide the number and size of the groupings to be
used. In this process, the first step is to decide the size of the class interval.
According to H.E. Garrett (1985:4), the most ―commonly used grouping
intervals are 3, 5, 10 units in length.‖ The size should be such that number of
classes will be within 5 to 10 classes. This can be determined approximately by
dividing the range by the grouping interval tentatively chosen.
Step 3:
Prepare the class intervals. It is natural to start the intervals with their
lowest scores at multiples of the size of the intervals. For example, when the
interval is 3, it has to start with 9, 12, 15, 18, etc. Also, when the interval is 5,
it can start with 5, 10, 15, 20, etc.
The class intervals can be expressed in three different ways:
First Type:
The first types of class intervals include all scores.
For example:
10 - 15 includes scores of 10, 11, 12, 13 and 14 but not 15
15 - 20 includes scores of 15, 16, 17, 18 and 19 but not 20
20 - 25 includes scores of 20, 21, 22, 23 and 24 but not 25
164
In this type of classification, the lower limit and higher limit of the each
class is repeated.
This repetition can be avoided in the following type.
Second Type:
In this type the class intervals are arranged in the following way:
10 - 14 includes scores of 10, 11, 12, 13 and 14
15 - 19 includes scores of 15, 16, 17, 18 and 19
20 - 24 includes scores of 20, 21, 22, 23 and 24
Here, there is no question of confusion about the scores in the higher
and lower limits as the scores are not repeated.
Third Type:
Sometimes, we are confused about the exact limits of class intervals
because very often it is necessary the computations to work with exact limits. A
score of 10 actually includes from 9.5 to 10.5 and 11 from 10.5 to 11.5. Thus,
the interval 10 to 14 actually contains scores from 9.5 to 14.5. The same
principle holds no matter what the size of interval or where it begins in terms of
a given score. In the third type of classification we use the real lower and upper
limits.
9.5 - 14.5
14.5 - 19.5
19.5 - 24.5 and so on.
Step 4:
Once we have adopted a set of class intervals, we need to list them in
their respective class intervals. Then, we have to put tallies in their proper
intervals. (See illustration in Table 1.)
Step 5:
Make a column to the right of the tallies headed ―f‖ (frequency). Write
the total number of tallies on each class interval under column f. The sum of
the f column will be total number of cases ―N‖.
The next matrix contains the scores of students in mathematics.
Tabulate the scores into frequency distribution using a class interval of 5 units.
165
Solution:
166
Table 2. Cumulative Frequency and Class Midpoint (n=60)
Class f Midpoint Cumulative Cumulative
Intervals (CI) (M) frequency percentage
> < > <
90 - 94 2 92 2 60 3% 100%
85 - 89 2 87 4 58 7% 97%
80 - 84 4 82 8 56 13% 93%
75 - 79 8 77 16 52 27% 87%
70 - 74 7 72 23 44 38% 73%
65 - 69 10 67 33 37 55% 62%
60 - 64 9 62 42 27 70% 45%
55 - 59 6 57 48 18 80% 30%
50 - 54 5 52 53 12 88% 20%
45 - 49 3 47 56 7 93% 12%
40 - 44 2 42 58 4 97% 7%
35 - 39 2 37 60 2 100% 3%
167
Graphic Representation of Data
Most of us are familiar with the saying, ―A picture is worth a thousand
words.‖ In the same token, ―a graph can be worth a hundred or a thousand
numbers.‖ The use of tables may not be enough to give a clear picture of the
properties of a group of test scores. If numbers presented in tables are
transformed into visual models, then the reader becomes more interested in
reading the material. Consequently, understanding of the information and
problems for discussion is facilitated. Graphs are very useful for the comparison
of test results of different groups of examinees.
The graphic method is mainly used to give a simple, permanent idea and
to emphasize the relative aspect of data. Graphic presentation is highly desired
when a fact at one time or over a period of time has to be described. It must be
stressed that tabulation of statistical data is necessary, while graphic
presentation is not. Data is plotted on a graph from a table. This means that
graphic form cannot replace tabular form of data. It can only supplement the
tabular form.
Graphic presentation has a number of advantages, some of which are
enumerated below:
1. Graphs are visual aids which give a bird’s eye view of a given set of
numerical data. They present the data in simple, readily
comprehensible form.
2. Graphs are generally more attractive, fascinating and impressive than
the set of numerical data. They are more appealing to the eye and
leave a much lasting impression on the mind as compared to the dry
and uninteresting statistical figures. Even a layman, who has no
statistics knowledge, can understand them easily.
3. They are more catching and as such are extensively used to present
statistical figures and facts in most of the exhibitions, trade or industrial
fairs, public functions, statistical reports, etc. Graphs have universal
applicability.
4. They register a meaningful impression on the mind almost before we
think. They also save a lot of time as very little effort is required to
grasp them and draw meaningful inferences from them.
168
5. Another advantage of graphic form of data is that they make the
principal characteristics of groups and series visible at a glance. If the
data is not presented in graphic form, the viewer will have to study the
whole details about a particular phenomenon and this takes a lot of
time. When data is presented in graphic form, we can have information
without going into many details.
6. If the relationship between two variables is to be studied, graphic form
of data is a useful device. Graphs help us in studying the relations of
one part to the other and to the whole set of data.
7. Graphic form of data is also very useful device to suggest the direction
of investigations. Investigations cannot be conducted without any
regard to the desired aim and the graphic form helps in fulfilling that
desired aim by suggesting the direction of investigations.
8. In short, graphic form of statistical data converts the complex and huge
data into a readily intelligible form and introduces an element of
simplicity in it.
169
3. Identify figure axes by the variables under analysis;
4. Quote the source which provided the data, if required;
5. Demonstrate the scale being used; and
6. Be self-explanatory.
The graph's vertical axis should always start with zero. A usual type of
distortion is starting this axis with values higher than zero. Whenever it
happens, differences between variables are overestimated, as can been seen
in Figure 1.
170
polygon can also be superimposed to compare several frequency
distribution, which cannot be done with histograms.
You can construct a frequency polygon manually using the histogram
in Figure 2 by following these simple steps:
a. Locate the midpoint on the top of each bar. Bear in mind that the
height of each bar represents the frequency in each class interval,
and the width of the bar is the class interval. As such, that point in the
middle of each bar is actually the midpoint of that class interval.
b. Draw a line to connect all the midpoints in consecutive order.
c. The line graph is an estimate of the frequency polygon of the test
scores.
171
Thus, consider the class interval of 70-74 where cf> and cf< are 23
and 44, respectively. It means that there are 23 (or 38%) students have
scores of 70 and above, while there are 44 (or 73%) students whose scores
fall from 74 and below. (Please see illustrations in Figures 3 and 4).
3. Bar Graph
This graph is often used to present frequencies in categories of a
qualitative variable. It looks very similar to a histogram, constructed in the
same manner, but spaces are placed in between the consecutive bars. The
columns represent the categories and the height of each bar as in a
histogram represents the frequency. If experimental data are graphed, the
independent variables in categories is usually plotted on the x-axis, while
variable in the horizontal or x-axis is categorical, bar graphs can be
172
presented horizontally. Bar graphs are very useful in comparison of test
performance of groups categorized in two or more variables. Following are
some examples of bar graphs.
173
4. Circle graph (Pie Chart)
One commonly used method to present categorical data is the use of
a circle graph. You have learned in basic mathematics that there are 360⁰
in a full circle. As such, the categories can be represented by the sectors
of the circle that appear like a pie; thus, the name pie graph. The size of the
pie is determined by the percentage of students who belong in each
category such as the one shown in Figures 8 and 9.
Selection of the most appropriate graph for a given set of data can be
facilitated by some computer software or applications. A common application is
the Chart Wizard in Microsoft Excel which offers an array of different charts
along with several variants.
174
Variations on the Shapes of Frequency Distributions
As discuss earlier, a frequency distribution is an arrangement of a set of
observations. These observations in the field of education or other sciences are
empirical data that illustrate situations in the real world. With the world
population reaching 7.6 billion, you can imagine hundreds of possible frequency
distributions representing different groups and subgroups taken from an
infinitely large population. It is reasonable to expect that there will be variations
in the shapes of frequency distributions. Researchers, scientists, and educators
have found that empirical data, when recorded, fit the following shapes of
frequency distributions.
What is skewness?
Examine the graphs below.
175
The higher frequencies are concentrated in the middle of the distribution. A
number of experiments have shown that IQ scores, height, and weight of
human beings follow a normal distribution.
The graphs of Figures 10c and 10d are asymmetrical in shape. The
degree of asymmetry of a graph is called skewness. Basic principles of a
coordinate system tell us that, as we move toward the right of the x-axis, the
numerical value increases. Likewise, as we move up y-axis, the scale value
becomes higher. Thus, in a negatively-skewed distribution, there are more
who get higher scores and the tail indicates that the lower frequencies of
distribution points to the left or to the lower scores. On the other hand, in
positively-skewed distribution, lower scores clustered on the left side. This
means that there are more who get lower scores and the tail indicates the lower
frequencies are on the right or to the higher scores.
The graph in Figure 10b is a rectangular distribution. It occurs when the
frequency of each score class interval of scores is the same or almost
comparable such that it is also called a uniform distribution.
We have differentiated the four graphs in terms of skewness, which
refers to their symmetry or asymmetry (non-symmetry). Another way of
characterizing frequency distribution is with respect the number of ―peaks‖
seen on the curve. Refer to the following graphs.
176
Figure 12. A bimodal frequency distribution
We call this bimodal distribution. For those with more than two peaks,
we call these multimodal distributions. In addition, unimodal, bimodal, or
multimodal may or may not be symmetric. Look back at the negatively- skewed
and positively-skewed distributions in Figures 10c and 10d. Both have one
peak; hence, they are also unimodal distributions.
What is kurtosis?
Another way of contrasting frequency distributions is illustrated below.
Let us consider the graphs of three frequency distributions in Figure 13.
177
X is the flattest distribution. It has a platykurtic (platy, meaning broad of
flat) distribution. Y is the normal distribution and it is a mesokurtic (meso,
meaning intermediate) distribution. Z is the steepest or slimmest, and is called
leptokurtic (lepto, meaning narrow) distribution.
What curve has more extreme scores than the normal distribution?
What curve has more scores that are far from the central value (or
average) than does the normal distribution?
For the meantime, the characteristics are simply described visually.
Succeeding lesson connects these visual characteristics to some important
statistical measures.
Summary
Test data are better appreciated and communicated if they are arranged,
organized, and presented in a clear and concise manner.
A frequency distribution is a list, table or graph that displays the frequency
of various outcomes in a sample. Each entry in the table contains the
frequency or count of the occurrences of values within a particular group
or interval.
There are steps to follow in constructing a frequency distribution.
Tables and graphs are common tools that help readers better understand
the test results.
The graphic method is mainly used to give a simple, permanent idea and
to emphasize the relative aspect of data.
Tabulation of statistical data is primarily needed over the graphic
presentation.
Data are plotted on a graph from a table. This means that graphic form
cannot replace tabular form of data but can definitely supplement it.
Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same to
the left and right of the center point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed
relative to a normal distribution. Data sets with high kurtosis tend to have
178
heavy tails, or outliers, while data sets with low kurtosis tend to have light
tails, or lack of outliers.
Enrichment
1. Explore the Chart Wizard facility of Microsoft Excel application.
2. Read the following articles:
a. ―How to Create a Chart in Excel using the Chart Wizard‖ from
[Link]
b. ―Are the Skewness and Kurtosis Useful Statistics?‖ from
[Link]
skewness-and-kurtosis-useful-statistics
3. Watch the following videos:
a. ―MS Excel - Pie, Bar, Column & Line Chart‖ by Tutorials Point (India)
Ltd. (2018, January 15) from
[Link]
b. ―How to Construct a Frequency Distribution Table‖ from
[Link]
Assessment
A. Let us see how well you understood what have been presented in this
lesson.
1. Consider the table showing the results of a reading examination of set of
students.
Class Midpoint F Cumulative Cumulative
Interval Frequency Percentage
140-144 142 2
135-139 137 7
130-134 132 9
125-129 127 14
120-124 122 10
115-119 117 6
110-114 112 2 2
179
f. What is the upper limit of the class with the lowest frequency?
g. The entry in the lowest class interval of the 4th column is done for
you. From the lower class interval, can you fill up the remaining
blanks upward? How did you do it?
h. Look at the entire column on cumulative frequency. What is the
cumulative frequency of the highest class interval? How do you
compare this cumulative frequency with the number of students
who took the test?
i. The last column is labeled cumulative percentage. What should be
the first entry at the bottom of the column? How did you determine
it? Can you fill up the entire column with the right percentage? How
do you do these in two ways? Which is the easy way?
j. Take a look at the values in the table, in particular, the frequency
column. What type of distribution (positively skewed, negatively
skewed, and symmetrical) is depicted by the given values? Why do
you say so?
k. What type of graph is most appropriate for this frequency table?
2. Analyze the figures in the succeeding pages and answer the questions
that pertain to each graph.
For Figure 15:
a. What is the shape of the frequency distributions as to symmetry?
b. What is the estimated value of the highest score in each
distribution? What does this value indicated?
c. Which section got the highest average? Which section got the
lowest?
180
For Figure 17:
a. If the center dotted line is taken as the average, how do you
compare the average of the three frequency distributions?
b. In what aspects do the three distributions differ?
c. Imagine Xs place inside each of the three curves, where X
represents a score. How do you compare the spread of the scores
in the three frequency distributions from its respective average?
d. In which section did the scores spread most?
e. Which section has scores closet to the average?
181
3. Now, to further see how well you were able to comprehend all the topics
discussed earlier, fill in the answer to each box in the diagram below.
B. Accomplish the following activities to know the extent to which you have
understood the concepts introduced in this lesson.
1. The following aptitude test scores have been recorded in a guidance
office.
140 88 115 91 96
93 117 99 101 108
98 123 119 146 107
107 111 100 125 110
83 127 116 113 104
126 114 110 114 138
109 102 113 106 90
107 91 102 103 135
104 101 131 87 124
113 135 126 112 140
182
4. Cumulative frequency; and
5. Cumulative percentage.
f. Construct a histogram from the given scores.
g. Draw a frequency polygon superimposed in the histogram you have
done in f.
h. Using your data in e.5, draw a cumulative percentage polygon.
Figure 19 shows a graph constructed from a first quarter exam in Science
gathered from 193 STEM students; 100 are males and 93 females. Give three
statements on test performance of STEM students as depicted in the figure.
183
C. Use the given Self-confidence Inventory to gather the data that you need
to apply what you have learned in this lesson.
Self-Confidence Inventory
Put check mark (√) on the appropriate column that describe how you find yourself in
the following situations below. There is no right or wrong response on each item, so
feel free to express your true self. Results will be kept strictly confidential.
Statements Always Almost Sometimes Seldom Never
(5) Always (4) (3) (2) (1)
1. I feel that I have a
number of positive
qualities.
2. I feel I am a worthy
person to my family,
friends, and classmates.
3. I am inclined to think I am
a failure.
4. I have many
accomplishments as what
others of my age have
done.
5. I feel I do not have much
to be proud of my family.
6. I am happy with who I
am.
7. I feel I have not
contributed much as a
son/daughter to my
parents.
8. I feel that my classmates
are afraid to approach me
for help.
9. I am afraid to make
mistakes.
10. I am not bothered
about what people say
about me.
11. With how I am going,
the future will be bright for
me.
12. I get excited when I
try new things.
13. I cannot sleep when I
hear negative things
about me.
14. I am as important as
other people.
15. I feel depressed
when I do not succeed in
what I plan to achieve.
184
You are to work in a group of five to accomplish the following:
1. Administer the Self-Confidence Inventory to at least 100 respondents with
the following requirements:
a. Different characteristics/demographics (e.g., year level or age, gender,
professional background, etc.)
b. Classify the respondents in two groups only if you have 100
respondents or the most three groups if you have more than 100
respondents.
2. Get the score of each respondent using the point system below:
Always = 5 points Seldom = 2 points
Almost Always = 4 points Never = 1 point
Sometimes = 3 points
185
7. With reference to the degree of skewness, kurtosis, and a number of peak
points of the graphs that you have constructed, makes a descriptive
comparison on the level of self-confidence between or among groups of
respondents.
8. Make a written report where sections of the report are aligned with 1-7 of
the above tasks.
D. Use the rubric below in evaluating your report on the Self-Confidence
inventory. Then let another rate group rate your work using the same
rubric.
Criteria 3 points 2 Points 1 Points
I. Methodology and Response Scoring
Organization Administration of Administration of Administration of
inventory was inventory was inventory was neither
organized and organized and not organized nor
coherent with the coherent with the coherent with the
instructions given. instructions given. instructions given.
Response Scoring of respondents’ Scoring responses Scoring of responses
Scoring answers are all correct has one to two showed three or more
and accurate. error/s. errors.
II. Frequency Distribution
Accuracy of All content given are One to two error/s Three or more errors
Content correct. is/are evident. are evident.
Representation All relevant information One to two relevant Three or more
is presented correctly information is/ are relevant information
presented incorrectly. are presented
incorrectly.
III. Graphic Display of Data
Completeness The graph contains The graph contains The graph contains
complete information one missing two or more missing
(i.e., has title, labels, information information.
and legend).
Neatness The graph is very neat The graph is generally The graph is difficult to
Organization and easy to read. neat and readable. read
IV. Written Report
Content All contents are Minor irrelevances Both irrelevances
relevant to the task are present. and
misinterpretations
are present.
Explanation Explanation is clear Explanation is Explanation is difficult
and relevant to answer somewhat clear but to understand and
the questions. relevant to the not directly answering
questions. the questions.
Completion All problems and One to two of the Three or more of the
required activities are problems and problems and activities
completed. activities are not are not completed are
completed are completed
completed
186
E. Answer the following multiple-choice items.
1. If the lowest score in a distribution is 71 with a class interval of 5, what is
the most appropriate first class interval?
A. 71-75 C. 70-75
B. 67-75 D. 70-74
Refer to the figure below to answer the next questions:
187
4. What period shows the highest increase of students passing the subjects?
A. 1st Quarter
B. 2nd Quarter
C. 3rd Quarter
D. 4th Quarter
5. What is the rate of increase of passing from the 2nd to 3rd quarter?
A. 75%
B. 50%
C. 33%
D. 25%
F. Supplemental Exercises
1. The following is a frequency distribution of examination marks:
Class interval f
90 – 94 6
85 – 89 9
80 – 84 7
75 – 79 13
70 – 74 14
65 – 69 19
60 – 64 11
55 – 59 11
50 – 54 9
45 – 49 8
40 – 44 8
Answer the following questions. You are free to consult your teacher
should you have concerns over these exercises.
a. What is the size of the class interval?
188
b. What is the exact limit of the class interval with an observed
frequency of 13? How did you determine it?
c. Without graphing, how do you see the shape of the graph? Is it
symmetrical or skewed? Is it unimodal or bimodal? Give a
statement or two to support your answer.
d. Sketch the graph of the frequency distribution using the data on the
table.
e. Confirm your answer in c. With the shape of the graph you have
drawn in d, are you correct in your thinking?
f. Create a 3rd column containing the midpoints of the class intervals.
Did you get a whole number or a decimal number?
g. Create a 4th column to indicate the cumulative frequency starting
from the lowest class interval.
h. Create a 5th column to represent the cumulative percentage.
2. Talk to a teacher/cooperating teacher or practicing student teacher in
your area of specialization. Then, request a set of test results from a
periodical examination of a class with at least 50 students. With these
results, work on the following:
a. Arrange the test scores from highest to lowest.
b. Tally the occurrence of the scores.
c. Prepare a frequency distribution and a cumulative frequency
distribution for these data using an appropriate class interval.
d. Write the exact limits and the midpoints of the class intervals for the
frequency distributions.
e. Sketch a histogram for the data you have summarized.
f. Superimpose a frequency polygon in the histogram you have
drawn.
g. Describe the graph of the frequency distribution you have done as
to its (a) symmetry and (b) modal point.
3. This work can be done with a classmate or a friend. With your partner
do the following:
a. Secure three (3) sets of test results from the same test given to
students in any grading period this year or last year. This could be
in any subject. Tell the teacher or student teacher who will allow
189
you to access the test about the confidentiality of information you
will access from them. Inform the source of your data that all results
will only be used for this academic undertaking, and no teacher or
student names, or even the school will be mentioned in any of the
reports.
b. Tally the scores in each section separately.
c. Make a separate frequency distribution table for each of the three
classes using the same class intervals.
d. Draw a graphical representation of test performance of three groups
all contained using the same class intervals.
e. Interpret the results in each section.
f. Make a qualitative comparison of test performance between and
among the three sections.
Educators’ inputs
I have been teaching statistics for many years at both undergraduate
and graduate programs in the College of Education. I am happy with the
illustration of statistics in the area of assessment and evaluation. It gave me the
opportunity to teach statistics in different contexts. It provided me with a
practical application of statistics in the assessment of students’ learning. Topics
in statistics like data handling can be boring to many. When I present, for
example, test score data with 100 observations, my students usually just look
at the data and wait for my next step. When I group data into a frequency
distribution table, I see my students amazed with the summarized and
condensed form of information. When I create a graph out of the frequency
table for a group data, I see students becoming more attentive with pictures and
graphs. However, drawing graphs in the traditional way with different steps to
follow and the do’s and don’ts could cause anciently for many. When I create
graphs, I usually do it through SPSS software, which has been an indispensable
tool in my statistics class. When I use this software vis-à-vis the traditional way
of computing, students are amazed. While I should be happy when my students
show interest in what they learn in my class, I am also concerned that there is
an erosion in my students’ ability to read hidden information in a graph, which
a teacher should be concerned about, being
190
mindful of this, I have some favorite lines that I use when teaching the use of
tables and graphs in organizing and presenting test data:
It seems you prefer seeing pictures rather than tables and reading text.
But let me give you some words of caution. When you read materials with
pictures or graphs in published works on competitive achievement tests, or
other forms of advertisement and reports, be critical. A picture or a graph may
be deceiving. With some tricks, you can be misled. With some mechanical
manipulations, like compressing or expanding graphs with incorrect scaling,
you can be deceived. Do not rely completely on the visuals, examine the
underlying information and detect the missing information.
References
191
Lesson 2: Analysis, Interpretation, and Use of Test Data
Pre-discussion
Discussions in this lesson will build upon the concepts and examples
presented in the preceding lesson, which focused on the tabular and graphical
presentation and interpretation of test results. This time, other ways of
summarizing test data using descriptive statistics, which provides a more
precise means of describing a set of scores, will be introduced. The word
―measures‖ is commonly associated with numerical and quantitative data.
Hence, the prerequisite to understanding the concepts contained in this lesson
in your basic knowledge of mathematics, e.g., summation of values, simple
operation on integers, squaring and finding the square roots, etc.
What to Expect?
At the end of the lesson, the students can:
1. find the mean, median, and mode of test score distribution;
2. determine the different measures of dispersion of test scores;
3. calculate the measure of position;
4. relate standard deviation and normal distribution;
5. transform raw scores to standardized scores (z, T and stanine);
6. compute the measure of covariability using the long process and Excel;
and
7. interpret test data applying measures of central tendency, variability,
position, and covariability.
192
Mean. This is the most preferred measure of central tendency for use
with test scores, also referred to as the arithmetic mean. The computation is
very simple. When a student has added up the examination scores he/she
made in a subject during the grading period and divided it by the number of
examinations taken, then he/she computed the arithmetic mean.
That is, = where = the mean, the sum of all the scores,
and N = the number of scores in the set.
Consider again the test scores of students given in Table 1, which is the
same set of test scores used in the previous lesson.
The mean is the sum of all the scores from 53 down to the last score,
which is 35, divided by the total number of cases.
That is,
= = (53 + 36 + 57 + … + 60 + 49 + 35)/100
You have many ways computing the mean. The traditional long and
tedious computation techniques have outlasted their relevance due to
advancement of technology and the emergence of statistical software. Using
your specific calculator, you will see the symbols and . Just follow the
simple steps indicated in the guide. There are also simple steps in Microsoft
Excel. Different versions of the statistical software SPSS offer the fastest way
of obtaining the mean, even with hundreds of scores in a set. There is no loss
of original information because you are dealing with original individual scores.
The use of statistical software will be explained later.
193
While we organize the power of technology, there is information that is
unappreciated because of the short-hand processing of data through
mechanical computations. Look at the conventional way of presenting data in a
frequency distribution table as done in the previous lesson:
In the traditional way, it cannot be argued that you can see at a glance
how the scores are distributed among the range of values in a condensed
manner. You can even estimate the average of the scores by looking at the
frequency in each class interval. In the absence of statistical program, the mean
can be computed with the following formula.
194
Median. It is the value that divides the ranked scores into halves, or the middle
value of the ranked scores. If the number of scores is odd, then there is only
one middle value that gives the median. However, if the number of scores in
the set is even number, then there are 2 middle values. But if there are more
than 50 scores, arranging the scores and finding the middle value will take time.
This formula will help you determine the median:
195
4. Find the exact limits of the median class. In this case, class 44.5-49.5. the
lower limit then is 44.5
Summing up these steps and substituting these values to the formula,
we have:
Mode. It is the easiest measure of central tendency to obtain. The mode is the
score or value with the highest frequency in the set of scores. If the scores are
arranged in a frequency distribution, the mode is estimated as the midpoint of
the class interval which has the highest frequency. This class interval with the
highest frequency is also called the modal class. In a graphical representation
the frequency distribution, the mode is the value in the horizontal axis at which
the curve is at its highest point (peak). If there are two highest points, then,
there are two modes, as discussed in previous lesson. When all the scores in
a group have the same frequency, the group of scores has no mode.
Considering the test data in Table 2, it can be seen that highest
frequency of 21 occurred in the class interval 45 - 49. The rough estimate of the
mode is 42, which is the midpoint of the class interval.
As manual computations of the mean, median and mode are so lengthy
and tedious processes; technology makes it simpler through the use of
Microsoft Excel application. Here’s the simple guide to observe.
196
B12. Calculating the mean of scores located in several columns and rows can
also be possible provided they are all selected or defined.
Median in Excel
197
Mode in Excel
Mode helps you to find out the value that occurs most number of times.
When you are working on a large amount of data, this function can be a lot of
help. To find the most occurring value in Excel, use the MODE function and
select the range you want to find the mode of. In our example below, we use
=MODE(B2:B12) and since 2 students have scored 55 we get the answer as
55.
In situations where there are two or more modes in your data set, the
Excel MODE function will return the lowest mode.
Scale of Measurement
There are four levels of measurement that apply to the treatment of test
data: nominal, ordinal, interval, and ratio. In nominal measurement, the
number is used for labeling or identification purposes only. An example is the
198
student’s identification number or section number. In data processing, instead
of labeling gender as female, a code ―1‖ is used to denote Female and ―2‖ to
denote Male. While ―2‖ is numerically greater than ―1,‖ in this case the
difference of 1 has no meaning; it does not indicate that Male is better than
Female. The purpose is simply to differentiate or categorize the subjects by
gender.
The ordinal level of measurement is used when the values can be ranked
in some order of characteristics. The numeric values used to indicate the
difference in traits under consideration. Academic awards are made on the
basis of an order of performance: first honors, second honors, third honors, and
so on. Some assessment tools require students to rank their interest or hobbies,
or even career choices. Percentile ranks in national assessment test or
entrance examination are examples of measurement in an ordinal scale.
Percentile score becomes more useful and meaningful than simple raw scores
in university entrance or division-wide examinations.
The interval level of measurement, which has the properties of both the
nominal and ordinal scales, is attained when the values can describe the
magnitude of the differences between groups or when the intervals between
the numbers are equal. ―Equal Interval‖ means that the distance between the
things represented by 3 and 4 is the same distance represented by 4 and 5.
The difference between the temperature 30o and 40o is the same as that
between 90o and 100o. However, there is no true zero point. The zero degree
in the Celsius thermometer does not mean zero or absence of heat, 0o is an
arbitrary value, a convenient starting point. With arbitrary zero point, there is a
restriction with interval data. We cannot say that an 80o object is twice as hot
as 40o object. In the educational setting, a student who gets a score of 120 in a
reading ability test is not twice the better reader than one who got a score of 60
in the same test.
The highest level of measurement is the ratio scale. As such, it carries
the properties of the nominal, ordinal, and interval scales. Its additional
advantage is the presence of a true zero point, where zero indicates the total
absence of the trait being measured. A 0 cm as a measure of width means no
width, 0 km as a measure of distance means no distance traveled, and 0 words
spelled means no words was spelled at all. Test scores as measure of
199
achievement in many school subjects are often treated as interval scale.
However, if achievement in a performance test in Physical Education is
measured by the number of ―push-ups‖ one can do in a minute or distance run in
an hour; or in a Typing Class where, we count the words typed in a minute or
words spelled correctly, then these are all on ratio scale.
Now, the most likely questions that cross your mind are:
Which measure of central tendency should I use?
Do I have to use all the three since the statistical program can
automatically give the three measures the easiest way?
Generally, the mean is the most used measure of central tendency
because it is appropriate for interval and ratio variables, which are higher levels
of measurement. Its value is affected by the change of a single score such that
it is regarded as the most accurate measure to represent a set of scores. In
research, this is most utilized specifically when you want to make an inference
about the population characteristics on the basis of the observed sample value.
For median, in some cases, we could have one very high score (or very
few high scores) and many low scores. This is especially true when the test is
difficult, or when students are not well prepared for the test. This will result to
many low scores and a few high scores that will lead to a positively- skewed
distribution. In the same way, when the test is too easy for students, there will
be many high scores, which lead to a negatively-skewed distribution. In both
cases, the mean can give an erroneous impression of central tendency because
its value is pulled by the extreme values that reduce its role as the
representative value of the set of scores. Hence, the median is a better
measure. It is the value that occupies the middle position among the ranked
values; thus, it is less likely to be drawn toward the direction of extreme scores.
It is an ordinal statistic but can also be used for interval or ratio data distribution.
The mode is determined by the highest frequency of observation that
makes it a nominal statistic.
200
How do measures of central tendency determine skewness?
The mean, median, and mode may further be compared if they have
been calculated from the same frequency distribution. In a perfectly
symmetrical distribution, the mean, median, and mode have the same value,
and the value of the median is between the mean and the mode. This shape is
illustrated in Figure 1.
201
On the other hand, in a negatively-skewed distribution, as shown in the
graph below, the mode remains at the peak of the curve, but it will have the
largest value. The mean will have the smallest value as influenced by the
extremely low scores, and the median still lies between the mode and the mean.
You can see that different distributions may be symmetrical, may have
same average values (mean, median, mode), but how the scores in each
distribution are spread out around these measures are different. In A, as shown
in Figure 4, scores range between 40 and 60; in B, between 30 and
202
70; and in C between about 20 and 80. Measures of variability give us the
estimate to determine how the scores are compressed, which contributes to the
―flatness‖ or ―peakedness‖ of the distribution.
There are several indices of variability, and the most commonly used in
the area of assessment are the following.
Range. It is the difference between the highest (XH) and the lowest (XL) scores
in a distribution. It is the simplest measure of variability but also considered as
the least accurate measure of dispersion because its value is determined by
just two scores in the group. It does not take into consideration the spread of all
scores; its value simply depends on the highest and lowest scores. Its value
could be drastically changed by a single value. Consider the following
examples:
Determine the range for the following scores: 9, 9, 9, 12, 12, 13, 15, 15, 17,
17, 18, 18, 20, 20, 20.
Range = XH - XL
= 20 – 9
Range= 11
Now, replace a high score in one of the scores, say, the last score and
make it 50. The range becomes:
Range = XH - XL
= 50 – 9
= 41
We observed that with just a single score, the range increased so high.
This can be interpreted as large dispersion of test scores; however, when you
look at the individual scores, it is not.
Variance and Standard Deviation. Standard deviation is the most widely used
measure of variability and is considered as the most accurate to represent the
deviations of individual scores from the mean values in the distribution.
Let us examine the following test score distribution:
203
Class A Class B Class C
22 16 12
18 15 12
16 15 12
14 14 12
12 12 12
11 11 12
9 11 12
7 9 12
6 9 12
5 8 12
Note that while the distributions contain different scores, they have the
same mean. If we ask how each mean represents the score in their respective
distribution, there will be no doubt with the mean of distribution C because each
score in the distribution is 12. How about in distributions A and B? For these
two distributions, the mean of 12 is a better estimate of the scores in distribution
B than in distribution A. We can see that no score in B is more than 4 points
away from mean of 12. However, in distribution A, half of the 12 scores is 4
points or more away from the mean. We can also say that there is less variability
of scores in B than in A. However, we cannot just determine which distribution
is dispersed or not by merely looking at the numbers, especially when there are
many scores. We need a reliable index of variability, such as variance or
standard deviation that takes into consideration all the scores.
Recall that ∑(X- ) is the sum of the deviation scores from the mean, which
is equal to zero. As such, we square each deviation score, then sum up all the
squared deviation scores, and divide it by the number of cases, this yields the
variance. Extracting its square root gives us the standard deviation.
204
The measure is generally defined by the formula:
205
Class A Class B
X (X – ) (X – ) 2 X (X – ) (X – )2
22 22-12 100 16 16-12 16
18 18-12 36 15 15-12 9
16 16-12 16 15 15-12 9
14 14-12 4 14 14-12 4
12 12-12 0 12 12-12 0
11 11-12 1 11 11-12 1
9 9-12 9 11 11-12 1
7 7-12 25 9 9-12 9
6 6-12 36 9 9-12 9
5 5-12 49 8 8-12 16
= 12 ∑(X– ) = 276
2
= 12 ∑(X- )2 = 74
The values 276 and 74 are the sum of the squared deviations of scores
in Class A and Class B, respectively. If these are divided by number of scores
in each class, this gives the variance (S2):
The values above are both in squared units, while our original units of
scores are not in squared units. When we find their square roots, we obtained
values that are on the same scale of units as the original set of scores. These
too give the respective standard deviation (S) of each class and computed as
follows:
206
When you are finding the variance using Excel, you can simply use the
VAR function and select the range and you will find the desired variance. We
take the data in Class A to find the variance of the scores obtained by students.
So we use =VAR(A2:A11). In the case of Class B, the same function is used
but the data in the cells from B2 to B 11 are considered. Thus, we use
=VAR(B2:B11).
207
In both instances, the Excel values match with the results using the
manual process. For larger number of scores in a distribution, Microsoft Excel
or other software is more appropriate and efficient in obtaining the variance and
standard deviation. This can be done in few seconds if you have already
entered and saved the data used to get the measure of central tendency. In
addition, the VAR and STDEV functions can still be used even if scores are
encoded in several columns and rows as shown in the next illustration.
208
In additional, since the standard deviation is a measure of desperation,
it means that a large standard deviation indicates greater score variability than
a lower standard deviation. If the standard deviation is small, the scores are
closely clustered around the mean, or the graph of the distribution is
compressed even if it is symmetrical or skewed.
where Sk = skewness
= Mean
Mdn = median
SD = standard deviation
209
difference between mean and median moves farther from 0, the coefficient of
skewness changes to either lower or higher values.
Measures of Position
While measures of central tendency and measures of dispersion are
used often in assessment, there are other methods of describing data
distributions such as using measures of position or location. It is about the
score’s position in the distribution. What are these measures?
Figure 6. Quartiles
210
The following example illustrates these measures.
Given the following scores, find the 1st quartile, 3rd quartile, and quartile
deviation.
90, 85, 85, 86, 100, 105, 109, 110, 88, 105, 100, 112
Steps:
1. Arrange the scores in the decreasing order.
2. From the bottom, find the points below which 25% of the score value
and 75% of the score values fall.
3. Find the average of the two scores in each of these points to determine
Q1 and Q3, respectively.
Note that in the above example, the upper and lower 50% contains even
center values, so the median in each half is the average of the two center
values. Consequently, applying the formula gives the quartile
Decile. It divides the distribution into 10 equal parts. There are 9 deciles such
that 10% of the distribution are equal or less than decile 1, (D 1), 20% of the
scores are equal or less than decile 2 (D2); and so on. A student whose mark
is below the first decile is said to belong in decile 1. A student whose mark is
211
above the 9th decile belongs to decile 10. If there are a small number of data
values, decile is not appropriate to use.
Percentile. It divides the distribution into one hundred equal parts. In the same
manner, for percentiles, there are 99 percentiles such that 1% of the scores are
less than the first percentile, 2% of the scores are less than the second
percentile, and so on. For example, if you scored 95 in a 100-item test, and your
percentile rank is 99th, then this means that 99% of those who took the test
performed lower than you. This also means that you belong to the top 1% of
those who took the test. In many cases, percentiles are wrongly interpreted as
percentage score. For example, 75% as a percentage score means you get 75
items correct out of a hundred items, which is a mark or grade reflecting
performance level. But percentile is a measure of position such that 75th
percentile as your mark means 75% of the students who took the test got lower
score than you, or you score is located at the upper 25% of the class who took
the same test. For every large data set, percentile is appropriate to use for
accuracy. This is one reason why percentiles are commonly used in national
assessments or university entrance examinations with large dataset or scores
in thousands.
212
different categories. In the first example, one is the distribution of mathematics
scores while the other is the distribution of science scores. To make the
comparison logical, we need a measure of relative dispersion is also known as
the coefficient of variation (CV). This is simply the ratio of the standard
= 0.277
= 28%.
Looking at 10 and 5 as the standard deviation of mathematics and
science scores, respectively, this may lead you to judge that the set of scores
in mathematics is twice more dispersed than scores in science. From the
computed coefficient of variations as measure of relative dispersion, we can
clearly see that the scores in mathematics are more homogenous than the
scores in science.
213
The Normal Distribution
The normal distribution is a special kind of symmetrical distribution that
is most frequently used to compare scores. It has been found that when a
frequency polygon for a large distribution of scores of a natural phenomenon
and occurring characteristics (IQ, Height, income, test scores, etc.) are drawn
as a smooth curve, one curve stands out, which is the bell-shaped curve. As
seen below. This curve has a small percentage of observations on both tails,
and bigger percentage has a small percentage of observations on both tails,
and the bigger percentage on the inner part of the curve. The shape of this
particular curve is known as the normal curve, hence the name normal
distribution.
214
99.77% of the scores fall between three standards deviations below
and above the mean
Figure 8 illustrates the theoretical model.
From the above figures, we can state the properties of the normal
distribution:
1. The mean, median, and more are all equal
2. The curve is symmetrical. As such, the value is a specific area on the
left is equal to the value of its corresponding area on the right.
3. The curve changes from concave to convex and approaches the X-
axis, but the tails do not touch the horizontal axis.
4. The total area on the curve is equal to 1.
Standard Scores
In the preceding sections, we discussed raw scores, which are the
original scores collected from an actual testing activity. However, there are
situations where computing measures from raw scores may not be enough.
Consider a situation where we, as a student, want to know in what subjects we
performed best and poorest to determine where you need to exert. Or
215
maybe in the past, you took entrance examinations in more than one university
and asked yourself in which university you performed best. In case like these
we cannot find the answer by merely relying on a single score. More concretely,
if you get a score of 86 in English and 90 in Physics, you cannot conclude that
you perform better in Physics than in English. This is ridiculous because 90 is
higher than 86. Say, you later learned that the mean score of the class in
English was 80, and in Physics, the mean score like 86 or 90 is not meaningful
unless it compared with other test scores. In particular, a score can be
interpreted more meaningfully if we know the mean and variability of the other
scores where that single score belongs. Knowing this, a raw score can be
converted into standard scores.
Z-score. There are many kinds of standard scores. The most useful is the z-
score, which is often used to express a raw score relation to the mean and
standard deviation. This relationship is expressed in the following formula:
216
Figure 9. A Comparison of Score Distribution with Different
Means and Standard Deviations
From the above, if 86 and 90 are your scores in the two subjects, you
can confidently say that, compared with the rest of your class, you performed
better in English than in Physics. That is because in English, your performance
is 2 standard deviations above the mean, while in Physics, you are 2.5 standard
deviations below the mean. While 90 is numerically higher, this score is more
than half standard deviation below the average performance of the class where
you belong, while 86 is above the mean and even 2 standard deviations above
it. Having been transformed in the same scale, the two graphs can be taken as
one standard distribution of score.
217
We may want to look deeper into the scores. Note that in Figure 9, the
shaded areas in the two graphs indicate the proportion of scores below yours.
This is the same as saying that this proportion is the number of students in your
class who scored lower than you. But to be more specific on this proportion,
you need to look back at Figure 8 and the converted z-scores. Examining the
areas under the normal curve, we can say that about 98% in the class scored
below you in English, while only 1.2% scored below you in Physics. We assume
here that the scores are normally distributed. So, in what subject makes you
feel better?
T-Score. As you can see in the computation of the z-score, it can give us a
negative number, which simply means the score is below the mean. However,
communicating negative z-score as below the mean may not be
understandable to others. We will not even say to students that they got a
negative z-score. Also, a z-score may also be a repeating or non-repeating
decimal, which may not be comfortable for others. One option is to convert a z-
score into a T-score, which is a transformed standard score. To do this, there
is calling in which a mean of 0 in a z-score is transformed into mean of 50, and
the standard deviation in a z-score is multiplied by 10. The corresponding
equation is:
T-score = 50 + 10z
T-score = 50 + 10 (-2)
= 50 - 20
= 30
= 50 + 20
= 70
218
Examining further Figures 9 and 10, where z-scores range from -4 to
+4, more than 99% of the scores are between 20 and 80.
Example:
Scores in stanine scale have some limitations. Since they are in a 9-
point scale and expressed as a whole number, they are not precise. Different
z-scores or T-scores may have stanine score equivalent.
219
With the above percentage distribution of scores in each stanine, you can
directly convert a set of raw scores into stanine scores. Simply arrange the raw
scores from lowest to highest, and with the percentage of scores in each
stanine, we can directly assign the appropriate stanine score in each raw score.
On interpretation of stanine score, let us say Kate has a stanine score of 2. We
can see that her score is somewhere at the low or bottom 7 percent of the
scores. In the same way, if John’s score is the 6th stanine, it falls between the
60th and 77th percentile, simply because 60 percent of the scores are below the
6th stanine and 23 percent of the scores are above the 6 th stanine. For
qualitative description, stanine scores of 1, 2, and 3 are considered as below
average; 4, 5, and 6 are average, and 7, 8, and 9 are above average. Thus, we
can say that your score of 86 in English is above average. Similarly, Kate’s
score is below average while that of John is average. Figure 11 illustrates the
equivalence of the different commonly-used standard scores.
220
Measures of Covariability
There are situations when we look at examinees’ performance
measures, and ask ourselves what could explain such scores. Measures of
covariablity tell us to a certain extent the relationship between two tests or
factors. Admittedly, a score one gets may not only be due to a single factor but
with other factors directly or indirectly observable which are also related to one
another. This section will be limited to introducing two scores that are
hypothesized to be related to one another.
When we are interested in finding the degree of relationship between
two scores, we are dealing with the correlation between two variables. The
statistical measure is the correlation coefficient, an index number that ranges
from -1.0 to +1.0. The value -1.0 indicates a negative perfect correlation, 0.00
no correlation at all, and 1.00 a perfect positive correlation. There have been
many correlation studies conducted in the field of assessment and research,
but correlation coefficient values are either closer to +1.0 or -1.0.
The various types of relationships are illustrated in the following scatter-
plot diagrams:
221
Some examples of interpreting Correlation coefficients are as follows:
222
First, we will encode the two (2) sets of raw data representing X and Y
in either two columns or rows in Excel as shown in the next illustration. At any
free cell, type =CORREL(A3:A12,B3:B12) as reflected in the worksheet. The
A3:A12 are the cell addresses of the first variable X, while B3:B12 are the cell
addresses of the second variable Y. Take note that the cell ranges of the 2
variables are separated by a comma.
In contrast, the manual process will give us this complex process and
the same result.
223
Obviously, the mathematical processes gave a correlation coefficient of
0.705 between performance scores in Reading and Problem-Solving. This
coefficient indicates a moderate correlation between performance in reading
and problem-solving.
There are some precautions to be observed with regard to the computed
r as correlation coefficient. First, it should not be interpreted as percent. Thus,
0.705 should not be interpreted as 70%. If we want to extend the meaning of
0.705, we need to compute the value for r2, which becomes the coefficient of
determination. This coefficient explains the percentage of the variance in one
variable that is associated with the other variable. Remember how variance and
standard deviation were explained in the precious sections?
Now, with reference to the two variables indicated in Table 3, and the
computed r of 0.705, it results to an r2 of 0.49, which can be taken as 49%. This
is interpreted as 49% of the variance which is equal to 1 or 100%. If 49% of the
variance observed in problem-solving scores is attributable to reading scores,
then the other 51% of the variance is due to other factors. This concept is
depicted in Figure 12. Second, while the correlation coefficient shows the
relationship between two variables, it should not be interpreted as causation.
Considering our examples, we could not say that the scores in reading test
cause 49% of the variance of the problem-solving test scores. Relationship is
different from causation.
224
Figure 13. Covariation of Performance in Reading and Problem-
Solving
225
Figure 14. Correlation between Performance in Reading and
Problem Solving
Summary
Measures of central tendency include the mean, median and mode.
Mean is the average of given sets of scores. You should add the
numbers up then divide by the number of the cases.
Median is the number in the middle when you order the numbers in an
ascending order. If there are two numbers in the middle, you should
take the average of those two numbers.
Mode is the number which is repeated the most in the set.
Measures of dispersion include the range, interquartile range, semi-
interquartile range, quartile deviation, variance and standard deviation.
The range of a dataset is the difference between the largest and
smallest values in that dataset.
The interquartile range is the middle half of the data that is in between
the upper and lower quartiles. Dividing this by 2 gives us the quartile
deviation.
Variance is the average squared difference of the values from the
mean. Unlike the previous measures of variability, the variance
includes all values in the calculation by comparing each value to the
mean.
The standard deviation is the standard or typical difference between
each data point and the mean.
Measures of location can be categorized as percentile, decile and
quartile.
226
A normal distribution is sometimes called the bell curve because its
distribution occurs naturally in many situations. The bell curve is
symmetrical where half of the data falls to the left of the mean and
another half falls to the right.
Standardized scores can be a z-score, T-score or stanine.
Measures of covariablity tell us to a certain extent the relationship
between two tests or factors.
Assessment
A. Answer the following questions orally.
1. What are the measures of central tendency?
2. What are the measures of dispersion?
3. What are the measures of position?
4. What is covariability?
5. Differentiate the different ways of converting raw scores into standard
scores?
B. Given the following test scores in Math: 5, 20, 13, 15, 12, 19, 20, 10, 17,
11& 16.
1. What is the mean? How did you find it?
2. What is the median? Why did you say it is the median?
3. What is the mode? How did you determine this measure?
C. Examine the following frequency polygons below:
227
3. In which distribution are the scores most dispersed? What range of
scores is indicated by this distribution?
4. Which distribution has the smallest standard deviation? Why did you
say so? How do you describe this distribution?
D. Let me see further how you understand the earlier discussion by
considering the following class scenario:
Ms. Dioneza’s best students in Spelling obtained the following scores in a
competitive test:
89, 90, 91, 92, 93, 94, 95, 95, 96, 98
Dr. Protacio’s best students in Spelling got the following scores in the
same competitive test:
60, 90, 92, 93, 93, 94, 96, 96, 97, 98
1. Which measure of central tendency is most appropriate for the
students of Ms. Dioneza? Why did you say so? Then, what is the value
of that measure?
2. Which measure of central tendency is most appropriate for the
students of Dr. Protacio? Why? Then, what is the value of that
measure?
3. For each group, find the following:
a. Range
b. Q1, Q2 & Q3
c. Semi-interquartile range
d. Variance
e. Standard deviation
E. Determine the type of distribution depicted by the following measures:
1. = 80.45 Mdn = 80.78 Mode = 80.25
2. = 120 Mdn = 130 Mode = 150
3. = 89.78 Mdn = 82.16 Mode = 82.10
228
F. The following is frequency distribution of scores of 10 persons
X f
30 1
28 2
20 3
15 1
13 2
10 1
229
H. Refer to the figure below as the frequency polygons representing entrance
test scores of three groups of students in different programs.
1. What is the mean score of Education students?
2. What is the mean score of Engineering students?
3. What is the mean score of Business students?
4. Which group of students has the most dispersed scores in the test?
Why?
5. What distribution is symmetrical? What distribution is skewed? Why?
I. The following is a frequency distribution of year-end examination marks in
a certain secondary school.
Class Interval f
60 - 65 2
55 - 59 5
50 - 54 6
45 - 49 8
40 - 44 11
35 - 39 10
30 - 34 11
25 - 29 20
20 - 24 17
15 - 19 6
10 - 14 4
N = 100
230
3. Make a sketch of the graph of the frequency distribution. Describe the
graph of the distribution as to its skewness.
4. Find:
a. Third quartile or the 75th percentile (P75)
b. First quartile or the 25th percentile (P25)
c. Semi-interquartile range
J. A common exit examination is given to 400 students in a university. The
scores are normally distributed and the mean is 78 with a standard
deviation of 6. Paul had a score of 72 and Vic a score of 84. What are the
corresponding z-score of Paul and Vic? How many students would be
expected to score between the scores of Paul and Vic? Explain your
answer.
K. Jayson obtained a score of 40 in his Mathematics test and 34 in his
Reading test. The class mean score in Mathematics is 45 with a standard
deviation of 4 while in Reading, the mean score is 50 with a standard
deviation of 7. On which test did Jayson do better compared to the rest of
the class? Explain your work.
L. Following are sets of scores on two variables: X for Reading
Comprehension and Y for Reasoning Skills administered to sample of
students
X 11 9 15 7 5 9 8 4 8 11
Y 13 8 14 9 8 7 7 5 10 12
231
empirical data that are experiential and will depict the true picture in the
field of education.
Activity 1
Identify yourself in a group of four classmates. Here is the task you need to
work your team members.
Secure a set of old test papers that have been scored by a teacher or a
student teacher. It is advised that number of cases be at least 100. It is
understood that the 100 scores came from the same test. See to it that
you observe confidentially of the documents you have requested. No
name should be identified in your written work, but use codes to identify
the observations or cases.
Here are the tasks:
1. Prepare a data set for the test scores.
2. With the aid of your scientific calculator, or Microsoft Office Excel,
find the following:
a. Mean
b. Median
c. Mode
3. Describe the type of distribution of test scores.
4. Which measure of central tendency is most appropriate to describe
the distribution? Why do you say so? Explain.
5. Select three (3) students in the list. Describe the test performance
of each student relative to the performance of all the students in the
whole class.
Activity 2
Interview a teacher on how she decides to pass or fail a student. Report the
specific questions that you have asked and the corresponding responses you
have captured in this interview activity. Your analysis and presentation should
reflect the application of the measures of central tendency, measures of
variability, and of the measures of location.
232
Activity 3
With your team members, make a visit to any of the following offices of a
school:
1. Office of the Guidance Councilor
2. Testing Center
3. Office of Student Affairs
Request for any of the following data which you can avail of:
1. IQ test scores
2. Aptitude test scores
3. Admission test scores
4. Qualifying examination scores
5. Any psychological test scores
It is advised that the number of observations should at least be 50. The
larger the number of cases would be better.
Please emphasize the data will be held in full confidentiality. They have
the option not to give the names of the examinees as ethical consideration. Do
not apply coding technique to label the different observation.
From the test scores that you have gathered, do the following:
1. Make a frequency distribution of the test scores.
2. Find the following measures: mean, median, and mode.
3. Describe the type of distribution.
4. Determine the most appropriate measure of central tendency to
describe the average IQ/Aptitude/Performance of the examinees.
5. For those who gather IQ test scores, interview the personnel in-charge
on how they interpret the numerical value of IQ test scores (as above
average, average, etc. whatever applies).
6. Find the standard deviation.
7. Consider at least 5 student scores in you data set and try computing
the ff:
a. z-scores
b. T-scores
c. Percentile equivalent
d. Stanine scores
8. Make a report you your class on the following:
233
a. The methods and procedures in the test data collection
b. Summary of the data set to your class using tables, figures, and
graphs
c. Summary of your findings on question 1-7 above.
N. Test Yourself.
In each item, choose the letter which you
think can best represent the answer to
given problem situation. Give a statement
to justify your choice.
1. Which distribution is negatively
skewed in the figure shown?
A. Distribution X
B. Distribution Y
C. Distribution Z
D. Distribution W
2. What is the preferred measure of central tendency in a test distribution
where there are small number of scores that are extremely high or low?
A. Median C. Variance
B. Mean D. Mode
3. The following scores were obtained from a short spelling test:
10, 8, 11, 13, 12, 13, 8, 16, 11, 8, 7, 9.
What is the modal score?
A. 8 C. 11
B. 10 D. 13
4. What does it mean when a student got a score at the 70 th percentile on
a test?
a. The performance of the student is above average
b. The student answered 70 percent of the items correctly
c. The student got at the least 70 percent of the correct answers
d. The student’s score is equal or above 70 percent of the other
students in the class
5. Which best describes a normal distribution?
A. Positively skewed c. symmetric
B. Negatively skewed D. bimodal
6. What does a large standard deviation indicate?
A. Scores are not normally distributed
B. Scores are not widely spread, and the median is unreliable
measure of central tendency.
C. Scores are widely distributed where the mean may not be a reliable
measure of central tendency.
D. Scores are not widely distributed, and the mean is recommended
as more reliable measure of central tendency
234
7. In a normal distribution, approximately what percentage of scores is
expected to fall within three-standard deviation from the mean?
A. 34% C. 95%
B. 68% D. 99%
8. Which of the following is interpreted as the percentage of scores in a
reference group that falls below a particular raw score?
A. Standard scores C. Reference group
B. Percentile rank D. T-score
9. For the data illustrated in the scatter plot
below, what is the reasonable product-
moment correlation coefficient?
A. 1.0
B. -1.0
C. 0.90
D. -0.85
[Link] Pearson test statistics yields a
correlation coefficient (r) of 0.90. If X
represents scores on vocabulary and Y,
the reading comprehension test scores, which of the following best
explains r=0.90?
A. The degree of association between X and Y is 81%.
B. The strength of relationship between vocabulary and reading
comprehensive is 90%.
C. There is almost perfect positive relationship between vocabulary
test scores and reading comprehension.
D. 81% of the variance observed in y can be attributed to the variance
observed in X.
O. Refer to these test data:
38 15 10 27 18 26 27 30 17 43
21 36 27 34 49 31 40 20 30 36
42 23 34 39 46 21 39 33 37 34
26 43 27 28 34 36 27 35 47 45
18 28 22 42 20 18 35 27 47 34
60 51 63 37 43 60 38 25 45 18
31 29 64 18 31 28 32 22 18 49
22 21 21 31 26 27 27 32 41 40
35 38 34 24 37 35 35 22 38 25
19 28 28 42 33 25 27 25 50 18
a. How do you find the mean if data are grouped? What is the most
appropriate class interval for this set of data? What other information
you need to generate and achieve the task of what is the mean?
b. How do we determine the median of the scores?
235
In what class interval did the median fall?
What is the cumulative frequency below the median class?
What is the frequency of the median class?
What is the lower limit of the median class?
What is the median equal to?
P. Take hold of a specific calculator. Use the raw score formula in finding the
standard deviation. If you have Excel or whatever software, you can directly
work on the data.
1. What is the variance?
2. What is the standard deviation?
Q. Assuming that the scores are normally distributed, what is the range of
scores the would fall between:
1. ±1σ from the mean. Explain your answer.
2. ±2σ from the mean. How did you determine this ranger of scores?
3. ±3σ from the mean. Explain your answer.
4. Define the 25th percentile. How did you determine it?
5. Where is the 75th percentile? How did you locate it? How many fall on
this percentile?
6. How many got a percentile rank of 99? How did you determine it?
Explain.
R. With the help of your scientific calculator, Excel, or whatever, load a data
set - either hypothetical data or an actual data - you have accessed from
additional documents or any research studies. Analyse the data and enter
the values into the following table:
Mean
Median
Mode
Range
Variance
Standard Deviation
Skewness
Kurtosis
Select an appropriate type of graph (bar chart, histogram, pie graph, box
plot, and line) for this variable and have Excel or SPSS draw the graph.
Why did you select this particular type of graph?
236
Examine the values you have written in the table. Discuss the graph in
relation to the values in the table.
Educator’s Input
In many programs, topics on measures of central tendency, measures
of variability, and measure of correlation are taught generally by mathematics
teachers. This is because these topics are core content in Statistics and for
many, statistics is mathematics. Many personal encounters with students made
me think that instructional discourse on the subject of Measurement and
Assessment of Learning that involves statistics tends to be restricted to
performing mathematical procedures, such as calculating the mean, median,
mode, standard deviation, etc. My graduate students who are already
practitioners in their own field have shared their year-end reports of student
performance, including computations of the mean, median, mode, and standard
deviation. I am happy to hear that they did not encounter difficulties in
mechanical computations with scientific calculators and Microsoft Excel. When
I asked why they have to compute all the three measures of central tendency
for their student performance report, they could not give me a convincing
reason. I further asked what they did with the standard deviation and how they
utilized all those numerical values. I did not get logical justification, while they
were confident in providing the conceptual meaning of mean, median, mode,
variability measures; getting the ―big ideas‖ that underlie these statistical
concepts and their functionalities in improvement of learning appeared very
much wanting. Whether I had been guilty, somehow, of being too procedural in
my own teaching of Statistics during my early years of teaching, I have later
discovered a teaching approach to deepen my students’ understanding of
measurement theories, concepts, and principles. The strategy of ―posing
relevant scenario‖ or simply ―scenario-posing‖ for students to examine and
explore can be effective. From the scenario, I can generate prompts to invite
my students to participate in the class discussion. Most likely, students become
interested because in presenting a scenario, I create a story. I want to believe
that a story-form presentation works across age levels, not only for young
children. In the early years, the scenarios I
237
presented were mostly hypothetical; my objective was primarily to dramatize
the main idea embodied in the statistical measure. In later years, these
scenarios became more real and effortless based on my actual experiences as
a teacher educator and a researcher.
In presenting the scenario, I always have in mind the ―seed concepts‖,
which are in coherence with my lesson objectives. I have been practicing this
approach for quite some time, and I think teachers using this strategy will be
able to deepen student knowledge by eliciting from them the nuances of the
information embodied in the story problem. These examples maybe worth
sharing:
1. The town I live has a population of 5,000 native people in 2017, and then
mean income per person is Php 30,000.00. Now, suppose Mr. Manuel
from a distant region and a millionaire moves to my town in 2018. Let us
say that the income of Mr. Manuel was Php 120,000,000.00. So, with
5,001 people now living in my town, what is the mean income of my
town? Php53,989.00? Does this information indicate that the 5000
natives in my town suddenly made Php 23,989 more in their income
2018?
(The message that is conveyed in this situation is the presence of
outlier or extreme values in the dataset that will make the mean a
less appropriate measure of average.)
2. Mary entered a class, seeing one seatmate, Jane, in the verge of crying.
She asked her what makes her cry. Jane answered “I got a grade of 75
in physics.”
Mary saw her seatmate, Daniel who is in full smile and happy. She asked
him what makes him happy. Daniel answered, “I got a grade of 75 in
Biology.”
Mary turned back to Jane and Told her, “you do not have a reason to be
sad. Look at Daniel. He got the same grade of 75 in Biology, and he is
happy. So, be happy like Daniel.”
Is Mary correct in her statement to Jane?
(This scenario can deepen the understanding of mean and standard
deviation and their functionalities, and eventually can serve as a takeoff point
in introducing z-scores as one standard score and its use in comparing test
performance.)
238
References
Cheusheva, Svetlana (September 4, 2020). Mean, median and mode in
Excel. Retrieved from [Link]
blog/2017/05/24/mean-median-mode-excel/
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
Frost, Jim (n.d.).Measures of Variability: Range, Interquartile Range,
Variance, and Standard Deviation. Retrieved from
[Link]
variance-standard-deviation/.
McLeod, Saul (2019). Introduction to the Normal Distribution (Bell Curve).
Retrieved from [Link]
[Link]
Mean, Median, Mode: What They Are, How to Find Them. Retrieved from
[Link]
definitions/mean-median-mode/
Statisticsfun (September 20, 2009). How to calculate Standard Deviation,
Mean, Variance Statistics, Excel. [Video]. YouTube:
[Link]
[Link] (2020). Skewness in Statistics: Definition, Formula & Example.
[Video]: [Link]
[Link]
Surbhi S. (February 27, 2017). Differences Between Skewness and Kurtosis.
Retrieved from [Link]
[Link].
The Organic Chemistry Tutor (May 29, 2018). Calculating The Standard
Deviation, Mean, Median, Mode, Range, & Variance Using Excel.
[Video]. YouTube: [Link]
239
Lesson 3: Grading and Reporting of Test Results
Pre-discussion
Grading and reporting learners’ test performance is a complex task. It
requires specific knowledge, skills, and experience. To perform successfully the
assigning of grades and reporting of level of performance or achievement, pre-
service teachers should be able to understand its purpose, identify different
methods of scoring and grading test performance, differentiate the various
types of test scores, and interpret test results based on norms and pre-set
standards.
What to Expect?
At the end of the lesson, the students can:
1. define what is grading;
2. discuss the different methods in scoring tests and performance tasks;
different types of test scores; guidelines on grading tests and
performance; test scores; and how to communicate test scores;
3. prepare scoring rubrics for performance tasks;
4. discuss the assessment system in the Department of Education as
contained in DO No. 8, s. 2015; and
5. compute the grades of learners based on DepEd guidelines.
240
To learn how to assign grades and report learners’ test performance in
a meaningful and effective manner, it is important that you review your prior
knowledge and experience, as well as the standards or policies used by your
institution in grading and reporting learners’ performance in the test and the
course as a whole. You may also need to read books and other references on
the topics to validate you’re a prior knowledge and to enhance further your
knowledge and skills.
Definition of Grading
In his paper, Magno discussed (2010) that effective and efficient way of
recording and reporting evaluation results is very important and useful to
persons concerned in the school setting. Hence, it is very important that
students’ progress is recorded and reported to them, their parents and
teachers, school administrators, and counselors as well because this
information shall be used to guide and motivate students to learn, establish
cooperation, and collaboration between the home and the school. It is also used
in certifying the students’ qualifications for higher educational levels and for
employment. In the educational setting, grades are used to record and report
students’ progress.
Grades are essential in education such that students’ learning can be
assessed, quantified, and communicated. Every teacher needs to assign
grades which are based on assessment tools such as tests, quizzes, projects,
and so on. Through these grades, achievement of learning goals can be
communicated with the students and parents, teachers, administrators, and
counselors. However, it should be remembered that grades are just a part of
communicating student achievement; therefore, it must be used with additional
feedback methods.
Grading implies (a) combining several assessments, (b) translating the
result into some type of scale that has evaluative meaning, and (c) reporting
the result in a formal way. From this definition, we can clearly say that grading
is more than quantitative values as many may see it; rather, it is a process.
Grades are frequently misunderstood as scores. However, it must be clarified
that scores make up the grades. Grades are the ones written in the report cards
of students which is a compilation of students’ progress and
241
achievement all throughout a quarter, a trimester, a semester or a school
year.
Grades are symbols used to convey the overall performance or
achievement of a student and they are frequently used for summative
assessments of students. Take for instance two long exams, five quizzes, and
ten homework assignments as requirements for a quarter in a particular subject
area. To arrive at grades, a teacher must be able to combine scores from the
different sets of requirements and compute or translate them according to the
assigned weights or percentages. Then, he or she should also be able to design
effective ways on how he/she can communicate it with students, parents,
administrators and others who are concerned. Another term not commonly
used to refer to the process is marking. Figure 1 shows a graphical
interpretation summarizing the grading process.
There are various reasons why we assign grades and report learners’
test performance. Grades are alphabetical or numerical symbols/marks that
indicate the degree to which learners are able to achieve the intended learning
outcomes. Grades do not exist in a vacuum but are part of the instructional
process and serve as a feedback loop between the teacher and learners. They
are one of the ways to communicate the level of learning of the learners in
specific course content. They give feedback on what specific
242
topic/s learners have mastered and what they need to focus more when they
review for summative assessment or final exams. In a way, grades serve as a
motivator for learners to study and do better in the next tests to maintain or
improve their final grade.
Grades also provide the parents, who have the greatest stake in
learners’ education, precise information about their children’s achievements.
They give teachers the bases for improving their teaching and learning
practices and for identifying learners who need further educational intervention.
They are also useful to school administrators who want to evaluate the
effectiveness of the instructional programs in developing the needed skills and
competencies of the learners.
Magno (2020) emphasized that the purposes of grading can be
categorized into four major parts in the educational setting.
Feedback. Feedback plays an important role in the field of education such that
it provides information about the students’ progress or lack. Feedback can be
addressed to three distinct groups concerned in the teaching and learning
process: parents, students, and teachers.
243
Feedback to Students. Grades are one way of providing feedbacks to
students such that it is through grades that students can recognize their
strengths and weaknesses. Upon knowing these strengths and
weaknesses, students can be able to further develop their competencies
and improve their deficiencies. Grades also help students to keep track of
their progress and identify changes in their performance. This feedback is
directly proportional with the age and year level with the students such that
grades are given more importance and meaning by a high school student
as compared to a grade one student, however, motivation plays a role in
grades. Such that Grade 1 students are motivated to get high grades
because of external rewards and high school students are also motivated
internally to improve one’s competencies and performance.
Feedback to Teachers. Grades serve as relevant information to teachers. It is
through grades of students that they can somehow (a) assess whether
students were able to acquire the competencies they are supposed to have
after instruction; (b) assess whether their instruction plan and
implementation was effective for the students; (c) reflect about their
teaching strategies and methods; (d) reflect about possible positive and
negative factors that might have affected the grades of students before,
during and after instruction; and (e) evaluate whether the program was
indeed effective or not. Given these beneficial purposes of grades to
teachers, we can really say that teaching and learning is a two way
interrelated process, such that it is not only the students who learn from the
teacher, but the teacher also learns from the students. Through grades of
students, a teacher can be able to undergo self-assessment and self-
reflection in order to improve oneself and be able to recognize relative
effectiveness of varying instructional approaches across different student
groups being observed and be flexible and effective across different
situations.
Administrative Purposes
At their end, school administrators can use the grades of students for a
more general purpose as compared to teachers, such that they can utilize
grades to evaluate programs, identify and assess areas that needs to be
244
improved and whether or not curriculum goals and objectives of the school, has
been attained by the students through their institution.
Promotion and Retention. Grades can serve as one factor in determining if a
student will be promoted to the next level or not. Through the grades of
students, skills, and competencies required of him to have for a certain level
can be assumed whether or not he was able to achieve the curriculum goals
and objectives of the school and/ or the state. In some schools, the grade
of students is a factor taken into consideration for his/ her eligibility in joining
extracurricular activities (performing, theater arts, varsity, cheering
squads… etc.). Grades are also used to qualify a student to enter high
school or college in some cases. Other policies may arise depending on
the schools’ internal regulations. At times, failing marks may prohibit a
student from being a part of the varsity team, running for officer, joining
school organizations, and some privileges that students with passing grade
get. In some colleges and universities, students who get passing grades
are given priority in enrolling for the succeeding term, as compared to
students who get failing grades.
Placement of Students and Awards. Through grades of students, placement
can be done. Grades are factors to be considered in placing students
according to their competencies and deficiencies. Through which, teaching
can be more focused in terms of developing the strengths and improving
the weaknesses of students. For example, students who consistently get
high, average and failing grades are placed in one section wherein teachers
can be able to focus more and emphasize students’ needs and demands
to ensure a more productive teaching learning process. Another example
which is more domain-specific would be grouping students having same
competency on a certain subject together. Through this strategy, students
who have high ability in Science can further improve their knowledge and
skills by receiving more complex and advanced topics and activities at a
faster pace, and students having low ability in Science can receive simpler
and more specific topics at a slower pace. Aside from placement of
students, grades are frequently use as basis for academic awards. Many
or almost all schools, universities and colleges have honor rolls, and dean’s
list, to recognize student
245
achievement and performance. Grades also determine graduation awards
for the overall achievement or excellence a student has garnered
throughout his/her education in a single subject or for the whole program
he has taken.
Program Evaluation and Improvement. Through the grades of students
taking a certain program, program effectiveness can be somehow
evaluated. Grades of students can be a factor used in determining whether
the program was effective or not. Through the evaluation process, some
factors that might have affected the program’s effectiveness can be
identified and minimized to improve the program further for future
implementations.
Admission and Selection. External organizations from the school also use
grades as reference for admission. When students transfer from one school
to another, their grades play crucial role for their admission. Most colleges
and universities also use grades of students in their senior year in high
school together with the grade they shall acquire for the entrance exam.
However, grades from academic records and high stakes tests are not the
sole basis for admission; some colleges and universities also require
recommendations from the school, teachers and/or counselors about
students’ behavior and conduct. The use of grades is not limited to the
educational context. It is also used in employment, for job selection
purposes, and at times even in insurance companies that use grades as
basis for giving discounts in insurance rates.
Number Right Scoring (NR). It entails assigning positive values only to correct
answer while giving a score of zero to incorrect answer. The test score is
the sum of the scores for correct responses. One major concern with this
scoring method is that learners may get the correct answer by guessing;
thus, affecting the test reliability and validity.
246
Example: Solve for 3(X + 8) – (X – 12) = -28.
a. X = 32 c. X = - 32
b. X = 8 d. X = -8
For the above item, the correct answer is c (X = -32) and this will merit
a score. Responses other than c will be given zero (0) point.
Item # Score
1 1
2 0
3 -0.25
4 1
5 1
6 0
7 -0.25
8 1
9 1
10 1
Total 6 + (-0.50) = 5.5
247
Both NR and NM methods of scoring multiple-choice test are prone to
guessing which affect test validity and reliability. As a result, nonconventional
scoring methods were introduce, which include: (1) Partial Credit Scoring
Methods, (2) Multiple Answer Scoring Method, (3) Retrospective Correcting for
guessing, and (4) Standard Setting Scoring Method.
Example:
Linda obtained a score of 35% in her Reading Test. What does her score
mean? Justify your answer.
a. Linda got 55% of the test items correct.
b. Linda was able to answer correctly more than half of the items.
c. Linda obtained a raw score lower than those obtained by 55% of his
peers.
d. If the test has 60 items, Linda would probably have 33 correct answers.
248
For this item, each response option has an assigned score with its
corresponding rationale. An example of how the item can be scored is shown
below:
Option Points Rationale
A 3 Since the core was presented in percent, this is the correct
interpretation.
B 1 While the interpretation may be correct, it does not give a
more specific meaning to the score. Besides, the same
interpretation can also be applicable to scores higher than
51%.
C 0 This interpretation is wrong as this interpretation is applicable
to a score of 55th percentile rank.
D 2 This interpretation gives an example how the score was
computed
249
will get 4 points. Learners who choose the wrong option will be penalized using
the negative marking method.
Retrospective Correcting for Guessing. It considers omitted or no-answer
items as incorrect, forcing learners to give an answer for every item even if
they do not know the correct answer. The correction for guessing is
implemented later or retroactively. This can be done through comparing
learners’ answer in multiple-choice items with their answer on other test
formats, such as short-answer test.
Standard-setting. It entails using standards when scoring multiple-choice
items, particularly standards set through norm-referenced or criterion-
referenced assessment. Standards based on norm-referenced assessment
area derived from the test performance of a certain group of learners, while
standards from criterion-referenced assessment are based on present
standards specified from the very start by the teacher or school in general.
For example, for a final examination in algebra, the Mathematics
Department can set the passing score (e.g., 75 percentile rank) based on the
norms derived from the scores of learners for the past three years. To do this,
the department will need to collect the previous scores of learners on the same
or equivalent final exams and apply the formula for standard scores to compute
for the percentile ranks for each range of scores. On the other hand, passing
grades/scores are usually set by the department or the school based on their
standards (e.g., A (90-100 percent), B (80-89 percent), C (70-79 percent), or F
(0-69 percent).
Marking or scoring constructed-type of tests, such as essay and
performance tests, require standardized scoring schemes so that scores are
reliable and have the same valid meaning for all learners. There are four types
of rating scales for the assessment of writing, which can also be applied to other
authentic or performance-type assessment. These four types of scoring are (1)
Holistic, (2) Analytic, (3) Primary Trait, and (4) Multiple Trait scoring.
Holistic Scoring. It involves giving a single, overall assessment score for an
essay, writing composition, or other performance-type of assessment as a
whole. Although the scoring rubric for holistic scoring lays out specific
250
criteria for evaluating a task, raters do not assign a score for each criterion.
Instead, as they read writing task or observe a performance task, they
balance strengths and weaknesses among the various criteria to arrive at
an overall assessment. Holistic scoring is considered efficient in terms of
time and cost. It also does not penalized poor performance based on only
one aspect (e.g., content, delivery, organization, vocabulary, or coherence
for oral presentation). However, it is said that holistic scoring does not
provide sufficient diagnostic information about the students’ ability as it
does not identify the areas for improvement and is difficult to interpret as it
does not detail the basis for evaluation.
The following is an example of a rubric for an oral presentation:
Rating/Grade Characteristics
A Is very organized. Has a clear opening statement that catches
(Exemplary) audience’s interest. Content of report is comprehensive and
demonstrates substance and depth. Delivery is very clear and
understandable. Uses slides/multimedia equipment effortlessly
to enhance presentation.
B Is mostly organized. Has opening statement relevant to topics.
(Satisfactory) Has appropriate pace and without distracting mannerisms.
Looks at slides to keep on track.
C Has opening statement relevant to topic and but does not give
(Emerging) outline of speech; is somewhat disorganized. Lacks content and
depth in the discussion of the topic. Delivery is fast and not
clear; some items not covered well. Relies heavily in slides and
notes and makes little eye contact.
D Has no opening statement regarding the focus of the
(Unacceptable) presentation. Does not give adequate coverage of topic. Is often
hard to understand, with voice that is too soft or too loud and
pace that is too quick or too slow. Just reads slides; slides too
much text.
251
scores in one scale may influence the ratings of others. It is also difficult to
create.
Below is an example of an analytic scoring for a research paper. In this
scoring, learners’ work is to be assessed based on different parts of the
research paper, namely, introduction, method, results, discussion, conclusion,
and recommendations, as well as on some criteria (e.g., spelling, grammar,
documentation, and format).
Rubric for a Final Research Paper
Criteria/Indicators Expert Proficient Apprentice Novice
4 3 2 1
1. Introduction At least Any two of Any one of None of
a. Clearly identifies and A to C the given the given the given
discusses research are indicators indicators indicators
focus/purpose satisfied are is satisfied is satisfied
b. Research focus is clearly satisfied
grounded in previous
research/theoretically
relevant literature.
c. Significance of study is
clearly identified (and how
it adds to previous
research).
d. Others, please specify
2. Method At least Any two of Any one of None of
Provides accurate and A to C the given the given the given
thorough information of the are indicators indicators indicators
following: satisfied are is satisfied is satisfied
a. Research method, design, satisfied
and context
b. Data source, collection
procedure, and tools
c. Data analysis
d. Other, please specify
3. Results At least Any two of Any one of None of
a. Results are clearly A to C the given the given the given
explained in a are indicators indicators indicators
comprehensive level and satisfied are is satisfied is satisfied
are well-organized. satisfied
b. Tables/figures clearly and
concisely convey the data.
c. Statistical analyses are
appropriate tests and are
accurately interpreted.
d. Others, please specify.
4. Conclusions, Discussion, At least Any two of Any one of None of
and Recommendations A to C the given the given the given
a. Interpretations/ analysis of are indicators indicators indicators
result are thoughtful and satisfied are is satisfied is satisfied
insightful; are clearly satisfied
252
informed by the study’s
results; and thoroughly
address how they
supported, refuted, and/or
informed the
hypotheses/proposition.
b. Discussions on how the
study relates to and/or
enhances the present
scholarship in this area are
adequate.
c. Suggestions for further
research in this area are
insightful and thoughtful.
d. Others, please specify
5. Documentation and Quality At least Any two of Any one of None of
of Sources A to C the given the given the given
a. Cites all data obtained are indicators indicators indicators
from other sources satisfied are is satisfied is satisfied
b. APA style is accurately satisfied
used in both text and
references.
c. Sources are all scholarly
and clearly relate to
research.
d. Others please specify.
6. Spelling and Grammar At least Any two of Any one of None of
a. No error in spelling. A to C the given the given the given
b. No error in grammar are indicators indicators indicators
c. No error in the use of satisfied are is satisfied is satisfied
punctuation marks. satisfied
d. Others, please specify.
7. Manuscript Format At least Any two of Any one of None of
a. Title page has proper APA A to C the given the given the given
formatting. are indicators indicators indicators
b. Used correct headings and satisfied are is satisfied is satisfied
subheadings consistently. satisfied
c. Proper margins were
observed.
d. Others, please specify.
Final Grade
Primary Trait Scoring. It focuses on only one aspect or criterion of a task, and
a learner’s performance is evaluated on only trait. This scoring system
defines a primary trait in the test that will then be scored. For example, if a
teacher in a political science class asks his students to write an essay on
the advantages and disadvantages of Martial Law (i.e., the writing task),
the basic question addressed in scoring is, ―Did the writer successfully
accomplish the purpose of this task?‖ With this focus,
253
teacher would ignore errors in conventions of written language but instead
focus on overall rhetorical effectiveness. One disadvantage of this scoring
scheme is that it is often difficult to focus exclusively on one trait, such that
other traits may be included when scoring. Thus, it is important that a very
detailed scoring guide is used for each specific task.
Multiple-Trait Scoring. It requires that an essay test or performance task is
scored on more aspect, with scoring criteria in place so that they are
consistent with the prompt. Multiple-trait scoring is task-specific, and the
features to be scored vary from to task; thus, requiring separate scores for
different criteria. Multiple-trait scoring is similar to analytic scoring because
of its focus on several categories or criteria. However, while analytic scoring
evaluates more traditional and generic dimensions of language production,
multiple-trait scoring focuses on specific features or performance required
to fulfill the given task or tasks. For example, scoring criteria for writing
performance may include abilities to present argument clearly, to organized
one’s and to present accurate language usage through grammar,
punctuation, and spelling.
254
whole. For example, a raw score of 95 would look impressive, but only if
there are 100 items in the test. However, if the test contains 500 items, then
the raw score of 95 is not good at all. A test that only gives raw score but
not the total number of items does not measure and communicate the
learner’s performance or achievement. Raw scores may be useful if
everyone knows the test and what it covers, how many possible right
answers there are, and how learners typically do in the test.
2. Percentage Score. This refers to the percent of items answered correctly
in a test. The number of items answered correctly is typically converted to
percent based on the total possible score. The percentage score is
interpreted as the percent of content, skills, or knowledge that the learner
has a solid grasp of. Just like raw score, percentage score has limitation
because there is no way comparing the percentage correct obtained in a
test with the percentage correct in another test with a different difficulty level.
Percentage score is most appropriate to use in teacher-made test or
criterion-referenced test. Percentage score is appropriate to use for teacher-
made test that is administered commonly to a class or to students taking the
same course with the same contents or syllabus. In this way, the students’
test performance can be compared among each other in the class or with
their peers in another section. In the same manner, percentage score is
suitable to use in subjects wherein a standard has been set. For example, if
an algebra subject sets a passing score of 60% in a test (e.g., for example
it is considered as average), the teachers and learners would know if a
learner has met the desired level of competencies through his/her
percentage score.
Aside from the above test scores, the decision on what type of test
scores to use is based on whether the learners’ test performance is to be
compared with a standard or criterion or with the scores of other learners or
peers. This decision will entail the choice between the two major types of
grading system: 1) criterion-referenced; and 2) norm-referenced grading
system.
3. Criterion-Referenced Grading System. This is a grading system wherein
learners’ test scores or achievement levels are based on their
255
performance in specified learning goals and outcomes and performance
standards. Criterion-referenced grades provided a measure of how well the
learners have achieved the preset standards, regardless of how everyone
else does. It is therefore important that the desired outcomes and the
standards that determine proficiency and success are clear to the learners
at the very start. These should be indicated in the course syllabus. Criterion-
referenced grading is premised on the assumption that learners’
performance is independent of the performance of the other learners in their
group/class.
The following are some of the types of criterion-referenced scores or
grades:
a. Pass or Fail Grade. This typed of score is most appropriate if the test or
assessment is primarily or entirely to make a pass or fail decision. In this
type of scoring, a standard or cut-off score is preset, and a learner is
given a score of Pass if he or she surpassed the expected level of
performance or the cut-off score. Pass or Fail scoring is most appropriate
for comprehensive or licensure exams because there is no limit to the
number of examinees who can pass or fail. Each individual examinee’s
performance is compared to an absolute standard and not to the
performance of others.
Pass or fail grading has the following advantages: (1) it takes
pressure off the learners in getting a high letter or numerical grade,
allowing them to relax while still getting the needed education; (2) it gives
learners a clear cut idea of their strengths and weaknesses; and
(3) it allows learners to focus on true understanding or learning of the
course content rather than on true understanding or learning of the
course content rather than on specific details that will help them receive
a high letter or numerical score. However, this type of grading also
eliminated competitiveness because learners no longer find the urgency
or the need to work hard to get a higher grade, does not provide accurate
representation of performance level and knowledge of the learners, and
is not possible to convert to exact score.
b. Letter Grade. This is one of the most commonly used grading systems.
Letter grades are usually composed of five-level grading
256
scale labeled from A to E or F, with A representing the highest level of
achievement or performance, and E or F – the lowest grade –
representing a Failing grade. These are often used for all forms of
learners’ work, such as quizzes, essays, projects, and assignments.
While letter grades look simple and easy to understand, the true
meaning of letters is not always clear to learners, parents, or other
stakeholders. The teachers’ rating and the stakeholders’ interpretation
of the grade are often different. As such, it is important that descriptors
for each grade are included in the reporting sheet to ensure accurate
interpretation of the letter grades. An example of the descriptors for letter
grades is presented below.
Letter Grades Interpretation
A Excellent
B Good
C Satisfactory
D Poor
E Unacceptable
257
performance by dividing each grade category into three levels, such that
a grade of A can be assigned as A+, A and A-; B as B+ B and B-, and
so on. Plus (+) and minus (-) grades provide a finer discrimination
between achievement or performance level. They also increase the
accuracy of grades as a reflection learners’ performance; enhance
student motivation (i.e., to get a high A rather that an A-); and
discriminate among performance in a very similar pool of learners, such
as those in advance course or star sections. However, +/- grading
system is viewed as unfair, particularly for learners in the highest
category; creates for stress for learners; and is more difficult for teachers
as they need to deal with more grade categories when grading learners.
Examples of the descriptors for plus (+) and minus (-) letter grades are
presented in the next matrix:
(+)/(-) Letter Interpretation
Grades
A+ Excellent
A Superior
A- Very Good
B+ Good
B Very Satisfactory
B- High Average
C+ Average
C Fair
C- Pass
D Conditional
E/F Failed
258
Categorical grading methods have the same drawbacks as letter
grades. Like letter grades, the categorical grades provide cut-offs
between levels that are often arbitrary, lack the richness of more detailed
reporting methods, and fail to provide feedback or information that can
be used to diagnose learners’ weaknesses and refer for remediation.
4. Norm-Referenced Grading System. In this method of grading,
learners’ test scores are compared with those of their peers. Norm-
referenced grading involves rank ordering learners and expressing a
learner’s score in relation to the achievement of the rest of the group
(i.e., class or grade level, school, etc.). the peer group usually serves as
the normative group (e.g., class, age group, year level). Unlike the
criterion-referenced scoring, norm-referenced scoring does not well what
the learners actually achieved, but it only indicates the learners’
achievement in relation to their peers’ performance. Norm-referenced
grading allows teachers to: 1) compare learners’ test performance with
that of other learners; 2) compare learners’ performance in one test
(subtest) with another test (subtest); and 3) compare learners’
performance in one form of the test with another form of the test
administered at an earlier date.
There are different types of norm-referenced scores:
a. Developmental Score. This is the score that has been transformed from
raw scores and reflect the average performance at age and grade levels.
There are two kinds of developmental scores: 1) grade- equivalent; and
2) age-equivalent scores.
i. Grade-Equivalent Score is described as both a growth score
and status score. The grade equivalent of a given raw score on
any test indicates the grade level at which the typical learner
earns this raw score. It describes test performance of a learner in
terms of a grade level and the months since the beginning of the
school year. A decimal point is used between the grade and
month in grade equivalents. For example, a score of 7.5 means
259
that the learner did as well as a Grade 7 taking the test at the end
of the fifth month of the school year.
i. Age-Equivalent Score indicate the age level that us typical to a
learner to obtain such raw score. It reflects a learner’s
performance in term of the chronological age as compared to
those in the norm group. Age-equivalent scores are written with a
hyphen between years and months, for example, a learner’s
score of 11-5 means that his age equivalent is 11 years and 5
months old, indicating a test performance that is similar to that of
11½ year olds in the norm group.
b. Percentile Rank. This indicates the percentage of scores that fall at or
below a given score. Percentile ranks range from 1 to 99. For example,
if a learner obtained a score of 75th percentile rank in a standardized
achievement test, it means that the learner was able to get a higher score
than 75% of the learners or peers in the norm group. Percentile ranks
are not equal interval data, with differences in percentile ranks at the
extreme or end range larger that they are in the middle range. For
example, the differences between 90 and 95 percentile ranks and
between 5 and 10 percentile ranks are larger than the differences
between 50 and percentile ranks.
c. Stanine Score. This system express test result in nine equal steps,
which range from one (lowest) to nine (highest). A stanine score of 5 is
interpreted as ―average‖ stanine. Percentile tanks are grouped into
stanines, with the following verbal interpretations:
260
d. Standard Score. They are raw scores that are converted into a common
scale of measurement that provides meaningful description of the
individual scores within the distribution. A standard score describes the
difference of the raw score from a sample mean, expressed in standard
deviations. Two most-commonly used standard scores are (1) z-score
and (2) T-score.
i. Z-score is one of a standard score. Z-score have a mean of 0 and
a standard deviation of 1. It is computed using the following formula.
Standard scores are useful when you want to compare learners’ test
performance across two distributions. For example:
Class A Class B
Standard Deviation 1 5
Mean Score 85 90
Score of Student 1 90 (Luis) 95 (Michael)
While the difference between raw scores of Luis and Michael from
the mean is the same (i.e., 5). Michael’s standard score is lower than
Luis’ standard scoring (z of 1 vs. z of 5). This is because the
variability in scores in Michael’s class is higher than that in Luis’
class. As such, it is appropriate to convert raw scores mean different
things in different situation or for different learners.
A z-score can either be positive or negative. The (+) and (-) signs do
not indicate the magnitude of z-score; rather, they indicate the
direction of raw scores from the mean. A positive (+) z-score means
that the score is higher than the group mean, while a negative (-) z-
score indicates that the raw score is lower than the group mean.
261
i. T-score is another types of standard score, where in the mean is
equal to 50, and the standard deviation is equal to 10. It is linear
transformation of z-scores, which have mean 0 and standard
deviation. It is computed from a z-score with the following formula:
T = 5 + 10Z
A T-score of 50 is considered ―average‖, with T-score ranging from
40 to 60 as within the normal range. T-score of 30 and below and T-
scores of 70 and above are interpreted as low and high test
performance, respectively.
The following figure presents the relationships among the various
standard scores and percentile rank in a normal distribution.
262
1. Stick to the purpose of the assessment. Before coming up with an
assessment, it is first important to determine the purpose of the test. Will the
assessment be used for diagnostic purposes? Will it be a formative
assessment, or is a summative assessment? Diagnostic and formative
assessments are generally not graded. Diagnostic assessments are
primarily used to gather feedback about the learners’ prior knowledge or
misconception before the start of a learning activity, while results from
formative assessments are used to determine what leaners need to improve
on or what topics or course contents need to be addressed and given
emphasis by the teacher. Formative assessment results are also to be used
by learners to reflect on and monitor their own learning and, therefore,
should not be graded. On the other hand, an assessment that is used as a
summative evaluation should be assigned a grade as it is aimed to
determine how well the learners were able to achieve the desired learning
outcomes.
2. Be guided by the desired learning outcomes. The learners should be
informed early on what are expected of them insofar as learning outcomes
are concerned, as well as how they will be assessed and graded in the test.
Such information can be disseminated through the course syllabus or during
course introduction. Accordingly, the tests or performance tasks that will be
conducted in the class should only include and focus on the intended
learning outcomes and the specified. Should the test include items that have
not been discussed or are not part of the course syllabus, these items should
not be included in the computation of the final test score.
3. Develop grading criteria. Grading criteria to be used in traditional tests,
and performance tasks should be made clear to the students. Similarly,
learners should also be informed of the weight of each criterion. Grading
criteria and weights should be applied fairly and consistently. A holistic or
analytic rubric can be used to map out the grading criteria.
Developing grading criteria may be tedious. However, having clear
criteria can save time in the grading process, makes the grading process
more consistent and fair, communicates expectations to learners, and helps
learners understand how their work is graded.
263
4. Inform learners what scoring methods are to be used. Learners should
be made aware before that start of testing, whether their test responses are
to be scored based on the number right, negative marking, or through non-
conventional scoring methods. As such, the learners will be guided on how
to mark their responses during the test. Such instruction should be followed
and applied consistently to every leaner in the same class or other learners
in the same course but different classes.
5. Decide on what type of test scores to use. As discussed earlier, there are
different ways by which students learning can be measured and presented.
Performance in a particular test can be measured and reported through raw
scores, percentage scores, criterion-referenced scores, or norm-referenced
scores. It is important that different types of grading scheme be used for
different tests, assignments, or performance tasks. Learners should also be
informed at the start of what grading system is to be used for a particular
test or task.
264
given based on the overall judgment of the learners’ writing composition.
Holistic rubric is viewed to be more convenient for the teachers as it requires
less area or aspect of writing to evaluate. However, it does not provide
specific feedback on what course topic/content or criteria that the students
are week at and need to improve on. On the other hand, analytic scoring
system requires that the essay is evaluated based on each of the criteria. It
provides useful feedback on learner’s strengths and weaknesses for each
course content or criterion.
3. Prepare the rubric. In developing rubric, the skills and competencies
related to essay writing should first be identified. These skills and
competencies represent the criteria. Then, performance benchmarks and
point values are determined. Performance marks can be numerical
categories, but the most frequently used are descriptors with corresponding
rating scale.
Points Sample Performance Benchmarks
Values
1 Needs Beginning Novice Inadequate
Improvement
2 Satisfactory Developing Apprentice Developing
3 Good Accomplished Proficient Proficient
4 Exemplary Exceptional Distinguished Skilled
265
8. Get two or more raters for essays that are high-stake, such as those used
for admission, placement, or scholarship screening purposes. Final grade
will be the average of the all ratings given.
9. Write comments next to the learner’s responses to provide feedback on
how well one performed in the essay test.
266
Illustrative Example:
Assuming that a student has obtained the following raw scores in the
different components in English subject:
Components Total Score Total Possible Score
Written Works 145 160
Performance Tasks 100 120
Quarterly Assessment 50 50
267
Now, let us compute the grade.
268
Summary
In this lesson, we were able to discuss exhaustively the purpose of grading and
communicating learners’ test performance, the various methods in marking or
scoring tests and even performance tasks, the different methods in grading
learners’ performance in assessments, types of test scores, general guidelines
in grading tests or performance tasks, general guidelines in scoring essay tests,
and how test results be communicated. Finally, the guidelines on classroom
assessment of the DepEd K to 12 Basic Education Program were likewise
highlighted.
Enrichment
1. Read the following articles:
1. Magno, C. (2010). The Functions of Grading. The Assessment
Handbook, Vol. 3.
2. Guskey, T. R.(2001). Grading and Reporting Student Learning. Corwin
Press: KY, USA.
3. Brookhart, Susan M. (2013). How to Create and Use Rubrics for
formative assessment and grading. Virginia, USA: ASCD.
2. Watch this video:
Nancy Heilbronner (2019, April 2). Grading and Reporting. [Video].
YouTube: [Link]
Assessment
A. Let us review what you have learned about grading and communicating
test results.
1. What are the purpose of grading and communicating learners’ test
performance?
2. What are the different methods in marking or scoring tests or
performance tasks?
269
3. What are the different methods in grading learners’ performance in
assessments?
4. What are the different types of test scores?
5. What are the general guidelines in grading tests or performance tasks?
6. What are the general guidelines in scoring essay tests?
7. How should test results be communicated?
B. After the discussion on grading and reporting test scores, you are now ready
to identify what methods of scoring/grading and types of scores that you can
employ in your assessments. Let us apply what you have learned by
extending the assessment plan that you have developed in earlier lesson,
or you may consider anew. In additional to the desired learning outcomes,
course topic, and test formats that you have listed down for each subject,
please identify the methods of scoring, types of grades, and reporting
strategies that you will employ.
270
Now, use this template for your own sample plan.
Desired Course Topic Assessment Method of Types of
learning Method Scoring Test Score
outcomes
C. Let us then come up with a grading and reporting scheme for each type of
assessment that you will employ in each of your subjects. In the
development of the grading and reporting scheme, you need the following
information:
1. Purpose of Assessment: Why is this assessment being conducted? Is
it for learners’ monitoring and improvement (formative), or is it for
demonstrating student achievement (summative)?
2. Desired Learning Outcomes for the Topic/Subject Area: What are
the learning outcomes expected from the learners for this unit/subject?
3. Type of Assessment: How will each outcome be measured?
4. Grading Criteria: What are the criteria to include that demonstrate
achievement of the stated desired learning outcomes?
5. Scoring/Grading Method: How will be the test/performance tasks be
scored?
6. Type of Score: What types of scores that are appropriate to indicate
the students’ level of achievement or performance?
271
For example: Final Grade for Entrepreneurship Course
Purpose of Desired Types of Test Grading Scoring Type of
assessment Learning Assessment Criteria Method Score
Outcomes
Determine The learner Written Test Remembering Number Percentage
learners’ can present (Multiple- Understanding Right score
understanding of alone or with choice)
concepts, his/her Written Test Content Holistic Letter
underlying classmates (Essay) Organization Scoring Grades
principles, and an Language
processes of acceptable usage Support
developing a detailed Performance Market Analytic Categorical
business plan business Assessment- Analysis Scoring Grades
plan. Product Competitive
(Research Analysis
Report on the Marketing
Development Strategy
of Business Administrative
Plan) Personnel;
Critical Risks
Financial Data
and
Projections
Timeliness of
submission
Grammar and
spelling
Essay Content Holistic
Organization Scoring
Language
Usage support
D. Evaluate the sample grading and reporting scheme that you have
developed for each assessment by using the rubric.
Criteria Inadequate (1) Developing (2) Proficient (3)
Purpose of the test The purpose of The purpose of The purpose of
testing is not testing is specified; testing is clearly
specified in the however, it is not specified and
grading and reporting clear or relevant to relevant to the
system suggested for the grading and grading and reporting
the subject area reporting system system suggested for
covered. suggested for the the subject area
subject area covered. covered.
Identification of The intended learning The intended learning The intended learning
intended Learning outcomes in the outcomes are listed outcomes in the
Outcomes unit/topic/course are but they are not unit/topic/course are
not identified and clearly described. explicitly specified.
specified in the
grading and reporting
scheme.
Types of Tests The tests by which The tests are The tests will provide
students’ level of appropriate, but an adequate and
achievement are not they will not provide accurate measure of
valid and a complete and the extent to which
appropriate to valid measure of the learners have
measure extent to extent to which achieved the
which learners have learners have intended outcomes.
272
achieved the achieved the
intended outcomes. intended outcomes.
Grading Criteria No criterion is The same criteria or Different criteria or
included in the performance performance
grading and reporting standards are used standards are used
scheme. for all kinds of for different types of
learners’ test tests or performance
performance and tasks.
outputs.
Scoring Methods The scoring methods The scoring methods The scoring methods
used to grade are appropriate, but will provide an
learners’ level of they will not provide adequate and
achievement or a complete and accurate measures
performances are not valid measure of the of the extent to which
valid and extent to which learners have
appropriate. learners have achieved the
achieved the intended outcomes.
intended outcomes.
Types of Scores The types of score The types of score The types of scores
included in the are appropriate will serve as a
grading and reporting measures; however, concrete evidence
scheme will not they are not of the learners’ level
provide a valid adequate to assess of achievement or
measure of learners’ learners’ level of performance.
level of achievement achievement and
and performance. performance.
273
b. Holistic scoring d. primary trait scoring
e. Assessing each aspect of a performance task and assigning a
score for each criterion.
a. Analytic scoring c. multiple trait scoring
b. Holistic scoring d. primary trait scoring
2. Identify the types of scores identified below.
a. Simply gives the number of items answered correctly on a test.
a. Percentile rank c. raw score
b. Percentage score d. standard scores
b. Tells you the percentage of scores that falls at or below your
score.
a. Percentile rank c. raw score
b. Percentage score d. standard scores
c. Compares the performance of a learner with those of his or her
peers.
a. Criterion-referenced c. norm-referenced
b. Letter grade d. pass or fail
F. Give three (3) main reasons why you need to assign or give grades to your
students’ test results, justify your answer. Your response to this question will
be evaluated using the holistic rubric below.
Criteria 1 Point 2 Points 3 Points
Knowledge/ Demonstrate no Demonstrate fair Demonstrates
Understanding or limited understanding of extensive knowledge
of concept understanding of the topic/concept and strong
the topic/concept understanding of the
topic/concept
Argument Makes an Makes an Makes an accurate
Conclusion inaccurate accurate but and complete
argument or uncompleted argument or
conclusion argument or conclusion
conclusion
Support Provides Provides Provides appropriate
inappropriate appropriate but and sufficient
and insufficient insufficient evidence or examples
examples of examples or to support argument
evidence to evidence to or conclusion
support the support argument
argument or or conclusion
conclusion
Explanation/ Does not provide Provides good Provides excellent
Reasoning explanation/ explanation or explanation or
justification to justification that reasoning that links
274
argument or links the examples to argument
conclusion argument/ or conclusion
conclusion and
examples
275
Educator’s Input
I have extensive experience in the development of the various
standardized tests in our university as head of the testing unit for 21 years. I
have been involved in the development of test used for admission and
placement of our college and senior high school students, as well as tests used
for screening our employee applicants. I have also conceptualized and develop
scales and rubrics for performance evaluation of our personnel. For the high-
stake tests, we make use of norm-referenced scoring, particularly percentile
and stanine scores. These scores were developed with normative group as
references and for each academic program/college. For example, norms for
undergraduate and graduate school admission tests were based on the scores
of applicants to the different program offering or colleges. Decisions on whether
or not to accept or reject student applicants are based on the University-set cut-
off percentile rank score. On the other hand, answers to the essays are rated
using the Pass/Fail scoring method. For scales and rubrics, criterion-referenced
scores are used.
As a faculty in the graduate school, I extensively use number right
scoring for selected-response tests, holistic scoring for essay tests, and analytic
scoring for performance tasks. For holistic and analytic scoring, I employ
rubrics. I have developed rubrics for final research output, peer review, paper
oral presentation, research proposal, and projects. The rubrics help me in
evaluating both the performance- and product-based outputs of my students.
Grades are communicated to students during the grade consultation
day. The bases of the final grade, which I normally present to the students
during the first meeting, are again explained to the students. Students are then
given the opportunity to ask questions about their grades.
References
David et al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon
City: Adriana Publishing Co., Inc.
D.O. No. 8, s. 2015 (Policy Guidelines on Classroom Assessment for the K to
12 Basic Education Program)
276
du Plessis, S. (2017, November). 5 Reasons Why Grades Are Important.
Retrieved from [Link]
important/
Marzano, Robert (September 2020). What Are Grades For? In Transforming
Classroom Grading. Retrieved from
[Link]
Grades-For%C2%[Link]
277
Appendix A – Course Syllabus
Republic of the Philippines
SULTAN KUDARAT STATE
UNIVERSITY
ACCESS, EJC Montilla, 9800 City of
Tacurong Province of Sultan Kudarat
College of Teacher Education
First Semester, Academic Year 2020-
2021
A trailblazer in arts, science and technology in the region. a. Enhance competency development, commitment, professionalism, unity
and true spirit of service for public accountability, transparency and
UNIVERSITY MISSION delivery of quality services;
b. Provide relevant programs and professional trainings that will respond to
The University shall primarily provide advanced instruction and the development needs of the region;
professional training in science and technology, agriculture, fisheries, c. Strengthen local and international collaborations and partnerships
education and other related field of study. It shall undertake research for borderless programs;
and extension services, and provide progressive leadership in its area d. Develop a research culture among faculty and students;
of specialization. e. Develop and promote environmentally-sound and market-driven
knowledge and technologies at par with international standards;
UNIVERSITY GOAL
f. Promote research-based information and technologies for sustainable
development;
To produce graduates with excellence and dignity in arts, science and
g. Enhance resource generation and mobilization to sustain financial
technology.
viability of the university.
279
b. Discuss outcomes-based education as a concept / / / /
c. Explain the fundamental concepts and principles of assessment in learning / / / /
d. Formulate the learning objectives and targets considering the purpose and methods of / / / / /
assessment
e. Decide the right assessment tool based on the suitable form and use / / /
f. Plan out a written test through the use of a Table of Specifications (TOS) / / / /
g. Construct a test based on the learning objectives, outcomes/targets and guidelines of / / / /
varied test formats
h. Enhance the quality of test through judgmental test-improvement and other empirically- / / / / / /
based procedures
289
i. Ensure the validity and reliability of the constructed test / / / / / /
j. Organize the data derived from tests using tables and charts / / / / / / /
k. Use statistics to analyze, interpret and use test data in decision making / / / / / /
l. Observe the guidelines in test scoring and grading as well as its methods of reporting / / / / /
7. Course Contents
Course Objectives, Desired Student Learning Outcomes-Based Evidence Course Progra Values
Topics, Time Outcomes Assessment (OBA) of Learnin m Integrati
Allotment Activities Outcome g Objectiv on
s Outcom es
es
Lesson 0. Course Orientation (3 hours)
Course Syllabus 1. Explain the vision and Recite sincerely the Oral a a, b Accountabil
Basic academic policies mission, and significant University Vision and Recitation ity,
academic policies of the Mission (OR) Excellence
University Involvement in the G-class Class
2. Enumerate the course Participation
desired learning Rating
outcomes (CPR)
3. Use the syllabus as
reference for independent
learning
4. Simulate the computation
of one’s grades given the
criteria
CHAPTER 1. OUTCOMES-BASED EDUCATION
Lesson 1. Understanding Outcomes-Based Education (3 hours)
Meaning of Education 1. Discuss outcomes-based Self-assessment as Exercises/Q b b, g, h, p Honesty,
What is OBE? education, its meaning, brief contained in the last part uiz Scores Transparen
Educational history and characteristics of the module (EQS) Case cy, Justice
Landscape for Higher 2. Identify the procedures in Involvement in the G-class Report
Education the implementation of OBE Quiz Rating
The Outcomes of in subjects or courses (CRR)
Education 3. Define outcomes and Class
discuss each type of Participation
outcomes Rating
(CPR)
280
CHAPTER 2. INTRODUCTION TO ASSESSMENT IN LEARNING
Lesson 1. Basic Concepts and Principles in Assessing Learning (4.5 hours)
Meaning of Assessment 1. Make a personal Sharing of personal Exercises/Q c b, g, h, p Honesty,
Meaning of Learning definition of experiences on testing and uiz Scores Transparen
Evaluation assessment grading practices of past (EQS) Case cy, Justice
and 2. Compare assessment teachers through a case Report
Measureme with measurement and presentation Rating
nt evaluation (CRR)
280
Principles in 3. Discuss testing and grading Quiz Class
Assessing 4. Explain the different Self-assessment as Participation
Learning principles in assessing contained in the last part Rating
Grading and Testing learning of the module (CPR)
5. Relate an experience as a Involvement in the G-class
student or pupil related to
each principle
6. Comment on the tests
administered by the past
teachers
7. Perform simple evaluation
Lesson 2. Assessment Purposes, Learning Objectives/Targets and Appropriate Methods (4.5 hours)
Purpose of 1. Articulate the purpose of Completion of Table of Exercises/Q d b, g, h, p, r Objectivit
Classroom classroom assessment Learning uiz Scores y, Justice,
Assessment 2. Tell the difference between Objectives/targets (EQS) Case Truthfulne
Bloom’s the Bloom’s Taxonomy and Presentation of matrix of Report ss
Taxonomy of the Revised Bloom’s learning targets and methods Rating
Educational Taxonomy in stating learning of assessment (CRR)
Objectives objectives Quiz Class
Learning Objectives 3. Apply the Revised Self-assessment as Participation
Learning Targets Bloom’s Taxonomy in contained in the last part Rating
Matching writing learning of the module (CPR)
Appropriate objectives Involvement in the G-class
Assessment 4. Discuss the importance of
Methods learning targets in
instruction
5. Formulate learning targets
6. Match the assessment
methods with specific
learning objectives/targets
Lesson 3. Different Classifications of Assessment (6 hours)
281
Educational test 1. Compare the following forms Completed table of Exercises/Q e b, g. p Fairness,
vs. of assessment: educational learning uiz Scores Respect,
Psychological vs. psychological, teacher- objectives/targets (EQS) Class Accountabi
test made vs. standardized, Completed matrix of learning Participation lity
Teacher-made selected-response vs. targets and methods of Rating
test vs. constructed-response, assessment (CPR)
Standardized test achievement vs. aptitude, Quiz
Constructed- and power vs. speed Self-assessment as
response test vs. 2. Give examples of each contained in the last part
Selected-response classification of test of the module
test 3. Illustrate situations in the Involvement in the G-class
Achievement use of different
test vs. Aptitude classifications of
test assessment
Power test vs. speed 4. Decide on the kind of
test assessment to be used
282
CHAPTER 3. DEVELOPMENT AND ENHANCEMENT OF TESTS
Lesson 1. Planning a Written Test (6 hours)
Planning a Test 1. Define the necessary Completed Table of Exercises/Qu f b, g, h, p Excellence,
Defining the Test instructional outcomes to be Specifications iz Scores Perseveran
Objectives or Learning included in a written test Quiz (EQS) ce, Honesty
Outcomes for 2. Describe what is a table of Oral Recitation Checklist
Assessment specifications (TOS) and its Self-assessment as Rating (CLR)
Objectives for Testing formats contained in the last part Class
Table of Specifications 3. Prepare a TOS for a written of the module Participatio
General Steps in test Involvement in the G-class n Rating
Developing a Table of 4. Demonstrate the systematic (CPR)
Specifications steps in making a TOS
Different Formats of a
Table of Specifications
Lesson 2. Construction of Written Tests (6 hours)
Constructing various 1. Describe the characteristics Constructed written test in (OR) g b, g, j, p Justice,
Types of Traditional of selected-response and varied formats Exercises/Q Respect,
Test Formats constructed- response tests Quiz uiz Scores Hard work,
General Guidelines 2. Classify whether a test is Oral Recitation (EQS) Class Responsibility
in the Selection of selected- response or Self-assessment as Participation
Appropriate Test constructed-response contained in the last part Rating
Format 3. Identify the test format that is of the module (CPR)
Categories and most appropriate to a Participation in the G-class
Formats of Traditional particular learning
Tests outcome/target
General Guidelines in 4. Apply the general
Test Item guidelines in constructing
Construction (Multiple- test items
Choice, Matching- 5. Prepare a written test based
type, True-False, on the prepared TOS
Short-answer, Essay 6. Evaluate a given teacher-
Tests, Problem- made test based on
solving) guidelines
Lesson 3. Improving a Classroom-Based Assessment (3 hours)
283
Judgmental item- 1. List down the different ways Teachers and Peer Review Exercises/Qu h b, g, h, n, Integrity,
improvement for judgmental item- results iz Scores p, r, t Justice,
(Teacher’s own improvement and other Item Analysis Results (EQS) Objectivity
Review, Peer empirically-based (Difficulty Index and Index of Checklist
Review, Student procedures Discrimination) Rating (CLR)
Review) 2. Evaluate which type of test Quiz Class
Other Empirically- item- improvement is Oral Recitation Participatio
based Procedures appropriate to use Involvement in the G-class n Rating
(Difficulty Index, 3. Compute and interpret the (CPR)
Index of results for index of difficulty,
Discrimination, index of discrimination and
Distracter Analysis) distracter
284
efficiency
4. Demonstrate knowledge
on the procedures for
improving a classroom-
based assessment
Lesson 4. Establishing Test Validity and Reliability (6 hours)
Validity Test (Content, 1. Explain the different tests of Validity and Reliability Test Exercises/Qu i b, g,h, n, Integrity,
Face, Predictive, validity results iz Scores p, r, t Justice,
Concurrent, Construct, 2. Identify the most practical Quiz (EQS) Objectivity
Convergent, Divergent) test to apply when Oral Recitation Checklist
Reliability Test (Test- validating a typical Self-assessment as Rating (CLR)
retest, Parallel forms, teacher-made assessment contained in the last part Class
Split-half, Internal 3. Tell when to use a certain of the module Participatio
consistency, Inter- type of reliability test Involvement in the G-class n Rating
rater) 4. Apply the suitable (CPR)
method of reliability
test given a set of
assessment
results/test data
5. Decide whether a test is
valid or reliable
CHAPTER 4. ORGANIZATION, UTILIZATION, AND COMMUNICATION OF TEST RESULTS
Lesson 1. Organization of Test Data Using Tables and Graphs (6 hours)
Frequency Distribution 1. Organize the raw data from a Completed frequency Exercises/Qu j a, b, g, j, Simplicity,
Cumulative test distribution table iz Scores n, p, r, t Order,
Frequency 2. Construct a frequency Completed tables and graphs (EQS) Fairness
Distribution distribution Self-assessment as Checklist
Determining the 3. Acquire knowledge on the contained in the last part Rating (CLR)
Midpoint of the Class basic rules in preparing of the module Class
Intervals tables and graphs Demonstration of steps in Participatio
Using Excel Chart 4. Summarize test data the use of Excel Chart n Rating
Wizard using appropriate Wizard (CPR)
Graphic table or graph Oral Recitation
Representation of 5. Use Microsoft Excel to Involvement in the G-class
Data construct appropriate
Skewness and Kurtosis graphs for a data set
6. Interpret the graph of a
frequency and cumulative
frequency distribution
285
7. Characterize a frequency
distribution graph in terms of
skewness and kurtosis
286
Application of Excel dispersion of test scores Computation results (CLR)
Scale of Measurement 3. Calculate the measure of Oral Recitation Class
Measures of Dispersion position Quiz Participatio
Measures of Position 4. Relate standard Involvement in the G-class n Rating
deviation and normal (CPR)
distribution
5. Transform raw scores to
standardized scores (z,
T and stanine)
6. Compute the measure of
covariability using the long
process and Excel
7. Interpret test data applying
measures of central
tendency, variability, position,
and covariability
Lesson 3. Grading and Reporting of Test Results (6 hours)
Grading and Reporting 1. Define what is grading Self-assessment as Exercises/Qu l a, b, g, j, s, Respect,
Purposes of Grading 2. Discuss the different methods contained in the last part iz Scores t Truthfulne
and Reporting in scoring tests and of the module (EQS) ss, Justice
Learners’ Test performance tasks; different Scoring Rubrics Checklist
Performance Oral Recitation Rating (CLR)
types of test scores;
Methods in Scoring Quiz Class
guidelines on grading tests Participatio
Tests or Performance Participation/Involvement in
Tasks and performance test scores; the G- class n Rating
Types of Test Scores and how to communicate test (CPR)
Guidelines in Grading scores
Tests or Performance 3. Prepare scoring
Tasks rubrics for
Grading System in K performance tasks
to 12 Basic Education
4. Discuss the assessment
Program
system in the Department of
Education as contained in
DO No. 8, s. 2015
5. Compute the grades of
learners
based on DepEd guidelines
287
8. Course Evaluation
Course Requirements The following are the course requirements: (a) Examinations (Midterm and Final); (b) Quizzes/Exercises; and, (c) Class
Participation/involvement
Course Policies All students must adhere to these class guidelines: (a) act politely, responsibly and with maturity; (b) arrive on time and be ready for
instruction; (c) set cell phones in silent mode and keep them inside the bags; (d) contribute to an orderly learning environment; (e)
consult the professor when deemed necessary; (f) establish good rapport with professors; (g) maintain silence during oral
reports/presentations; and, (h) cooperate in classroom activities or any task performances.
Grading System Midterm Grade Final Term Grade
Midterm Examination (50%); Quizzes/Assignments (30%); Midterm Examination (50%); Quizzes/Assignments (30%);
Participation/Attendance (20%) Participation/Attendance
(20%)
Schedule of October 11-12, 2020* December 11-13, 2020*
Examination
*tentative
References
Boo Andrade, H. (2010). Students as the definitive source of formative assessment: Academic self-assessment and the self-regulation of learning.
k In H. Andrade & G. Cizek (Eds.), Handbook of formative assessment (pp. 90–105). New York, NY: Routledge.
Brookhart, Susan M. (2013). How to Create and Use Rubrics for formative assessment and grading.
Virginia, USA: ASCD. David el al. (2020). Assessment in Learning 1. Manila: Rex Book Store.
De Guzman, E. and Adamos, J. (2015). Assessment of Learning 1. Quezon City: Adriana Publishing Co., Inc.
Fives, H. & DiDonato-Barnes, N. (February 2013). Classroom Test Construction: The Power of a Table of Specifications. Practical Assessment,
Research & Evaluation, Volume 18, (3). Hattie, John. Visible Learning for Teachers: Maximizing Impact on Learning. New York: Routledge, 2012.
Klenowski, V. (1995). Student self-evaluation processes in student-centred teaching and learning contexts of Australia and England. Assessment
in Education: Principles, Policy & Practice, 2(2).
Macayan, J. (2017). Implementing Outcome-Based Education (OBE) Framework: Implications for Assessment of Students’ Performance.
Educational Measurement and Evaluation Review, Vol. 8 (1, 1-10).
Magno, C. (2011). A Closer Look at other Taxonomies of Learning: A Guide for Assessing Student Learning. The Assessment Handbook, Vol. 5.
(2010). The Functions of Grading Students. The Assessment Handbook, 3, 50-58.
Maxwell, Graham S. (2001). Teacher Observation in Student Assessment. (Discussion Paper). The
University of Queensland. McMillan, J. and Hearn, J. (2008). Student Self-Assessment. Educational
Horizons. Retrieved from [Link]
Moss, Connie and Susan Brookhart. Learning Targets: Helping Students Aim for Understanding in Today’s Lesson.
Alexandria: ASCD, 2012. Navarro, L., Santos, R. and Corpuz, B. (2017). Assessment of Learning 1 (3rd ed.).
Quezon City: Lorimar Publishing, Inc.
288
Onlin Alberta Education (2008, October 1). Types of Classroom Assessment. Retrieved from
e [Link] Aptitude Tests. Retrieved from [Link]
[Link]/[Link]
Armstrong, P. (2020). Bloom’s Taxonomy. TN: Vanderbilt University Center for Teaching. Retrieved from [Link]
pages/blooms-taxonomy/ .
Cherry, Kendra (2020, February 06). How Achievement Tests Measure What People Have Learned. Retrieved from
[Link] 2794805
Classroom Assessment. Retrieved from [Link]
289
Clayton, Heather. “Power Standards: Focusing on the Essential.” Making the Standards Come Alive! Alexandria, VA: Just ASK
Publications, 2016. Access at [Link]/just-ask-resource-center/e-newsletters/msca/power-standards/
EL Education (2020). Students Unpack a Learning Target and Discuss Academic Vocabulary. [Video]. [Link]
Fisher, M. Jr. R. (2020). Student Assessment in Teaching and Learning. Retrieved from [Link]
assessment-in-teaching-and-learning/ Improving your Test Questions. [Link]
evaluation/exam-scoring/improving-your-test-questions?src=cte-migration-
map&url=%2Ftesting%2Fexam%2Ftest_ques.html
Isaacs, Geoff (1996). Bloom’s Taxonomy of Educational Objectives. The University of Queensland: TEDI. Retrieved from
[Link]
Kurt, Serhat. (2019, April 24). Using Bloom’s Taxonomy to Write Effective Learning Objectives: The ABCD Approach. Retrieved from
[Link] taxonomy-to-write-effective-learning-objectives-the-abcd-approach/
LSI (2018, November 10). 3 Types of Learning Targets. An excerpt from Creating & Using Learning Targets & Performance Scales: How Teachers
Make Better Instructional Decisions, by Carla Moore, Libby H. Garst, and Robert J. Marzano. Retrieved from [Link]
types-of-learning-targets/
Phelan, C. and Wren, J. (2006). Exploring Reliability in Classroom Assessmnet. Retrieved from [Link]
Shabatura, J. (2013, September 27}. Using Bloom’s Taxonomy to Write Effective Learning Objectives. Retrieved from [Link]
blooms-taxonomy/
The Graide Network (2018, September 10). Importance of Validity and Reliability in Classroom Assessments.
[Link] quality-testing-reliability-and-validity
University of Lethbridge (2020). Creating Assessments. Retrieved from [Link]
290
Relevance of Contributions , when made, Contributions are Contributions are always Contributions are relevant and
Contribution to are off- topic or distract class sometimes off- topic or relevant promote deeper analysis of the
topic from distracting topic
under discussion discussion
Preparation Student is not adequately Student has read the Student has read and Student is consistently well
prepared; Does not appear to material but not closely or thought about the material prepared; Frequently raises
have read the material in has read only some of the in advance of class; questions or comments on
advance of class assigned material in material outside
advance of class
291
Case Study Grading Rubric
Each item is rated on the following rubric. [1= Very poor; 2 = Poor; 3 = Adequate; 4 = Good; 5 = Excellent]
Ite Score
m s
1. Evidence of preparation (organized presentation, presentation/discussion flows well, no awkward pauses or confusion 1 2 3 4 5
from the group/individual,
evidence you did your homework)
2. Content (group/individual presented accurate & relevant information, appeared knowledgeable about the case studies 1 2 3 4 5
assigned and the topic
discussed, offered strategies for dealing with the problems identified in the case studies)
3. Enthusiasm/Audience Awareness (demonstrates strong enthusiasm about topic during entire presentation; significantly 1 2 3 4 5
increases audience
understanding and knowledge of topic; convinces an audience to recognize the validity and importance of the subject)
4. Delivery (clear and logical organization, effective introduction and conclusion, creativity, transition between speakers, oral 1 2 3 4 5
communication skills -
eye contact)
5. Discussion (group/individual initiates and maintains class discussion concerning assigned case studies, use of visual 1 2 3 4 5
aids, good use of time,
involves classmates)
292
The normal distribution, often referred to as a bell curve, describes how test scores are expected to be distributed: most scores cluster around the mean, with fewer scores appearing as they move away from the mean. This distribution is foundational in interpreting standardized test scores, as it allows for the calculation of percentile ranks and standard deviations, helping educators understand how a student's performance compares to the broader population. Understanding the normal distribution aids in identifying outliers, ensuring fair comparisons, and aligning assessments against predicted academic standards .
Holistic rubrics provide a single overall score based on an overall impression of an essay, which can be more convenient and quicker for scoring. However, they offer less specific feedback to students about areas that need improvement. Analytic rubrics, conversely, assess multiple criteria separately, presenting detailed feedback on specific aspects such as content, organization, and grammar. They yield comprehensive insights into student performance but are more time-consuming to create and use. The choice between these rubrics depends on the depth of feedback desired and the time available for assessment .
Bloom's Taxonomy provides a structured framework for creating learning objectives and assessments by categorizing cognitive skills from lower-order to higher-order thinking: remembering, understanding, applying, analyzing, evaluating, and creating. Teachers can use this taxonomy to design lesson objectives that promote critical thinking and deeper learning. By aligning assessment methods with the various levels of the taxonomy, teachers can ensure that students are tested comprehensively on different cognitive skills rather than just rote memorization. This approach helps in writing objectives and assessments that foster intellectual engagement and development .
Test reliability and validity are crucial to ensure that educational assessments accurately measure what they intend to test and that results are consistent over time. Reliability refers to the consistency of test results across multiple administrations or different forms of the test, indicating that outcomes are stable and repeatable. Validity concerns whether the test measures what it claims to measure, ensuring that the inferences and decisions made based on test results are appropriate. Together, these metrics support the credibility of assessments in accurately reflecting student knowledge and skills .
When developing a grading scheme for assessments, educators should consider the assessment's purpose (formative or summative), the learning outcomes expected, the type of assessment methods used, and the grading criteria that represent the achievement of learning outcomes. The chosen scoring method (e.g., holistic or analytic) and the type of grade (e.g., letter grades, percentage scores) should reflect the assessment's objectives and be aligned with the tasks being assessed. Defining clear grading criteria ensures consistency, fairness, and transparency in evaluating student performance .
Empirical item analysis techniques, such as computing item difficulty and discrimination indices, allow educators to refine multiple-choice tests by identifying which questions perform well in distinguishing between high and low performers. A good item discriminates adequately, meaning it is answered correctly more often by students with higher overall test scores. Difficulty indices help in gauging question challenge levels to match the ability of a diverse student population. Item analysis improves test quality by ensuring questions are clear, fair, and provide valid measures of student knowledge .
Educators can enhance student learning and motivation by providing timely, specific, and constructive feedback that focuses on strengths and targeted areas for improvement. Feedback should guide students on how to approach their learning tasks, encourage self-reflection, and set clear goals. It is crucial to establish a positive feedback environment where students feel supported and motivated to make adjustments to their learning strategies. Clear feedback also helps boost student confidence and engagement by demonstrating progress and achievable next steps .
Outcome-Based Education (OBE) emphasizes defining clear learning outcomes for students and aligns teaching and assessment strategies to these outcomes, leading to a focus on measuring whether students meet specific objectives. This approach requires teachers to design assessments that are directly tied to learning outcomes, influencing the way in which they evaluate and provide feedback on student performance. OBE encourages the use of formative assessments to continuously monitor and support student learning, ensuring that teaching methods are adjusted to help students achieve the desired outcomes .
Formative assessments (Assessment for Learning) are ongoing assessments used by teachers to provide feedback and guide instructional decisions, supporting student learning during the educational process. They help identify student needs and allow for adjustments in teaching methods. Summative assessments (Assessment of Learning), on the other hand, evaluate student learning at the end of an instructional period, such as finals or standardized tests, providing a summary of what students have achieved. While formative assessments focus on feedback and student improvement, summative assessments are primarily for recording achievements .
Criterion-referenced grading evaluates student performance against a fixed set of criteria or learning standards, providing clear benchmarks for achievement and encouraging mastery of content. Norm-referenced grading, however, ranks students in relation to each other, which can motivate competitive improvement but may not reflect individual mastery of the material. The choice of grading method can influence educational outcomes by shaping how achievement is perceived and understood by students, educators, and stakeholders, potentially affecting motivation, confidence, and instructional strategies .