Criterion-referenced tests and assessments are designed to measure student performance against a fixed set of predetermined criteria or learning standards—i.e., concise, written descriptions of what students are expected to know and be able to do at a specific stage of their education. In elementary and secondary education, criterion-referenced tests are used to evaluate whether students have learned a specific body of knowledge or acquired a specific skill set. For example, the curriculum taught in a course, academic program, or content area.
If students perform at or above the established expectations—for example, by answering a certain percentage of questions correctly—they will pass the test, meet the expected standards, or be deemed “proficient.” On a criterion-referenced test, every student taking the exam could theoretically fail if they don’t meet the expected standard; alternatively, every student could earn the highest possible score. On criterion-referenced tests, it is not only possible, but desirable, for every student to pass the test or earn a perfect score. Criterion-referenced tests have been compared to driver’s-license exams, which require would-be drivers to achieve a minimum passing score to earn a license.
Criterion-Referenced vs. Norm-Referenced Tests
Norm-referenced tests are designed to rank test takers on a “bell curve,” or a distribution of scores that resembles, when graphed, the outline of a bell—i.e., a small percentage of students performing poorly, most performing average, and a small percentage performing well. To produce a bell curve each time, test questions are carefully designed to accentuate performance differences among test takers—not to determine if students have achieved specified learning standards, learned required material, or acquired specific skills. Unlike norm-referenced tests, criterion-referenced tests measure performance against a fixed set of criteria.
Criterion-referenced tests may include multiple-choice questions, true-false questions, “open-ended” questions (e.g., questions that ask students to write a short response or an essay), or a combination of question types. Individual teachers may design the tests for use in a specific course, or they may be created by teams of experts for large companies that have contracts with state departments of education. Criterion-referenced tests may be high-stakes tests—i.e., tests that are used to make important decisions about students, educators, schools, or districts—or they may be “low-stakes tests” used to measure the academic achievement of individual students, identify learning problems, or inform instructional adjustments.
Well-known examples of criterion-referenced tests include Advanced Placement exams and the National Assessment of Educational Progress, which are both standardized tests administered to students throughout the United States. When testing companies develop criterion-referenced standardized tests for large-scale use, they usually have committees of experts determine the testing criteria and passing scores, or the number of questions students will need to answer correctly to pass the test. Scores on these tests are typically expressed as a percentage.
It should be noted that passing scores—or “cut-off scores“—on criterion-referenced tests are judgment calls made by either individuals or groups. It’s theoretically possible, for example, that a given test-development committee, if it had been made up of different individuals with different backgrounds and viewpoints, would have determined different passing scores for a certain test. For example, one group might determine that a minimum passing score is 70 percent correct answers, while another group might establish the cut-off score at 75 percent correct. For a related discussion, see proficiency.
Criterion-referenced tests created by individual teachers are also very common in American public schools. For example, a history teacher may devise a test to evaluate understanding and retention of a unit on World War II. The criteria in this case might include the causes and timeline of the war, the nations that were involved, the dates and circumstances of major battles, and the names and roles of certain leaders. The teacher may design a test to evaluate student understanding of the criteria and determine a minimum passing score.
While criterion-referenced test scores are often expressed as percentages, and many have minimum passing scores, the test results may also be scored or reported in alternative ways. For example, results may be grouped into broad achievement categories—such as “below basic,” “basic,” “proficient,” and “advanced”—or reported on a 1–5 numerical scale, with the numbers representing different levels of achievement. As with minimum passing scores, proficiency levels are judgment calls made by individuals or groups that may choose to modify proficiency levels by raising or lowering them.
The following are a few representative examples of how criterion-referenced tests and scores may be used:
- To determine whether students have learned expected knowledge and skills. If the criterion-referenced tests are used to make decisions about grade promotion or diploma eligibility, they would be considered “high-stakes tests.”
- To determine if students have learning gaps or academic deficits that need to be addressed. For a related discussion, see formative assessment.
- To evaluate the effectiveness of a course, academic program, or learning experience by using “pre-tests” and “post-tests” to measure learning progress over the duration of the instructional period.
- To evaluate the effectiveness of teachers by factoring test results into job-performance evaluations. For a related discussion, see value-added measures.
- To measure progress toward the goals and objectives described in an “individualized education plan” for students with disabilities.
- To determine if a student or teacher is qualified to receive a license or certificate.
- To measure the academic achievement of students in a given state, usually for the purposes of comparing academic performance among schools and districts.
- To measure the academic achievement of students in a given country, usually for the purposes of comparing academic performance among nations. A few widely used examples of international-comparison tests include the Programme for International Student Assessment (PISA), the Progress in International Reading Literacy Study (PIRLS), and the Trends in International Mathematics and Science Study (TIMSS).
Criterion-referenced tests are the most widely used type of test in American public education. All the large-scale standardized tests used to measure public-school performance, hold schools accountable for improving student learning results, and comply with state or federal policies—such as the No Child Left Behind Act—are criterion-referenced tests, including the assessments being developed to measure student achievement of the Common Core State Standards. Criterion-referenced tests are used for these purposes because the goal is to determine whether educators and schools are successfully teaching students what they are expected to learn.
Criterion-referenced tests are also used by educators and schools practicing proficiency-based learning, a term that refers to systems of instruction, assessment, grading, and academic reporting that are based on students demonstrating mastery of the knowledge and skills they are expected to learn before they progress to the next lesson, get promoted to the next grade level, or receive a diploma. In most cases, proficiency-based systems use state learning standards to determine academic expectations and define “proficiency” in a given course, content area, or grade level. Criterion-referenced tests are one method used to measure academic progress and achievement in relation to standards.
Following a wide variety of state and federal policies aimed at improving school and teacher performance, criterion-referenced standardized tests have become an increasingly prominent part of public schooling in the United States. When focused on reforming schools and improving student achievement, these tests are used in a few primary ways:
- To hold schools and educators accountable for educational results and student performance. In this case, test scores are used as a measure of effectiveness, and low scores may trigger a variety of consequences for schools and teachers.
- To evaluate whether students have learned what they are expected to learn. In this case, test scores are seen as a representative indicator of student achievement.
- To identify gaps in student learning and academic progress. Test scores may be used, along with other information about students, to diagnose learning needs so that educators can provide appropriate services, instruction, or academic support.
- To identify achievement gaps among different student groups. Students of color, students who are not proficient in English, students from low-income households, and students with physical or learning disabilities tend to score, on average, well below white students from more educated, higher income households on standardized tests. In this case, exposing and highlighting achievement gaps may be seen as an essential first step in the effort to educate all students well, which can lead to greater public awareness and resulting changes in educational policies and programs.
- To determine whether educational policies are working as intended. Elected officials and education policy makers may rely on standardized-test results to determine whether their laws and policies are working as intended, or to compare educational performance from school to school or state to state. They may also use the results to persuade the public and other elected officials that their policies are in the best interest of children and society.
The widespread use of high-stakes standardized tests in the United States has made criterion-referenced tests an object of criticism and debate. While many educators believe that criterion-referenced tests are a fair and useful way to evaluate student, teacher, and school performance, others argue that the overuse, and potential misuse, of the tests could have negative consequences that outweigh their benefits.
The following are a few representative arguments typically made by proponents of criterion-referenced testing:
- The tests are better suited to measuring learning progress than norm-referenced exams, and they give educators information they can use to improve teaching and school performance.
- The tests are fairer to students than norm-referenced tests because they don’t compare the relative performance of students; they evaluate achievement against a common and consistently applied set of criteria.
- The tests apply the same learning standards to all students, which can hold underprivileged or disadvantaged students to the same high expectations as other students. Historically, students of color, students who are not proficient in English, students from low-income households, and students with physical or learning disabilities have suffered from lower academic achievement, and many educators contend that this pattern of underperformance results, at least in part, from lower academic expectations. Raising academic expectations for these student groups, and making sure they reach those expectations, is believed to promote greater equity in education.
- The tests can be constructed with open-ended questions and tasks that require students to use higher-level cognitive skills such as critical thinking, problem solving, reasoning, analysis, or interpretation. Multiple-choice and true-false questions promote memorization and factual recall, but they do not ask students to apply what they have learned to solve a challenging problem or write insightfully about a complex issue, for example. For a related discussion, see 21st century skills and Bloom’s taxonomy.
The following are representative arguments typically made by critics of criterion-referenced testing:
- The tests are only as accurate or fair as the learning standards upon which they are based. If the standards are vaguely worded, or if they are either too difficult or too easy for the students being evaluated, the associated test results will reflect the flawed standards. A test administered in eleventh grade that reflects a level of knowledge and skill students should have acquired in eighth grade would be one general example. Alternatively, tests may not be appropriately “aligned” with learning standards, so that even if the standards are clearly written, age appropriate, and focused on the right knowledge and skills, the test might not designed well enough to achievement of the standards.
- The process of determining proficiency levels and passing scores on criterion-referenced tests can be highly subjective or misleading—and the potential consequences can be significant, particularly if the tests are used to make high-stakes decisions about students, teachers, and schools. Because reported “proficiency” rises and falls in direct relation to the standards or cut-off scores used to make a proficiency determination, it’s possible to manipulate the perception and interpretation of test results by elevating or lowering either standards and passing scores. And when educators are evaluated based on test scores, their job security may rest on potentially misleading or flawed results. Even the reputations of national education systems can be negatively affected when a large percentage of students fail to achieve “proficiency” on international assessments.
- The subjective nature of proficiency levels allows the tests to be exploited for political purposes to make it appear that schools are either doing better or worse than they actually are. For example, some states have been accused of lowering proficiency standards of standardized tests to increase the number of students achieving “proficiency,” and thereby avoid the consequences—negative press, public criticism, large numbers of students being held back or denied diplomas (in states that base graduation eligibility on test scores)—that may result from large numbers of students failing to achieve expected or required proficiency levels.
- If the tests primarily utilize multiple-choice questions—which, in the case of standardized testing, makes scoring faster and less expensive because it can be done by computers rather than human scorers—they will promote rote memorization and factual recall in schools, rather than the higher-order thinking skills students will need in college, careers, and adult life. For example, the overuse or misuse of standardized testing can encourage a phenomenon known as “teaching to the test,” which means that teachers focus too much on test preparation and the academic content that will be evaluated by standardized tests, typically at the expense of other important topics and skills.