Reliabilty and validity

<< Click to Display Table of Contents >>

Navigation:  Coming to terms >

Reliabilty and validity

Previous pageReturn to chapter overviewNext page

These are very important terms in measurement. I'll present them now as definitions, and then refer back to them as I continue with today's presentation. This material will be a bit boring, perhaps, but I need to put these basic terms "on the table" before we can start eating some practical examples.

 

I think we may all have an idea of how to use the word "reliability".

 

Does QANTAS flight QF71 depart on time everyday?

(Is it reliable, can we count on it?)

During the monsoon, will the rain start everyday in the mid-afternoon?

My student, Carlos, consistently gets the best marks in my measurement class.

He has had the top scores on all assignments and tests.

(Carlos is a very reliable performer.)

 

Reliability = consistency

 

 

Reliability is also a term used as one measure of the quality of a test

 

It is often referred to as test consistency. If I give students a test on Monday and then test them again on Wednesday with the same instrument, the same test, the same items, do they get the same scores? If I could take one student and repeatedly test him or her with the same instrument, will his/her scores be close to the same each time?

 

Note: it will seem strange to think that we would actually do this as surely the students will remember the items and may have thought about them between Monday and Wednesday. Nonetheless, "test-retest" reliability is a term often encountered in "measurement".

 

The reliability of a test is generally found by computing a "reliability coefficient". This will often be expressed on a numeric scale which will range from 0.00 to 1.00; in test-retest reliability, the reliability coefficient is based on the correlation between the scores from two administrations of the same test.

 

One of the most popular ways of finding the reliability of a test or scale is to compute "coefficient alpha", also known as "Cronbach's alpha". Coefficient alpha is in fact probably the most commonly-used index of reliability. It is also referred to as an index of a test's or scale's "internal consistency".

 

Good tests will have reliability coefficients in excess of 0.90; teacher-made tests will seldom have a figure this high -- they're more like to have reliabilities in the 0.80 to 0.85 range (often less, unfortunately).

 

Errors which arise when using tests

 

There are quite a few factors which might very well arise when we administer a test, factors which introduce error, making a person's test score differ from what it would otherwise be. For example, the test environment might have aspects which bother some of the test takers, such as poor lighting, noises, or a room temperature too hot or too cold. These are environmental factors.

 

The person being tested may not be in the best of health on the test day; s/he may be worried about a parent or a child, or may, perhaps, have recently had an argument with someone and be upset. Test anxiety is fairly common -- some students will perform poorly because they're nervous. These personal factors can have a real impact on a person's test result.

 

The items on a test are usually just a sample of the items which could have been used. If we take two mathematics teachers, for example, and ask them to create a 30-item test of simple geometry, the items they produce will likely be quite different even when the teachers are using a common list of learning objectives, have lectured from exactly the same syllabus, and have used the same teaching methods and materials.

 

Often, as in FIMS study to be discussed shortly, test items are sampled from a larger "pool" of items. The FIMS study had a pool of 102 items to draw from, but only 14 were drawn to make the test we will soon be discussing.

 

Thus a student's score on a particular test will depend on the version of the test which was used on the day, and/or on the particular sample of items selected from a large item pool.

 

It is common to use a statistic called the "error of measurement" when using test results. When we test a student we should acknowledge that the score which results one day may not be the same if we could test the student again. We should, and easily can, use an error-of-measurement statistic to estimate the amount of error in the test score.

 

Validity

 

Validity is another core term which refers to the quality of a test.

 

It refers to the degree to which the test measures what it is supposed to measure. While this may seem only too obvious, I will give examples of mathematics items which may be measuring something other than maths.

 

Tools of the trade

 

Computers. These days there's a considerable choice of computer apps (or programs) which we can use to write, administer, and score tests. The best of these will give an estimate of test reliability as well as an indication of the possible extent of error which may have accompanied the testing process.

 

I use a laptop for my work with tests and many of the examples I will be showing you today will be based on software I have written and published. But I'm quite sure there are also iPad-based apps. And there are web-based "apps" too, such as Survey Monkey, QUIA, and Google Classroom.

 

Here's a link to a PDF document that discusses selected professional software programs for item and test analysis, systems which are free for students to use.