Note to myself: workbook saved on Imation USB as: StuIQ_10Sep2017.xlsx
In 1995 the New Zealand government set out to design a new test of "mathematics aptitude".
They assembled a team of expert teachers, and a developed a subject design matrix similar to that used in the FIMS project.
They wrote about 300 test items. About a third of them were supply items; the rest were multiple choice items with 5 options (4 distractors).
They then developed two "parallel" tests, each with 70 items (some supply items, some multiple-choice items).
These tests were referred to as "Form A" and "Form B".
By "parallel" is meant that they selected the same number of items from each of the content categories, and the same proportional number from the performance expectation levels. The 70 items on Form A were different from the 70 items of Form B, but were matched on content and performance levels -- they were judged to be "parallel", having items of similar content and performance difficulty.
They then selected a stratified random sample of 300 junior high-school students to test.
I will display results using Excel and Lertap.
The histograms for the two tests were similar.
The scatterplot of Form A vs Form B is interesting. It basically display the desired pattern of correlation.
The reliability, the correlation between Form A and Form B test scores, was 0.84 -- this would be referred to as "parallel-forms" or "equivalent-forms" reliability. It's a good value.
The csem worksheet can be used to apply the binomial (as in the marbles example) to derive an estimate of a range which is likely to cover "true" percentage-correct score.
The FA35+ worksheet shows what would happen if we applied a cutoff score of 50% (35 items correct).
If we said that students had to get a Form A score of 35 or more to pass and then, if we could test them again with a carefully-developed parallel form (Form B in this example), would we find consistent results? Do the same students pass when tested again?
We will see that not all of them will.
An "item analysis" should be done to see that the items are discriminating as we want them to -- on each item, the top students should do better than the bottom students.
I will show some item response plots.
It can be shown that reliability is related to item discrimination.
And, it can be shown that item discrimination for multiple-choice items requires effective distractors; the weak students will choose the distractors, but the top students will not.