Measuring Reliability

We make the distinction between three types of reliability.

Test-retest reliability (a measure of stability)
Interrater reliability (a measure of agreement)
Internal consistency reliability (a measure of how correlated the items of a measure are with one another)

Test-retest and Interrater Reliability

Test-retest and interrater reliability are most often indexed with a product moment correlation if our measure is continuous. To assess test-retest reliability, for example, we would give the measure to a group of people on two occasions, separated by a specified period of time. We then compute the correlation between the measures given on the two occasions. With test-retest reliability, we should ALWAYS specify the length of time when we are reporting the level of reliability (e.g., "The test-retest reliability for this measure over 10 weeks was .72."). Similarly, we assess interrater reliability by having two raters rate the same group of subjects and then compute the product-moment correlation between their ratings. The procedures for computing these reliability indices using SPSS for Windows is detailed elsewhere on this website.

Note that unless you expect a characteristic to be stable over time, you should not expect high test-retest reliability. Sometimes we naively think that all measures should ideally have all types of reliability, but it depends entirely on what we are measuring and what its theoretical characteristics are.

For example, if we are measuring something like anxiety, we would expect that it would go up and down depending on the situation. Therefore, we would not generally expect high test-retest reliability. In fact, we could say it more strongly. If we did get high test-retest reliability on our anxiety measure, we might well question whether we are really measuring anxiety as we conceptualize this construct.

When the measure being used is categorical, typically a measure of percent agreement is used. For example, if we were categorizing patients on the basis of the clinical diagnoses, we would use an agreement measure. We could technically use an agreement measure for continuous data as well, although this is rarely done in practice. The percent agreement index is simply the percentage of times that two rater agree or the percentage of times that the same classification is made on retest. The procedures for computing this simple index are described in Chapter 9 of the text.

The problem with the percent agreement index is that it is easier to agree consistently if the number of categories are small or if most of the participants fall into a single category. If, for example, we were diagnosing patients as either psychotic or not psychotic (just two categories), this is a relatively easy discrimination compared to giving them precise diagnoses. If one rater thought that a patient qualified for the diagnosis of bipolar disorder and the other rater thought the patient qualified for the diagnosis of schizophrenia, they would still be in agreement, because both of these are considered psychotic disorders. However, if dozens of diagnoses are possible, the raters must agree exactly on which diagnosis applies to be in agreement. In other words, the amount of agreement will depend on how fine a discrimination is required.

The Kappa coefficient takes this into account and also takes into account the fact that it is easier to agree if almost everyone is in one category. Many advanced statistical packages can compute Kappa, although the version of SPSS for Windows included with this text does not. The procedures for the manual computation of Kappa are detailed elsewhere on this website.

Internal Consistency Reliability

Internal consistency reliability is a measure of how intercorrelated the items of a measure are. When the items of a measure are highly intercorrelated, it means that the items are all apparently measuring the same characteristic or trait.

We expect to get high internal consistency reliability when our measure is theoretically trait-like. For example, if we are measuring a construct such as "knowledge of research methods" with a 50-item test based on material in this textbook, we would think of this as a trait and would expect students who know a lot about research methods to do consistently well on most items and those who know very little to do consistently poorly on most items. If on the other hand, we were measuring dozens of different knowledge domains in a single test, we would expect much less agreement, with some people being knowledgeable on certain topics, while others are knowledgeable about different topics. In this latter case, we would expect low to moderate internal consistency reliability.

The most widely-used index of internal consistency reliability is called coefficient alpha. Please note that coefficient alpha has nothing to do with the alpha level of statistical tests, even though the names are similar. Unfortunately, as in many areas of research, independent development of concepts has led to a confusing array of names and procedures for internal consistency reliability. We will briefly review them here, BUT we would like to make it clear that coefficient alpha is the most appropriate and most general of the internal consistency coefficients and is the one that should be used. Many statistical analysis packages will compute coefficient alpha for you, although the version of SPSS for Windows that can be bundled with this text does not.

Two other terms that are often used to refer to internal consistency reliability are split-half reliability and KR-20. Splitting the test into two halves and correlating those halves will give us some sense of how intercorrelated the items of the test are. There are two problems with this approach. The first is that there are many different ways in which one could split a test, and most of the time, different splits of the test will produce different levels of correlation. The second is that when you correlate half the test with the other half of the test, you get a measure of the reliability of half of the test, not the reliability of the whole test. There is a statistical method for estimating what the reliability of the whole test is, but it requires additional computations. Coefficient alpha is equal to the mean of all possible split-half reliabilities, and it is much easier to compute than computing all possible split-half reliabilities.

KR-20 is short for Kudor-Richardson formula #20, which is the 20th formula in a classic paper by Kudor and Richardson detailing this method of quantifying internal consistency reliability. The term KR-20 is sometimes used interchangeably with coefficient alpha, although this is not accurate. Technically, KR-20 is a computational formula for internal consistency that can be used ONLY when all items are dichotomous (either entirely right or entirely wrong). Many measures are of this nature. In such a case, KR-20 and coefficient alpha will give exactly the same reliability. However, if some or all of the items can be partially correct, then KR-20 cannot be used.

KR-20 was developed long before computers were available for statistical analysis. It was designed to simplify computation so that one could compute an index of reliability in just a few hours. With modern computers and computer software, KR-20 has been rendered obsolete.