< Evidence-based assessment

Reliability

This page focuses on psychometric reliability in the context of Evidence-Based Assessment. There are other more general and comprehensive discussions of reliability on Wikipedia and elsewhere.

Reliability describes the reproducibility of the score. It is expressed as a number that ranges from 0 (no reliable variance) to 1.0 (perfect reliability). Conceptually, the number can be thought of as the correlation of the score with itself. In classical test theory, it is the correlation of the observed score with the "true" score.

There are different facets of reliability. Internal consistency looks at whether different parts of the scale measure the same thing. One way of looking at internal consistency would be to split the items into two halves randomly, and then look at the correlation between the two scores (split-half reliability). Cronbach's alpha is the most widely used form of internal consistency reliability. Conceptually, alpha is the average of all possible split-half versions. Internal consistency is the most widely reported form of reliability because it is the most convenient and least expensive to estimate... not because it is always the most appropriate choice.

Evaluating norms and reliability

Rubric for evaluating norms and reliability for assessments (extending Hunsley & Mash, 2008, 2018; *indicates new construct or category)
CriterionAdequateGoodExcellentToo Good
NormsMean and standard deviation for total score (and subscores if relevant) from a large, relevant clinical sampleMean and standard deviation for total score (and subscores if relevant) from multiple large, relevant samples, at least one clinical and one nonclinicalSame as “good,” but must be from representative sample (i.e., random sampling, or matching to census data)Not a concern
Internal consistency (Cronbach's alpha, split half, etc.)Most evidence shows Cronbach's alpha values of .70 to .79Most reported alphas .80 to .89Most reported alphas >= .90Alpha is also tied to scale length and content coverage - very high alphas may indicate that scale is longer than needed, or that it has a very narrow scope
w:Inter-rater reliabilityMost evidence shows kappas of .60-.74, or intraclass correlations of .70-.79Most reported kappas of .75-.84, ICCs of .80-.89Most kappas ≥ .85, or ICCs ≥ .90Very high levels of agreement often achieved by re-rating from audio or transcript
w:Test-retest reliability (stability)Most evidence shows test-retest correlations ≥ .70 over period of several days or weeksMost evidence shows test-retest correlations ≥ .70 over period of several monthsMost evidence shows test-retest correlations ≥ .70 over a year or longerKey consideration is appropriate time interval; many constructs would not be stable for years at a time
*RepeatabilityBland-Altman plots (Bland & Altman, 1986) plots show small bias, and/or weak trends; coefficient of repeatability is tolerable compared to clinical benchmarks (Vaz, Falkmer, Passmore, Parsons, & Andreou, 2013)Bland-Altman plots and corresponding regressions show no significant bias, and no significant trends; coefficient of repeatability is tolerableBland-Altman plots and corresponding regressions show no significant bias, and no significant trends across multiple studies; coefficient of repeatability is small enough that it is not clinically concerningNot a concern
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.