Ec300022.qxd
COMPENDIUM OF PRIMERS
Evaluating Statistics
Probability and Odds and Interpreting Their Ratios
1
Type I and Type II Errors
3
Absolute vs. Relative Differences
5
Correlation Coefficients
6
95% CIs for the Number Needed to Treat
8
Statistical Significance and P values
9
95% Confidence Intervals
11
Evaluating Study Designs
Before–After Studies: Evaluating a Report of a "Successful" Intervention
14
Group Randomized Trials
16
Cost-Effectiveness Analysis
18
Interpreting Surveys
21
Utilities
23
Scores: What Counts?
24
Special Topics
Lead-Time, Length, and Overdiagnosis Biases
25
Dissecting a Medical Imperative
26
HEDIS
28
Geographic Variation in Health Care
30
American College of Physicians-
American Society of Internal Medicine
Community Health Plans
Primer on Probability and Odds and Interpreting Their Ratios
Chance is measured by using either probabilities (a ratio of
ability of breast-feeding is 600/1000, or 0.6 (often expressed as
occurrence to the whole) or odds (a ratio of occurrence to non-
60%), whereas the odds of breast-feeding are 600/400, or 1.5
occurrence). Consider measuring the chance of breast-feeding
(often expressed as 1.5 to 1). Table 1 summarizes the characteris-
among 1000 new mothers. If 600 ultimately breast-feed, the prob-
tics of probability and odds.
Transformation to other measure
1 – probability
Probabilities and odds contain the same information and
are equally valid measures of chance. In the case of infrequent
events (i.e., probability < 0.1 or 10%), the distinction is unimpor-
tant (probability and odds have essentially the same value).
However, as shown in Table 2, probability and odds take on very
different values as the chance of an event increases.
Although probabilities are often reported in the medical lit-
erature, it is rare to see odds reported. On the other hand, ratios
of probabilities (i.e., relative risks, or risk ratios [RRs]) and odds
(i.e., odds ratios [ORs]) are seen often. And it is in these ratios of
ratios that the distinction between probability and odds may be
both important and ambiguous.
When the chance of common events are being compared,
ORs and RRs substantially diverge in value. Let's return to the
breast-feeding example. Imagine a randomized trial of a lactation-
support system. The probability of breast-feeding in the control
group is 60% (or an odds of 1.5); in the intervention group, it is
90% (or an odds of 9). Table 3 shows that the relative risk is 1.5
while the odds ratio is 6.
PROBABILITY OF
RELATIVE RISK
ODDS RATIO
(INTERVENTION VS. CONTROL)
(INTERVENTION VS. CONTROL)
Effective Clinical Practice ■
May/June 2000 Volume 3 Number 3
In general, ORs are more extreme (i.e., farther away from
more ORs in the medical literature, largely because of the
1) than are RRs. ORs that are greater than 1 exaggerate the
increased use of logistic regression. Because most people are
increase in risk (i.e., OR > RR); ORs that are less than 1 exag-
more familiar with probabilities than odds, ORs are often inter-
gerate the decrease in risk (i.e., OR < RR). Practically speaking,
preted as RRs. When events are common, this misinterpretation
the discrepancy between the two measures is relevant only
substantially exaggerates the association being reported. If the
when relatively common events are being compared. Readers
goal is clarity, the probability (or absolute event rate) for each
should begin to worry about the distinction when baseline prob-
group is tough to beat.
abilities exceed 10% to 20%. And, as shown in Table 4, they
might reasonably pursue a conversion when baseline probabili-
Talfryn H, Davies O, Crombie IK, Tavakoli M. When can odds ratios mis-
ties are greater than 50%.
lead? BMJ. 1998;316:989-91.
It is important to emphasize that ORs and RRs are equally
Zhang J, Yu KF. What's the relative risk? A method of correcting the odds
valid—but different—measures. Readers are seeing more and
ratios in cohort studies of common outcomes. JAMA. 1998;280:1690-1.
APPROXIMATE RELATIVE RISK FOR ODDS RATIOS GREATER THAN 1
APPROXIMATE RELATIVE RISK FOR ODDS RATIOS LESS THAN 1
Effective Clinical Practice ■
May/June 2000 Volume 3 Number 3
Primer on Type I and Type II Errors
Statistical tests are tools that help us assess the role of chance
bility that a type I error has occurred in a positive study is the
as an explanation of patterns observed in data. The most com-
exact P value reported. For example, if the P value is 0.001,
mon "pattern" of interest is how two groups compare in terms of
then the probability that the study has yielded false-positive
a single outcome. After a statistical test is performed, investiga-
results is 1 in 1000.*
tors (and readers) can arrive at one of two conclusions:
Type II Errors
1) The pattern is probably not due to chance (i.e., in common
jargon, "There was a significant difference" or "The study
A type II error is analogous to a false-negative result during diag-
was positive").
nostic testing: No difference is shown when in "truth" there is
2) The pattern is likely due to chance (i.e., in common jargon,
one. Traditionally, this error has received less attention from
"There was no significant difference" or "The study was
researchers than type I error and, consequently, may occur more
often. Type II errors are generally the result of a researcher study-
No matter how well the study is performed, either conclusion may
ing too few participants. To avoid the error, some researchers per-
be wrong. As shown in the Table below, a mistake about the first
form a sample size calculation before beginning a study and, as
conclusion is labeled a type I error and a mistake about the sec-
part of the calculation, assert what a "true difference" is and
ond is labeled a type II error.
accept that they will miss it 10% to 20% of the time (i.e., type II
error rate of 0.1 or 0.2). Regardless of how a study was planned,
when faced with a negative study readers must be aware of the
possibility of a type II error. Determining the likelihood of such an
error is not a simple calculation but a judgment.
"Positive" study
Type I error
Role of 95% CIs in Assessing
Type II Errors
"Negative" study
Type II error
(no significant
The best way to decide whether a type II error exists is to ask two
questions: 1) Is the observed effect clinically important? and 2)
To what extent does the confidence interval include clinically
important effects? The more important the observed effect and
Note that a type I error is only possible in a positive study,
the more the confidence interval includes important effects, the
and a type II error is possible only in a negative study. Thus, this
more likely that a type II error exists.
is one of the few areas of medicine where you can only make one
To gain some experience with this approach, consider the
mistake at a time.
confidence intervals from three hypothetical randomized trials in
the Figure. Each trial addresses the efficacy of an intervention to
Type I Errors
prevent a localized cancer from spreading. The outcome is the
A type I error is analogous to a false-positive result during
relative risk (RR) of metastasis (ratio of the risk in the interven-
diagnostic testing: A difference is shown when in "truth" there
tion group over the risk in the control group). The interventions
is none. Researchers have long been concerned about making
are not trivial, and you assert that you only consider risk reduc-
this mistake and have conventionally demanded that the prob-
tions of greater that 10% to be clinically important. Note that each
ability of a type I error be less than 5%. This convention is
confidence interval includes 1—that is, each study is negative.
operationalized in the familiar critical threshold for P values:
There are no "significant differences" here. Which study is most
P must be less than 0.05 before we conclude that a study is
likely to have a type II error?
positive. This means we are willing to accept that in 100 posi-
*This statement only considers the role of chance. Readers should be
tive studies, at most 5 will be due to chance alone. The proba-
aware, however, that observed patterns may also be the result of bias.
Effective Clinical Practice ■
November/December 2001 Volume 4 Number 6
FIGURE. Role of 95% CIs in assessing
type II errors.
Relative Risk (RR) 1.0
(95% CI, 0.9, 1.1)
RR 1.0 (CI, 0.5, 1.5)
RR 0.7 (CI, 0.48, 1.02)
Study A suggests that the intervention has no effect (i.e.
an important beneficial one. A type II error is possible, and it
the relative risk is 1) and is very precise (i.e., the confidence inter-
could be in either direction.
val is narrow). You can be confident that it is not missing an
Study C suggests that the intervention has a clinically
important difference. In other words, you can be confident that
important beneficial effect (i.e., the RR is much less than 1) and
there's no type II error.
is also very imprecise. Most of the confidence interval includes
Study B suggests that the intervention has no effect (i.e.,
clinically important beneficial effects. Consequently, a type II
the RR is 1) but is very imprecise (i.e., the confidence interval is
error is very likely. This is a study you would like to see repeated
wide). This study may be missing an important difference. In
using a larger sample.
other words, you should be worried about type II error, but this
study is just as likely to be missing an important harmful effect as
Effective Clinical Practice ■
November/December 2001 Volume 4 Number 6
Primer on Absolute vs. Relative Differences
When presenting data comparing two or more groups, researchers
Both expressions have their place. Without any qualifica-
(and reporters) naturally focus on differences. Compared with oth-
tion, both statements ("reduced the risk by 1%" and "reduced the
ers, one group may (pick one): cost more, have longer hospital
risk by 50%") could be construed as representing either an
stays, or have higher complication rates. These relations may be
absolute or relative difference. But most important, note the differ-
expressed as either absolute or relative differences. An absolute
ence in "feel." A statement of "reduced the risk by 1%" does feel
difference is a subtraction; a relative difference is a ratio. Because
like a smaller effect than "reduced the risk by 50%."
this choice may influence how big a difference "feels," readers
The most frequent problem readers will face is the reporting
need to be alert to the distinction.
of an isolated relative difference. Research abstracts, medical
When the units are counts, such as dollars, the distinction
review articles, and general circulation newspapers and magazines
between absolute and relative differences is obvious: group 1
are filled with statements like "60% decrease in costs," "twice as
costs $30,000 more; group 1 had 40% higher costs. But when the
many days in the hospital," or "20% decrease in mortality." These
units are percentages (frequently used to describe rates, probabil-
statements provide no information about the starting point. For
ities, and proportions), it can be difficult to determine whether a
example, the statement, "The risk for disease X was cut in half"
stated difference is absolute or relative.
gives no information about where you started. As shown in the
Consider the risk for blindness in a patient with diabetes
Table below, there is a wide range of risks that can be cut in half.
over a 5-year period. If the risk for blindness is 2 in 100 (2%) in a
Consequently, when you're
group of patients treated conventionally and 1 in 100 (1%) in
RISK FOR DISEASE
patients treated intensively, the absolute difference is derived by
simply subtracting the two risks:
2%–1% = 1%
20% (2/10)
10% (1/10)
Expressed as an absolute difference, intensive therapy
2% (2/100)
1% (1/100)
reduces the 5-year risk for blindness by 1%.
The relative difference is the ratio of the two risks. (NB:
Relative risk, relative rate, rate ratios, and odds ratios are all exam-
ples of relative differences.) Given the data above, the relative dif-
ference is:
presented with a relative difference ("60% more") and you really
want to get a complete picture of what's going on, make sure you
ask the question, "From what?" If the goal is clarity, the actual
data (the dollars, the hospital days, and the mortality rates) for
Expressed as a relative difference, intensive therapy
each group is tough to beat.
reduces the risk for blindness by half.
Effective Clinical Practice ■
November/December 1999 Volume 2 Number 6
Primer on Correlation Coefficients
Researchers are often interested in how two continuous vari-
Scatterplots of other relationships may involve different units of
ables relate to one another. To examine the relationship between
analysis, as shown in Table 1.
body mass and fasting blood sugar, for example, one might study
Any of these relationships can also be quantified by a sin-
20 people and measure both variables in each. The simplest
gle number—the correlation coefficient, also known as r.
approach to examine the relationship is to draw a picture, a scat-
Because journals frequently only publish the number (and not the
terplot (an x–y graph), of body mass vs. fasting blood sugar. In
picture), this primer offers three questions to help readers visu-
this case, there are 20 dots, each representing one person.
alize and interpret correlation coefficients.
VARIABLE 1
VARIABLE 2
UNIT OF ANALYSIS
Body mass
Fasting blood sugar
Individual
Pneumococcal vaccination
Years in practice
Pap smear compliance
Physicians per capita
Death rate
What Is the Sign on the Coefficient?
ative number, the variables are inversely related. In other words, as
one goes up, the other goes down (an example might be age and
The first step is to look at the sign on r. If r is a positive number, the
exercise capacity in adults). Knowing the sign helps you visualize
variables are directly related. In other words, as one goes up, so
the slope in the scatterplot, as shown in Figure 1.
does the other (height and weight are a good example). If r is a neg-
r Is Positive
r Is Negative
(directly related)
(inversely related)
FIGURE 1.
What Is the Magnitude of the Coefficient?
stronger the correlation. The smaller the absolute value (i.e., the
closer to 0), the weaker the correlation.
The next step is to consider how big r is; r ranges from –1 to 1.
To provide perspective on what various r's look like, Figure
An r of 0 signifies absolutely no correlation, whereas an r of –1 or
2 shows three positive correlation coefficients and their associated
1 signifies a perfect correlation (all the data points fall on a line).
scatterplots. (The scatterplots for the negative correlation coeffi-
In practice, r always has some intermediate value—there's always
cients would simply be mirror images.) Note that it may be difficult
some correlation between two variables, but it's never perfect.
to see a relationship when r is less than 0.3 (or greater than –0.3).
The bigger the absolute value of r (i.e., the closer to –1 or 1), the
Effective Clinical Practice ■
March/April 2001 Volume 4 Number 2
FIGURE 2.
The absolute magnitude of r is also a major determinant of
To mitigate this problem, r is often recalculated substitut-
statistical significance (the other being the number of observa-
ing ranks for the raw data. (An r calculated using raw data is
tions). Consider 20 observations as depicted above. An r of 0.3 (a
called a Pearson r, while an r calculated using ranks is called a
weak correlation) has an associated P value of 0.2. The P value
Spearman r. A reported r should be assumed to be Pearson r
falls with stronger correlations: P = 0.005 for an r of 0.6 and P <
unless otherwise noted.)
0.0001 for an r of 0.9.
For example, fasting blood sugar levels of 610, 320, 290,
and 280 mg/dL would be converted to ranks 1, 2, 3, and 4; body
Does the Coefficient Reflect a
weights of 350, 270, 220, and 210 lb would be converted to ranks
General Relationship or an Outlier?
1, 2, 3, and 4; and the data point (610, 220) would become (1, 3).
A critical reader will want to consider if seeing a scatterplot might
This recalculation does not eliminate the effect of outliers, but it
influence the interpretation of r. As shown in Figure 3, a single
does help to dampen their effects (in Figure 3, from left to right
extreme data point (an outlier) can have a powerful effect on the
the recalculated r's are 0.56, 0.62, and 0.37). In small samples, this
correlation coefficient when the sample size is small.
recalculation can be particularly important.
FIGURE 3.
Although correlation coefficients are an efficient way to
ed to vaccination (e.g., reminder systems, nurse-run vaccination
communicate the relationship between two variables, they are
not sufficient to interpret a relationship. The unit of analysis also
Finally, correlation coefficients do not communicate infor-
matters. For example, a strong positive correlation between
mation about whether one variable moves in response to anoth-
influenza and pneumococcal vaccination rates measured among
er. There is no attempt to distinguish between the two variables—
physicians should be interpreted differently than the same coef-
that is, to establish one as dependent and the other as
ficients measured among clinics. The former may imply that
independent. Thus, relationships identified using correlation
physicians have different beliefs about vaccinations, whereas the
coefficients should be interpreted for what they are: associa-
latter may simply reflect that clinics differ in the resources devot-
tions, not causal relationships.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
March/April 2001 Volume 4 Number 2
Primer on 95% CIs for the Number Needed To Treat
Few, if any, therapeutic interventions benefit every patient. One
95% CIs for NNTs that contain the possibility of both harm
way to gauge the likelihood that one patient will benefit is to cal-
and benefit are probably best communicated graphically. Altman
culate the number needed to treat (NNT) — that is, the number of
introduced the concept in a recent article in the BMJ,1 and pro-
patients who must be treated for one to benefit. The general
posed the following labels: NNT (benefit) and NNT (harm).
approach is as follows:
The importance of a graphic display is best demonstrated
by example. Consider 95% CIs for NNTs for lipid-lowering thera-
Percentage with outcomestandard treatment
py. The outcome is death from any cause. In the Scandinavian
– Percentage with outcomenew treatment
Simvastatin Survival Study (4S)2 (which studied simvastatin in
Absolute risk reduction
patients who had either angina or previous myocardial infarc-
100/Absolute risk reduction = Number needed to treat
tion), the 95% CIs for NNTs did not pass through infinity. The NNT
For example, consider a randomized trial in which 50% of the partic-
(benefit) was 30 (95% CI, 19 to 68). In the Air Force Coronary/Texas
ipants die in the control group and 40% die in the intervention group.
Atherosclerosis Prevention Study (AFCAPS/TexCAPS)3 (which stud-
The absolute risk reduction for death is thus 10%, and the NNT to
ied lovastatin in patients without heart disease who had normal cho-
avoid a death is 10 (100/10)*. This treatment would be preferred over
lesterol levels), however, the CI does pass through infinity. The
a competing treatment whose NNT to avoid death was 20.
NNT (harm) was 1130; 95% CI,: NNT (benefit) 153 to to NNT
NNT can be calculated using any dichotomous outcome
(harm) 120. For most of us, these data would be better summa-
(an outcome that a patient either experiences or does not experi-
rized in a figure:
ence). In most cases, the NNT is calculated by using an adverse
NNT and the 95% CIs for NNT are relatively new concepts.
outcome —- one that most persons would prefer to avoid (e.g.,
angina, myocardial infarction, cardiac death, any death). But
because different outcomes are possible, an NNT of 10 is not
always preferable to an NNT of 20 (e.g., if the former were for
angina and the latter for any death). Therefore, an NNT should
Outcome: any death
always be accompanied by a clearly specified outcome.
As is the case with all variables measured in research, the
NNT is an estimate.The precision of the estimate is largely a function
of how many people were studied and is reflected by using a 95% CI.
(primary prevention)
The 95% CI for an NNT is the range of values in which we would
expect to find the "true" NNT 95% of the time.† In some cases, the
range may also include the possibility of harm. A 95% CI for an NNT
that contains the possibility for both harm and benefit passes
through infinity. In other words, an intervention with no effect has an
NNT of infinity. This notion is probably most easily understood by
considering the continuum of possible NNTs:
Whether they represent a genuine advance in communicating
data to clinicians is unknown. As always, we are interested in
Increasing harm
1. Altman DG. Confidence intervals for the number needed to treat.
BMJ. 1998;317:1309-12.
2. Scandinavian Simvastatin Survival Study Group. Randomized trial of
cholesterol lowering in 4444 patients with coronary heart disease: the
Scandinavian Simvastatin Survival Study (4S). Lancet. 1994;344:1383-9.
Number needed to treat
3. Down JR, Clearfield M, Weis S, et al. Primary prevention of acute
Number needed to treat
to
benefit one person
to
harm one person
coronary events with lovastatin in men and women with average cho-lesterol levels: results of AFCAPS/TexCAPS. JAMA. 1998;279:1615-22.
*For readers who prefer decimals, NNT = 1/Absolute risk reduction. Inthis example, 1/0.1 or 10.
†Apologies to statistical purists who would direct the reader toward a moreformal definition for a 95% CI: "The interval computed from the sampledata which, were the study repeated multiple times, would contain theunknown parameter 95% of the time."
Effective Clinical Practice ■
May/June 1999 Volume 2 Number 3
Primer on Statistical Significance and P Values
In the world of medical journals, few phrases evoke more author-
is no difference (i.e., the null hypothesis is true), what is the prob-
ity than "the differences observed were statistically significant."
ability of observing this difference (i.e., 7 lbs) or one more
Unfortunately, readers frequently accord too much importance to
extreme (i.e., 8 lbs, 9 lbs, etc.)"? This probability is called the P
this statement and are often distracted from more pressing
value and, for most of us, translates roughly to "the probability
issues. This Primer reviews the meaning of the term statistical
that the observed result is due to chance."
significance and includes some important caveats for critical
If the P value is less than 5%, researchers typically assert
readers to consider whenever it is used.
that the findings are "statistically significant." In the case of the
weight loss program, if the chance of observing a difference of 7
Assessing the Role of Chance
pounds or more (when, in fact, none exists) is less than 5 in 100,
Consider a study of a new weight loss program: Group A receives
then the weight loss program is presumed to have a real effect.
the intervention and loses an average of 10 pounds, while group
B serves as a control and loses an average of 3 pounds. The main
effect of the weight loss program is therefore estimated to be a 7-
Relationship between Common Language and
pound weight loss (on average). But we would rarely expect that
any two groups would have exactly the same amount of weight
change. So could it just be chance that group A lost more weight?
There are two basic statistical methods used to assess the
role of chance: confidence intervals (the subject of next issue's
The null hypothesis
P < 0.05
Primer) and hypothesis testing. As shown in the Figure below,
"Unlikely due to
both use the same fundamental inputs.
The null hypothesis
P > 0.05
"Due to chance"
could not be
1. the
main effect
Table 1 shows how our common language relates to the statisti-
(the difference in the outcome)
cal language of hypothesis testing.
2. the variance in the main effect
Factors That Influence P Values
Statistical significance (meaning a low P value) depends on three
factors: the main effect itself and the two factors that make up the
State a null hypothesis
variance. Here is how each relates to the P value:
(
the main effect is 0)
•
The magnitude of the main effect. A 7-lb difference will
95% confidence interval
have a lower P value (i.e., more likely to be statistically
around the main effect
significant) than a 1-lb difference.
•
The number of observations. A 7-lb difference observed in
Calculate the test statistic
a study with 500 patients in each group will have a lower
(main effect / variance)
P value than a 7-lb difference observed in a study with 25
to determine
P value
patients in each group.
FIGURE 1. Statistical approach to comparing two groups.
•
The spread in the data (commonly measured as a stan-
dard deviation). If everybody in group A loses about 10
pounds and everybody in group B loses about 3 pounds,
Hypothesis testing goes on to consider a condition—the
the P value will be lower than if there is a wide variation in
null hypothesis—that no difference exists. In this case, the null
individual weight changes (even if the group averages
hypothesis is that the weight change in the two groups is the
remain at 10 and 3 pounds). Note: More observations do
same. The test addresses the question, "If the true state of affairs
not reduce spread in data.
2001 American College of Physicians–American Society of Internal Medicine
Caveats about the Importance of P Values
2. Statistical significance does not translate into clinical
Unfortunately, P values and statistical significance are often
Although it is tempting to equate statistical significance
accorded too much weight. Critical readers should bear three
with clinical importance, critical readers should avoid this temp-
facts in mind:
tation. To be clinically important requires a substantial change in
an outcome that matters. Statistically significant changes, how-
1. The P < 0.05 threshold is wholly arbitrary.
ever, can be observed with trivial outcomes. And because signif-
There is nothing magical about a 5% chance—it's simply a
icance is powerfully influenced by the number of observations,
convenient convention and could just as easily be 10% or 1%.
statistically significant changes can be observed with trivial
The arbitrariness of the 0.05 threshold is most obvious when P
changes in important outcomes. As shown in Table 2, large stud-
values are near the cut-off. To call one finding significant when
ies can be significant without being clinically important and small
the P value is 0.04 and another not significant when it is 0.06 vast-
studies may be important without being significant.
ly overstates the difference between the two findings.
Critical readers should also realize that dichotomizing P
3. Chance is rarely the most pressing issue.
values into simply "significant" and "insignificant" loses infor-
Finally, because P values are quantifiable and seemingly
mation in the same way that dichotomizing any clinical laborato-
objective, it's easy to overemphasize the importance of statistical
ry value into "normal" and "abnormal" does. Although serum
significance. For most studies, the biggest threat to an author's
sodium levels of 115 and 132 are both below normal, the former
conclusion is not random error (chance), but systematic error
is of much greater concern than the latter. Similarly, although
(bias). Thus, readers must focus on the more difficult, qualitative
both are significant, a P value of 0.001 is much more "significant"
questions: Are these the right patients? Are these the right out-
than a P value of 0.04.
comes? Are there measurement biases? Are observed associa-
tions confounded by other factors?
Big Studies Make Small Differences "Significant"*
WEIGHT LOSS
MAIN EFFECT
(IN EACH GROUP)
Not significant, but promising
Significant, but clinically unimportant
*
The standard deviation of the weight change is assumed to be 20 lb.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
July/August 2001 Volume 4 Number 4
Primer on 95% Confidence Intervals
Readers frequently face questions about the role of chance in a
tion addressed by a 95% CI. In this example the study abstract
study's results. The traditional approach has been to consider the
might read:
probability that an observed result is due to chance–the P value.
The mean weight loss was 10 pounds for patients in
However, P values provide no information on the results' preci-
the intervention group and 3 pounds for patients in
sion—that is, the degree to which they would vary if measured
the control group, resulting in a mean difference of
multiple times. Consequently, journals are increasingly empha-
7 pounds and a 95% CI of 2 to 12. In other words,
sizing a second approach: reporting a range of plausible results,
95% of the time the true effect of the intervention will
better known as the 95% confidence interval (CI). This Primer
be within the range from 2 to 12 pounds.
reviews the concept of CIs and their relationship to P values.
To conceptualize the more formal definition of a 95% CI, it
is useful to consider what would happen if the study were repeat-
Assessing the Role of Chance
ed 100 times. Obviously, not every study would result in a 7-
There are two basic statistical methods used to assess the role of
pound weight loss in favor of the intervention. Simply due to the
chance: hypothesis testing (which results in a P value–the sub-
ject of last issue's Primer) and 95% CIs. As shown in Figure 1,
both use the same fundamental inputs.
1. the
main effect
(the difference in the outcome)
2. the variance in the main effect
State a null hypothesis
(
the main effect is 0)
95% confidence interval
around the main effect
Calculate the test statistic
(main effect/variance)
to determine
P value
FIGURE 1. Statistical approach for comparing two groups.
FIGURE 2. Every study can have a 95% CI.
Consider a study of a new weight loss program: Group A
receives the intervention and loses an average of 10 pounds,
whereas group B serves as a control and loses an average of 3
play of chance, weight loss would be greater in some studies and
pounds. The main effect of the weight loss program is therefore
less in others, and some studies might show that the controls
estimated to be a 7-pound weight loss (on average).
lost more weight. As shown in Figure 2, we can generate a 95% CI
But readers should recognize that the true effect of the
for each study.
program may not be exactly a 7-pound weight loss. Instead, the
Note that for 95 out of 100 studies, the CI contains the truth
true effect is best represented as a range. What is the range of
(and 5 times out of 100 it does not). This example helps explain
effects that might be expected just by chance? That is the ques-
the formal definition of a 95% CI: "The interval computed from the
2001 American College of Physicians–American Society of Internal Medicine
FIGURE 3. More diversity in weight loss
equals a larger 95% CI. The horizontal
bars represent group means.
Mean difference, 7 lb
Mean difference, 7 lb
sample data which, were the study repeated multiple times,
about 10 pounds and everybody in group B loses about 3
would contain the true effect 95% of the time."
pounds, then the CI will be narrower (left part of figure) than if
individual weight changes are spread all over the map (right part
Factors That Influence 95% CIs
of figure).
Confidence intervals really are a measure of how precise an esti-
Readers will occasionally encounter CIs calculated for
mated effect is. The range of a CI is dependent on the two factors
other confidence levels (e.g., 90% or 99%). The higher the degree
that cause the main effect to vary:
of confidence, the wider the confidence interval. Thus, a 99% CI
1) The number of observations. This factor is largely under
for the 7-pound difference would have to be wider than the 95%
the investigator's control. A 7-pound difference observed in a
CI for the same data.
study with 500 patients in each group will have a narrower CI than
Relationship between 95% CIs
a 7-pound difference observed in a study with 25 patients in each
and P values
2) The spread in the data (commonly measured as a stan-
Information about the P value is contained in the 95% CI. As
dard deviation). This factor is largely outside the investigator's
shown in Figure 4, the P value can be inferred based on whether
control. Consider the two comparisons in Figure 3. In both cases,
the finding of "no difference" falls within the CI.
the mean weight loss in group A is 10 pounds and the mean
So, given a CI of 2 to 12 pounds for the 7-pound difference,
weight loss in group B is 3 pounds. If everybody in group A loses
one could infer that the P value is less than 0.05. Alternatively,
FIGURE 4. Relationship between P value and
If the 95% CI includes no difference between groups,
then the
P value is > 0.05.
If the 95% CI does not include no difference between groups,
then the
P value is < 0.05.
Effective Clinical Practice ■
September/October 2001 Volume 4 Number 5
given a CI of –3 to 17 pounds for the 7-pound difference, one
Although P values and 95% CIs are related, CIs are pre-
could infer that the P value is greater than 0.05. If the CI termi-
ferred because they convey information about the range of plau-
nates exactly on no difference, such as 0 to 14 pounds, then the
sible effects. In other words, the CI provides the reader with some
P value is exactly 0.05.
sense of how precise the estimate of the effect is. This is a valu-
Remember that the value for no difference depends on the
able dimension that is not contained within a P value.
type of effect measure used. When the effect measure involves a
But, like P values, 95% CIs do not answer two critical ques-
subtraction, the value for the difference is 0. When the effect mea-
tions: 1) Is the result correct? 2) Is the observed effect "impor-
sure involves a ratio, the value for no difference is 1. As shown in
tant"? To answer the first question, readers must seek other data
Table 1, readers must pay careful attention to this in order to reli-
and evaluate the possibility of systematic error (bias). To answer
ably interpret the CI.
the second, they must rely on their own clinical judgment.
Examples Demonstrating 95% CIs and P Values
VALUE FOR NO
CI INCLUDES NO
(P < 0.05)
The average weight loss was 7 lbs
(95% CI, –3 to 17)
42% absolute reduction in the need for
intubation (95% CI, 7% to 70%)
The relative risk for cancer was 2.3 for
smokers compared with nonsmokers
(95% CI, 1.8 to 3.0)
The odds ratio for readmission was 0.8 for
Odds ratio
managed care patients (95% CI, 0.3 to 1.2)
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
September/October 2001 Volume 4 Number 5
A Primer on Before–After
deep venous thrombosis (DVT). A comparison of cost between
Studies: Evaluating a Report of a
all patients who have DVT (before) and patients who have DVT
and are eligible for the outpatient program (after) would dra-
matically overestimate the effect of the intervention. The best
estimate of the intervention's effect would be to compare all
patients with DVT (before) with all patients with DVT (after),
It can be difficult to rigorously evaluate a clinical management or
including both those who are eligible and those who are ineli-
quality improvement intervention. Because these interventions
gible for the program. The comparability of patients in the
generally occur at a system level (i.e., throughout the clinic, the
before group and the after group is particularly relevant in
hospital, or the health plan), it may not be practical to obtain
assessments of the effect of guidelines (which generally apply
suitable concurrent controls (clinics, hospitals, or plans not
to select patient subgroups).
exposed to the intervention). As illustrated below, a common
approach is to measure outcomes before the intervention is
implemented and compare them with outcomes measured after-
Is there evidence for a prevailing "temporal trend"? Many out-
ward—an approach often called a before–after study (or a
comes change over time, regardless of whether an intervention
pre–post study).
has been applied. Consider a before–after study testing an inter-
vention to reduce length of stay in the hospital. The average
length of stay is 5 days before the introduction of the interven-
tion but is 4.7 days after introduction. It is tempting to believe
that the intervention caused the change. On the other hand,
there is a prevailing temporal trend: Length of stay has been
decreasing everywhere across time (at least until recently). The
same problem would arise in a before–after study that tested an
intervention to increase the use of aspirin in patients who have
had a myocardial infarction. It would be difficult to untangle
whether the observed change is the result of the intervention or
dramatic television advertising. Because many forces are likely
to be acting on outcomes that people care about, it is important
Although academics can easily criticize the lack of a concur-
to question whether an intervention is truly responsible for "suc-
rent control group, managers still need to make decisions on
cess," particularly if outcomes are improving everywhere.
the basis of data available to them. This primer is intended to
provide guidance on how to think critically about a report of
Were study participants selected because they were "outliers"?
a "successful" intervention obtained from a before–after
Understandably, some before–after studies target "problem
areas" and select persons who are "outliers"—that is, partici-
pants who have extreme values in some measure. These stud-
As with any report of "success," readers should start by asking
ies may follow the same participants over time and face anoth-
three questions: Is the outcome unimportant? Is the magnitude
er threat to validity: regression to the mean. Examples could
of the change trivial? Were critical outcomes ignored? If the
include a study of case management in patients who have had
reader is comfortable that the answer to each is no, then he or
high utilization in the past or a study of an intensive communi-
she must go on to challenge the fundamental inference: that the
cation tutorial in physicians who have been judged by their
"success" is a consequence of the intervention. The validity of
patients to have poor communication skills. Even if there is no
this inference is threatened with an affirmative response to any
intervention, participants selected because of extreme values
of the following questions:
will, on average, be found to have less extreme values with
repeated measurement. Extremely high utilization in 1 year
Would all participants in the "before group" be eligible for the
tends not to be so high the next (some patients may have had a
"after group"? A typical before–after study compares the out-
major heart attack, stroke, or other catastrophic event that does
comes of hospitalized patients before and after some system
not occur again in the next year); a group of physicians with
intervention. Thus, different patients are often involved (e.g.,
extremely poor communication skills will tend to improve (some
patients admitted with pneumonia in June are compared with
may have had a personal crisis that resolves in the ensuing
patients admitted with pneumonia in July). If only certain
year). Note that in neither case are the participants expected to
patients are eligible for the intervention, however, an inference
return to the mean; they just become less extreme. Regression
about the success of the intervention can be seriously flawed.
to the mean sets the stage to ascribe changes to a case man-
Consider a study of the effect an outpatient low-molecular-
agement program or a communication tutorial when they actu-
weight heparin program (which, by necessity, excludes the
ally represent the natural course of events.
sickest patients) on the average length of stay of patients with
Effective Clinical Practice ■
September/October 1999 Volume 2 Number 5
Although it is always possible that a change observed in a
cult to ascribe the finding to temporal trends. The confusing
before–after study is a consequence of the intervention, affir-
effect of regression to the mean can be avoided if participants
mative responses to any of the preceding questions make the
are not selected because they are outliers. Nonetheless, infer-
inference more tenuous. Alternatively, the inference is strength-
ences from a before–after study should be seen as being based
ened when investigators paid careful attention to the compara-
on circumstantial evidence. If the accuracy of the inference is
bility of the participants. Inferences are further strengthened
important, readers and researchers alike must ask whether
when the observed change is substantial, unique, and occurs
there is a reasonable opportunity to test the intervention by
quickly after the intervention—in other words, when it is diffi-
using concurrent controls.
Effective Clinical Practice ■
September/October 1999 Volume 2 Number 5
Primer on Group Randomized Trials
Group randomized trials are experiments in which the interven-
with 40 physicians is more likely to detect a significant interven-
tion occurs at the level of the group (typically physicians or clin-
tion effect than the one with only 8 physicians—despite the equiv-
ics) but observations are made on individuals within the groups
alent size of the patient sample. In other words, collecting a large
(e.g., patients). Because group randomized trials are increasingly
amount of information on patients in one physician practice
common in health services research, critical readers should
allows something precise to be said about that physician but
understand their rationale, the implications of group size vs. num-
adds little to the ability to answer the study question.
ber of groups, and the limitations of the approach.
Although ideally there should be as many physicians as
possible, practical considerations often limit enrollment. The
Why Randomize by Group?
number of physicians available and willing to participate is often
Group randomization is particularly useful when there is a high
limited. It can be very expensive to enroll and train a physician. It
risk for contamination if group members are randomized as indi-
is often easier to recruit many patients and a few physicians than
viduals. For example, an investigator studying the effects of a
it is to recruit many physicians. Thus, there is a trade-off between
clinical practice guideline can't assume that a provider caring for
increasing group size (often the most expedient way to increase
patients in the intervention arm will not apply this knowledge to
sample size) and increasing the number of groups (generally the
the patients assigned to the control arm. Such contamination
most effective way to increase power).
biases the study toward a finding of no effect. Randomizing at the
level of the physician avoids this source of contamination
Sample Size in Group Randomized Trials
because physicians are either exposed or not exposed to the
The ability to make statistical inferences is inversely related to vari-
intervention. If there are concerns that intervention physicians
ability in the outcome measure. In this example, the variability in
will contaminate control physicians in the same clinic, random-
cholesterol can come from two sources: differences among
ization should occur at the clinic level.
patients and differences among physicians (presumably in their
ability to influence the patients' cholesterol either through behavior
Group Size vs. Number of Groups
modification or pharmacologic treatment). The proportion of cho-
To illustrate some of the issues raised by group randomization,
lesterol variability attributable to physicians is called the intraclass
consider a trial to test a cholesterol management guideline.
correlation (the term is a misnomer because it has nothing to do
Physicians would be randomly assigned to a control or an inter-
with correlation), or rho. As rho increases, a greater share of the
vention arm while the outcome (say, the mean change in choles-
variability comes from physicians, so that increasing the number of
terol after 6 months) would be measured on their patients. As
physicians will become more important. If rho is small, then
shown in the Figure, however, there are many possible combina-
increasing the number of patients per physician may be sufficient
tions of group size and number of groups.
to increase the power to detect an effect. Rho can only be zero if
In each case we have 200 patient observations (100
there is no systematic difference between groups. In other words,
patients in each arm), but as group size increases there are fewer
1) physicians do not differ in their response to education and 2) the
physicians. With smaller group size, there is less information on
patients of one physician do not differ systematically from those of
many physicians; with larger group size, there is more informa-
another. A typical rho in this setting is between 0.01 and 0.04.
tion on only a few physicians. Because the study is intended to
Table 1 illustrates how changes in the intraclass correla-
measure the impact of the guideline on physicians, the design
tion affect the sample size needed to produce equivalent levels of
Medium Group Size
(5 patients/group)
(5 patients/group)
(10 patients/group)
(10 patients/group)
(25 patients/group)
(25 patients/group)
Decreased ability to estimate effect
FIGURE. The relationship between group size and the number of groups. Size for each group is 200.
Effective Clinical Practice ■
January/February 2001 Volume 4 Number 1
Comparability of Patients
Relationship between Intraclass Correlation,
One of the most important advantages of randomization is that, if
Sample Size, and Number of Groups
the trial is large enough, it is fair to assume that the study groups
NUMBER OF
SAMPLE SIZE
will be comparable with respect to all variables (measured and
unmeasured). This enhances our ability to make inferences about
the effect of the intervention on the outcome. In contrast to ran-
domized trials of individuals, group randomized trials involve
only a limited number of groups—typically 15 or 20. Thus, there
are rarely enough groups to ensure even distribution of variables
that could confound the treatment effect and bias the outcomes
As a result, investigators need to collect information on
important confounders and plan analyses that will control for
these factors. These analyses require special techniques that
directly incorporate the group structure (cluster analyses). It
would be a mistake in our hypothetical example to simply com-
pare the average cholesterol levels in the treatment and control
group with, say, a standard z-test. For example, a study with
rho = 0.03, 10 physicians per group, and 486 total patients would
*
No physician effect.
be equivalent to a study with rho = 0 and 200 total patients. A
z-test would calculate a standard error based on 486 patients,
precision. As the intraclass correlation increases, the total num-
when the effective sample size is only 200. Statistical analysis
ber of patients needed also increases. In addition, Table 1 shows
that ignores this fact can give falsely low P values and overly opti-
how the effect is modified by the number of physicians. When the
mistic confidence intervals.
intraclass correlation is 0.03, for example, a study with 10 physi-
Policymakers and managers are increasingly interested in
cians in each arm requires 486 patients to achieve the same pre-
moving "hard science" to the vagaries of actual clinical practice.
cision as a study with 278 patients and 20 physicians in each arm.
To help translate efficacy into effectiveness, interventions are
Notice that with 4 physicians in each arm, no number of patients
being directed to physicians (or groups of physicians). Group
would provide sufficient information to answer the study ques-
randomization is the best approach to make valid inferences
tion. This illustrates a major limitation of group randomized trials:
about their value.
It may be impossible to collect enough data at the patient level to
make up for a small number of groups. The important lesson here
is that the effective sample size in a group randomized trial is not
related only to the number of patients but depends on the num-
This Primer was contributed by Michael L. Beach, MD, PhD,
ber of groups and the intraclass correlation.
Dartmouth–Hitchcock Medical Center, Lebanon, New Hampshire.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
January/February 2001 Volume 4 Number 1
Primer on Cost-Effectiveness Analysis
Cost-effectiveness analysis (CEA) is a technique for selecting
among competing wants wherever resources are limited.
Conditions under Which CEA Is Relevant
Developed in the military, CEA was first applied to health care in
the mid-1960s and was introduced with enthusiasm to clinicians
by Weinstein and Stason in 1977:
NEW STRATEGY
NEW STRATEGY
"If these approaches were to become widely understood and
accepted by the key decision makers in the health-care sector,
including the physician, important health benefits or cost savings
New strategy is more
Adopt new
might be realized."
CEA relevant
Regardless of whether this hope was realized, CEA has
New strategy is less
New strategy is
since become a common feature in medical literature.
CEA relevant
The Basics of CEA
CEA is a technique for comparing the relative value of various
clinical strategies. In its most common form, a new strategy is
compared with current practice (the "low-cost alternative") in the
calculation of the cost-effectiveness ratio:
strategy might compare with an existing approach. Note that a
CEA is relevant only if a new strategy is both more effective and
more costly (or both less effective and less costly).
new strategy
CE ratio =
– effect
An Example
new strategy
Consider two strategies intended to lengthen life in patients with
The result might be considered as the "price" of the addi-
heart disease. One is simple and cheap (e.g., aspirin and -block-
tional outcome purchased by switching from current practice to
ers); the other is more complex, more expensive, and more effec-
the new strategy (e.g., $10,000 per life year). If the price is low
tive (e.g., medication plus cardiac catheterization, angioplasty,
enough, the new strategy is considered "cost-effective."
stents, and bypass). For simplicity, we will assume that doing
It's important to carefully consider exactly what that state-
nothing has no cost and no effectiveness. Table 2 shows the rel-
ment means. If a strategy is dubbed "cost-effective" and the term
evant data.
is used as its creators intended, it means that the new strategy is
Note that CEA is about marginal (also called incremental)
a good value. Note that being cost-effective does not mean that
costs and benefits. So the marginal cost of a simple strategy is
the strategy saves money, and just because a strategy saves
the difference between the cost of that strategy and the cost of
money doesn't mean that it is cost-effective. Also note that the
doing nothing. The marginal cost for the complex strategy is the
very notion of cost-effective requires a value judgment—what
difference between the cost of the complex strategy and the cost
you think is a good price for an additional outcome, someone else
of the simple strategy (not the cost of doing nothing). The calcu-
lation is similar for effectiveness. The final outcome measure for
It's also worthwhile to recognize that CEA is only relevant
the analysis is the CE ratio: the ratio of marginal cost to margin-
to certain decisions. Table 1 delineates the various way a new
A CEA Examining Three Strategies
5.5 years
0.5 years
Effective Clinical Practice ■
September/October 2000 Volume 3 Number 5
A CEA Examining Two Strategies
5.5 years
5.5 years
Things To Ask
4. Where do the cost data come from?
The basic question here is, "Was resource use modeled, or
If a study is of interest and its primary outcome is a cost-effec-
was it measured in real practice?" In modeling, investigators have
tiveness ratio, critical readers should seek answers to the follow-
to make assumptions about which services are likely to be uti-
lized differently—thus driving the difference in cost. The mea-
1. Are the relevant strategies being compared?
surement of resource use in practice has the advantage of cap-
Because CEA involves marginal cost and benefits, the
turing utilization that may not be anticipated by investigators
choice of which strategies to compare can drive the calculation
(e.g., extra testing, extra visits, readmissions).
and the conclusion of a CEA. Consider the effect of repeating the
In either approach, there can be considerable debate about
above analysis without the simple strategy (Table 3).
how to attach dollar amounts to utilization counts (debates that
By excluding the simple strategy, the CE ratio for the com-
can get very tedious very quickly). Critical readers should look at
plex strategy falls from $90,000 per life-year to $9091 per life-year.
the utilization counts themselves and have some confidence about
Thus, CEA is very sensitive to the choice of strategies being
the face validity of the dollars attached to them (probably the most
compared. Readers need to carefully consider whether the choice
practical standard being the Medicare fee schedule/allowed
being presented is really the choice that interests clinicians.
charges). If more utilization doesn't equal more money, some-
2. How good are the effectiveness data?
It's hard to get too excited about cost-effectiveness if the
5. Who's funding the CEA?
effectiveness of the strategy is really unknown. So as a first step,
Unfortunately, funding sources seem to matter. There is
the critical reader should examine the information used for effec-
now considerable evidence that researchers with ties to drug
tiveness. Ideally, the data should come from randomized trials. If
companies are indeed more likely to report favorable results than
they don't, you'll want to scrutinize the face validity of the
are researchers without such ties. Because they are so sensitive
assumptions. Unfortunately, sometimes the analyses get way
to both the choice of strategies and assumptions, CEAs are par-
ahead of the data (one CEA was published on autologous bone
ticularly susceptible to bias—intentional or not. Consequently,
marrow transplantation in metastatic breast cancer 8 years before
some journals have chosen not to publish industry-supported
a randomized trial showed no benefit).
CEAs. For those that are published, readers must consider the
3. Do the effectiveness data reflect how the strategy will be
conflict posed by funding from a manufacture of one of the ana-
used in the real world?
Even if the effectiveness data are from randomized trials,
6. Did we get anywhere?
it's important to ask whether they really pertain to the population
Finally, readers may want to consider whether the entire
and setting in which the strategy is likely to be applied. Consider
exercise somehow helped them with a decision. Although some
a CEA of carotid endarterectomy in asymptomatic patients with
CEAs have extremely high CE ratios (i.e., > $200,000 per quali-
more than 70% stenosis. If the trial data represent the best surgi-
ty-adjusted life-year—a poor value) and other have very low CE
cal practice while broad implementation of the strategy would
ratios (i.e., < $10,000 per quality-adjusted life-year—a good
involve community providers, then effectiveness is being over-
value), most fall somewhere in the middle. Analyses with CE
estimated—as is cost-effectiveness. A similar problem may occur
ratios of $50,000 per quality-adjusted life-year may conclude
if the trials involve patient selection criteria that are not easily
with an assertion that the analyzed strategy is "cost-effective."
replicated in practice. A critical reader of CEAs should carefully
Whether or not this helps anyone make a decision is hard to
consider the generalizability of the effectiveness data.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
September/October 2000 Volume 3 Number 5
Kassirer JP, Angell M. The Journal's policy on cost-effectiveness analyses.
Azimi NA, Welch HG. The effectiveness of cost-effectiveness analysis in
N Engl J Med. 1994;331:669-70.
containing costs. J Gen Intern Med. 1998;13:664-9.
O'Brien BJ, Heyland D, Richardson WS, Levine M, Drummond MF.
Doubilet P, Weinstein MC, McNeil BJ. Use and misuse of the term "cost-
Users' guides to the medical literature. XIII. How to use an article on eco-
effective" in medicine. N Engl J Med. 1986;314:253-6.
nomic analysis of clinical practice. B. What are the results and will they help
Drummond MF, Richardson WS, O'Brien BJ, Levine M, Heyland D.
me in caring for my patients? Evidence-Based Medicine Working Group
Users' guides to the medical literature. XIII. How to use an article on eco-
nomic analysis of clinical practice. A. Are the results of the study valid?
Russell LB, Gold MR, Siegel JE, Daniels N, Weinstein MC, for the Panel
Evidence-Based Medicine Working Group. JAMA.1997;277:1552-7.
on Cost-Effectiveness in Health and Medicine. The role of cost-effectiveness
Eddy DM. Cost-effectiveness analysis: a conversation with my father.
analysis in health and medicine. JAMA. 1996;276:1172-7.
Siegel JE, Weinstein MC, Russell LB, Gold MR, for the Panel on Cost-
Eddy DM. Cost-effectiveness analysis: is it up to the task? JAMA. 1992;267:
Effectiveness in Health and Medicine. Recommendations for reporting cost-
effectiveness analyses. JAMA. 1996;276:1339-41.
Eddy DM. Cost-effectiveness analysis: the inside story. JAMA. 1992;268:
Weinstein MC, Siegel JE, Gold MR, Kamlet MS, Russell LB, for the Panel
on Cost-Effectiveness in Health and Medicine. Recommendations of the
Eddy DM. Cost-effectiveness analysis: will it be accepted? JAMA. 1992;268:
panel on cost-effectiveness in health and medicine. JAMA. 1996;276:1253-58.
Weinstein MC, Stasson WB. Foundations of cost-effectiveness analysis for
Friedberg M, Saffran B, Stinson TJ, Nelson W, Bennett CL. Evaluation of
health and medical practice. N Engl J Med. 1977;296:716-21.
conflict of interest in economic analyses of new drugs used in oncology.
JAMA. 1999;282:1453-7.
Effective Clinical Practice ■
September/October 2000 Volume 3 Number 5
Primer on Interpreting Surveys
To answer their research questions, investigators often need to
are surprised by how educated most of them are about the
ask questions of others. These questions may revolve around
technique. You conclude that there is little need for further
how people feel, what people know, and what people think. Some
examples are given in the following table.
What's wrong with this conclusion? Patients in the registry
may be more motivated than patients in general. Furthermore,
patients who received the survey and did not know the answers
to the questions might have decided not to complete it.
Therefore, it is possible that your conclusion is wrong and that,
How do people
How do patients with lung cancer feel
in fact, most asthmatic persons do not understand the use of
after having chemotherapy?
How do physicians react to having
To avoid this general problem, readers need to ask them-
their decisions reviewed?
selves how well the respondents represent the target population.
How much do healthy women fear
As shown in the following figure, there are three basic steps of
selection between the target population (about which the conclu-
sion will be drawn) and the actual sample (where the data come
What do people
What do patients know about the
from). The reduction at each step potentially threatens a conclu-
benefit of chemotherapy in lung
cancer?
sion about the target population.
What do physicians know about
the evidence supporting certain
therapies?
Target Population
Adults with Asthma
What do women know about their risk
(Who the researchers want)
for heart disease?
What do people
Do patients with lung cancer think
they should be told the average
(Who they can get)
Do physicians think that there is a
better way to change their behavior?
Do women think that they are getting
too much or too little information?
(Who they try to get)
Drawn from Registry
To address these questions, investigators must systemat-
ically question a defined group of individuals—in other words,
administer a survey. This can be done in person, by mail, by
(Who they end up with) Respondents
phone, or over the Internet. Because surveys are increasingly
common in the medical literature, readers need to be able to crit-
ically evaluate the survey method. Two questions are fundamen-
tal: 1) Who do the respondents represent? 2) What do their
Who Do the Respondents Represent?
Target Population ➔
Sample Frame
Like most types of research, surveys are useful only to the extent
The sample frame is the portion of the target population that is
that they help us learn something about a defined population. The
accessible to researchers (e.g., persons who read newspapers,
population we are interested in learning about is called the target
persons with phones). Often, the sample frame is some sort of
population. Surveys are almost always based on a sample of the
list (e.g., a membership list). But individuals who are accessible
target population, and the respondents may not accurately repre-
may differ from those who are not. For example, persons with
sent this population.
phones are different from persons without phones, and physi-
Consider the following example. Suppose you are inter-
cians who are members of professional organizations are dif-
ested in how well adults with asthma are schooled in the use
ferent from those who are not. Readers should carefully judge
of spacers with inhalers. You question a sample of adults who
how the sample frame might systematically differ from the tar-
are members of an asthma registry, and one third respond. You
Effective Clinical Practice ■
January/February 2002 Volume 5 Number 1
Sample Frame ➔
Selected Sample
with an external gold standard. Examples of criterion validity
Although researchers may try to contact the entire sample frame,
include comparing reported age with birth certificates, reported
in many cases this would involve an unmanageable number of
weight with measured weight, and reported eyesight with visual
individuals. The selected sample is the portion of the sample
acuity. Although readers may be much more confident about a
frame that the researchers actually try to contact. If the selected
question that has been validated against an explicit criterion, they
sample is randomly selected from the sample frame, readers can
must also ask whether it may have been more accurate to simply
be confident that this step does not seriously threaten generaliz-
apply the gold standard (e.g., why ask about weight when you can
ability. If it is selected by some other means, readers must be
measure it?). Unfortunately, there is no criterion for many impor-
more circumspect. Suppose the selected sample is 100 patients
tant questions (e.g., questions about what people think).
who appear consecutively in an outpatient clinic (consecutive
sample) or 100 persons who respond to a newspaper advertise-
ment (convenience sample). Although both approaches are rea-
At the other extreme, readers need to consider for themselves
sonable places to begin to learn about a topic, the first does not
whether the questions seem appropriate and reasonably com-
adequately represent patients coming to clinic (because it over-
plete "on the face of it." To really judge face validity, readers
represents persons who visit the clinic frequently) and the second
should look (and journals should publish) the exact language
does not adequately represent persons who read newspapers.
used in the question. Face validity has the disadvantage of being
entirely subjective. At the same time, it may be the only type of
Selected Sample ➔
Actual Sample
validity that can be applied to the important subjective questions
Not everyone who is contacted responds to a survey. The final
that survey researchers are trying to answer.
sample is the portion of the selected sample that chooses to
respond. However, the decision not to respond is usually not ran-
dom—that is, respondents and nonrespondents usually differ.
Construct validity is somewhere between criterion validity and face
Patients who respond to questions about their disease may be
validity. When the "gold standard" is not very objective but other
more educated, have a smaller number of other problems, and
data are available with which to judge a question's performance, we
care more about health. Physicians who respond to questions
are in the realm of construct validity. The basic idea behind con-
about guidelines may be more likely to believe that guidelines are
struct validity is that if your measurement does what you think it
important and more likely to be compliant. To judge these factors,
does, it should behave in certain ways. For example, the level of
readers need to consider the response rate. Whenever response
self-reported pain would be expected to decrease when respon-
rates are less than perfect (< 90%) and particularly when they are
dents are given morphine. Wherever possible, readers should look
low (< 50%), readers should ask themselves how nonrespondents
for evidence that the pattern of responses is generally what would
are likely to differ from respondents.
be expected given other data.
What Do Their Answers Mean?
It is increasingly common to see the answers for several questions
Having decided who the respondents represent, readers can pro-
aggregated into a single score ("The mean PDQ score for dentists
ceed to making judgments about their responses. The real challenge
was 2.5 points higher than for lawyers; P = 0.03"). If possible, read-
is to think about validity: How well do the survey questions do their
ers should try to move beyond the score to consider the validity of
job? Validity is the degree to which a particular indicator measures
individual questions. But because use of scores is increasing,
what it is supposed to measure rather than reflecting some other
readers also need to seek some grounding about what the scores
phenomenon. Although there are numerous kinds of validity (and
mean ("Is 2.5 big or little?"). Sometimes this grounding can be
even more names for each kind), it may be more useful for readers
achieved by knowing the mean score for groups with which one is
to consider validity as a spectrum, as in the following illustration.
familiar or by knowing how much a score changes after a familiar
event. Knowing that the development of a new chronic disease
Increasingly Subjective
translates to approximately a 5-point drop in the Physical Compo-
nent Summary score of the SF-36, for example, helps give a sense
for this measure of health status.
Do the responses
Are the responses
Does this question
agree with a gold
similar to what you
mean the same thing
would expect given
to you as it does in
Survey research is an important way of learning what our patients
understand and what they want. At the same time, it is often clut-
tered with unnecessary complexity and jargon. More important,
false conclusions are a constant possibility. Simply figuring out
At one extreme, readers can determine the extent to which
what questions were asked and who the respondents were will go
researchers have compared the performance of their question
a long way toward avoiding these problems.
Effective Clinical Practice ■
January/February 2002 Volume 5 Number 1
Primer on Utilities
Utilities are numerical expressions of patient preferences
scale between 0 and 1. More commonly, utilities are
for a particular state of health. Although utilities and measures
elicited by asking patients to make a series of choices to
of functional status both reflect quality of life, utilities describe
identify at what point they are indifferent about the choice
how patients feel about or value living with a given clinical
between two options. There are two commonly used iterative
condition, and measures of functional status generally reflect
approaches to assessing utilities. With the time trade-off
the limitations experienced by patients with a clinical
method, for example, patients might be asked whether they
condition (e.g., New York Heart Association class for
would prefer to live 10 years in good health or 20 years with a
congestive heart failure). Utilities are typically assessed on a
disabling stroke. If they chose the latter, the choice might be
scale from 0 (death or worst health imaginable) to 1 (best
modified to 15 years in good health or 20 years living with a
health).
disabling stroke. This iterative process would continue until a
patient was indifferent about the choice between the two
options--for example, that living 12 years in good health was
equivalent to living 20 years with a disabling stroke. In this
case, the utility for stroke is the ratio of the two values:
12/20=0.6 (Figure). With the standard gamble method, a patient
is instead asked to choose between life with a specific
condition and a gamble with variable probabilities of life
without the condition and death.
Average utilities for a wide variety of clinical conditions
or symptoms may be obtained from the literature. One often-
used catalogue is the Beaver Dam study.1 This population-
based study describes utilities (obtained by two different
methods) for patients with a variety of common clinical
conditions, such as severe back pain (0.87), insulin-dependent
diabetes (0.72), and cataract (0.94).
One familiar application of utilities is the quality-adjusted
life-year (QALY). To calculate QALYs, time spent in a
particular outcome state is multiplied by the utility for life in
that state. For example, 10 years after a disabling stroke
(utility of 0.6) is equivalent to 6.0 QALYs (10x0.6=6.0 QALYs).
This aggregate measure is frequently used in decision
analysis and cost-effectiveness analysis to compare the
relative value of clinical interventions.
Reference 1. Fryback DG, Dasbach EJ, Klein R, et al. The Beaver
Patient utilities may be measured by using a variety of
Dam Health Outcomes Study: initial catalog of health-
techniques (Figure). With the simplest approach, the visual
state quality factors. Med Decis Making. 1993;13:89-
analogue scale, patients simply mark an "X" on a continuous
A Primer on Scores: What Counts?
To judge the effects of clinical interventions, researchers look for
What's being measured?
changes in certain key variables—better known as outcome mea-
The first step is to try to get a handle on the construct. This
sures. Some of the most familiar (and most important) outcome
can be harder than you think. Like so many things in medicine,
measures are dichotomous variables (so-called "0,1 variables"):
scores often go by their acronym (and even when you know what
They either happen or they don't. Examples include heart
the acronym stands for, you may not be that much closer to the
attacks, strokes, and death. Other outcome measures can take
construct). Consider the following examples. PCS stands for
on many values. Physiologic and laboratory measurements fall
physical component summary; it is an overall measure of physi-
into this category (such as blood pressure, serum sodium levels,
cal function assessed by self-report (part of the Medical
and CD4 counts), as do various functional status and symptom
Outcomes Study SF-36). APACHE II stands for Acute Physiologic
scales (such as the Glasgow Coma Scale to classify level of con-
and Chronic Health Evaluation (second version); it is a prognos-
sciousness and visual analog scales to classify level of pain).
tic measure for intensive care unit patients that is used to predict
Over the past two decades, a new type of outcome mea-
sure has been increasingly used in clinical research: scores. A
Which end is up?
score is a composite measure—in other words, it is derived from
Sometimes it's hard to know whether a higher score is a
several individual variables. A score may be the composite of
good thing or a bad thing. A high PCS score, for example, is good.
multiple dichotomous variables, multiple physiologic and labora-
A high APACHE II score, on the other hand, most definitely is not.
tory measurements, multiple scales, or any combination thereof.
Scores are used primarily to measure multiattribute patient func-
Knowing the range of possible values is the next step for
tion (e.g., Mini-Mental Status Score is a metric for classifying the
getting a feel for the results. Some scores, such as the PCS score,
combined functions of orientation, computational ability, and
range from 0 to 100. But many do not (APACHE II ranges from 0
short-term memory) or to predict risk for various outcomes (e.g.,
heart attack, breast cancer, or death).
What are some benchmarks?
Because they may summarize several different variables
The reader needs context—some grounding on what an
(which may have various weights), it can be difficult to know what
expected score would be for a defined set of individuals.
a score really means. If the topic is of interest and primary out-
Published norms are available for the PCS score.1 For example, in
come is a score, critical readers should seek answers to the fol-
the general U.S. population, the average PCS score for men over
lowing questions (Table 1). (If you can't answer these questions,
65 years is 42. A healthy 40-year-old will have an APACHE II score
it's tough to know what counts as an important effect.)
Finally, the reader needs help to make judgments about
what constitutes an important change. In other words, a reader
needs a clinical correlation. A 5-point decrease in the PCS score,
for example, is equivalent to developing a new chronic disease
like congestive heart failure. Of course, the information is not as
precise as we would like (the severity of congestive heart failure
What's being measured?
varies from person to person, as does its impact), but it's a lot bet-
ter than nothing. A change in APACHE II from 12 to 24 is associ-
Which end is up?
ated with an absolute increase in inpatient mortality of 30% (from
approximately 10% to over 40%).
To make sense of scores, readers should try to answer the
What are some benchmarks?
preceding questions. Unfortunately, authors often fail to provide
the needed information. In these cases, if readers want to really
understand what a score means, they must do the hard work
*
Finding the answers can be challenging. One excellent resource
for understanding functional health scores is McDowell I,
Newell C. Measuring Health, 2nd ed. Oxford: Oxford Univ
1. SF-36 Physical and Mental Health Summary Scales: A User's Manual.
Boston: The Health Institute, New England Medical Center; 1994.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
July/August 2000 Volume 3 Number 4
Primer on Lead-Time, Length, and Overdiagnosis Biases
The apparent effects of early diagnosis and intervention (mea-
gression). In the following figure (representing 12 patients), 2 of
sured in terms of how screening-detected cases compare with
6 rapidly progressive cases are detected, whereas 4 of 6 slowly
cases detected by signs and symptoms) are always more favor-
progressive case are detected.
able than the real effects (measured in terms of how a population
that is screened compares with a population that is not). The
comparison between screening-detected cases and others over-
Rapidly Progressive
estimates benefit because the former consists of cases that were
diagnosed earlier, progress more slowly, and may never become
clinically relevant. This comparison, therefore, is said to be
biased. In fact, three biases exist that inflate the survival of
1. Lead-time bias: Overestimation of survival duration
Slowly Progressive
among screen-detected cases (relative to those detected by
signs and symptoms) when survival is measured from diagnosis.
In the figure below (representing one patient), the patient sur-
vives for 10 years after clinical diagnosis and survives for 15
years after the screening-detected diagnosis. However, this sim-
ply reflects earlier diagnosis because the overall survival time of
the patient is unchanged.
o = Time of disease onset.
Dx = Time when disease is clinically obvious without testing.
3. Overdiagnosis bias: Overestimation of survival duration
among screen-detected cases caused by inclusion of pseudo-
disease—subclinical disease that would not become overt before
the patient dies of other causes. Some researchers further divide
2. Length bias: Overestimation of survival duration among
pseudodisease into two categories: one in which the disease
screening-detected cases caused by the relative excess of slow-
does not progress (type I) and another in which the disease does
ly progressing cases. These cases are disproportionately identi-
progress—but so slowly that it never becomes clinically evident
fied by screening because the probability of detection is directly
to the patient (type II). Inclusion of either type as being a "case"
proportional to the length of time during which they are
of disease improves apparent outcomes of screening-detected
detectable (and thereby inversely proportional to the rate of pro-
evaluations of screening will ultimately need to be based
early detection is so appealing that there has been a
on a combination of trial data and decision modeling.
dramatic growth in the use of diagnostic tests—as
Perhaps the best we can expect is to have the real effect
part of systematic efforts (the
Appendix Table pro-
of an early detection strategy demonstrated under a few
vides the current cancer screening recommendations
well-specified conditions and then make careful infer-
of the U.S. Preventive Services Task Force and the
ences about how changing conditions (e.g., target popu-
American Cancer Society) or as more routine testing
lation, screening frequency, new tests) will affect net
in general (witness the finding, also in this issue, that
one quarter of the elderly in Miami undergo echocar-
At first glance, there is every reason to believe
diography each year4).
that early detection should work. If people are exam-
But there are downsides to early detection. First,
ined carefully enough by using advanced laboratory
many people must be involved but only a few can bene-
or imaging technologies, then most disease ought to
fit. To encourage people to be screened, proponents
be "caught" at an early stage. It also stands to reason
must articulate a message that motivates people to do so
that disease found earlier will be easier to treat. Con-
(exemplified by the "1-in-9" statistic for breast cancer).
sequently, much of the mortality and morbidity of
Too often this persuasion involves overstating the risk
advanced disease should be preventable. The idea of
for the target disorder and exaggerating the potential
Effective Clinical Practice ■
March/April 1999 Volume 2 Number 2
Primer on Dissecting a Medical Imperative
Clinicians often face medical imperatives, which are broad state-
constitutes cancer?) that may have important implications when
ments that endorse a course of action. Consider two familiar
the imperative is put into action (e.g., Do doctors agree on what
medical imperatives: invest in patient safety and screen for can-
an error is? Do pathologists agree on who has early cancer?).
cer. Supporting these imperatives are the assertions that elimi-
Carefully understanding the vocabulary may also help identify
nating mistakes and early cancer detection will save lives.
subtle changes in words (e.g., from preventable adverse event to
Medical imperatives are rarely the result of a single study.
error) that may have tremendous influence on public policy.
Instead, they are generally the product of a complex mixture of
observation, reasoning, and belief. Because the actions they
Distinguish between Observation and Inference
engender may be beneficial, distracting, or possibly even be
Once an argument is diagrammed, each element should be con-
harmful, critical readers will want to carefully consider the line of
sidered in terms of its source. Is it the product of an observation
reasoning on which they are based. Several steps may be useful
or the result of an inference? Generally, the observations appear
in this regard.
earlier in the line of argument.
Diagram the Line of Reasoning
Critically Examine the Observations
Diagramming the argument that supports an imperative provides
The observations are typically the result of published findings
the structure necessary to carefully consider the issue. Figure 1
and should be subject to the same scrutiny given any important
is a prototype for the line of reasoning for each of the above
finding (e.g., Is it relevant? Is it valid? Is it generalizable?).
examples (other constructions are, of course, possible).
Look Out for Leaps of Faith
Understand the Vocabulary
Next, consider the inferences carefully. Some may be cautious and
The process of depicting the argument also helps to identify crit-
conservative, others may be reckless. The most common problem
ical issues of definition (e.g., What constitutes an error? What
is to confuse association and causality (e.g., "Because people who
There are many adverse events
Diagnostic tests can detect small
cancers in asymptomatic people
Adverse events are
Adverse events are often
Early stage cancer
Patients with early
often preventable
associated with death
Early treatment leads
should be treated
to improved outcomes
Eliminating medical errors
Early cancer detection and
will save thousands of lives
treatment will save thousands of lives
Invest in patient safety
Screen for cancer
FIGURE 1. Lines of reasoning underlying two imperatives.
Effective Clinical Practice ■
November/December 2000 Volume 3 Number 6
die in the hospital often experience adverse events, preventing
can have unintended effects. For example, cancer screening may
adverse events will save lives" or "Because patients with early dis-
help some people avoid late-stage disease, yet lead others to be
ease do well, early treatment will improve outcomes").
treated unnecessarily (e.g., those with nonprogressive cancer).
And all actions have opportunity costs. For example, dollars
Ask about Vested Interest
devoted to nurse clinicians to improve patient safety are dollars
taken from something else. If that something is routine hospital
How impartial is the person (or group) promoting the imperative?
nursing services, the net effect may be to diminish patient safety.
Obviously, some degree of intellectual interest is expected. But
Just because net effects are difficult to predict, it doesn't mean
the presence of strong professional and/or financial interests may
they can be ignored.
unduly influence the call for action (e.g., safety consultants call for
It's important to think about medical imperatives carefully.
safety initiatives, mammographers calling for mammography).
When you do so, you will probably find that most are oversimpli-
fications. Unfortunately, the world is more complex than any of us
Consider Unintended Effects
would like. Most imperatives are probably neither right nor
Finally, think hard about the net effects (intended and unintend-
wrong—instead, there are settings where they are useful and oth-
ed) of the proposed course of action. Even the simplest action
ers where they are not.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
November/December 2000 Volume 3 Number 6
A Primer on HEDIS
Although many people talk about report cards for medical care,
HEDIS is perhaps best thought of as a standardized test for
there are few working examples. The most prominent is the
health plans. As in most standardized tests, different sections
Health Plan Employer Data and Information Set, better known as
test different domains (e.g., mathematics, language skills). Each
HEDIS. Used by over 400 health plans, HEDIS is a set of stan-
domain contains a series of performance measures (e.g., individ-
dardized performance measures intended to help purchasers and
ual questions). Table 1 shows the seven HEDIS domains and
patients compare health plans in terms of quality (instead of sim-
selected performance measures.
ply comparing costs).
SELECTED PERFORMANCE MEASURES
Effectiveness of care
See Tables 2 and 3
Access and availability of care
Proportion of enrollees with preventive/ambulatory health visits during the
reporting year (calculated separately for children and adults)
Number of providers (primary, behavioral health, obstetric and prenatal, and
Availability of language interpretation services
Satisfaction with experience of care
Member satisfaction
Health plan stability
Disenrollment
Provider turnover
Indicators of financial stability (e.g., revenue, loss, reserves held by plan)
Use of services
Visits (prenatal care, well-child, adolescent well-care, other ambulatory care)
Frequency of selected procedures
Cesarean section rate
Vaginal birth after cesarean rate
Inpatient utilization (acute care, maternity care, newborns, mental health,
Outpatient drug utilization
Cost of care
Actual expense per member per month
High-occurrence/high-cost DRGs (e.g., stroke, TIA, pneumonia, asthma, COPD,
chest pain, angina pectoris, heart failure and shock, major joint replacement)
Health plan descriptive information
Total enrollment and enrollment by payer
Provider characteristics (board certification, residency completion,
Report of plan affiliations with public health, community-based and school-
based agencies
Cultural diversity of Medicaid membership
*
COPD = chronic obstructive pulmonary disease; DRG = diagnosis-related group; TIA = transient ischemic attack.
HEDIS measures of greatest interest to clinicians are in the
subsequently to reflect treatment quality in diabetic and post–
effectiveness-of-care domain. Table 2 lists the performance mea-
myocardial infarction patients. New measures to examine care of
sures, describes how each is calculated, and reports the most recent
patients with hypertension, asthma, chlamydia, and menopause
averages available for the Alliance of Community Health Plans and
have been proposed for the next version of HEDIS (Table 3).
the national average (representing all participating plans). In each
As HEDIS performance measures become more complex,
case, a higher proportion is presumed to represent better care.
so do the questions about measurement methods (e.g., Does a
Some patients, however, may have an informed preference to forgo
blood pressure of 145/95 mm Hg require control? What consti-
some of these services, such as certain immunizations (see the arti-
tutes a sufficient discussion of treatment options?).
cle by Mehl in this issue).
HEDIS is managed by the National Committee for Quality Assur-
The individual performance measures have evolved over
ance (NCQA). NCQA is encouraging the broad use of HEDIS data by
time. When HEDIS was initiated in 1991, the effectiveness measures
employers, consumers, and other health care professionals to com-
focused on vaccination and screening rates. Measures were added
pare health plans. Further information can be found at www.ncqa.org.
Effective Clinical Practice ■
November/December 1999 Volume 2 Number 6
Current Performance Measures in the Effectiveness-of-Care Domain*
1997 ACHP
Childhood immunization rate
DPT, polio, MMR,
hepatitis B, HIB
Adolescent immunization rate
2nd MMR, hepatitis B,
chicken pox
Advice to quit smoking
Received advice to quit
Adults ≥
18 yr who are
current smokers
Breast cancer screening rate
One or more mammo-
Women aged 52–69 yr
grams in the past 2
years
Cervical cancer screening
One or more Pap tests
Women aged 21–64 yr
in the past 3 years
Rate of prenatal care in the
Prenatal care visit
Women who delivered live
first trimester
between 176 and 280
days before delivery
Check-ups after delivery
Postpartum visit between
Women who delivered live
21 and 56 days after
β
-blocker treatment rate
β
-blocker dispensed
Adults ≥
35 yr admitted
within 7 days after
with a diagnosis of AMI
Diabetic retinal examination rate
Retinal examination by an
Adults ≥
31 yr who have
eye care professional
Rate of follow-up after hospital-
Visit with mental health
Individuals ≥
6 yr admitted
ization for mental illness
provider within 30 days
with a mental health
of discharge
*
ACHP = Alliance of Community Health Plans; AMI = acute myocardial infarction; DPT = diphtheria, pertussis, tetanus; HIB =Haemophilus influenzae
type B; MMR = measles, mumps, and rubella; Pap = Papanicolaou.
New Effectiveness-of-Care Performance Measures for HEDIS 2000
Controlling high blood pressure
Blood pressure controlled to below
Enrollees with high blood pressure
140/90 mm Hg
Appropriate medications for people Received medications for long-term
Enrollees with chronic asthma
with asthma
control (e.g., inhaled corticosteroids)
Tested for chlamydia
Sexually active women aged 15–25 yr
Management of menopause*
Breadth, depth, and personalization of
*
This measure encourages plans to discuss with women the pros and cons of various treatment options, such as hormone replacementtherapy, so that they can make more informed choices.
Effective Clinical Practice ■
November/December 1999 Volume 2 Number 6
Primer on Geographic Variation in Health Care
Although regional variation in health care has long been recog-
average. As illustrated in Figure 1, plotting standardized rates is
nized,1 studies describing variation in intervention rates across
useful for comparing the "variation profiles" of different proce-
geographic areas continue to appear regularly in medical jour-
dures.4 Some procedures, such as hip fracture repair and colec-
nals. This primer is intended to help readers make sense of
tomy for colon cancer, vary little—regional rates cluster near the
reports about geographic variation. We focus on two basic ques-
national average. In contrast, radical prostatectomy and back
tions: 1) How much variation is there? 2) What causes variation?
surgery vary markedly—their variation patterns are scattered dif-
fusely. Peripheral arterial angioplasty varies even more than
How Much Variation Is There?
these high-variation benchmarks.5
Regional rates of medical interventions always vary. Chance
What Causes Variation?
alone creates some degree of variation in intervention rates, par-
ticularly when many geographic areas are compared. Although
Considering the entire sequence of steps by which a patient ulti-
debate remains about which method is best, a variety of statisti-
mately gets to surgery (or any medical intervention) is a useful
cal approaches can be used to evaluate the role of chance in
way to understand the potential explanations for geographic vari-
studies of geographic variation.2
ation (Figure 2).
In most studies, however, geographic variation in interven-
tion rates is not due to chance alone (i.e., it is statistically signifi-
Prevalence of Disease
cant). So readers must consider the "clinical significance" of
Procedure rates may vary because of underlying differences in
observed variations: How much variation is there? Many studies
disease prevalence across regions. For example, generally high-
simply report the extremal range (ratio of highest to lowest rates) to
er rates of cardiovascular interventions in the southeastern
reflect the magnitude of variation (e.g., "rates of carotid endarterec-
United States may be in part related to a higher prevalence of cig-
tomy varied 7-fold, from 1.1 to 7.6 per 1000 enrollees").3 However,
arette smoking and other risk factors in that region.
this measure can be misleading because procedures performed
infrequently generally appear more variable than more common pro-
Access to Care
cedures. The extremal range also reflects rates only in high and low
To receive a procedure, patients must first get into the medical
outlier regions, thus ignoring practice patterns in all other regions.
system. Procedure rates may vary if there are regional differ-
To compare procedures reliably, variation measures
ences in access (e.g., related to socioeconomic status, insur-
should be standardized (i.e., on the same scale). One approach is
ance) or patient proclivity to seek medical care (e.g., related to
to divide observed procedure rates in each region by the overall
e rate in HRR to U.S. A
Type of Procedure
FIGURE 1. Variation profiles of six common procedures. Data for peripheral angioplasty from Axelrod and colleagues.3 Other figures
derived from 1995–6 national Medicare data from the
Dartmouth Atlas of Health Care.5 CABG = coronary artery bypass grafting; HRR =
hospital referral region.
2001 American College of Physicians–American Society of Internal Medicine
Decision To Treat
Potential reasons for
Finally, it is important to consider how treatment decisions are
geographic variation
in intervention rates:
made, particularly in instances where treatment is not con-
strained to a single therapeutic option. Several components of
this decision process may contribute to regional variation in
intervention rates. Primary care physicians may vary in their
propensity to refer patients to specialists (and delegate decision
Variation in disease incidence
making to them). Specialists may vary in their beliefs about the
risks and benefits of a given procedure, and thus vary in the rec-
ommendations they give patients. Finally, there may be regional
variation in the degree to which individual patient preferences are
incorporated into clinical decisions.
Variation in access, patient
proclivity to seek care
Differences in the degree to which procedures vary can be
explained in the context of these components of decision mak-
ing. Consider hip fracture repair, a low-variation procedure. Hip
fracture prevalence does not vary geographically—all patients
seek care, the diagnosis is usually made without discretionary
testing, and decisions about treatment are constrained to a sin-
Variation in use of diagnostic testing
gle option (surgery). In contrast, regional rates of radical prosta-
tectomy vary widely. This is not surprising: Prostate cancer
prevalence varies widely (likely due to variation in testing), and
Diagnosis of surgically
treatable condition
there is wide disagreement among both primary care physicians
and specialists about the risks and benefits of several different
Variation in primary care physician
proclivity to refer to specialists
Geographic variation studies often identify unrecognized
problems in clinical decision making. These studies stimulate us
Variation in specialist's beliefs about
to ask, but cannot answer, the question, "Which rate is right?"
procedure risks and benefits
Research aimed at better understanding of clinical effectiveness,
patient preferences, and economic implications is necessary for
Variation in how patient preferences are incorporated into decision making
addressing this basic question.
Wennberg JE, Gittelsohn A. Small area variation in health
care delivery. Science. 1973;182:1102-8.
Diehr P, Cain K, Connell F, Volinn E. What is too much vari-ation? The null hypothesis in small area analysis. Health Serv
FIGURE 2. Process by which a healthy person becomes a
patient and ultimately receives a medical intervention and
Wennberg JE, Cooper MM. Practice variations and the qual-
potential reasons for geographic variation in intervention
ity of surgical care for common conditions. In: 1999
Dartmouth Atlas of Health Care. Chicago: AmericanHospital Publishing; 1999.
Birkmeyer JD, Sharp SM, Finlayson SRG, Fisher ES,Wennberg JE. Variation profiles of common surgical proce-
Decision To Test
dures. Surgery 1998;124:917-23.
Many surgically treatable conditions are identified primarily by
Axelrod DA, Fendrick AM, Wennberg DE, Birkmeyer JD,
diagnostic tests (e.g., prostate-specific antigen testing, coronary
Siewers AE. Cardiologists performing peripheral angioplas-
angiography). Thus, surgery rates may vary because of regional
ties: impact on utilization. Eff Clin Pract. 2001;4:191-8.
variation in the use of diagnostic testing. For example, regional
rates of carotid endarterectomy have been shown to be highly
This primer was contributed by John D. Birkmeyer, MD,
correlated with rates of carotid ultrasonography.3
Dartmouth Medical School, Hanover, New Hampshire.
A compendium of ecp primers from past issues can be viewed and/or requested at http://www.acponline.org/journals/ecp/primers.htm.
Effective Clinical Practice ■
September/October 2001 Volume 4 Number 5
Source: http://www.vaoutcomes.com/downloads/Compendium_of_Primers.pdf
Acute Kidney Injury: A Guide to Diagnosis and ManagementMAHBOOB RAHMAN, MD, MS, Case Western Reserve University School of Medicine, Cleveland, OhioFARIHA SHAD, MD, Kaiser Permanente, Cleveland, Ohio MICHAEL C. SMITH, MD, Case Western Reserve University School of Medicine, Cleveland, Ohio Acute kidney injury is characterized by abrupt deterioration in kidney function, manifested by an increase in serum creatinine level with or without reduced urine output. The spectrum of injury ranges from mild to advanced, some-times requiring renal replacement therapy. The diagnostic evaluation can be used to classify acute kidney injury as prerenal, intrinsic renal, or postrenal. The initial workup includes a patient history to identify the use of nephrotoxic medications or systemic illnesses that might cause poor renal perfusion or directly impair renal function. Physi-cal examination should assess intravascular volume status and identify skin rashes indicative of systemic illness. The initial laboratory evaluation should include measurement of serum creatinine level, complete blood count, uri-nalysis, and fractional excretion of sodium. Ultrasonography of the kidneys should be performed in most patients, particularly in older men, to rule out obstruction. Management of acute kidney injury involves fluid resuscitation, avoidance of nephrotoxic medications and contrast media exposure, and correction of electrolyte imbal-ances. Renal replacement therapy (dialysis) is indicated for refrac-tory hyperkalemia; volume overload; intractable acidosis; uremic encephalopathy, pericarditis, or pleuritis; and removal of certain toxins. Recognition of risk factors (e.g., older age, sepsis, hypovo-lemia/shock, cardiac surgery, infusion of contrast agents, diabetes
Infos, Tipps, Ausflüge Begegnung mit reiner Natur Die in dieser Broschüre enthaltenen Informationen wurden nach bestemWissen zusammengestellt. Sie dienen ausschließlich zu Ihrer Orientierung,beschreiben keine Reiseleistungen im rechtlichen Sinne und beinhaltenkeine Verpflichtung oder gar Garantie. Inhaltliche Fehler sind trotz unseresständigen Bemühens um Aktualität nicht mit letzter Gewissheit auszu -schließen. Mit Ausnahme vorsätzlicher oder grob fahrlässiger Fehler wirddaher keinerlei Verantwortung bzw. Haftung für mögliche Unstimmigkeitenübernommen.