## Doi:10.1016/j.jclinepi.2004.03.01

Journal of Clinical Epidemiology 57 (2004) 1223–1231
Methods to assess intended effects of drug treatment in
observational studies are reviewed
Olaf H. Edwin P. Bruce M. Diederik E.
Sean D. Bruno H.Ch. Hubert G.M. A. de
a

*Department of Pharmacoepidemiology and Pharmacotherapy, Utrecht Institute of Pharmaceutical Sciences*
*(UIPS), Utrecht University, Sorbonnelaan 16, 3584 CA Utrecht, the Netherlands*
b

*Centre for Biostatistics, Utrecht University, Utrecht, the Netherlands*
c

*Cardiovascular Health Research Unit, Medicine, Health Services, and Epidemiology, University of Washington, Seattle, WA, USA*
d

*Julius Centre for Health Sciences and Primary Care, Utrecht Medical Centre (UMC), Utrecht, the Netherlands*
e

*Departments of Pharmacy and Health Services, University of Washington, Seattle, WA, USA*
f

*Department of Epidemiology and Biostatistics, Erasmus University Rotterdam, Rotterdam, the Netherlands*
Accepted 30 March 2004

**Background and objective: **To review methods that seek to adjust for confounding in observational studies when assessing intended

drug effects.

**Methods: **We reviewed the statistical, economical and medical literature on the development, comparison and use of methods adjusting

for confounding.

**Results: **In addition to standard statistical techniques of

*(logistic) regression *and

*Cox proportional hazards regression*, alternative

methods have been proposed to adjust for confounding in observational studies. A first group of methods focus on the main problem ofnonrandomization by balancing treatment groups on observed covariates:

*selection, matching*,

*stratification*,

*multivariate confounder score,*and

*propensity score methods*, of which the latter can be combined with stratification or various matching methods. Another group ofmethods look for variables to be used like randomization in order to adjust also for unobserved covariates:

*instrumental variable methods*,

*two-stage least squares, *and

*grouped-treatment approach*. Identifying these variables is difficult, however, and assumptions are strong.

*Sensitivity analyses *are useful tools in assessing the robustness and plausibility of the estimated treatment effects to variations in assumptionsabout unmeasured confounders.

**Conclusion: **In most studies regression-like techniques are routinely used for adjustment for confounding, although alternative methods

are available. More complete empirical evaluations comparing these methods in different situations are needed.

쑖 2004 Elsevier Inc. All
rights reserved.

*Keywords: *Review; Confounding; Observational studies; Treatment effectiveness; Intended drug effects; Statistical methods
effect in the population under study (confidence intervals,significance). Proper randomization should remove all
In the evaluation of intended effects of drug therapies,
kinds of potential selection bias, such as physician preference
well-conducted

*randomized controlled trials *(RCTs) have
for giving the new treatment to selected patients or patient
been widely accepted as the scientific standard The key
preference for one of the treatments in the trial Ran-
component of RCTs is the randomization procedure, which
domization does not assure equality on all prognostic factors
allows us to focus on only the outcome variable or variables
in the treatment groups, especially with small sample
in the different treatment groups in assessing an unbiased
sizes, but it assures confidence intervals and

*P*-values to
treatment effect. Because adequate randomization will assure
be valid by using probability theory
that treatment groups will differ on all known and unknown
There are settings where a randomized comparison of
prognostic factors only by chance, probability theory can
treatments may not be feasible due to ethical, economic
easily be used in making inferences about the treatment
or other constraints Also, RCTs usually exclude particulargroups of patients (because of age, other drug usage or non-
* C⫹fax: ⫹31 30 253 9166.

compliance); are mostly conducted under strict, protocol-
driven conditions; and are generally of shorter duration than
0895-4356/04/$ – see front matter 쑖 2004 Elsevier Inc. All rights reserved.

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
the period that drugs are used in clinical practice Thus,
of a general population or subgroup over time are not uncom-
RCTs typically provide evidence of what can be achieved with
mon Furthermore, there may exist differences in popu-
treatments under the controlled conditions in selected groups
lation definitions between different research settings.

of patients for a defined period of treatment.

The main alternatives are

*observational studies*. Their

*2.2. Candidates for treatment*
validity for assessing intended effects of therapies has longbeen debated and remains controversial The recent
If current treatment guidelines exist, the comparison be-
example of the potential cardiovascular risk reducing effects
tween the treated and the untreated group can be improved
of hormone replacement therapy (HRT) illustrates this con-
by choosing for the untreated group only those subjects who
troversy Most observational studies indicated that HRT
are candidates for the treatment under study according to these
reduces the risk of cardiovascular disease, whereas RCTs
guidelines. As a preliminary selection, this method was used
demonstrated that HRT increases cardiovascular risk
in a cohort study to estimate the effect of drug treatment
The main criticism of observational studies is the absence
of hypertension on the incidence of stroke in the general
of a randomized assignment of treatments, with the result
population by selecting candidates on the basis of their
that uncontrolled confounding by unknown, unmeasured, or
blood pressure and the presence of other cardiovascular risk
inadequately measured covariates may provide an alternative
factors The selection of a cohort of candidates for
explanation for the treatment effect
treatment can also be conducted by a panel of physicians
Along with these criticisms, many different methods have
after presenting them the clinical characteristics of the
been proposed in the literature to assess treatment effects
patients in the study
in observational studies. With all these methods, the mainobjective is to deal with the potential bias caused by the

*2.3. Comparing treatments for the same indication*
nonrandomized assignment of treatments, a problem also
When different classes of drugs, prescribed for the same
known as

*confounding *
indication, have to be studied, at least some similarity in
Here we review existing methods that seek to achieve
prognostic factors between treatment groups occurs natu-
valid and feasible assessment of treatment effects in observa-
rally. This strategy was used in two case–control studies to
tional studies.

compare the effects of different antihypertensive drug thera-pies on the risks of myocardial infarction and ischemic strokeOnly patients who used antihypertensive drugs for

**2. Design for observational studies**
the indication hypertension were included in these studies(and also some subgroups that had other indications such
A first group of method of dealing with potential bias
as angina for drugs that can be used to treat high blood
following from nonrandomized observational studies is to
pressure were removed).

narrow the treatment and/or control group in order to createmore comparable groups on one or more measured charac-

*2.4. Case–crossover and case–time–control design*
teristics. This can be done by selection of subjects or bychoosing a specific study design. These methods can also
The use of matched case–control (case–referent) studies
be seen as only a first step in removing bias, in which case
when the occurrence of a disease is rather rare is a well-
further reduction of bias has to be attained by means of data-
known research design in epidemiology. This type of design
can also be adopted when a strong treatment effect is sus-pected or when a cohort is available from which thesubjects are selected (nested case–control study) Varia-

*2.1. Historical controls*
tions of this design have been proposed to control for con-
Before the introduction and acceptance of the RCT as the
founding due to differences between exposed and unexposed
gold standard for assessing the intended effect of treatments, it
patients. One such variant is the

*case–crossover study*, in
was common to compare the outcome of treated patients
which event periods are compared with control periods
with the outcome of

*historical controls *(patients previously
within cases of patients who experienced an event. This study
untreated or otherwise treated) An example of this
design may avoid bias resulting from differences between
method can be found in Kalra et al. The authors as-
exposed and nonexposed patients, but variations in the
sessed the rates of stroke and bleeding in patients with atrial
underlying disease state within individuals could still con-
fibrillation receiving warfarin anticoagulation therapy in
found the association between treatment and outcome
general medical clinics and compared these with the rates
An extension of this design is the

*case–time–control design*,
of stroke and bleeding among similar patients with atrial
which takes also into account changes of exposure levels
fibrillation who received warfarin in a RCT.

over time. With this design and with certain assumptions
Using historical controls as a comparison group is in
confounding due to time trends in exposure can be removed,
general a problematic approach, because the factor time
but variations in the severity of disease over time within
can play an important role. Changes of the characteristics
individuals, although probably correlated with exposure

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
levels, cannot be controlled In a study comparing
of a stroke, and women are subdivided by the history of a
the effect of high and moderate β-antagonist use on the risk
previous cardiovascular disease. By pooling all treatment
of fatal or near-fatal asthma attacks, the odds ratio (OR)
effects in the strata in the usual way, a corrected treatment
from a case–time control analysis controlling for time trends
effect can be calculated. Although by this method more
in exposure, turned out to be much lower (OR ⫽ 1.2, 95%
covariates can be handled than with normal stratification,
confidence interval, CI95%, ⫽ 0.5–3.0) than in a conventional
most of them will be partly used. We are unaware of any
case–control analysis (OR ⫽ 3.1, CI95% ⫽ 1.8–5.4)
medical study in which this method has been used.

Advantages of these designs in which each subject is
its own control, are the considerably reduced intersubject

*3.3. Common multivariable statistical techniques*
variability and the exclusion of alternative explanations frompossible confounders. These methods are on the other hand
Compared to selection, restriction, stratification, or
of limited use, because for only some treatments the outcome
matching, more advanced multivariable statistical tech-
can be measured at both the control period and the event
niques have been developed to reduce bias due to differences
period, and thereby excluding possible carryover effects.

in prognosis between treatment groups in observational stud-By assessing a model with outcome as the dependentand type of treatment as the independent variable of inter-

**3. Data-analytical techniques**
est, many prognostic factors can be added to the analysis toadjust the treatment effect for these confounders. Well known
Another group of bias reducing methods are the data-
and frequently used methods are

*multivariable linear re-*
analytical techniques, which can be divided into model-

*gression*,

*logistic regression, *and

*Cox proportional hazards*
based techniques (regression-like methods) and methods

*regression *(survival analysis). Main advantage over earlier
without underlying model assumptions (stratification and
mentioned techniques is that more prognostic variables,
quantitative and qualitative, can be used for adjustment, dueto a model that is imposed on the data. It's obvious that also

*3.1. Stratification and matching*
in these models the number of subjects or the number of
Intuitive and simple methods to improve the comparison
events puts a restriction on the number of covariates; a
between treatment groups in assessing treatment effects, are
ratio of 10–15 subjects or events per independent variable
the techniques of

*stratification *(subclassification) and

*match-*
is mentioned in the literature

*ing *on certain covariates as a data analytical technique. The
An important disadvantage of these techniques when used
limitations and advantages of these methods are in general
for adjusting a treatment effect for confounding, is the danger
the same. Advantages are (i) clear interpretation and commu-
of extrapolations when the overlap on covariates between
nication of results, (ii) direct warning when treatment groups
treatment groups is too limited. While matching or stratifica-
do not adequately overlap on used covariates, and (iii) no
tion gives a warning or breaks down, regression analysis
assumptions about the relation between outcome and covari-
will still compute coefficients. Mainly when two or more
ates (e.g., linearity) The main limitation of these
covariates are used, a check on adequate overlap of the joint
techniques is, that in general only one or two covariates or
distributions of the covariates will be seldom performed.

rough strata or categories are possible. More covariates will
The use of a functional form of the relationship between
easily result in many empty strata in case of stratification
outcome and covariates is an advantage for dealing with
and many mismatches in case of matching. Another disad-
more covariates, but have its drawback, mainly when treat-
vantage is that continuous variables have to be classified,
ment groups have different covariate distributions. In that
using (mostly) arbitrary criteria.

case, the results are heavily dependent on the chosen relation-
These techniques can easily be combined with methods
ship (e.g., linearity).

like propensity scores and multivariate confounder score,as will be discussed below, using the advantages of clear

*3.4. Propensity score adjustment*
interpretation and absence of assumptions about func-tional relationships.

An alternative way of dealing with confounding caused
by nonrandomized assignment of treatments in cohortstudies, is the use of

*propensity scores*, a method developed

*3.2. Asymmetric stratification*
by Rosenbaum and Rubin D'Agostino found that
A method found in the literature that is worth mentioning,
"the propensity score for an individual, defined as the condi-
is

*asymmetric stratification *Compared to cross-stratifi-
tional probability of being treated given the individual's
cation of more covariates, in this method each stratum of
covariates, can be used to balance the covariates in observa-
the first covariate is subdivided by the covariate that have
tional studies, and thus reduce bias." In other words, by this
highest correlation with the outcome within that stratum.

method a collection of covariates is replaced by a single
For instance, men are subdivided on the existence of diabetes
covariate, being a function of the original ones. For an indi-
mellitus because of the strongest relationship with the risk
vidual

*i *(

*i *⫽ 1, …,

*n*) with vector

**xi **of observed covariates,

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
the propensity score is the probability

*e*(

**xi**) of being treated

model. Therefore, propensity score adjustment is less sensi-
(

*Zi *⫽ 1) versus not being treated (

*Zi *⫽ 0):
tive to assumptions about the functional form of the associa-tion of a particular covariate with the outcome (e.g., linear

*e*(

**xi**) ⫽ Pr(

*Zi *⫽ 1

*Xi *⫽

**xi**)

or quadratic) Recently, the propensity score method
where it is assumed that the

*Zi *are independent, given the

*X*'s.

was compared to logistic regression in a simulation study with
By using logistic regression analysis, for instance, for
a low number of events and multiple confounders
every subject a probability (propensity score) is estimated
With respect to the sensitivity of the model misspecification
that this subject would have been treated, on the basis of the
(robustness) and empirical power, the authors found the
measured covariates. Subjects in treatment and control groups
propensity score method to be superior overall. With respect
with (nearly) equal propensity scores will tend to have the
to the empirical coverage probability, bias, and precision, they
same distributions of the covariates used and can be consid-
found the propensity score method to be superior only when
ered similar. Once a propensity score has been computed, this
the number of events per confounder was low (say, 7 or
score can be used in three different ways to adjust for the
less). When there were more events per confounder, logistic
uncontrolled assignment of treatments: (i) as a matching
regression performs better on the criteria of bias and cover-
variable, (ii) as a stratification variable, and (iii) as a continu-
age probability.

ous variable in a regression model (covariance adjustment).

Examples of the these methods can be found in two studies

*3.5. Multivariate confounder score*
of the effect of early statin treatment on the short-term riskof death
The

*multivariate confounder score *was suggested by
The most preferred methods are stratification and match-
Miettinen as a method to adjust for confounding in
ing, because with only one variable (the propensity score)
case–control studies. Although Miettinen did not specifically
the disadvantages noted in section 3.1 disappear and the
propose this method to adjust for confounding in studies of
clear interpretation and absence of model-based adjustments
intended effects of treatment, the multivariate confounder
remain as the main advantages. When classified into
score is very similar to the propensity score, except that the
quintiles or deciles, a stratified analysis on these strata of
propensity score is not conditional on the outcome of interest,
the propensity score is most simple to adopt. Within these
whereas the multivariate confounder score is conditional on
classes, most of the bias due to the measured confounders
not being a case
disappears. Matching, on the other hand, can be much more
The multivariate confounder score has been evaluated for
laborious because of the continuous scale of the propensity
validity Theoretically and in simulation studies, this
score. Various matching methods have been proposed. In
score was found to exaggerate significance, compared to the
all these methods, an important role is given to the distance
propensity score. The point estimates in these simulations
matrix, of which the cells are most often defined as simply
were, however, similar for propensity score and multivariate
the difference in propensity score between treated and un-
confounder score.

treated patients. A distinction between methods can be madebetween

*pair-matching *(one treated to one untreated patient)

*3.6. Instrumental variables*
and

*matching with multiple controls *(two, three, or four). Thelatter method should be used when the number of untreated
A technique widely used in econometrics, but not yet
patients is much greater than the number of treated patients;
generally applied in medical research, is the use of

*instru-*
an additional gain in bias reduction can be reached when a

*mental variables *(IV). This method can be used for the
variable number per pair, instead of a fixed number, is used
estimation of treatment effects (the effect of treatment on
Another distinction can be made between

*greedy meth-*
the treated) in observational studies as an alternative to

*ods *and

*optimal methods*. A greedy method selects at random
making causal inferences in RCTs. In short, an instrumental
a treated patient and looks for an untreated patient with
variable is an observable factor associated with the actual
smallest distance to form a pair. In subsequent steps, all
treatment but not directly affecting outcome. Unlike standard
other patients are considered for which a match can be made
regression models, two equations are needed to capture
within a defined maximum distance. An optimal method,
these relationships:
on the other hand, takes the whole distance matrix into
account to look for the smallest total distance between all

*i *⫽ α0 ⫹ α1

*Zi *⫹

*vi*
possible pairs. An optimal method combined with a variable

*Yi *⫽ β0⫹β1

*Di *⫹ ε

*i*
number of controls should be the preferred method
The method of propensity scores was evaluated in a simu-
where

*Yi *is outcome,

*Di *is treatment,

*Zi *is the instrumental
lation study, and it was found that the bias due to omitted
variable or assignment, and α1 ≠ 0. Both treatment

*D *and
confounders was of similar magnitude as for regression
assignment

*Z *can be either continuous or dichotomous. In
adjustment The bias due to misspecification of the
case of a dichotomous

*D*, equation can be written as
propensity score model was, however, smaller than the bias

*Di** ⫽ α0 ⫹ α1

*Zi *⫹ ν

*i*, where

*Di** is a latent index (

*Di** ⬎
due to misspecification of the multivariable regression
0 →

*Di *⫽ 1; otherwise

*Di *⫽ 0).

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
By equation it is explicitly expressed that it is un-
whereas it could reasonably be assumed that differential
known how treatments are assigned (at least we know it was
distance did not directly affect mortality.

not random) and that we like to explain why one is treated
As stated above, the main limitation of instrumental vari-
and the other is not by a variable

*Z*. Substituting equation
ables estimation is that it is based on the assumption that
the instrumental variable only affects outcome by being apredictor for the treatment assignment and no direct predictor

*Yi *⫽ (β0 ⫹ β1α0) ⫹ β1α1

*Zi *⫹ (β1

*vi *⫹ ε

*i*)
for the outcome (exclusion restriction). This assumption is
The slope β1α1 can be estimated by least squares regression
difficult to fulfill; more important, it is practically untestable.

and is, when

*Z *is dichotomous, the difference in outcome
Another limitation is that the treatment effect may not
between

*Z *⫽ 0 and

*Z *⫽ 1 (i.e., the intention-to-treat estima-
be generalizable to the population of patients whose treat-
tor). In order to estimate the direct treatment effect β1
ment status was not determined by the instrumental variable.

of treatment

*D *on outcome

*Y*, this estimator β1α1 must be
This problem is similar to that seen with RCTs, where esti-
divided by α1, the effect of

*Z *on

*D *from equation As
mated treatment effects may not be generalizable to a broader
an illustration, it can be seen that in case of a perfect instru-
population. Finally, when variation in the likelihood of re-
ment (e.g., random assignment), a perfect relationship exists
ceiving a particular therapy is small between groups of
between

*Z *and

*D *and the parameter α1 ⫽ 1, in which case
patients based on an instrumental variable, differences in
the intention-to-treat estimator and the instrumental vari-
outcome due to this differential use of the treatment may be
able estimator coincide. By using two equations to describe
very small and, hence, difficult to assess.

the problem, the implicit but important assumption ismade that

*Z *has no effect on outcome

*Y *other than through

*3.7. Simultaneous equations and two-stage*
its effect on treatment

*D *(cov[

*Z*
*least squares*
*i*,ε

*i*] ⫽ 0). Other assumptions
are that α1 ≠ 0 and that there is no subject

*i *"who does the
The method just described as instrumental variables is in
opposite of its assignment" This is illustrated in the fol-
fact a simple example of the more general methods of

*simul-*
lowing example.

*taneous equations estimation*, widely used in economics and
One of the earliest examples of the use of instrumental
econometrics. When there are only two simultaneous equa-
variables (simultaneous equations) in medical research was
tions and regression analysis is used this method is also
in the study of Permutt and Hebel where the effect of
known as

*two-stage least squares *(TSLS) In the first
smoking on birth weight was studied. The treatment con-
stage treatment

*D *is explained by one or more variables
sisted of encouraging pregnant women to stop smoking.

that do not directly influence the outcome variable

*Y*. In the
The difference in mean birth weight between the treatment
second stage this outcome is explained by the predicted
groups, the intention-to-treat estimator (β1α1), was found to
probability of receiving a particular treatment, which is
be 92 g, whereas the difference in mean cigarettes smoked
adjusted for measured and unmeasured covariates. An exam-
per day was ⫺6.4. This leads to an estimated effect β2 of
ple of this method is used to assess the effects of parental
92/⫺6.4 ⫽ ⫺15, meaning an increase of 15 g in birth weight
drinking on the behavioral health of children Paren-
for every cigarette per day smoked less. The assumption
tal drinking (the treatment) is not randomized, probably
that the encouragement to stop smoking (

*Z*) does not affect
associated with unmeasured factors (e.g., parental skills) and
birth weight (

*Y*) other than through smoking behavior seems
estimated in the first stage by exogenous or instrumental
plausible. Also the assumption that there is no woman who
variables that explain and constrain parents drinking behav-
did not stop smoking because she was encouraged to stop,
ior (e.g., price, number of relatives drinking).

is probably fulfilled.

Because the method of simultaneous equations and two-
Another example of the use of an instrumental variable can
stage least squares covers the technique of instrumental vari-
be found in the study of McClellan et al. where the
ables, the same assumptions and limitations can be mentioned
effect of cardiac catheterization on mortality was assessed.

here. We have chosen to elaborate the instrumental variables
The difference in distance between their home and the near-
approach, because in the medical literature these type of
est hospital that performed cardiac catheterizations and the
methods are more known under that name.

nearest hospital that did not perform this procedure, wasused as an instrumental variable. Patients with a relatively

*3.8. Ecologic studies and grouped-treatment effects*
small difference in distance to both types of hospitals (⬍2.5miles) did not differ from patients with a larger difference
Ample warning can be found in the literature against
in distance to both types of hospitals (⭓2.5 miles) with
the use of

*ecologic studies *to describe relationships on the
regard to observed characteristics such as age, gender, and
individual level (the ecologic fallacy); a correlation found
comorbidity; however, patients who lived relatively closer
at the aggregated level (e.g., hospital) cannot be interpreted
to a hospital that performed cardiac catheterizations more
as a correlation at the patient level. Wen and Kramer
often received this treatment (26%) compared to patients
however, proposed the use of ecologic studies as a method
who lived farther away (20%). Thus, the differential distance
to deal with confounding at the individual level when
affected the probability of receiving cardiac catheterization,
intended treatment effects have to be estimated. In situations

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
where considerable variation in the utilization of treat-
A

*sensitivity analysis *can be a valuable tool in assessing
ments exists across geographic areas independent of the
the possible influence of an unmeasured confounder. This
severity of disease but mainly driven by practice style, the
method was probably first used by Cornfield et al. when
"relative immunity from confounding by indication may
they attacked Fisher's that the apparent asso-
outweigh the ‘ecologic fallacy'" by performing an ecologic
ciation between smoking and lung cancer could be explained
study Of course, such ecologic studies have low statisti-
by an unmeasured genetic confounder related to both smok-
cal power by the reduced number of experimental units and
ing and lung cancer. The problem of nonrandomized assign-
tell us little about the individuals in the compared groups.

ment to treatments in observational studies can be thought
Moreover, Naylor argues that the limitations of the
of as a problem of unmeasured confounding factors. Instead
proposed technique in order to remove confounding by indi-
of stating that an unmeasured confounder can explain the
cation are too severe to consider an aggregated analysis as
treatment effect found, sensitivity analyses try to find a
a serious alternative when individual level data are available.

lower bound for the magnitude of association between that
An alternative method described in the literature is known
confounder and the treatment variable. Lin et al. devel-
as the

*grouped-treatment approach*. Keeping the analysis at
oped a general approach for assessing the sensitivity of the
the individual level, the individual treatment variable will
treatment effect to the confounding effects of unmeasured
be replaced by an ecological or grouped-treatment variable,
confounders after adjusting for measured covariates, assum-
indicating the percentage of treated persons at the aggregated
ing that the true treatment effect can be represented in a
level With this method the relative immunity for con-
regression model. The plausibility of the estimated treatment
founding by indication by an aggregated analysis is com-
effects will increase if the estimated treatment effects are
bined with the advantage of correcting for variation at the
insensitive over a wide range of plausible assumptions about
individual level. In fact this method is covered by the method
these unmeasured confounders.

of

*two-stage least squares*, where in the first stage morevariables are allowed to assess the probability of receivingthe treatment. This method faces the same assumptions

**5. Summary and discussion**
as the instrumental variables approach discussed earlier.

Most important is the assumption that unmeasured variables
Although randomized clinical trials remain the gold stan-
do not produce an association between prognosis and the
dard in the assessment of intended effects of drugs, observa-
grouped-treatment variable, which in practice will be hard
tional studies may provide important information on
effectiveness under everyday circumstances and in sub-groups not previously studied in RCTs. The main defect inthese studies is the incomparability of groups, giving a possi-

**4. Validations and sensitivity analyses**
ble alternative explanation for any treatment effect found.

Horwitz et al. proposed to validate observational
Thus, focus in such studies is directed toward adjustment
studies by constructing a cohort of subjects in clinical prac-
for confounding effects of covariates.

tice that is restricted by the inclusion criteria of RCTs. Simi-
Along with standard methods of

*appropriate selection of*
larity in estimated treatment effects from the observational

*reference groups*,

*stratification *and

*matching*, we discussed
studies and the RCTs would provide empirical evidence for
multivariable statistical methods such as

*(logistic) regression*
the validity of the observational method. Although this may
and

*Cox proportional hazards regression *to correct for con-
be correct in specific situations it does not provide
founding. In these models, the covariates, added to a model
evidence for the validity of observational methods for the
with ‘treatment' as the only explanation, give alternative
evaluation of treatments in general
explanations for the variation in outcome, resulting in a
To answer the question whether observational studies
corrected treatment effect. In fact, the main problem of bal-
produce similar estimates of treatment effects compared to
ancing the treatment and control groups according to some
randomized studies, several authors have compared the re-
covariates has been avoided. A method that more directly
sults of randomized and nonrandomized studies for a number
attacks the problem of imbalance between treatment and
of conditions, sometimes based on meta-analyses
control group, is the method of

*propensity scores*. By trying
In general, these reviews have concluded that the direction
to explain this imbalance with measured covariates, a score
of treatment effects assessed in nonrandomized studies is
is computed which can be used as a single variable to match
often, but not always, similar to the direction of the treatment
both groups. Alternatively, this score can be used as a strati-
effects in randomized studies, but that differences between
fication variable or as a single covariate in a regression
nonrandomized and randomized studies in the estimated
magnitude of treatment effect are very common. Trials may
In all these techniques, an important limitation is that
under- or overestimate the actual treatment effect, and the
adjustment can only be achieved for

*measured *covariates,
same is true for nonrandomized comparison of treatments.

implicating possible measurement error on these covari-
Therefore, these comparisons should not be interpreted as
ates (e.g., the severity of a past disease) and possible omis-
true validations.

sion of other important, unmeasured covariates. A method

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
not limited by these shortcomings is a technique known as
drugs are advised to be taken lifelong. Another purpose of

*instrumental variables*. In this approach, the focus is on
observational studies is to investigate the causes of interindi-
finding a variable (the instrument) that is related to the
vidual variability in drug response. Most causes of variability
allocation of treatments, but is related to outcome only
in drug response are unknown. Observational studies can
because of its relation to treatment. This technique can
also be used to assess the intended effects of drugs in patients
achieve the same effect as randomization in bypassing the
that were excluded from RCTs (e.g., very young patients, or
usual way in which physicians allocate treatment according
patients with different comorbidities and polypharmacy),
to prognosis, but its rather strong assumptions limit its use
or in patients that were studied in RCTs but who might still
in practice. Related techniques are

*two-stage least squares*
respond differently (e.g., because of genetic differences).

and the

*grouped-treatment approach*, sharing the same limi-
Comparison between the presented methods to assess
tations. All these methods are summarized in
adjusted treatment effects in observational studies is mainly
Given the limitations of observational studies, the evi-
based on theoretical considerations, although some empirical
dence in assessing intended drug effects from observational
evidence is available. A more complete empirical evaluation
studies will be in general less convincing than from well con-
that compares the different adjustment methods with respect
ducted RCTs. The same of course is true when RCTs are
to the estimated treatment effects under several conditions

*not *well conducted (e.g., lacking double blinding or exclu-
will be needed to assess the validity of the different meth-
sions after randomization). This means that due to differ-
ods. Preference for one method or the other can be expressed
ences in quality, size or other characteristics disagreement
in terms of bias, precision, power, and coverage probability
among RCTs is not uncommon In general we sub-
of the methods, whereas the different conditions can be
scribe to the view that observational studies including appro-
defined by means of, for instance, the severity of the dis-
priate adjustments are less suited to assess new intended
ease, the number of covariates, the strength of association
drug effects (unless the expected effect is very large), but can
between covariates and outcome, the association among the
certainly be valuable for assessing the long-term beneficial
covariates, and the amount of overlap between the groups.

effects of drugs already proven effective in short-term RCTs.

These empirical evaluations can be performed with existing
For instance, the RCTs of acetylsalicylic acid that demon-
databases or computer simulations. Given the lack of empiri-
strated the beneficial effects in the secondary prevention of
cal evaluations for comparisons of the different methods and
coronary heart disease were of limited duration, but these
the importance of the assessment of treatment effects in
Table 1Strengths and limitations of methods to assess treatment effects in nonrandomized, observational studies
Design approaches
Historical controls
• Easy to identify comparison group
• Treatment effect often biased
Candidates for treatment
• Useful for preliminary selection
• Difficult to identify not treated candidates
Treatments for the same indication
• Similarity of prognostic factors
• Only useful for diseases treated with
• Only effectiveness of one drug
compared to another
Case–crossover and case–time–control
• Reduced variability by intersubject
• Only useful to assess time-limited effects
• Possible crossover effects
Stratification and (weighted) matching
• Clear interpretation / no assumptions
• Only a few covariates or rough categories
• Clarity of incomparability on used
• More covariates than with normal
• Still limited number of covariates
Common statistical techniques:
• More covariates than matching or
• Focus is not on balancing groups
regression, logistic regression,
• Adequate overlap between groups
survival analysis
• Easy to perform
difficult to assess
Propensity scores
• Many covariates possible
• Performs better with only a few number
of events per confounder
Multivariate confounder score
• Less insensitive to
• Exaggerates significance
• Immune to confounding by indication • Loss of power by reduced number of units
• Loss of information at the individual level
Instrumental variables (IV),
• Large differences per area are needed
• Difficult to identify instrumental variable(s)
two-stage least squares;
• Strong assumption that IV is unrelated
with factors directly affecting outcome

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
observational studies, more effort should be directed toward
[22] Klungel OH, Heckbert SR, Longstreth WT Jr, Furberg CD, Kaplan RC,
these evaluations.

Smith NL, Lemaitre RN, Leufkens HG, de Boer A, Psaty BM. Antihy-pertensive drug therapies and the risk of ischemic stroke. Arch InternMed 2001;161:37–43.

[23] Abi-Said D, Annegers JF, Combs-Cantrell D, Suki R, Frankowski RF,
Willmore LJ. A case–control evaluation of treatment efficacy: theexample of magnesium sulfate prophylaxis against eclampsia in pa-
[1] Friedman LM, Furberg CD, DeMets DL. Fundamentals of clinical
tients with preeclampsia. J Clin Epidemiol 1997;50:419–23.

trials. St Louis: Mosby-Year Book; 1996.

[24] Concato J, Peduzzi P, Kamina A, Horwitz RI. A nested case–control
[2] Chalmers I. Why transition from alternation to randomisation in clini-
study of the effectiveness of screening for prostate cancer: research
cal trials was made [Letter]. BMJ 1999;319:1372.

design. J Clin Epidemiol 2001;54:558–64.

[3] Schulz KF, Grimes DA. Allocation concealment in randomised trials:
[25] Maclure M. The case–crossover design: a method for studying tran-
defending against deciphering. Lancet 2002;359:614–8.

sient effects on the risk of acute events. Am J Epidemiol 1991;133:
[4] Urbach P. The value of randomization and control in clinical trials.

Stat Med 1993;12:1421–31; discussion 1433–41.

[26] Greenland S. Confounding and exposure trends in case–crossover and
[5] Feinstein AR. Current problems and future challenges in randomized
case–time–control designs. Epidemiology 1996;7:231–9.

clinical trials. Circulation 1984;70:767–74.

[27] Suissa S. The case–time–control design. Epidemiology 1995;6:
[6] Gurwitz JH, Col NF, Avorn J. The exclusion of the elderly and women
from clinical trials in acute myocardial infarction. JAMA 1992;268:
[28] Suissa S. The case–time–control design: further assumptions and con-
ditions. Epidemiology 1998;9:441–5.

[7] Wieringa NF, de Graeff PA, van der Werf GT, Vos R. Cardiovascular
[29] Cochran WG. The effectiveness of adjustment by subclassification in
drugs: discrepancies in demographics between pre- and post-registra-
removing bias in observational studies. Biometrics 1968;24:295–313.

tion use. Eur J Clin Pharmacol 1999;55:537–44.

[30] Rubin DB. Estimating causal effects from large data sets using propen-
[8] MacMahon S, Collins R. Reliable assessment of the effects of treat-
sity scores. Ann Intern Med 1997;127:757–63.

ment on mortality and major morbidity, II: observational studies.

[31] Cook EF, Goldman L. Asymmetric stratification: an outline for an
efficient method for controlling confounding in cohort studies. AmJ Epidemiol 1988;127:626–39.

[9] McKee M, Britton A, Black N, McPherson K, Sanderson C, Bain C.

[32] Psaty BM, Koepsell TD, Lin D, Weiss NS, Siscovick DS, Rosendaal FR,
Methods in health services research. Interpreting the evidence: choos-
Pahor M, Furberg CD. Assessment and control for confounding by
ing between randomised and non-randomised studies. BMJ 1999;319:
indication in observational studies. J Am Geriatr Soc 1999;47:749–54.

[33] Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events
[10] Concato J, Shah N, Horwitz RI. Randomized, controlled trials, obser-
per independent variable in proportional hazards regression analysis. II.

vational studies, and the hierarchy of research designs. N Engl J Med
Accuracy and precision of regression estimates. J Clin Epidemiol
[11] Grodstein F, Clarkson TB, Manson JE. Understanding the divergent
[34] Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simula-
data on postmenopausal hormone therapy. N Engl J Med 2003;348:
tion study of the number of events per variable in logistic regression
analysis. J Clin Epidemiol 1996;49:1373–9.

[12] Beral V, Banks E, Reeves G. Evidence from randomised trials on the
[35] Rosenbaum PR, Rubin DB. The central role of the propensity score
long-term effects of hormone replacement therapy. Lancet 2002;
in observational studies for causal effects. Biometrika 1983;70:41–55.

[36] D'Agostino RB Jr. Tutorial in biostatistics: propensity score methods
[13] Messerli FH. Case–control study, meta-analysis, and bouillabaisse:
for bias reduction in the comparison of a treatment to a non-random-
putting the calcium antagonist scare into context [Editorial]. Ann
ized control group. Stat Med 1998;17:2265–81.

Intern Med 1995;123:888–9.

[37] Stenestrand U, Wallentin L. Early statin treatment following acute
[14] Grobbee DE, Hoes AW. Confounding and indication for treatment in
myocardial infarction and 1-year survival. JAMA 2001;285:430–6.

evaluation of drug treatment for hypertension. BMJ 1997;315:1151–4.

[38] Aronow HD, Topol EJ, Roe MT, Houghtaling PL, Wolski KE,
[15] Rosenbaum PR. Observational studies. 2nd edition. New York:
Lincoff AM, Harrington RA, Califf RM, Ohman EM, Kleiman NS,
Springer; 2002.

Keltai M, Wilcox RG, Vahanian A, Armstrong PW, Lauer MS. Effect
[16] Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical
of lipid-lowering therapy on early mortality after acute coronary syn-
controls for clinical trials. Am J Med 1982;72:233–40.

dromes: an observational study. Lancet 2001;357:1063–8.

[17] Kalra L, Yu G, Perez I, Lakhani A, Donaldson N. Prospective cohort
[39] Rosenbaum PR, Rubin DB. Constructing a control group using multi-
study to determine if trial efficacy of anticoagulation for stroke preven-
variate matched sampling methods that incorporate the propensity
tion in atrial fibrillation translates into clinical effectiveness. BMJ
score. Am Stat 1985;39:33–8.

[40] Ming K, Rosenbaum PR. Substantial gains in bias reduction from
[18] Ioannidis JP, Polycarpou A, Ntais C, Pavlidis N. Randomised trials
matching with a variable number of controls. Biometrics 2000;56:
comparing chemotherapy regimens for advanced non-small cell lung
cancer: biases and evolution over time. Eur J Cancer 2003;39:
[41] Drake C. Effects of misspecification of the propensity score on estima-
tors of treatment effect. Biometrics 1993;49:1231–6.

[19] Klungel OH, Stricker BH, Breteler MM, Seidell JC, Psaty BM, de
[42] Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic
Boer A. Is drug treatment of hypertension in clinical practice as
regression versus propensity score when the number of events is low
effective as in randomized controlled trials with regard to the reduction
and there are multiple confounders. Am J Epidemiol 2003;158:280–7.

of the incidence of stroke? Epidemiology 2001;12:339–44.

[43] Miettinen OS. Stratification by a multivariate confounder score. Am J
[20] Johnston SC. Identifying confounding by indication through blinded
prospective review. Am J Epidemiol 2001;154:276–84.

[44] Pike MC, Anderson J, Day N. Some insights into Miettinen's multivar-
[21] Psaty BM, Heckbert SR, Koepsell TD, Siscovick DS, Raghunathan TE,
iate confounder score approach to case–control study analysis. Epide-
Weiss NS, Rosendaal FR, Lemaitre RN, Smith NL, Wahl PW. The
miol Community Health 1979;33:104–6.

risk of myocardial infarction associated with antihypertensive drug
[45] Newhouse JP, McClellan M. Econometrics in outcomes research: the
therapies. JAMA 1995;274:620–5.

use of instrumental variables. Annu Rev Public Health 1998;19:17–34.

*O.H. Klungel et al. / Journal of Clinical Epidemiology 57 (2004) 1223–1231*
[46] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects
results of randomized controlled clinical trials of coronary artery
using instrumental variables. J Am Stat Assoc 1996;91:444–55.

bypass surgery. J Am Coll Cardiol 1988;11:237–45.

[47] Permutt T, Hebel JR. Simultaneous-equation estimation in a clinical
[56] Benson K, Hartz AJ. A comparison of observational studies and
trial of the effect of smoking on birth weight. Biometrics 1989;45:
randomized, controlled trials. N Engl J Med 2000;342:1878–86.

[57] Ioannidis JP, Haidich AB, Pappa M, Pantazis N, Kokori SI, Tektoni-
[48] McClellan M, McNeil BJ, Newhouse JP. Does more intensive treat-
dou MG, Contopoulos-Ioannidis DG, Lau J. Comparison of evidence
ment of acute myocardial infarction in the elderly reduce mortality?
of treatment effects in randomized and nonrandomized studies. JAMA
Analysis using instrumental variables. JAMA 1994;272:859–66.

[49] Angrist JD, Imbens GW. Two-stage least squares estimation of average
causal effects in models with variable treatment intensity. J Am Stat
[58] Kunz R, Oxman AD. The unpredictability paradox: review of empiri-
cal comparisons of randomised and non-randomised clinical trials.

[50] Snow Jones A, Miller DJ, Salkever DS. Parental use of alcohol and
children's behavioural health: a household production analysis. Health
[59] Cornfield J, Haenszel W, Hammond EC, Lilienfeld AM, Shimkin MB,
Wynder EL. Smoking and lung cancer: recent evidence and a discus-
[51] Wen SW, Kramer MS. Uses of ecologic studies in the assessment of
sion of some questions. J Natl Cancer Inst 1959;22:173–203.

intended treatment effects. J Clin Epidemiol 1999;52:7–12.

[60] Fisher RA. Lung cancer and cigarettes? Nature 1958;182:108.

[52] Naylor CD. Ecological analysis of intended treatment effects: caveat
[61] Lin DY, Psaty BM, Kronmal RA. Assessing the sensitivity of regres-
emptor. J Clin Epidemiol 1999;52:1–5.

sion results to unmeasured confounders in observational studies. Bio-
[53] Johnston SC, Henneman T, McCulloch CE, van der Laan M. Modeling
treatment effects on binary outcomes with grouped-treatment variables
[62] LeLorier J, Gregoire G, Benhaddad A, Lapierre J, Derderian F. Dis-
and individual covariates. Am J Epidemiol 2002;156:753–60.

[54] Horwitz RI, Viscoli CM, Clemens JD, Sadock RT. Developing im-
crepancies between meta-analyses and subsequent large randomized,
proved observational methods for evaluating therapeutic effectiveness.

controlled trials. N Engl J Med 1997;337:536–42.

Am J Med 1990;89:630–8.

[63] Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of
[55] Hlatky MA, Califf RM, Harrell FE Jr, Lee KL, Mark DB, Pryor DB.

bias: dimensions of methodological quality associated with estimates
Comparison of predictions based on observational data with the
of treatment effects in controlled trials. JAMA 1995;273:408–12.

Source: http://www.statisticor.nl/pdf/Klungel%20O%20-%20Martens%20EP%20-%20Methods%20to%20assess%20intended%20drug%20effects%20are%20reviewed.pdf

Biol Trace Elem ResDOI 10.1007/s12011-013-9732-6 Biomonitoring with Honeybees of Heavy Metalsand Pesticides in Nature Reserves of the MarcheRegion (Italy) Sara Ruschioni & Paola Riolo & Roxana Luisa Minuz &Mariassunta Stefano & Maddalena Cannella &Claudio Porrini & Nunzio Isidoro Received: 29 April 2013 / Accepted: 6 June 2013 # Springer Science+Business Media New York 2013

LIVE-CELL BIOSENSOR Novel Biosensors to Monitor Cellular Events in Live Cells Review of Fan, F. et al. (2008) Novel genetically encoded biosensors using firefly luciferase. ACS Chem. Biol. 3, 346–51. Neal Cosby, Promega Corporation entists targeted the hinge region of the luciferase mol- Drug discovery and life science researchers desire to