The Adequacy of Reporting Randomized, Controlled Trials in the Evaluation
David L Streiner, PhD, CPsych1, Russell Joffe, MD2
Objective: To evaluate the adequacy of the reports of methodological and
statistical aspects of randomized, controlled trials that evaluate antidepressant
medications and the degree to which their results can be used in subsequent
Data Sources: Randomized, controlled trials published in English that compared
2 antidepressant drugs with a placebo were reviewed. Papers were located
using Medline and reference lists.
Data Extraction: Each paper was evaluated using a checklist, and 3 summary
criteria scores were derived: minimal, ideal, and overall.
Results: Only 9 of the 69 papers met the minimal criteria to be included
in a metaanalysis (which would report the final sample size and an estimate
of the mean and of the standard deviation); and 0 met all of the ideal
criteria for reporting clinical trials. Out of a possible 100 points, the
mean score for the articles was 51, and the highest score was 80. Scores
were not related to the citation impact factor of the journal in which
Conclusions: Researchers must become more aware of the criteria for reporting
clinical trials, and editors must insist more strenuously that these criteria
(Can J Psychiatry 1998;43:10261030)
Key Words: statistical reporting, metaanalysis, clinical trials
Over the past few years, there has been increasing emphasis placed on using
the principles of evidenced-based medicine (EBM) in medical (1) and psychiatric
(2) care and teaching. The central premise of EBM is that all aspects of
clinical practice (including diagnosis, prognosis, and intervention) should
be based on the best available evidence. When the issue involves selecting
a therapy for a given condition, it is generally agreed that the most convincing
results come from randomized, controlled trials (RCTs), in which some of
the subjects are randomly allocated to receive the test treatment and the
others receive either a placebo or standard care (3). The primary advantage
of RCTs is that they minimize bias between groups caused by confounding
variables (4). Even firmer conclusions can be drawn from well-conducted
metaanalyses, in which the results of many trials are combined to arrive
at an overall estimate of the effectiveness of the treatment (5). Thus,
the validity of the decision regarding which therapeutic intervention to
use is predicated on a chain of events: 1) the RCTs are well-designed,
executed, and analyzed; 2) the results of these trials are properly reported;
and 3) the metaanalyses include all the relevant studies and are themselves
Various sets of criteria have been developed for each of these steps. Numerous
books and articles have been written regarding subject selection and allocation,
minimization of bias, the counting of events, analytic problems, and other
issues involved in the design, execution, and analysis of RCTs (for example,
6,7). Although there are still some unresolved issues in metaanalysis,
such as whether or not it is necessary to reanalyze data from existing
trials (8) or whether to use random- or fixed-effect models (9), the process
itself has been relatively well described (10). This paper focuses on the
intermediate stage: the proper reporting of the results of individual trials.
As is the case with the first and last steps, there have been many sets
of criteria proposed for presenting the findings of RCTs (1113). Indeed,
within the past few years, many leading medical journals have presented
their own guidelines or have called for such criteria to be developed (1317).
These statements were largely motivated by reports that highlighted the
generally poor quality of many studies in the literature (11,1821); the
problem is not only with articles in English but also with those in French,
German, Spanish, and Italian (22).
Most guidelines agree that the articles, in reporting results, should present,
as a minimum, 1) the original sample sizes in each group; 2) the number
of subjects who did not complete the trial and the reasons for this; 3)
simple summary data for the primary outcome (measures of central tendency
and variability) so that the reader can reproduce the results; 4) the statistical
procedures that were used; 5) the numerical value of the statistical tests;
and 6) the exact P level rather than P relative to some benchmark (usually
< 0.05 or < 0.01). Without this information, it is extremely difficult,
if not impossible, for the reader to determine whether the study has internal
validity (23). Further, these data are necessary if the trial results are
to be included in subsequent metaanalyses. Despite these guidelines, which
simply formalize what many statistical editors look for when reviewing
manuscripts, a recent study found that a significant proportion of articles
meeting the design criteria for a metaanalysis could not be included, because
some necessary information was missing from the report, usually the measure
of central tendency and variability (24). Even among those articles that
were included in the analysis, it was sometimes necessary to estimate mean
values from graphs and to use measures of variability based on other studies
employing the same outcome measures. This paper reanalyzes those papers
to document their adequacy in reporting results, the criteria that were
most problematic, and the relationship, if any, between the completeness
of the reporting and the editorial impact of the journal in which the
paper was published.
The method of locating these articles has been described previously (24).
Briefly, the articles constitute RCTs published in English that compared
2 different antidepressants with a placebo condition (a list of the articles
is available from the authors). After eliminating articles that were superseded
by later ones reporting on the same subjects, 69 papers were reviewed by
a psychiatrist or a clinical psychologist. These papers were published
in 26 different journals. The interrater agreement (K) of whether or not
the articles met the inclusion criteria was 0.80. Articles were evaluated
for their reports of 1) whether or not explicit diagnostic criteria were
used and by whom; 2) initial and final sample sizes and the reasons for
which participants dropped out; 3) the setting (inpatient, outpatient,
or mixed) and, if mixed, the number of subjects from each; 4) indices of
change; 5) some index of within-group variability; 6) the statistical tests;
7) numerical results of the tests; and 8) the exact P levels versus an
indication relative to some value. (A copy of the rating sheet is available
from the authors.)
Using these criteria, 3 scores were derived for each paper. The Minimum
Criteria score was 1 if the paper reported 1) the final sample size; 2)
some value for the magnitude of the change for each group, even if it had
to be estimated from a graph; and 3) some index of variability, which also
was credited if it could be estimated from a graph. If the paper failed
on any criterion, it was given a score of 0. This score reflects whether
there is the minimum amount of information in the paper for it to be included
in a metaanalysis. The Ideal Criteria score was 1 if the paper reported
1) the initial sample size; 2) the final sample size, 3) numerical values
for the pre- and postintervention measure or the change score; 4) actual
values for the appropriate standard deviations; 5) which statistical test
was used; 6) the actual value of the statistic; and 7) the exact P level.
Again, the paper was assigned a score of 0 if any criterion was not satisfied.
Finally, a continuous Overall Criteria score was derived by assigning 10
points each for reporting 1) the initial sample size, 2) the final sample
size, 3) the reasons for which subjects dropped out of the trial, 4) pre-
and posttest scores, 5) numerical values for these, 6) pre- and posttest
SDs, 7) actual values of the SDs, 8) the statistic used, 9) the value of
the statistic, and 10) the exact P level. For criteria 5 and 6, only 5
points were awarded if these indices had to be estimated from a graph;
and for criterion 10, only 5 points were given if P was indicated to be
less than some benchmark. Finally, 1 point was given for reporting only
posttest means and SDs (criteria 4 and 5). Each study could have a score
between 0 and 100.
The Citation Impact Factor (CIF) for each journal was determined using
the 1995 Journal Citation Reports (25). One journal was not listed; omitting
the article from the analysis or assigning the journal a CIF of 0 showed
no appreciable difference in the results. These scores were correlated
with the Overall Criteria score using Pearsons Product Moment Correlation
Reporting of Sample Sizes
All of the papers except 1 (98.6%) reported the initial sample sizes of
the groups; and the vast majority (64; 92.8%) indicated how many participants
completed the trial. Among the 67 studies that had dropouts, 46 (68.7%)
reported the reasons for which subjects discontinued, and 21 (31.3%) did
Reporting of Findings
Of the 69 papers, 45 (65.2%) reported both the pre- and posttreatment means
for the main outcome measure (often with intermediate values as well);
6 (8.7%) gave only a single value to reflect the change over the course
of the trial; 3 (4.3%) reported only the posttreatment score; and 15 (21.7%)
did not report any values for the outcome measure. The actual numerical
value of these scores (pre- and posttreatment, change, or posttreatment
only) was reported in 28 papers (40.6%); in 26 articles (37.7%), the values
had to be estimated from a graph.
Only 9 papers (13.0%) indicated the SD of the scores: 8 (11.6%) reported
the SD for the pre- and posttreatment scores, and 1 (1.4%) gave the SD
for the posttreatment score. No paper, even those that reported the results
as change scores, gave the SD of the change score; and the majority (60;
87.0%) did not give the SD at all. Of the 9 papers that did provide SDs,
4 (5.8%) reported the numerical value; SDs in the remaining 5 papers (7.2%)
had to be estimated from a graph. In some cases, the standard error (SE)
was shown, so the SD could be calculated using the SE value and the sample
Reporting of Statistical Testing
Sixty papers (87.0%) indicated which statistical tests were employed. Most
often, this information was provided in the Results or Methods section
of the paper, and it was not always obvious which test was used for a particular
result. However, only 17 articles (24.6%) gave the actual value of the
test statistic. In 23 cases (33.3%), the exact value of the P level was
given; while in 33 cases (47.8%), P was reported relative to some benchmark
(usually < 0.05); in 13 papers, no value was provided (18.8%), with the
authors simply reporting that the results were significant or not.
Other Methodological Issues
It has been shown previously that the effect size (ES) is larger in studies
in which strict diagnostic criteria were used than in studies that did
not report using objective criteria (24). In this sample, 53 papers (76.8%)
reported having used such criteria (usually from the Diagnostic and Statistical
Manual of Mental Disorders [DSM-III or DSM-III-R] or International Classification
of Diseases [ICD-9] , the Research Diagnostic Criteria , or the
Feighner criteria ). However, only 25 papers (36.2%) stated how these
criteria were implemented. All but 2 of the articles (97.1%) indicated
whether the study used inpatients, outpatients, or both. Of those that
used patients from both settings, 78.3% (18/23) reported how many subjects
came from each.
Only 9 of the studies (13.0%) had a Minimum Criteria score of 1, and 0
had an Ideal Criteria score of 1. The mean of the Overall Criteria scores
for the studies was 51.1 (SD 14.09). The correlation between the Overall
Criteria score and the CIF was 0.288 (P < 0.05), and a scatterplot is shown
in Figure 1. It indicates that some articles in journals with the highest
CIFs had relatively poor Overall Criteria scores and, conversely, some
of the best-reported articles were in journals with the lowest CIFs.
Scatterplot of Citation Impact Factor and Overall Criteria score for 68
The main finding from this study is that, despite the publication of guidelines
for reporting the results of RCTs (for example, 1316), the quality of
the average paper in this survey was less than satisfactory. The 69 papers
included in this review were selected because they met the criteria for
a metaanalysis. Thus, there is no reason to suspect that they are systematically
worse than other published articles in psychiatry. In fact, because they
all had to meet stringent design criteria regarding random assignment of
patients and appropriate comparison conditions, they may actually be the
better articles in the field. Yet, if a metaanalyst were to require that
each article report just the final sample size, and numerical values for
the means and SDs, only 9 articles (13%) would be included. If all of the
criteria listed by the various journal guidelines had to be met, then no
article would qualify. The only way that the previous metaanalysis (24)
could reflect a more representative sample of papers was by including articles
in which it was possible to estimate mean values from graphs and by using
values for the SDs from other studies. The dilemma is that, by doing so,
error is introduced for these parameters. Conversely, to have eliminated
these papers would have resulted in estimates for the overall ES based
on a much smaller number of values and would have made the search for factors
that affect the ESs much more difficult.
Some of the purported benefits of metaanalyses are that they are more objective
than narrative summaries and may utilize a less biased sample of evidence
(29,30). Despite this, Oxman and Guyatt, for example, found 5 reviews that
had conflicting recommendations for managing mild hypertension (31); Greenberg
and others metaanalysis (32) came to a conclusion regarding the effectiveness
of antidepressants diametrically opposed to that in Joffe and others metaanalysis
(24). One possible reason for different metaanalyses arriving at discrepant
results is that various authors may have different rules regarding which
articles to include or reject based on the reporting of methodological
and statistical details. Thus, each metaanalysis may begin with the same
population of studies, but end up with different subsets of articles for
inclusion, based on various criteria. While the relationship between the
methodological rigour of a study and the magnitude of its findings is still
a topic of debate, there are some indications that more poorly executed
ones show larger ESs than more tightly controlled ones (33). The effect
of this would be that more inclusive metaanalyses may report larger effects
than more restrictive ones. This could be resolved by all studies adhering
to a minimum set of criteria to lessen discrepancies among metaanalyses
of articles included for review.
The purpose of this paper is not to blame either researchers or editors.
Rather, it is to highlight an existing problemfound in other fields of
medicine as well (for example, 33)so that it can be ameliorated. The solution
is not difficult and has already been identified by many journals and societies
(for example, 16): results of studies should be reported in such a way
that the readers can verify the findings and come to their own conclusions
about the clinical importance of the outcome. This can be done only if
certain information is present in the articles. This would include, at
the very least, what we have called the Minimum Criteria and ideally should
include more complete informationthe Ideal Criteria. These would include
the initial and final sample sizes for each group; the numerical values
for the pre- and postintervention means and SDs (or the change scores);
the names and numerical values of the statistical tests used; and the exact
value of the P level. Responsibility for ensuring that these data are a
part of all papers rests with 3 sets of people: the authors, the reviewers,
and the editors. What is needed is a concerted will by the authors to include
the information; by the reviewers to identify what necessary data are missing;
and by the editors to refuse to publish the results of any study, no matter
how well executed, unless the data appear in the paper.
All studies represent a significant investment of time, effort, and money
by the researchers and sponsoring agencies and of time and risk of adverse
reactions for the participants. Publication of these studies in a way that
does not allow the results to be fully utilized by readers or metaanalysts
is poor use of these resources. Both the clinical and research communities
would be better served if the publication guidelines promulgated by the
journals were followed more rigorously.
The majority of articles reporting the results of trials of antidepressants
do not contain sufficient information to allow them to be included in metaanalyses.
Metaanalyses of the effectiveness or efficacy of antidepressants may yield
We cannot be as confident as we would like to be that the drugs have been
proven to be effective.
This study was based on only 1 type of trialantidepressants used in studies
with 2 active arms and a placebo control. It did not look at studies comparing
1 antidepressant with a placebo or other classes of drugs.
The authors thank Dr Alex Jadad, Dr Meir Steiner, and Dr Trevor Young for
their help and advice.
Cook DJ, Sibbald WJ, Vincent JL, Cerra FB. Evidence based critical care
medicine: what is it and what can it do for us? Crit Care Med 1996;24:3347.
Goldner EM, Bilsker D. Evidence-based psychiatry. Can J Psychiatry 1995;40:97101.
Department of Clinical Epidemiology and Biostatistics. How to read clinical
journals: IV. To determine etiology or causation. Can Med Assoc J 1981;124:98590.
Streiner DL, Norman GR. PDQ epidemiology. 2nd ed. St Louis: Mosby; 1996.
Glass GV. Primary, secondary, and meta-analysis of research. Educational
Cochran WG. Sampling techniques. New York: Wiley; 1977.
Sackett DL, Gent M. Controversy in counting and attributing events in clinical
trials. N Engl J Med 1979;301:14102.
Yusuf S. Obtaining medically meaningful answers from an overview of randomized
clinical trials. Stat Med 1987;6:2816.
Hedges LV. Meta-analysis. Journal of Educational Statistics 1992;17:27996.
Einarson TR, Leeder JS, Koren G. A method for meta-analysis of epidemiological
studies. Drug Intelligence and Clinical Pharmacy 1988;22:81324.
Gøtzsche PC. Methodology and overt and hidden bias in reports of 196 double-blind
trials of nonsteroidal antiinflammatory drugs in rheumatoid arthritis.
Control Clin Trials 1989;10:3156.
Meinert CL. Clinical trials: design, conduct, and analysis. New York: Oxford
University Press; 1986.
Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, and others. Improving
the quality of reporting of randomized controlled trials: The CONSORT statement.
Rennie D. Reporting randomized controlled trials: an experiment and a call
for responses from readers. JAMA 1995;273:10545.
Standards of Reporting Trials Group. A proposal for structured reporting
of randomized controlled trials. JAMA 1994;272:192631.
Bailar JC, Mosteller F. Guidelines for statistical reporting in articles
for medical journals. Ann Intern Med 1988;108:26673.
Working Group on Recommendations for Reporting of Clinical Trials in the
Biomedical Literature. Call for comments on a proposal to improve reporting
of clinical trials in the biomedical literature. Ann Intern Med 1994;121:8945.
Fletcher RH, Fletcher SW. Clinical research in general medical journals:
a 30-year perspective. N Engl J Med 1979;301:1803.
Altman DG. The scandal of poor medical research. BMJ 1994;308:2834.
Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of
clinical trials: a survey of three medical journals. N Engl J Med 1987;317:42632.
Gardner MJ, Bond J. An exploratory study of statistical assessment of
papers published in the British Medical Journal. JAMA 1990;263:13557.
Moher D, Fortin P, Jadad AR, J ni P, Klassen T, Le Lorier J, and others.
Completeness of reporting of trials published in languages other than English:
implications for conduct and reporting of systematic reviews. Lancet 1996;347:3636.
Cook TD, Campbell DT. Quasi-experimentation: design and analysis issues
for field settings. Boston: Houghton-Mifflin: 1979.
Joffe R, Sokolov S, Streiner DL. Antidepressant treatment of depression:
a metaanalysis. Can J Psychiatry 1996;41:6136.
Garfield E. Journal citation reports. Philadelphia: Institute for Science
National Center for Health Statistics. International classification of
diseases. 9th ed. Washington (DC): U.S. Department of Health and Human
Spitzer RL, Endicott J, Robins E. Research diagnostic criteria: rationale
and reliability. Arch Gen Psychiatry 1978;35:77382.
Feighner JP, Robins E, Guze SB, Woodruff RA, Winokur G, Munoz R. Diagnostic
criteria for use in psychiatric research. Arch Gen Psychiatry 1972;26:5763.
Olkin I. Meta-analysis: reconciling the results of independent studies.
Stat Med 1995;14:45772.
Peto R. Why do we need systematic overviews of randomized trials? Stat
Oxman AD, Guyatt GH. Guidelines for reading literature reviews. Can Med
Assoc J 1988;138:697703.
Greenberg RP, Bornstein RF, Greenberg MD, Fisher S. A meta-analysis of
antidepressant outcome under blinder conditions. J Consult Clin Psychol
Jadad AR. Systematic reviews and meta-analyses in pain relief research:
what can (and cannot) they do for us? Proceedings of the 8th World Congress
on Pain; 1996; Seattle. Seattle: International Association for the Study
of Pain (IASP). p 44552.
Objectif : Évaluer la pertinence de relever les aspects méthodologiques
et statistiques des études randomisées et contrôlées qui évaluent les antidépresseurs
ainsi que la mesure dans laquelle on peut utiliser les résultats de ces
études dans des méta-analyses subséquentes.
Sources des données : On a recensé les études randomisées et contrôlées
publiées en anglais qui comparaient 2 antidépresseurs avec un placebo.
Les articles ont été trouvés grâce à Medline et aux listes douvrages de
Extraction des données : Chaque article a été évalué à laide dune liste
de vérification, et on en a tiré 3 grandes catégories de critères : minimums,
idéaux et généraux.
Résultats : Seulement 9 des 69 articles répondaient aux critères minimums
pour être inclus dans une méta-analyse (compte rendu de la taille de léchantillonnage
final, estimation de la moyenne et de lécart-type) ; aucun ne répondait
à tous les critères idéaux de compte rendu détudes cliniques. Des 100
points possibles, la note moyenne des articles était de 51, et le maximum,
de 80. Les notes nétaient pas liées au facteur dinfluence de la citation
de la publication où se trouvent les articles.
Conclusions : Les chercheurs doivent être davantage sensibilisés aux critères
de compte rendu détudes cliniques, et les éditeurs doivent veiller plus
étroitement à ce que ces critères soient respectés.
Manuscript received February 1998 and accepted March 1998.
Portions of this paper were presented at the Canadian Academy of Psychiatric
Epidemiology, Calgary, Alberta, 16 September 1997.
1Professor, Department of Psychiatry, University of Toronto, Toronto, Ontario;
Assistant Vice President for Research, Director, Kunin-Lunenfeld Applied
Research Unit, Baycrest Centre for Geriatric Care, North York, Ontario.
2Dean, Vice President, Faculty of Health Sciences, McMaster University,
Address for correspondence: Dr DL Streiner, Kunin-Lunenfeld Applied Research
Unit, Baycrest Centre for Geriatric Care, 3560 Bathurst Street, North York,
ON M6A 2E1
Can J Psychiatry, Vol 43, December 1998