The Adequacy of Reporting Randomized, Controlled Trials in the Evaluation of Antidepressants

David L Streiner, PhD, CPsych1, Russell Joffe, MD2


Objective: To evaluate the adequacy of the reports of methodological and statistical aspects of randomized, controlled trials that evaluate antidepressant medications and the degree to which their results can be used in subsequent metaanalyses.

Data Sources: Randomized, controlled trials published in English that compared 2 antidepressant drugs with a placebo were reviewed. Papers were located using Medline and reference lists.

Data Extraction: Each paper was evaluated using a checklist, and 3 summary criteria scores were derived: minimal, ideal, and overall.

Results: Only 9 of the 69 papers met the minimal criteria to be included in a metaanalysis (which would report the final sample size and an estimate of the mean and of the standard deviation); and 0 met all of the ideal criteria for reporting clinical trials. Out of a possible 100 points, the mean score for the articles was 51, and the highest score was 80. Scores were not related to the citation impact factor of the journal in which they appeared.

Conclusions: Researchers must become more aware of the criteria for reporting clinical trials, and editors must insist more strenuously that these criteria be satisfied.

(Can J Psychiatry 1998;43:1026–1030)

Key Words: statistical reporting, metaanalysis, clinical trials

Over the past few years, there has been increasing emphasis placed on using the principles of evidenced-based medicine (EBM) in medical (1) and psychiatric (2) care and teaching. The central premise of EBM is that all aspects of clinical practice (including diagnosis, prognosis, and intervention) should be based on the best available evidence. When the issue involves selecting a therapy for a given condition, it is generally agreed that the most convincing results come from randomized, controlled trials (RCTs), in which some of the subjects are randomly allocated to receive the test treatment and the others receive either a placebo or standard care (3). The primary advantage of RCTs is that they minimize bias between groups caused by confounding variables (4). Even firmer conclusions can be drawn from well-conducted metaanalyses, in which the results of many trials are combined to arrive at an overall estimate of the effectiveness of the treatment (5). Thus, the validity of the decision regarding which therapeutic intervention to use is predicated on a chain of events: 1) the RCTs are well-designed, executed, and analyzed; 2) the results of these trials are properly reported; and 3) the metaanalyses include all the relevant studies and are themselves properly analyzed.

Various sets of criteria have been developed for each of these steps. Numerous books and articles have been written regarding subject selection and allocation, minimization of bias, the counting of events, analytic problems, and other issues involved in the design, execution, and analysis of RCTs (for example, 6,7). Although there are still some unresolved issues in metaanalysis, such as whether or not it is necessary to reanalyze data from existing trials (8) or whether to use random- or fixed-effect models (9), the process itself has been relatively well described (10). This paper focuses on the intermediate stage: the proper reporting of the results of individual trials.

As is the case with the first and last steps, there have been many sets of criteria proposed for presenting the findings of RCTs (11–13). Indeed, within the past few years, many leading medical journals have presented their own guidelines or have called for such criteria to be developed (13–17). These statements were largely motivated by reports that highlighted the generally poor quality of many studies in the literature (11,18–21); the problem is not only with articles in English but also with those in French, German, Spanish, and Italian (22).

Most guidelines agree that the articles, in reporting results, should present, as a minimum, 1) the original sample sizes in each group; 2) the number of subjects who did not complete the trial and the reasons for this; 3) simple summary data for the primary outcome (measures of central tendency and variability) so that the reader can reproduce the results; 4) the statistical procedures that were used; 5) the numerical value of the statistical tests; and 6) the exact P level rather than P relative to some benchmark (usually < 0.05 or < 0.01). Without this information, it is extremely difficult, if not impossible, for the reader to determine whether the study has internal validity (23). Further, these data are necessary if the trial results are to be included in subsequent metaanalyses. Despite these guidelines, which simply formalize what many statistical editors look for when reviewing manuscripts, a recent study found that a significant proportion of articles meeting the design criteria for a metaanalysis could not be included, because some necessary information was missing from the report, usually the measure of central tendency and variability (24). Even among those articles that were included in the analysis, it was sometimes necessary to estimate mean values from graphs and to use measures of variability based on other studies employing the same outcome measures. This paper reanalyzes those papers to document their adequacy in reporting results, the criteria that were most problematic, and the relationship, if any, between the completeness of the reporting and the editorial “impact” of the journal in which the paper was published.

Methods

The method of locating these articles has been described previously (24). Briefly, the articles constitute RCTs published in English that compared 2 different antidepressants with a placebo condition (a list of the articles is available from the authors). After eliminating articles that were superseded by later ones reporting on the same subjects, 69 papers were reviewed by a psychiatrist or a clinical psychologist. These papers were published in 26 different journals. The interrater agreement (K) of whether or not the articles met the inclusion criteria was 0.80. Articles were evaluated for their reports of 1) whether or not explicit diagnostic criteria were used and by whom; 2) initial and final sample sizes and the reasons for which participants dropped out; 3) the setting (inpatient, outpatient, or mixed) and, if mixed, the number of subjects from each; 4) indices of change; 5) some index of within-group variability; 6) the statistical tests; 7) numerical results of the tests; and 8) the exact P levels versus an indication relative to some value. (A copy of the rating sheet is available from the authors.)

Using these criteria, 3 scores were derived for each paper. The Minimum Criteria score was 1 if the paper reported 1) the final sample size; 2) some value for the magnitude of the change for each group, even if it had to be estimated from a graph; and 3) some index of variability, which also was credited if it could be estimated from a graph. If the paper failed on any criterion, it was given a score of 0. This score reflects whether there is the minimum amount of information in the paper for it to be included in a metaanalysis. The Ideal Criteria score was 1 if the paper reported 1) the initial sample size; 2) the final sample size, 3) numerical values for the pre- and postintervention measure or the change score; 4) actual values for the appropriate standard deviations; 5) which statistical test was used; 6) the actual value of the statistic; and 7) the exact P level. Again, the paper was assigned a score of 0 if any criterion was not satisfied.

Finally, a continuous Overall Criteria score was derived by assigning 10 points each for reporting 1) the initial sample size, 2) the final sample size, 3) the reasons for which subjects dropped out of the trial, 4) pre- and posttest scores, 5) numerical values for these, 6) pre- and posttest SDs, 7) actual values of the SDs, 8) the statistic used, 9) the value of the statistic, and 10) the exact P level. For criteria 5 and 6, only 5 points were awarded if these indices had to be estimated from a graph; and for criterion 10, only 5 points were given if P was indicated to be less than some benchmark. Finally, 1 point was given for reporting only posttest means and SDs (criteria 4 and 5). Each study could have a score between 0 and 100.

The Citation Impact Factor (CIF) for each journal was determined using the 1995 Journal Citation Reports (25). One journal was not listed; omitting the article from the analysis or assigning the journal a CIF of 0 showed no appreciable difference in the results. These scores were correlated with the Overall Criteria score using Pearson’s Product Moment Correlation Coefficient.

Results

Reporting of Sample Sizes

All of the papers except 1 (98.6%) reported the initial sample sizes of the groups; and the vast majority (64; 92.8%) indicated how many participants completed the trial. Among the 67 studies that had dropouts, 46 (68.7%) reported the reasons for which subjects discontinued, and 21 (31.3%) did not.

Reporting of Findings

Of the 69 papers, 45 (65.2%) reported both the pre- and posttreatment means for the main outcome measure (often with intermediate values as well); 6 (8.7%) gave only a single value to reflect the change over the course of the trial; 3 (4.3%) reported only the posttreatment score; and 15 (21.7%) did not report any values for the outcome measure. The actual numerical value of these scores (pre- and posttreatment, change, or posttreatment only) was reported in 28 papers (40.6%); in 26 articles (37.7%), the values had to be estimated from a graph.

Only 9 papers (13.0%) indicated the SD of the scores: 8 (11.6%) reported the SD for the pre- and posttreatment scores, and 1 (1.4%) gave the SD for the posttreatment score. No paper, even those that reported the results as change scores, gave the SD of the change score; and the majority (60; 87.0%) did not give the SD at all. Of the 9 papers that did provide SDs, 4 (5.8%) reported the numerical value; SDs in the remaining 5 papers (7.2%) had to be estimated from a graph. In some cases, the standard error (SE) was shown, so the SD could be calculated using the SE value and the sample size.

Reporting of Statistical Testing

Sixty papers (87.0%) indicated which statistical tests were employed. Most often, this information was provided in the Results or Methods section of the paper, and it was not always obvious which test was used for a particular result. However, only 17 articles (24.6%) gave the actual value of the test statistic. In 23 cases (33.3%), the exact value of the P level was given; while in 33 cases (47.8%), P was reported relative to some benchmark (usually < 0.05); in 13 papers, no value was provided (18.8%), with the authors simply reporting that the results were significant or not.

Other Methodological Issues

It has been shown previously that the effect size (ES) is larger in studies in which strict diagnostic criteria were used than in studies that did not report using objective criteria (24). In this sample, 53 papers (76.8%) reported having used such criteria (usually from the Diagnostic and Statistical Manual of Mental Disorders [DSM-III or DSM-III-R] or International Classification of Diseases [ICD-9] [26], the Research Diagnostic Criteria [27], or the Feighner criteria [28]). However, only 25 papers (36.2%) stated how these criteria were implemented. All but 2 of the articles (97.1%) indicated whether the study used inpatients, outpatients, or both. Of those that used patients from both settings, 78.3% (18/23) reported how many subjects came from each.

Criteria Scores

Only 9 of the studies (13.0%) had a Minimum Criteria score of 1, and 0 had an Ideal Criteria score of 1. The mean of the Overall Criteria scores for the studies was 51.1 (SD 14.09). The correlation between the Overall Criteria score and the CIF was 0.288 (P < 0.05), and a scatterplot is shown in Figure 1. It indicates that some articles in journals with the highest CIFs had relatively poor Overall Criteria scores and, conversely, some of the best-reported articles were in journals with the lowest CIFs.

streiner.JPG
Figure 1. Scatterplot of Citation Impact Factor and Overall Criteria score for 68 articles.

Discussion

The main finding from this study is that, despite the publication of guidelines for reporting the results of RCTs (for example, 13–16), the quality of the average paper in this survey was less than satisfactory. The 69 papers included in this review were selected because they met the criteria for a metaanalysis. Thus, there is no reason to suspect that they are systematically worse than other published articles in psychiatry. In fact, because they all had to meet stringent design criteria regarding random assignment of patients and appropriate comparison conditions, they may actually be the better articles in the field. Yet, if a metaanalyst were to require that each article report just the final sample size, and numerical values for the means and SDs, only 9 articles (13%) would be included. If all of the criteria listed by the various journal guidelines had to be met, then no article would qualify. The only way that the previous metaanalysis (24) could reflect a more representative sample of papers was by including articles in which it was possible to estimate mean values from graphs and by using values for the SDs from other studies. The dilemma is that, by doing so, error is introduced for these parameters. Conversely, to have eliminated these papers would have resulted in estimates for the overall ES based on a much smaller number of values and would have made the search for factors that affect the ESs much more difficult.

Some of the purported benefits of metaanalyses are that they are more objective than narrative summaries and may utilize a less biased sample of evidence (29,30). Despite this, Oxman and Guyatt, for example, found 5 reviews that had conflicting recommendations for managing mild hypertension (31); Greenberg and others’ metaanalysis (32) came to a conclusion regarding the effectiveness of antidepressants diametrically opposed to that in Joffe and others’ metaanalysis (24). One possible reason for different metaanalyses arriving at discrepant results is that various authors may have different rules regarding which articles to include or reject based on the reporting of methodological and statistical details. Thus, each metaanalysis may begin with the same population of studies, but end up with different subsets of articles for inclusion, based on various criteria. While the relationship between the methodological rigour of a study and the magnitude of its findings is still a topic of debate, there are some indications that more poorly executed ones show larger ESs than more tightly controlled ones (33). The effect of this would be that more inclusive metaanalyses may report larger effects than more restrictive ones. This could be resolved by all studies adhering to a minimum set of criteria to lessen discrepancies among metaanalyses of articles included for review.

The purpose of this paper is not to blame either researchers or editors. Rather, it is to highlight an existing problem—found in other fields of medicine as well (for example, 33)—so that it can be ameliorated. The solution is not difficult and has already been identified by many journals and societies (for example, 16): results of studies should be reported in such a way that the readers can verify the findings and come to their own conclusions about the clinical importance of the outcome. This can be done only if certain information is present in the articles. This would include, at the very least, what we have called the Minimum Criteria and ideally should include more complete information—the Ideal Criteria. These would include the initial and final sample sizes for each group; the numerical values for the pre- and postintervention means and SDs (or the change scores); the names and numerical values of the statistical tests used; and the exact value of the P level. Responsibility for ensuring that these data are a part of all papers rests with 3 sets of people: the authors, the reviewers, and the editors. What is needed is a concerted will by the authors to include the information; by the reviewers to identify what necessary data are missing; and by the editors to refuse to publish the results of any study, no matter how well executed, unless the data appear in the paper.

All studies represent a significant investment of time, effort, and money by the researchers and sponsoring agencies and of time and risk of adverse reactions for the participants. Publication of these studies in a way that does not allow the results to be fully utilized by readers or metaanalysts is poor use of these resources. Both the clinical and research communities would be better served if the publication guidelines promulgated by the journals were followed more rigorously.


Clinical Implications

Limitation

Acknowledgements

The authors thank Dr Alex Jadad, Dr Meir Steiner, and Dr Trevor Young for their help and advice.

References

1. Cook DJ, Sibbald WJ, Vincent JL, Cerra FB. Evidence based critical care medicine: what is it and what can it do for us? Crit Care Med 1996;24:334–7.

2. Goldner EM, Bilsker D. Evidence-based psychiatry. Can J Psychiatry 1995;40:97–101.

3. Department of Clinical Epidemiology and Biostatistics. How to read clinical journals: IV. To determine etiology or causation. Can Med Assoc J 1981;124:985–90.

4. Streiner DL, Norman GR. PDQ epidemiology. 2nd ed. St Louis: Mosby; 1996.

5. Glass GV. Primary, secondary, and meta-analysis of research. Educational Researcher 1976;5:3–8.

6. Cochran WG. Sampling techniques. New York: Wiley; 1977.

7. Sackett DL, Gent M. Controversy in counting and attributing events in clinical trials. N Engl J Med 1979;301:1410–2.

8. Yusuf S. Obtaining medically meaningful answers from an overview of randomized clinical trials. Stat Med 1987;6:281–6.

9. Hedges LV. Meta-analysis. Journal of Educational Statistics 1992;17:279–96.

10. Einarson TR, Leeder JS, Koren G. A method for meta-analysis of epidemiological studies. Drug Intelligence and Clinical Pharmacy 1988;22:813–24.

11. Gøtzsche PC. Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal antiinflammatory drugs in rheumatoid arthritis. Control Clin Trials 1989;10:31–56.

12. Meinert CL. Clinical trials: design, conduct, and analysis. New York: Oxford University Press; 1986.

13. Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, and others. Improving the quality of reporting of randomized controlled trials: The CONSORT statement. JAMA 1996;276:637–9.

14. Rennie D. Reporting randomized controlled trials: an experiment and a call for responses from readers. JAMA 1995;273:1054–5.

15. Standards of Reporting Trials Group. A proposal for structured reporting of randomized controlled trials. JAMA 1994;272:1926–31.

16. Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Ann Intern Med 1988;108:266–73.

17. Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature. Call for comments on a proposal to improve reporting of clinical trials in the biomedical literature. Ann Intern Med 1994;121:894–5.

18. Fletcher RH, Fletcher SW. Clinical research in general medical journals: a 30-year perspective. N Engl J Med 1979;301:180–3.

19. Altman DG. The scandal of poor medical research. BMJ 1994;308:283–4.

20. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials: a survey of three medical journals. N Engl J Med 1987;317:426–32.

21. Gardner MJ, Bond J. An exploratory study of statistical assessment of papers published in the British Medical Journal. JAMA 1990;263:1355–7.

22. Moher D, Fortin P, Jadad AR, J ni P, Klassen T, Le Lorier J, and others. Completeness of reporting of trials published in languages other than English: implications for conduct and reporting of systematic reviews. Lancet 1996;347:363–6.

23. Cook TD, Campbell DT. Quasi-experimentation: design and analysis issues for field settings. Boston: Houghton-Mifflin: 1979.

24. Joffe R, Sokolov S, Streiner DL. Antidepressant treatment of depression: a metaanalysis. Can J Psychiatry 1996;41:613–6.

25. Garfield E. Journal citation reports. Philadelphia: Institute for Science Information; 1996.

26. National Center for Health Statistics. International classification of diseases. 9th ed. Washington (DC): U.S. Department of Health and Human Services; 1980.

27. Spitzer RL, Endicott J, Robins E. Research diagnostic criteria: rationale and reliability. Arch Gen Psychiatry 1978;35:773–82.

28. Feighner JP, Robins E, Guze SB, Woodruff RA, Winokur G, Munoz R. Diagnostic criteria for use in psychiatric research. Arch Gen Psychiatry 1972;26:57–63.

29. Olkin I. Meta-analysis: reconciling the results of independent studies. Stat Med 1995;14:457–72.

30. Peto R. Why do we need systematic overviews of randomized trials? Stat Med 1987;6:233–40.

31. Oxman AD, Guyatt GH. Guidelines for reading literature reviews. Can Med Assoc J 1988;138:697–703.

32. Greenberg RP, Bornstein RF, Greenberg MD, Fisher S. A meta-analysis of antidepressant outcome under “blinder” conditions. J Consult Clin Psychol 1992;60:664–9.

33. Jadad AR. Systematic reviews and meta-analyses in pain relief research: what can (and cannot) they do for us? Proceedings of the 8th World Congress on Pain; 1996; Seattle. Seattle: International Association for the Study of Pain (IASP). p 445–52.


Résumé

Objectif : Évaluer la pertinence de relever les aspects méthodologiques et statistiques des études randomisées et contrôlées qui évaluent les antidépresseurs ainsi que la mesure dans laquelle on peut utiliser les résultats de ces études dans des méta-analyses subséquentes.

Sources des données : On a recensé les études randomisées et contrôlées publiées en anglais qui comparaient 2 antidépresseurs avec un placebo. Les articles ont été trouvés grâce à Medline et aux listes d’ouvrages de référence.

Extraction des données : Chaque article a été évalué à l’aide d’une liste de vérification, et on en a tiré 3 grandes catégories de critères : minimums, idéaux et généraux.

Résultats : Seulement 9 des 69 articles répondaient aux critères minimums pour être inclus dans une méta-analyse (compte rendu de la taille de l’échantillonnage final, estimation de la moyenne et de l’écart-type) ; aucun ne répondait à tous les critères idéaux de compte rendu d’études cliniques. Des 100 points possibles, la note moyenne des articles était de 51, et le maximum, de 80. Les notes n’étaient pas liées au facteur d’influence de la citation de la publication où se trouvent les articles.

Conclusions : Les chercheurs doivent être davantage sensibilisés aux critères de compte rendu d’études cliniques, et les éditeurs doivent veiller plus étroitement à ce que ces critères soient respectés.


Manuscript received February 1998 and accepted March 1998.

Portions of this paper were presented at the Canadian Academy of Psychiatric Epidemiology, Calgary, Alberta, 16 September 1997.

1Professor, Department of Psychiatry, University of Toronto, Toronto, Ontario; Assistant Vice President for Research, Director, Kunin-Lunenfeld Applied Research Unit, Baycrest Centre for Geriatric Care, North York, Ontario.

2Dean, Vice President, Faculty of Health Sciences, McMaster University, Hamilton, Ontario.

Address for correspondence: Dr DL Streiner, Kunin-Lunenfeld Applied Research Unit, Baycrest Centre for Geriatric Care, 3560 Bathurst Street, North York, ON  M6A 2E1

email: dstreiner@rotman-baycrest.on.ca

Can J Psychiatry, Vol 43, December 1998