To deliver high-quality clinical care to patients with diabetes and other chronic conditions, clinicians must understand the evidence available from studies that have been performed to address important clinical management questions. In an evidence-based approach to clinical care, the evidence from clinical research should be integrated with clinical expertise, pathophysiological knowledge, and an understanding of patient values. As such, in an effort to provide information from many studies, the publication of diabetes meta-analyses has increased markedly in the recent past, using either observational or clinical trial data. In this regard, guidelines have been developed to direct the performance of meta-analysis to provide consistency among contributions. Thus, when done appropriately, meta-analysis can provide estimates from clinically and statistically homogeneous but underpowered studies and is useful in supporting clinical decisions, guidelines, and cost-effectiveness analysis. However, often these conditions are not met, the data considered are unreliable, and the results should not be assumed to be any more valid than the data underlying the included studies. To provide an understanding of both sides of the argument, we provide a discussion of this topic as part of this two-part point-counterpoint narrative. In the point narrative as presented below, Dr. Home provides his opinion and review of the data to date showing that we need to carefully evaluate meta-analyses and to learn what results are reliable. In the counterpoint narrative following Dr. Home’s contribution, Drs. Golden and Bass emphasize that an effective system exists to guide meta-analysis and that rigorously conducted, high-quality systematic reviews and meta-analyses are an indispensable tool in evidence synthesis despite their limitations.
—William T. Cefalu, MD
Editor in Chief, Diabetes Care
Meta-analysis is a means of combining data from studies with a common end point with the intention of providing a more reliable estimate of the effect size of some intervention or observation (1). Effect size is useful in medicine in deciding whether an intervention has a benefit-to-risk ratio high enough to justify clinical use, and in economic analysis for cost-effectiveness. Combining data usually narrows CIs, allowing tighter limits to be set on potential outcomes and making statistical significance more likely. But early on concern was expressed about the limits of meta-analysis, concerns often now ignored (2).
The number of reports in which “meta-analysis” and “diabetes” are both contained in the title has expanded rapidly in recent years (Fig. 1). Sadly, many of these studies have significant weaknesses escaping the combined skills of peer reviewers and editors. Meanwhile the clinical community and the media seem to believe that anything that reaches statistical significance and is the result of meta-analysis has some mystical access to unchallengeable truth. This is then widely disseminated to those without the skills needed to make appropriately reserved judgments.
An approach that unintentionally promotes this misunderstanding is evidence grading. Often, this is a hierarchy in which randomized controlled trials (RCTs) are top rank, together with meta-analysis of these (3). Unfortunately RCTs are themselves of very variable quality, and often the data extracted from them for meta-analysis can be secondary and unreliable (4). The same is true of observational studies. However since the word meta-analysis is only given in the top rank of the data hierarchy, the nonexpert can be forgiven for assuming any meta-analysis must be first rank evidence.
This article, while acknowledging the usefulness of meta-analysis, attempts to highlight some of the method’s weaknesses. The author is a clinical researcher not a statistician, so the technical issues are not fully covered, but instead problems accessible to the clinician.
Some problems of meta-analysis
Unreliable primary data
The best data for meta-analysis is when the primary end point of a series of studies is the same. This is now beginning to be true for some cardiovascular (CV) outcomes such as myocardial infarction (MI), after years of discussion, but independent adjudication is still required (5). Otherwise, only all-cause mortality is generally regarded as reliable. CV mortality is problematic because terminal events are often CV even if another disease (e.g., cancer) is responsible, whereas sudden death is often indeterminate though usually counted as CV in the absence of other information. Furthermore investigator reporting, like death certification, is unreliable for many events—diagnoses are generally made on probabilities not certainties. Was it a stroke, a transient ischemic attack, a fit, or a syncopal event, and by what definition? This problem is compounded in meta-analyses in which the data used are trial primary or secondary end points, but are taken from adverse events, themselves subject to high levels of unreliability.
Recently, meta-analyses have been published for CV outcomes for the dipeptidyl peptidase 4 (DPP-4) inhibitors (6). Commendably the authors have often avoided use of the word “meta-analysis.” The data for sitagliptin were largely taken from studies designed to address glucose control and general safety, with no special emphasis on CV safety (7). While retrospective adjudication of events can be attempted, data extracted for long past events is necessarily tenuous. Furthermore, evidence for a possible event may be missed if dedicated pages in a clinical report form were not used. More recently, to meet U.S. Food and Drug Administration requirements, such approaches to CV end points have been incorporated in glucose-lowering phase 2/3 studies, so data quality has improved. However three other problems of meta-analysis affecting the DPP-4 inhibitor CV data findings are discussed below.
An important example of “difficult data” in diabetes is hypoglycemia. Hypoglycemic symptoms are often nonspecific, particularly in those with other diseases, the elderly, and the obese, while in type 2 diabetes self-glucose testing is performed infrequently. A particular problem here is nocturnal events. The situation in type 1 diabetes is now complicated by erratic and sometimes biased use of continuous glucose monitoring. As a result of these problems, hypoglycemia event rates vary markedly between studies for what seem like similar interventions in similar populations, with no clarity as to which study is more valid. Reported hypoglycemia rates tend to decrease with study duration, perhaps due to reporting fatigue. The effect of this is that hypoglycemia meta-analyses may give some useful estimate of relative rate of hypoglycemia on two therapies (8) but cannot give useful estimates of the more important absolute change.
The problems of study duration
Meta-analysis can be used to disguise some of the simple study duration issues that affect single studies. For example, most phase 3 studies (median in diabetes around 26 weeks) would be too short to identify the kind of fracture/osteoporotic problems that stem from thiazolidinedione use (9). But meta-analysis of a series of short-term studies may allow sufficient numbers of fractures to be accumulated to limit the upper CI, and the agent may then be provisionally judged safe, quite inappropriately (10). Other dangers relate to oncogenic effects and work both ways: Time may be necessary to uncover genotoxic or indeed growth-promoting effects, whereas for dapagliflozin investigations preexposure suggests that the apparent imbalance in bladder cancer malignancies was at least partly related to clinically silent disease (11).
Less obvious may be combined analyses of studies of very different duration. The type 2 diabetes MI/CV meta-analyses used four or five studies that included both the early terminated Action to Control Cardiovascular Risk in Diabetes (ACCORD; 3.5 years median) and the extension of the UK Prospective Diabetes Study (UKPDS; 15 years) (12–16). The meta-analyses agree on a 15–18% reduction in CV outcomes for a 10 mmol/mol (0.9% unit) reduction in HbA1c. However beyond confirming blood glucose control matters, that is not clinically useful because of the impossibility of understanding how long such control has to be maintained to gain the benefit. Note also the problem here of the expression of the result in terms of HbA1c change: 10 mmol/mol (0.9% units) is difficult to achieve early after diagnosis once lifestyle change is in place (17), but later in the disease (after 20 years) use of multiple agents including insulin may be delivering a 33 mmol/mol (3.0%) reduction—can the gain be multiplied up?
Study populations
Related to the last problem is conflation of results from different study populations. It is clear that ACCORD (obese North Americans), Action in Diabetes and Vascular Disease (ADVANCE; global, including many oriental Asians), Prospective Pioglitazone Clinical Trial in Macrovascular Events (PROactive; secondary CV), and Veterans Affairs Diabetes Trial (VADT; mainly men) are very different populations (Table 1) (15,18–20). The participants in each study were all studied relatively late in the course of their disease (e.g., at entry 8 to 12 years from diagnosis), yet combined with them is UKPDS with a population studied from 3 months after diagnosis (16). It does not follow that the results of meta-analysis will be generalizable to any particular population, or to any individual with diabetes.
This problem is not addressed by statistical tests of heterogeneity such as Cochran Q or Altman I2 (21). These statistics consider whether the numerical results (e.g., the hazard ratio [HR] and CIs of it) of the studies are consistent or different. However they are blind to clinical issues so that, for example, no statistical heterogeneity might be found in studies done in a mixture of rats, dogs, and humans if the HRs were comparable, or when combining data from different antihypertensive drug classes. If I2 is high, then examination of the studies will generally quickly identify why that is, but really the meta-analysis ought never to have been done in the first place.
Many of the phase 3 glucose-lowering studies used for secondary CV meta-analysis are of individuals selected as being in particularly good health, including a low prior incidence of CV disease. This may mean that meta-analyses of phase 3 studies are not suitable for assessment of CV risk, for the reports will not be reflecting risk in typical diabetes populations. A reverse issue relates to testing for CV protection. Because the evidence from UKPDS and the late megatrials is that glucose-lowering benefit only kicks in in the longer term, then no benefit is expected to be seen in early studies unless there is another pathogenetic mechanism operating. Indeed if the early indications from the incretin studies are real (6), then this would point to nonglucose-lowering mechanisms.
Tempting targets for meta-analysis exist in diabetes but have not been performed for reasons of population heterogeneity. Examples are the rosiglitazone versus metformin data for malignancy from A Diabetes Outcome Progression Trial (ADOPT) and Rosiglitazone Evaluated for Cardiovascular Outcomes and Regulation of Glycemia in Diabetes (RECORD) (22). Both studies found no difference, such that combined data would give an estimate closer to 1.00 with tighter CIs. But even before remembering one study was of monotherapy in relatively obese North Americans, and the other of combination oral agent therapy across broader European populations, is more information gained by combining the two estimates? Another thiazolidinedione example is for stroke. Using the same studies used by Nissen and Wolski (23) for MI, stroke was statistically significantly reduced by rosiglitazone compared with mixed comparators (24). Furthermore, stroke in RECORD, PROactive, and ADOPT also gave HRs below 0.80 (none statistically significant) (18,25,26), making it clear that meta-analysis would give a significant headline result (Table 2).
Such cautions did not prevent an analysis of heart failure data (27). Little was served by this—heart failure was a known problem of thiazolidinediones from the late 1990s, and adequately powered data were then available from a single RCT (28). But Singh et al. (27) included in their four selected studies one which specifically recruited people with heart failure, and thus with very different absolute outcome rates for both further failure events, CV events, and death (29). As a result the meta-analysis findings cannot be applied to any clinical population or further economic assessments.
Even measurements such as HbA1c and body weight can lack validity when combined from seemingly similar studies performed as part of a phase 2/3 development program. HbA1c change is known to be baseline dependent—starting therapy at higher levels results in larger falls (30,31). Accordingly, taking unadjusted data from studies with different baseline values, such as monotherapy and dual and triple combination therapy, has no useful clinical meaning but is still performed (32). Body weight change is similarly affected by baseline HbA1c because of the effects of falls in calorie loss through glycosuria—simple meta-analysis is similarly incapable of giving a meaningful result.
Data snooping and multiple testing
A danger is that breach of statistical norms may be followed by publication of a meta-analysis without realization that the process of statistical testing has been undermined. One example of this is for malignancy in which multiple meta-analyses of different organ sites are followed by highlighting of a single outlier organ result. This occurred in one of the insulin glargine quartet of articles in Diabetologia in 2009, in which breast cancer popped up as apparently statistically significant in the Swedish database study (33).
More subtly, two other examples have occurred in diabetes care, “more subtly” because it is not possible even for the authors themselves to know the extent to which they proceeded with meta-analysis only after noting that it was likely to generate a significant and therefore publishable result. The five large glucose-lowering CV trials all delivered central HR estimates below 1.00 (Table 2) (15,16,18–20). That in itself is close to statistical significance (P = 0.0625), and with upper CIs all below 1.10 or close to that, anyone familiar with statistical analysis will know that meta-analysis will give a statistically significant result. Three such publications followed the next year (12–14). Similarly the early central estimates for CV protection in all five available composite analyses of the DPP-4 inhibitors were below 1.00 (Fig. 2). Not surprisingly then, this presented an attractive target for meta-analysis and presentation and publication (6). This reviewer predicts with some confidence that this meta-analysis finding is not correct but biased toward a positive effect, as will be shown by the purpose designed RCTs (34,35).
This means that such meta-analyses are hypothesis generating, not hypothesis testing. Nissen and Wolski (23) acknowledged this in open session when discussing their rosiglitazone CV meta-analysis in 2007. Such reservations have not stopped clinical commentators in esteemed journals writing that the authors “have shown” that rosiglitazone causes increased “CV events.” A hypothesis-generating article cannot “show” anything. Further, since stroke was not addressed (see above), it is not CV events that were considered, only MI and mortality.
Drug regulators have a legitimate interest in data snooping when trying to understand the importance of nonstatistically significant trends in safety issues in the limited data available in licensing application. Here the intention is not to establish the actual safety of a new medication, but rather to develop a feel for the probability of whether such a trend might reflect a real issue. Predefined meta-analysis of the insulin degludec data for CV outcomes was broadly reassuring, though based on small numbers, but data snooping in weaker data using smaller comparisons and extension studies less so (36,37). In these circumstances, the predefined analysis should be that quoted by researchers and clinicians, the secondary analyses being reserved for regulatory decision making only.
Study population sizes
Chance plays games with allocation of events to active and comparator sides of randomized study populations. Data monitoring committees often watch an adverse outcome accumulate eight events in one group and only one in the other. But multiple small studies are not immune to this effect, as in the reported outcomes for CV effects in the early studies for saxagliptin and linagliptin (38,39). A clue here is the forest plot in which the larger composite studies, notably the study with vildagliptin, give central hazard estimates closer to 1.00 (Fig. 2) (6). Indeed, the vildagliptin finding is probably a better guide to the likely result of the formal outcome studies for the other agents than their own estimates from meta-analysis. A useful rule of thumb here is if total event number is <50, beware; if <100, be skeptical; and if <150, be cautious.
One well-known example is the meta-analysis by Nissen and Wolski (23) of the early rosiglitazone studies, after a similar analysis put in the public domain by GlaxoSmithKline some 8 months earlier (40). Nissen and Wolski were constrained to meta-analyze studies that mostly had <5 events in all, several indeed with 0 events. Because a HR cannot be calculated for the latter, they were excluded, but of course 0 versus 0 is equal hazard (and not zero information) thus inflating the final difference. Recalculation by more sophisticated methodology halved the central estimate of adverse effect and rendered it nonstatistically significant (41).
Problems inherent in included studies
Meta-analysis cannot correct for the inherent problems of the included studies, with the exception of inadequate study size (event rate and power). Some attempt is often made to deal with this by consideration of study quality, using published criteria such as Jadad scoring or the GRADE approach (42,43). Unfortunately these criteria deal mainly with technical issues and make no approach to determining clinical quality (e.g., whether a clinically appropriate comparator is chosen, whether the enrolled population is appropriate, and duration of follow-up) (Table 3). Those without clinical expertise may then perform meta-analysis with limited generalizability to a clinical population.
A particular problem relates to meta-analysis of observational studies. Observational studies of clinical interventions, all too easy to perform in these days of large electronic clinical databases, are subject to hidden confounding, biases that which are not known (44,45). Unfortunately the same bias might be expected in a series of database studies, and when meta-analysis is then used the findings become seemingly more solid. It is difficult by definition to provide examples of something that is unknown, but a couple of semihypothetical scenarios can be given. Metformin is a long-established medication used early in the course of diabetes since the results of the UKPDS substudy published in 1998, but many practitioners were brought up understanding that contraindications included liver disease, renal disease, and cardiac disease. As a result, there is a bias in its use to younger, shorter duration patients and those with better CV outcomes (46). By contrast, insulin is used later in the course of diabetes, in individuals with more complex overall health problems and often taking multiple therapies.
Accordingly metformin does well in prescription databases, and insulin does badly in, for example, malignancy and CV disease (46,47). This will be true for meta-analysis as well as the original observational studies but is negated by properly conducted RCTs. Less obviously, two medications in one class may be marketed to different target populations, for example one to primary care and one to specialists, the latter seemingly having worse outcomes, particularly on meta-analysis. Use of newer medications by teaching hospitals, which tend also to see more critically ill and complex medical cases, is another potentially confounding example.
Available studies
As well as issues discussed above—for example, short-term studies with poor data quality—available studies may limit conclusions through problems with comparators. When metformin was compared with other oral agents for malignancy, statistical heterogeneity was good as judged by an I2 of 13%, implying comparable results in the underlying studies, while the HR 0.98 (95% CI 0.77–1.23), implied no advantage (48). However, 67% of events come from comparisons with a thiazolidinedione, limiting the meaning of conclusions compared with other glucose-lowering therapies.
Lack of balance in study size may mean that a meta-analysis is merely a restatement of the results of the largest study. Lincoff et al. (49) published a meta-analysis of CV events with pioglitazone, in which 80% of the events came from PROactive. The remainder of the data came from several shorter studies with diverse comparators and ill-defined end point collection, but in reality the findings are a restatement of those of PROactive. If an outcome study is reasonably designed and conducted, it should displace anything based on poorer quality or observational data.
CONCLUSIONS
The problems discussed above suggest that the statistical and headline-grabbing power of meta-analysis must be applied and interpreted with caution. In the CV studies of glucose-lowering in which underpowering occurred, the finding of statistically significant reduction of MI and CV events is useful—but only when applied in the context of other information such as duration of the intervention and the populations that might benefit (Table 3). Taken together, the current DPP-4 inhibitor data are also positive, but the extent of the advantage is probably overestimated, such that, beyond assurance of CV safety, there is little information that is clinically applicable (Fig. 2).
On the insulin glargine cancer issue, the careful meta-analyses by Boyle and colleagues (50,51) have been of assistance in clarifying an area of some concern even before the reporting of the Outcome Reduction with Initial Glargine Intervention (ORIGIN) study, but here again the meta-analyses of the RCTs were limited by the duration, size, and quality of the underlying studies. The care taken by Boyle and colleagues in handling the observational studies does emphasize however the importance of considering the clinical quality of the underlying data and not just using technical criteria such as Jadad scoring.
Meta-analysis will remain important in the regulatory area, in which weak data particularly on safety and long-term outcomes are usual at the time of submission for licensing and what is needed is the best possible understanding of the possibilities and probabilities of risks and benefits. For this reason alone, meta-analysis will be with us for the foreseeable future, but clinicians and the media do need to learn what results are reliable, what are merely signals of possible effect, and what are associations without likely causation. Editors and reviewers of manuscripts have a duty to ensure cautions are explicit even when the headline results are given emphasis, while those presenting results to the media sometimes need more humility in making their findings public.
Acknowledgments
P.D.H. or institutions he is associated with received funding for his advisory, educational, and lecturing activities from AstraZeneca/BMS Alliance, Boehringer Ingelheim, Eli Lilly, GlaxoSmithKline, Merck Sharp & Dohme, Merck Serono, Novo Nordisk, Novartis, Sanofi, and Takeda Pharmaceuticals, including for some of the medications mentioned in this article. No other potential conflicts of interest relevant to this article were reported.