Grading the quality of evidence and the strength of recommendations
Judgments about evidence and recommendations in healthcare are complex. For example, those making recommendations must decide between recommending selective serotonin reuptake inhibitors (SSRI’s) and tricyclics for the treatment of moderate depression must agree on which outcomes to consider, which evidence to include for each outcome, how to assess the quality of that evidence, and how to determine if SSRI’s do more good than harm compared with tricyclics. Because resources are always limited and money that is allocated to treating depression cannot be spent on other worthwhile interventions, they may also need to decide whether any incremental health benefits are worth the additional costs.
Systematic reviews of the effects of healthcare provide essential, but not sufficient information for making well informed decisions. Reviewers and people who use reviews draw conclusions about the quality of the evidence, either implicitly or explicitly. Such judgments guide subsequent decisions. For example, clinical actions are likely to differ depending on whether one concludes that the evidence that warfarin reduces the risk of stroke in patients with atrial fibrillation is convincing (high quality) or that it is unconvincing (low quality).
Similarly, practice guidelines and people who use them draw conclusions about the strength of recommendations, either implicitly or explicitly. Using the same example, a guideline that recommends that patients with atrial fibrillation should be treated may suggest that all patients definitely should be treated or that patients should probably be treated, implying that treatment may not be warranted in all patients.
A systematic and explicit approach to making judgments such as these can help to prevent errors, facilitate critical appraisal of these judgments, and can help to improve communication of this information. Since the 1970’s a growing number of organizations have employed various systems to grade the quality (level) of evidence and the strength of recommendations. Unfortunately, different organizations use different systems to grade evidence and recommendations. The same evidence and recommendation could be graded as “II-2, B”, “C+, 1”, or “strong evidence, strongly recommended” depending on which system is used. This is confusing and impedes effective communication.
Criteria for applying or using GRADE [pdf]
One of the aims of the GRADE Working Group is to reduce unnecessary confusion arising from multiple systems for grading evidence and recommendations. To avoid adding to this confusion by having multiple variations of the GRADE system we suggest that the criteria below should be met when saying that the GRADE system was used. Also, while users may believe there may be good reasons for modifying the GRADE system, we discourage the use of “modified GRADE approaches” that differ substantially from the approach described by the GRADE Working Group.
On the other hand, we encourage and welcome constructive criticism of the GRADE approach, suggestions for improvements, and involvement in the GRADE Working Group. As most scientific approaches to advancing healthcare, the GRADE approach will continue to evolve in response to new evidence and to meet the needs of systematic review authors, guideline developers and other users.
Suggested criteria for stating that the GRADE system was used:
- “Quality of evidence” should be defined consistently with one of the two definitions (for guidelines or for systematic reviews) used by the GRADE Working Group.
- Explicit consideration should be given to each of the GRADE criteria for assessing the quality of evidence (risk of bias/study limitations, directness, consistency of results, precision, publication bias, magnitude of the effect, dose-response gradient, influence of residual plausible confounding and bias “antagonistic bias”) although different terminology may be used.
- The overall quality of evidence should be assessed for each important outcome and expressed using four (e.g. high, moderate, low, very low) or, if justified, three (e.g. high, moderate, and very low and low combined into low) categories based on definitions for each category that are consistent with the definitions used by the GRADE Working Group.
- Evidence summaries (narrative or in table format) should be used as the basis for judgements about the quality of evidence and the strength of recommendations. Ideally, full evidence profiles suggested by the GRADE Working Group should be used and these should be based on systematic reviews. At a minimum, the evidence that was assessed and the methods that were used to identify and appraise that evidence should be clearly described. In particular, reasons for up and downgrading should be described transparently.
- Explicit consideration should be given to each of the GRADE criteria for assessing the strength of a recommendation (the balance of desirable and undesirable consequences, quality of evidence, values and preferences, and resource use) and a general approach should be reported (e.g. if and how costs were considered, whose values and preferences were assumed, etc.).
- The strength of recommendations should be expressed using two categories (weak/conditional and strong) for or against a management option and the definitions for each category should be consistent with those used by the GRADE Working Group. Different terminology to express weak/conditional and strong recommendations may be used, although the interpretation and implications should be preserved.
- Decisions about the strength of the recommendations should ideally be transparently reported.