Rate each included study with a validated tool, use two independent reviewers, justify every judgment, and show how ratings shape your synthesis.
Doing Quality Assessment In A Systematic Review: The Core Workflow
Quality assessment tells readers whether study results are likely to be trustworthy. Many teams use the phrase “risk of bias,” which points to the same idea: features of a study that could distort its findings. Your workflow should be simple, repeatable, and well documented.
The big picture is the same across topics. Pick the right tool for each study design, define decision rules before you start, train the team with a small pilot, rate each domain, and record the reason for every call. Then feed those ratings into your analysis plan so the judgments actually matter.
Choose The Right Tool For Your Studies
Match tools to designs. A randomized trial needs different questions than a cohort study or a diagnostic accuracy paper. The Cochrane Handbook describes widely used tools for common designs, and many fields have their own add-ons. Use the tool that best fits the methods you will encounter.
| Study Design | Recommended Tool | Core Domains |
|---|---|---|
| Randomized trials | RoB 2 | Randomization process, deviations, missing data, measurement, reporting |
| Nonrandomized interventions | ROBINS-I | Confounding, selection, classification, deviations, missing data, measurement, reporting |
| Diagnostic accuracy | QUADAS-2 | Patient selection, index test, reference standard, flow and timing |
| Prognosis | PROBAST | Participants, predictors, outcomes, analysis |
| Systematic reviews | AMSTAR 2 | Protocol, search, selection, data, bias, synthesis, funding |
| Observational studies | JBI checklists or NOS | Selection, comparability, exposure or outcome |
Some tools give a domain-level call only, while others ask for an overall rating. Either way, keep to the intent of the tool. Do not invent extra scores or sum domain points unless the developers say to do that. If a design falls outside these tools, state why and cite the checklist you choose.
Plan, Calibrate, And Pilot
Write decision rules in your protocol. Spell out what “low,” “some concerns,” and “high” mean for each domain in your topic. Add examples that fit your area. Store the rules in your extraction form so they sit next to the questions your team answers.
Calibrate before you rate the full set. Pick five to ten diverse studies. Have two people rate them independently, then compare notes. Tighten unclear rules and add missing examples. This small investment reduces back-and-forth later and lifts agreement.
How To Assess Study Quality For Systematic Reviews: Tools And Tips
Prepare Your Form
Build one form per design. Mirror each domain in the tool and add short guidance under every item. Include a text box for free-form notes and a required field for the reason behind each call. Keep answer options close to the tool language so exports are easy to read.
Judge Each Domain
Work domain by domain, not question by question. Read the methods section, any protocols, and supplementary files. Mark signaling questions, cite page numbers, and write a short note that ties evidence to the choice you make. If information is missing, say so plainly.
Decide The Overall Rating
Use the rules of the tool. With RoB 2, one “high risk” domain usually makes the study “high” overall. With ROBINS-I, one “critical” domain moves the whole study to “critical.” Do not water down or re-label ratings to fit a preference for nuance; use notes to show nuance instead.
Record Reasons That Others Can Audit
Every rating should stand on its own. A third person reading your notes should see the link from evidence to judgment. Quote or paraphrase the parts of the paper that drove your call, and store the reference in your form. If you contacted authors, record the date and their reply.
Use Two Reviewers And Resolve Disagreements
Independent rating is the guardrail that protects against drift. Assign two reviewers to each study. They rate without seeing each other’s calls, then compare. A short huddle solves many mismatches. For tougher cases, bring in a third reviewer. Track reasons for changes so you can explain them later.
Agreement statistics can be helpful for process checks. Simple percent agreement is easy to read. A kappa or a Gwet AC1 can add context when categories are imbalanced. Report the metric and the slice of studies used so readers can judge the number.
Turn Ratings Into Synthesis And GRADE
Quality assessment only helps readers when it shapes the analysis. Predefine how ratings influence your synthesis: exclude “critical” studies, run sensitivity checks without “high” studies, or down-weight uncertain evidence. When you grade the body of evidence, use the GRADE Handbook to judge certainty by outcome.
Show the impact of ratings in plots and tables. A traffic-light figure gives a quick scan. Forest plots that label or color studies by risk status help readers connect methods to results. Keep the primary model clean and move exploratory runs to supplements.
Report Your Methods With PRISMA
Readers need a clear map of your process. State the tool used for each design, who rated, how disagreements were handled, and how ratings fed into synthesis. The PRISMA 2020 checklist points to the items that clearly belong in your methods, results, figures, and appendices.
Special Cases And Practical Notes
Cluster And Crossover Trials
For cluster trials, check whether the analysis adjusted for clustering. If not, mark the measurement or analysis domain at higher concern and adjust the effect or variance when possible. With crossover trials, look for carryover and washout; note the period used in the analysis.
Nonrandomized Interventions
List the confounders that matter for your question. If a study did not measure or adjust for these, rate confounding accordingly. Look for design features that limit bias, such as target trial emulation, propensity scores with good balance, or instrumental variable methods with strong assumptions spelled out.
Diagnostic Accuracy Studies
Check blinding between index and reference tests, the timing between tests, and whether the reference standard is valid for your setting. Pay attention to patient flow and exclusions, which can inflate accuracy if not handled well.
Systematic Reviews As Inputs
When a review supplies data to your umbrella review, rate it with AMSTAR 2. Prefer reviews with a protocol, a full search, duplicate screening, clear risk of bias methods, and transparent synthesis.
Differentiate Bias Types And Small-Study Effects
Study Bias Versus Reporting Bias
Study-level bias comes from design and conduct, while reporting bias comes from what gets written or posted. Your domain ratings judge design and conduct. To catch reporting bias, compare outcomes listed in protocols or registries with outcomes in the paper. If planned outcomes vanish or new ones appear, note that pattern in the reporting domain.
Small-Study Effects And Funnel Plots
Small trials can overstate benefits because they stop early, use flexible methods, or remain unpublished when results are neutral. Funnel plots and simple tests can flag asymmetry when you have enough studies. Treat these tools as signals, not verdicts. Pair them with a close read of methods and context.
When A Test Adds No Value
Asymmetry tests need at least ten studies per comparison and a mix of sizes. With fewer studies, a test adds noise. In that case, lean on design ratings, registries, and a forthright limits paragraph so readers see the full picture.
Handling Missing Information And Author Queries
Gaps in methods sections are common. Before you rate a domain as unclear, check protocols, trial registries, and supplements. Many journals host extra files with randomization details, adjudication manuals, or data dictionaries that answer your questions. Document every source you checked.
If major details remain missing, send a short, polite query to the corresponding author. Ask specific questions, one domain at a time. Store messages and replies with the study record. If no reply arrives after a fair window, state that fact and rate based on the evidence in hand. Readers value clear notes more than optimistic guesses.
Quality Assessment For Qualitative And Mixed-Methods Reviews
Some reviews include qualitative studies or mixed-methods projects. These need different lenses. Tools from JBI and CASP ask about sampling, data collection, reflexivity, and credibility checks. Apply the matching checklist and keep it separate from the tools you use for trials or cohorts. Synthesis methods differ, yet the principles hold: two reviewers, written rules, traceable reasons, and a clear path from ratings to the way you combine findings.
Training, Timing, And Workflows
Schedule training in short blocks. Start with one domain, rate studies, then pause for a group debrief. This keeps the language of the tool fresh and avoids inconsistent shortcuts that crop up when people rush. Agree on time budgets and give raters blocks to work. A shared chat thread for quick questions can prevent drift across weeks.
Build Reproducible Forms And Logs
Keep your rating form, codebook, pilot notes, and decision log under version control. Export a clean table for your supplement that lists each study, each domain, the overall call, and a one-line reason. Readers should be able to trace any data point back to a source. Keep a changelog that lists dates, editors, and the reason behind each tweak to rules to maintain a clean audit trail.
Common Pitfalls And Fixes
Mixing Study Quality With Reporting Quality
Poor reporting can hide bias, but absence of evidence is not proof of a good method. If a method is unclear, mark the domain as unclear or “some concerns” and ask authors when feasible. Do not bump ratings up because the paper reads well.
Creating Homemade Scores
Binning domains and summing a total score looks tidy, but it distorts the intent of validated tools. Keep to the rules the tools provide, and carry nuance in your notes and subgroup plans.
Letting Ratings Sit Outside The Analysis
Quality judgments matter only when they steer decisions. Tie each rating pattern to a planned action. If no action fits, say so and explain why the findings still help the reader.
From Ratings To Actions
| Signal | Rating Rule | Action In Synthesis |
|---|---|---|
| Serious deviation from protocol | High risk in “deviations” domain | Exclude from primary meta-analysis; include in sensitivity |
| Large, unexplained imbalance at baseline | High risk in “randomization” or confounding | Downgrade certainty; assess in subgroup or meta-regression |
| Selective outcome reporting suspected | High risk in “reporting” domain | Rate outcome certainty down for bias in GRADE |
Map Ratings To Outcomes
Bias can differ by outcome within the same study. A trial may measure mortality well but handle quality-of-life poorly. When outcomes differ, make separate domain calls and keep a tidy link between each outcome and its rating. This makes your GRADE tables line up with the way decisions are made in practice.
- Rate by outcome when methods differ within a study.
- Prefer patient-centered outcomes when you summarize certainty.
- Explain downgrades in plain language that mirrors the domain names.
This extra granularity adds a little time up front, then pays off when you write main messages. Readers can see which outcomes are solid, which are shaky, and why your confidence shifts across endpoints. That clarity builds trust in your judgments and keeps debates grounded in methods, not hunches.
Helpful Templates And Visuals
Use a simple color legend for domain calls and a legend that repeats the exact tool wording. A one-page appendix that shows your decision tree for each domain speeds peer review. Many teams now share blank and filled forms alongside data and code.
Ethics, Registration, And Transparency
Post your protocol on a registry such as PROSPERO if your field uses it, and archive the final rating form. Note funding and conflicts for the included studies and for your team. Readers should see who paid for what and how you handled any ties.
Quick Reference Checklist
Define tools per design. Write domain rules with examples. Pilot on a diverse set. Use two independent raters. Record reasons and page cites. Predefine how ratings change synthesis. Apply GRADE by outcome. Report methods with PRISMA. Share forms, logs, and data.
