Skip to main content

Table 3 Evaluation frameworks of AI studies

From: Artificial intelligence tool development: what clinicians need to know?

No

Evaluation/reporting framework

Content

1.

APPRAISE-AI [32]

This evaluates quality of AI studies in the model development across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality and reproducibility. These domains include 24 items with a maximum overall score of 100 points. Higher points indicating stronger methodological or reporting quality

2.

MI-CLAIM checklist [37]

Minimum information about clinical artificial intelligence modelling (MI-CLAIM) is a tool to improve transparent reporting of AI algorithms in medicine. It aims to enable a direct assessment of clinical impact including fairness and bias, and second, to allow rapid replication of the technical design process of any legitimate clinical AI study. The six parts are: (1) study design comprises clinical setting, performance measures, population composition and current baselines to measure performance against, (2) data partitions for model training and testing, (3) optimisation and final model selection, (4) performance evaluation to be reported at the model itself, and the model’s clinical performance metrics, (5) model examination as a “sanity check,” to uncover biases, to understand model behaviour, (6) reproducible pipeline by complete sharing of the code

3.

CODE-EHR checklist [38]

The CODE-EHR framework aims to improve the design and reporting of research studies using structured electronic health-care data. It requests for clarity on reporting and defines a set of minimum and preferred standards for the processes involved in (1) coding, dataset construction and linkage, (2) details and transparency of the preceding step, (3) disease and outcome definitions, (4) analysis, and (5) research governance which emphasises on patient and public engagement throughout the development process. Researchers are advised to use this checklist in the design phase of their study to ensure that important criteria for successful research and research impact are being used

4.

DECIDE-AI reporting guideline [39]

This comprises key items to be reported in early-stage clinical studies of AI-based decision support systems in healthcare to facilitate the appraisal of these studies and replicability of their findings. It has 17 AI-specific reporting items (with 28 subitems) and 10 generic reporting items with a paragraph for explanation for each of this item

5.

SPIRIT-AI reporting guideline [40]

The SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence) extension is a reporting guideline for clinical trial protocols evaluating interventions with an AI component. It includes 15 new items in addition to the core SPIRIT 2013 of 33 items. SPIRIT-AI requires clear descriptions of the AI intervention including instructions and skills needed for use, the setting in which the AI intervention will be integrated, considerations for the handling of input and output data, the human-AI interaction and analysis of error cases

6.

CONSORT-AI extension [41]

This includes 14 new items in addition to the core CONSORT 2010 items. It recommends investigators to provide clear descriptions of the AI intervention, including instructions and skills required for use, the setting in which the AI intervention is integrated, the handling of inputs and outputs of the AI intervention, the human-AI interaction and providing analysis of error cases

7.

The TRIPOD [42] and TRIPOD-AI [43]

TRIPOD + AI contains a 27-item checklist that aims to promote the complete, accurate, and transparent reporting of studies that develop a prediction model or evaluate its performance. Complete reporting will facilitate study appraisal, model evaluation and model implementation. It assists in reporting research in which a multivariable prediction model is being developed (or updated), or validated (tested) using any (supervised) ML technique. The checklists are not a quality appraisal tool

8.

PROBAST [44] and PROBAST-AI [45, 46]

The Prediction model Risk Of Bias ASsessment Tool (PROBAST) comprises four domains (participants, predictors, outcome and analysis) and contains 20 signalling questions to facilitate risk of bias assessment of prediction model studies from the study design, conduct to analysis. PROBAST-AI comprises two components: model development and model evaluation. In model development, users assess quality and applicability with 16 targeted signalling questions, while model evaluation uses 18 targeted questions to assess risk of bias and applicability. Both components share four domains—participants and data sources, predictors, outcome and analysis—with the prediction model’s applicability specifically rated in the first three domains

9.

STARD-AI [47]

Standards for Reporting of Diagnostic Accuracy Studies AI Extension (STARD-AI) is used to report diagnostic accuracy/test studies. STARD-AI is underdevelopment

10.

MINIMAR [48]

MINimum Information for Medical AI Reporting (MINIMAR) guides on the minimum information necessary to understand intended predictions, target populations, and hidden biases, and the ability to generalise these emerging technologies in four sections: (1) information on the population providing the training data, (2) training data demographics, (3) detailed information about the model architecture and development and (4) model evaluation, optimisation and validation to clarify how local model optimisation can be achieved and enable replication and resource sharing

11.

CLAIM [49]

Checklist for Artificial Intelligence in Medical Imaging (CLAIM) is modelled after the STARD guideline and has been extended to address applications of AI in medical imaging that include classification, image reconstruction, text analysis, and workflow optimisation to guide complete reporting of research

12.

CHEERS-AI [50, 51]

Consolidated Health Economic Evaluation Reporting Standards-AI assist in describing health economic evaluations to estimate the value for money (cost effectiveness) of AI interventions

13. 

IDEAL checklists [52, 53]

The Innovation, Development, Exploration, Assessment, and Long-term (IDEAL) Framework describes the five stages through which surgical therapy innovation normally passes. Each IDEAL stage is defined by key research questions These are intended to provide a minimum list of concepts authors should include in a report of surgical and device innovation. It can also be used both prospectively to help plan a study and retrospectively to assist in appraisal. IDEAL-D Framework for Device Innovation is a consensus statement on the preclinical stage of development (Stage 0) [54]

14.

FUTURE-AI checklist [55]

The guiding principles of FUTURE-AI are 1) Fairness, 2) Universality, 3) Traceability, 4) Usability, 5) Robustness and 6) Explainability. They aim to guide developers, evaluators and other stakeholders in delivering medical AI tools in health imaging that are trustworthy and optimised for real-world practice

15.

OPTICA [56]

Organisational PerspecTIve Checklist for AI solutions adoption (OPTICA) is a comprehensive and practical checklist tool to assess an adoption of AI solutions in health care organisations. It was developed through a consensus process involving multiple subject-matter domain

experts and decision-makers across the authors’ organisation. It comprises 13 chapters, each containing 3 to 12 checklist items, totalled 77. No scoring but checklist items that require a qualitative, case-specific evaluation process

16.

ALTAI [21, 57]

Assessment List for Trustworthy AI (ALTAI) provided by the European Commission’s High-Level Expert group for Artificial Intelligence. It comprises seven requirements for Trustworthy AI: (1) human agency and oversight, (2) technical robustness and safety, (3) privacy and data governance, (4) transparency, (5) diversity, nondiscrimination and fairness, (6) societal and environmental well-being and (7) accountability, and 60 questions in total