Clinical artificial intelligence algorithms are becoming increasingly accurate. However, accurate algorithms do not necessarily improve surgical patient and system outcomes or safety. In fact, even accurate algorithms can at times… Click to show full abstract
Clinical artificial intelligence algorithms are becoming increasingly accurate. However, accurate algorithms do not necessarily improve surgical patient and system outcomes or safety. In fact, even accurate algorithms can at times worsen outcomes through unintended effects. It is only after robust implementation studies and clinical trials that these algorithms can be reliably determined to be ready for safe clinical use. The predictive performance of an artificial intelligence algorithm may be described in terms of a number of metrics, including accuracy and receiver operating characteristic curves. However, these measures of predictive performance are not measures of efficacy or effectiveness. Through effects such as automation bias, errors of omission or commission due to overreliance on automated processes, artificial intelligence algorithms have the potential worsen outcomes. Accordingly, a stepwise and rigorous process is required to demonstrate that artificial intelligence algorithms improve patient outcomes. The steps involved in the development of a clinical artificial intelligence algorithm can be considered as similar to those in the development of a clinical decision rule or risk-stratification scale. These steps have been outlined previously and primarily comprise derivation studies, validation studies and implementation studies. Derivation studies involve the development of an algorithm. Examples of algorithms developed in this stage include those utilizing ensemble learning (such as boosting and bootstrap aggregating) and deep learning. This derivation study will involve a report of the model’s performance. The term ‘validation’ is used in multiple different ways in the surgical and artificial intelligence literature. Normally, validation studies evaluate the performance of these same models in a separate, prospective and/or external data set. There are guidelines regarding the processes that should be employed in validation studies, including the use of measures of discrimination and calibration and the evaluation of generalisability in diverse settings. A lack of robust validation studies has been cited as one reason for the limited success of medical artificial intelligence implementation studies, although examples of such validation studies have been provided by international consortia. If in validation studies model performance remains at a level that may provide utility, the process may then proceed to implementation studies. In scientific artificial intelligence parlance, ‘implementation studies’ may refer to both to the strict sense of the term, or clinical trials that implement models. Randomized clinical trials are familiar to clinicians and this type of evidence forms the foundation of modern evidence-based practice. These trials must show a benefit in patient or system-oriented factors prior to an intervention being ready for clinical use. Clinical trials of artificial intelligence algorithms are no different. However, the strict definition of the term ‘implementation study’ refers to studies that looks at how an intervention is implemented and which factors influence the possible effect of its use. Such factors for evaluation include the acceptability, uptake, costs and sustainability of an intervention. These studies are of particular relevance in the use of clinical artificial intelligence algorithms as there may be significant issues with interpretability that limit uptake. Important considerations must be made during study design, such as the applicability of the null hypothesis paradigm and the use of techniques such as data enrichments of study arms to address potential artificial intelligence-related biases. The critical appraisal of artificial intelligence clinical trials and implementation studies is of importance for surgical systems. Primarily, it is necessary to identify whether clinical trials have in fact been conducted for a particular algorithm. Relative to other approaches to system modification, there is currently a paucity of published clinical trials for artificial intelligence algorithms in the surgical literature. If a clinical trial has been conducted, this trial requires careful evaluation. The critical appraisal of artificial intelligence clinical trials may be more familiar to clinicians than the evaluation of derivation and validation studies. While derivation and validation studies may focus on the intricacies of model development and use complex statistical methodologies, artificial intelligence clinical trials should focus on the improvement of patient outcomes. This means that, as with standard clinical trials, artificial intelligence clinical trial designs should be evaluated for issues with randomisation, blinding, generalisability and the clinical significance of endpoints. These trials should compare novel interventions to the existing standard of care to evaluate comparative-effectiveness. These trials should be as rigorous, if not more so, than for trials evaluating the effect of new pharmacological therapeutics or other medical devices. Guidelines have also been published to assist with this critical appraisal process. A recent systematic review provides a summary of randomized healthcare machine learning clinical trials. This review identified frequent methodological issues, variable reporting and a significant risk of bias. The regulatory processes relating to artificial intelligence algorithms, which are subject to ongoing discussion, have complexities beyond the scope of this perspective. However, the regulation of the use of clinical artificial intelligence algorithms will be inherently related to evidenced generated through artificial intelligence clinical trials. As more artificial intelligence algorithms near clinical
               
Click one of the above tabs to view related content.