Misdiagnosis by physicians occurs in approximately 5% of outpatients. Computerised diagnostic decision support (CDDS) programmes can help, and interest in this area has increased alongside advances in artificial intelligence and… Click to show full abstract
Misdiagnosis by physicians occurs in approximately 5% of outpatients. Computerised diagnostic decision support (CDDS) programmes can help, and interest in this area has increased alongside advances in artificial intelligence and wider availability of clinical data. Originally designed for doctors, CDDS called symptom checkers are designed to directly assist patients by creating differential diagnoses and advising on the need for further care. The health technology company Babylon recently claimed that their Babylon Diagnostic and Triage System outperformed the average human doctor on a subset of the Royal College of General Practitioners exam. They supported this claim with an internal evaluation study, the results of which were met with scepticism because of methodological concerns. In particular, data in the trials were entered by doctors, not the intended lay users, and no statistical significance testing was performed. Comparisons between the Babylon Diagnostic and Triage System and seven doctors were sensitive to outliers; poor performance of just one doctor skewed results in favour of the Babylon Diagnostic and Triage System. Qualitative assessment of diagnosis appropriateness made by three clinicians exhibited high levels of disagreement. Comparison to historical results from a study by Semigran and colleagues produced high scores for the Babylon Diagnostic and Triage System but was potentially biased by unblinded selection of a subset of 30 of 45 test cases. The detailed analysis is shown in the appendix. Babylon is commended for releasing a fairly detailed description of the system development and the three evaluation studies. This is an important first step in determining its performance and safety. Overall, these results suggest that the Babylon Diagnostic and Triage System potentially showed some improvement compared to the average symptom checkers in the Semigran study. However methodological issues mean that any performance improvement is not proven. It is not possible to determine how well the Babylon Diagnostic and Triage System would perform on a broader randomised set of cases or with data entered by patients instead of doctors. Babylon’s study does not offer convincing evidence that its Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse. If this study is the only evidence for the performance of the Babylon Diagnostic and Triage System, then it appears to be early in stage 2 of the STEAD framework (preclinical). Further clinical evaluation is necessary to ensure confidence in patient safety. Similar concerns with the perfo r m ance of other CDDS for patients have been reported. Wolf and colleagues showed a high false negative rate in three of four systems designed to detect melanomas from images, which if used in the real world could falsely reassure patients and put their lives at risk. Symptom checkers with significant false negative rates could create similar dangers if used by patients presenting with high risk diseases such as cardiac ischaemia, pulmonary embolism, or meningitis. These cases highlight the urgent need for guidelines on robust evaluation of CDDS directed at patients for safety, efficacy, effectiveness, and cost. Such guidelines should form the basis of a regulatory framework, as there is currently minimal regulatory oversight of these technologies. Without such structure, commercial entities have little incentive to develop Published Online November 6, 2018 http://dx.doi.org/10.1016/ S01406736(18)328198
               
Click one of the above tabs to view related content.