Diagnostic accuracy of three computer-aided detection systems for detecting pulmonary tuberculosis on chest radiography when used for screening

Abstract

The aim of this study was to independently evaluate the diagnostic accuracy of three artificial intelligence (AI)-based computer aided detection (CAD) systems for detecting pulmonary tuberculosis (TB) on global migrants screening chest x-ray (CXR) cases when compared against both microbiological and radiological reference standards (MRS and RadRS, respectively). Retrospective clinical data and CXR images were collected from the International Organization for Migration (IOM) pre-migration health assessment TB screening global database for US-bound migrants. A total of 2,812 participants were included in the dataset used for analysis against RadRS, of which 1,769 (62.9%) had accompanying microbiological test results and were included against MRS. All CXRs were interpreted by three CAD systems (CAD4TB v6, Lunit INSIGHT v4.9.0, and qXR v2) in offline setting, and re-interpreted by two expert radiologists in a blinded fashion. The performance was evaluated using receiver operating characteristics curve (ROC), estimates of sensitivity and specificity at different CAD thresholds against both microbiological and radiological reference standards (MRS and RadRS, respectively), and was compared with that of the expert radiologists. The area under the curve against MRS was highest for Lunit (0.85; 95% CI 0.83−0.87), followed by qXR (0.75; 95% CI 0.72−0.77) and then CAD4TB (0.71; 95% CI 0.68−0.73). At a set specificity of 70%, Lunit had the highest sensitivity (81.4%; 95% CI 77.9–84.6); at a set sensitivity of 90%, specificity was also highest for Lunit (54.5%; 95% CI 51.7–57.3). The CAD systems performed comparable to the sensitivity (98.3%), and except CAD4TB, to specificity (13.7%) of the expert radiologists. Similar trends were observed when using RadRS. Area under the curve against RadRS was highest for CAD4TB (0.87; 95% CI 0.86–0.89) and Lunit (0.87; 95% CI 0.85–0.88) followed by qXR (0.81; 95% CI 0.80–0.83). At a set specificity of 70%, CAD4TB had highest sensitivity (84.1%; 95% CI 82.3−85.8) followed by Lunit (80.9%; 95% CI 78.9−82.7); and at a set sensitivity of 90%, specificity was also highest for CAD4TB (54.6%; 95% CI 51.3−57.8). In conclusion, the study demonstrated that the three CAD systems had broadly similar diagnostic accuracy with regard to TB screening and comparable accuracy to an expert radiologist against MRS. Compared with different reference standards, Lunit performed better than both qXR and CAD4TB against MRS, and CAD4TB and Lunit better than qXR against RadRS. Moreover, the performance of the CADs can be impacted by characteristics of subgroup of population. The main limitation was that our study relied on retrospective data and MRS was not routinely done in individuals with a low suspicion of TB and a normal CXR. Our findings suggest that CAD systems could be a useful tool for TB screening programs in remote, high TB prevalent places where access to expert radiologists may be limited. However, further large-scale prospective studies are needed to address outstanding questions around the operational performance and technical requirements of the CAD systems.