
Recent advances in ultrasonography (US) technology and the higher number of thyroid and neck US performed for health examinations have increased the likelihood of thyroid nodules being detected.1) In real-time US, the accurate diagnosis of thyroid nodules requires the high concentration of physicians, and thus variations might occur in diagnosis due to several factors such as skill level and fatigue of specialists.2,3) With the rapid development of computer technology, artificial intelligence (AI)-based computer aided diagnosis (CAD) is impacting medical imaging diagnosis in diverse ways.4,5) In particular, convolutional neural network (CNN) technology enables multi-level self-education using medical images, and this can be applied to actual medical image analysis.6,7) CNN technology can yield accurate, objective diagnostic results, especially in cancer diagnosis, and assist doctors across various medical fields to increase efficiency.8,9) In the field of thyroid nodule diagnosis, CNN can diagnose and classify benign and malignant thyroid nodules on US images.10,11) Moreover, CNN can play an assistive role to improve the diagnostic performance of inexperienced physicians.12,13)
In our institution, we developed an AI-CAD software, Severance Artificial intelligence program (SERA), using 13,560 US images of thyroid nodules labeled as either benign or malignant.14) SERA can be run on the internet, making it possible to use it on both computers and mobile phones, thus raising user accessibility. When SERA is run on a computer, the user operates SERA with US images uploaded to the picture archiving and communication system (PACS). However, several issues can arise when downloading images from PACS elsewhere. Security issues forbid the direct linking of SERA with PACS images in in-hospital computers, and users may have to find ways to import US images to personal computers to operate SERA. Furthermore, local clinics without a connection to online PACS would have to use additional data transportation methods to access US images. Thus, alternative ways to access US images to operate SERA are needed.
Instead of the previous complex steps to access US images for SERA, it would be more convenient and accessible if users took a picture of the PACS monitor with their mobile phone and ran the image through SERA without saving it permanently on their device. However, pictures taken with a mobile phone may generate noise which can affect the resolution of US images and diagnostic ability of SERA. Therefore, in this study, we investigated whether SERA using pictures with a mobile phone would show similar diagnostic performance for thyroid cancers when compared SERA directly using images downloaded from PACS.
The Institutional Review Board approved this retrospective study and required neither patient approval nor informed consent for our review of patient images and records.
From October 2019 to December 2019, 579 patients over the age of 19 who underwent US-guided fine-needle aspiration (US-FNA) on thyroid lesions in our institution were included. Among them, 211 patients who had thyroid nodules less than 1 cm in size were excluded. We also excluded 109 nodules in 109 patients who did not receive further management such as repeat FNA or surgery after nondiagnostic or indeterminate US-FNA results. Finally, 259 thyroid nodules in 259 patients were included (Fig. 1).
Five radiologists with 6 to 21 years of experience in thyroid imaging performed US and US-FNA, using EPIQ 7 (EPIQ 7, Phillips Medical Systems, Bothell, WA, USA) with a 7 to 17 MHz linear transducer. US images, which include composition, echogenicity, shape, margin, and calcification, were reviewed and analyzed by the radiology specialists before US-guided FNA.15) Composition was classified as solid, predominantly solid (solid component≥50%), or predominantly cystic (solid component<50%). Echogenecity was classified as hyperechoic (hyperechogenicity compared to the surrounding thyroid parenchyma) or isoechoic (isoechogenicity compared to the surrounding thyroid parenchyma), hypoechoic (hypoechogenicity compared to the surrounding thyroid parenchyma), and marked hypoechoic (hypoechogenicity compared to the strap muscles). Margin was classified as well-defined, microlobulated or irregular. Calcification was classified as negative, macrocalcifications, eggshell calcifications, or microcalcifications. Shape was classified as parallel or non-parallel (greater in the anteroposterior dimension than the transverse dimension, taller-than-wide shape). A representative US image of each thyroid nodule was selected by an experienced radiologist (J.Y.K.), and saved as a JPEG image in PACS. Afterwards, the same radiologist drew a rectangular region of interest including all thyroid nodules on the US image using the Windows 10 paint program. The US image of the thyroid nodules was displayed on a PACS monitor with a resolution of 1920*1080. Afterwards, the same radiologist used a mobile phone (Samsung Galaxy Note 8, 12 megapixel) to take a picture of the US image shown on the PACS monitor screen. Imaging of the PACS monitor was performed in a brightly lit room. Distance from the PACS monitor was set to about 20cm, and all other surrounding settings such as ambient light, background color, camera angle, exposure time and contrast value, were all manually controlled.16,17) In order to protect research ethics, all personal information and patient records that appears on the PACS monitor were excluded from the camera angle. All images were permanently discarded from the mobile phone after SERA diagnosis.
SERA is a deep learning-based CAD that was trained with 13,560 US images of thyroid nodules that were either surgically confirmed or cytologically proven as benign or malignant by the Bethesda system and larger than 1 cm in size.14) The thyroid US images were collected at Severance Hospital from 2004 to 2019 and consisted of 7160 cases of malignant thyroid nodules and 6400 cases of benign thyroid nodules. To improve the training speed and accuracy of the transfer learning algorithm of the VGG16 network, each nodule was extracted from the thyroid image as a region of interest and irrelevant parts were removed. In addition, the extracted thyroid images were converted into a grayscale image and displayed as a value of 0 to 255. SERA analyzes the US image uploaded by the user and presents the malignancy probability of the image in continuous numbers between 0 and 100. Currently, SERA can be accessed through “http://seracse.yonsei.ac.kr” only by membership registration.
Based on the review and analysis of US images, one radiologist (J.Y.K.) reclassified the US images into the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS, TR) 2 (not suspicious, 2 points), TR3 (mildly suspicious, 3 points), TR4 (moderately suspicious, 4-6 points), to TR5 (highly suspicious, 7 or more points).18) The same radiologist operated SERA using images directly downloaded from PACS and pictures taken with a mobile phone. Fig. 2 demonstrates each diagnostic process. The diagnostic results of SERA were defined as SERA_P and SERA_M depending on which images were analyzed. The malignancy rates from SERA were divided into categories of 2 to 5, according to the recommended malignancy rate ranges of ACR TI-RADS.
Continuous variables were presented as means± standard deviations and categorical variables were presented as numbers (percentages). Youden’s index was used to determine the optimal cut-off point for SERA_P and SERA_M in the receiver operating characteristics (ROC) curve to maximize the sensitivity and specificity. To compare diagnostic performance such as sensitivity and specificity of SERA_P, SERA_M, and radiologists, a logistic regression analysis for clustered data (generalized estimated equation method) was performed. The area under the curve (AUC) was calculated using the ROC curve, and compared using the Delong Method.
For all statistical processing, SAS (version 9.4, SAS Inc., Cary, NC, USA) was used, and a two-sided test that resulted in a p-value of 0.05 or less was considered to have statistically significant results.
Table 1 summarizes the baseline characteristics of the 259 patients that had 40 malignant and 219 benign ones. The average age of total patients, patients with malignant thyroid nodules, and patients with benign thyroid nodules was 50.59±13.26 years, 47±14.86 years, and 51.24±12.87 years, respectively. Neither gender nor age had a significant association with thyroid cancer (p=0.419 and p=0.063, respectively). The mean size of all, malignant, and benign thyroid nodules was 23.24±11.91 mm, 14.8±6.82 mm, and 24.78± 12.01 mm, respectively, indicating that benign thyroid nodules were significantly larger than malignant nodules (p<0.001). All US findings were significantly different between benign and malignant nodules (all p< 0.05) (Table 1). The mean value of SERA_P was 22.45±26.34% for benign nodules and 74.6±29.83% for malignant nodules (p<0.001). The mean value of SERA_M was 26.74±28.48% for benign nodules, and 76.41±27.62% for malignant nodules (p<0.001).
Baseline characteristics of the study cohort
Total (n=259) |
Malignant (n=40) |
Benign (n=219) |
p-value | |
---|---|---|---|---|
Sex | 0.419 | |||
Women | 201 (77.61%) | 33 (82.5%) | 168 (76.71%) | |
Men | 58 (22.39%) | 7 (17.5%) | 51 (23.29%) | |
Age (years) | 50.59±13.26 | 47±14.86 | 51.24±12.87 | 0.063 |
Nodule size (mm) | 23.24±11.92 | 14.8±6.82 | 24.78±12.01 | <0.001 |
US features | ||||
Composition | 0.001 | |||
Solid | 167 (64.48%) | 36 (90%) | 131 (59.82%) | |
Predominantly solid | 72 (27.8%) | 4 (10%) | 68 (31.05%) | |
Predominantly cystic | 20 (7.72%) | 0 (0%) | 20 (9.13%) | |
Echogenicity | <0.001 | |||
Hyperechoic or isoechoic | 151 (58.3%) | 3 (7.5%) | 148 (67.58%) | |
Hypoechoic | 103 (39.77%) | 33 (82.5%) | 70 (31.96%) | |
Marked hypoechoic | 5 (1.93%) | 4 (10%) | 1 (0.46%) | |
Margin | <0.001 | |||
Well-defined | 215 (83.01%) | 13 (32.5%) | 202 (92.24%) | |
Microlobulated or irregular | 44 (16.99%) | 27 (67.5%) | 17 (7.76%) | |
Calcification | <0.001 | |||
Negative | 190 (73.36%) | 14 (35%) | 176 (80.37%) | |
Macrocalcifications | 31 (11.97%) | 3 (7.5%) | 28 (12.79%) | |
Eggshell calcifications | 7 (2.7%) | 1 (2.5%) | 6 (2.74%) | |
Microcalcifications | 31 (11.97%) | 22 (55%) | 9 (4.11%) | |
Shape | <0.001 | |||
Parallel | 238 (91.9%) | 26 (65%) | 212 (96.8%) | |
Non-parallel | 21 (8.1%) | 14 (35%) | 7 (3.2%) |
Malignancy risks calculated from SERA_P and SERA_M were categorized according to values recommended by ACR TI-RADS (Figs. 3, 4). According to the SERA_M results, the malignancy risk of TR2, TR3, TR4, and TR5 was 0% (0/39), 3.61% (3/83), 4.69% (3/64), and 42.5% (34/80), respectively. According to the SERA_P results, the malignancy risk of TR2, TR3, TR4, and TR5 was 0% (0/39), 3.26% (3/92), 10.77% (7/65), and 47.62% (30/63) respectively. According to the radiologists, the malignancy risk of TR2, TR3, TR4, TR5 was 0% (0/66), 0% (0/68), 13.64% (12/88) and 35.14 (13/37), respectively.
The optimal cutoff value was set at TR 4; nodules were considered malignant if the ACR TI-RADS category was 5 or greater. There was no difference between the AUC of SERA_P (0.8; 95% confidence interval [CI], 0.728 to 0.872) and SERA_M (0.82; 95% CI, 0.758 to 0.882) (p=0.526, Table 2, Fig. 5). The diagnostic sensitivity of SERA_P (75%; 95% CI, 61.58 to 88.42) and SERA_M (85%; 95% CI, 73.93 to 96.07) was not statistically different (p=0.091). SERA_P showed significantly higher diagnostic specificity (84.93%; 95% CI, 80.19 to 89.67) compared to SERA_M (79%; 95% CI, 73.6 to 84.39) (p=0.008). The AUC value of radiologists was 0.856 (95% CI; 0.788-0.923), which was not statistically different compared to SERA_P (0.8), and SERA_M (0.82) (p=0.163 and p=0.414, respectively). Radiologists showed a sensitivity of 77.5% (95% CI; 64.56-90.44) and specificity of 93.61% (95% CI; 90.37-96.85). Radiologists showed higher diagnostic specificity (93.61%) compared to SERA_P (84.93%) and SERA_M (79%) (p=0.001 and p<0.001, respectively). However, sensitivity was not statistically different between radiologists and SERA_P (77.5% vs. 75%; p=0.739), and between radiologists and SERA_M (77.5% vs. 85%; p=0.361) (Table 2).
Comparison of diagnostic performances for thyroid cancers among SERA_P, SERA_M, and radiologists
SERA_P | SERA_M | Radiologists | p-value* | p-value† | p-value‡ | |
---|---|---|---|---|---|---|
Sensitivity | 75 (61.58-88.42) | 85 (73.93-96.07) | 77.5 (64.56-90.44) | 0.091 | 0.739 | 0.361 |
Specificity | 84.93 (80.19-89.67) | 79 (73.6-84.39) | 93.61 (90.37-96.85) | 0.008 | 0.001 | <0.001 |
AUC | 0.8 (0.728-0.872) | 0.82 (0.758-0.882) | 0.856 (0.788-0.923) | 0.526 | 0.163 | 0.414 |
The number in parentheses indicates the 95% confidence interval.
AUC: area under the receiver operating characteristic curve, SERA: Severance Artificial intelligence program, SERA_M: results from SERA using images taken from a mobile phone, SERA_P: results from SERA using images directly downloaded from PACS
*p-value for comparing SERA_P and SERA_M. †p-value for comparing SERA_P and radiologists. ‡p-value for comparing SERA_M and radiologists.
In this study, AUC and sensitivity of SERA_M were not significantly different to those of SERA_P. This suggests that pictures of original US images taken with a mobile phone can be used for AI-CAD in daily practice. Moreover, further technical support available for phone cameras, or external equipment may compensate for poor picture quality to increase the diagnostic ability of SERA_M.
Real-time US is the main diagnostic tool for thyroid nodules, but its diagnostic performance depends on the user’s skill.19,20) Meanwhile, AI-CAD is known to be more objective than human readers, and various studies have shown that its diagnostic performance is even comparable to those of experienced physicians.14,21) However, previous research was all conducted with US images stored on computers, and thus the images analyzed were of adequate quality and resolution for AI-CAD diagnosis.22) For better accessibility and user convenience in some facilities, pictures taken with a mobile phone can be a good alternative because the process does not require clinicians to undertake additional steps to access the PACS database to run SERA.
Rapid developments in the resolution and performance of mobile phone cameras have encouraged attempts to use pictures taken by mobile phones for medical use.23) Studies have shown that the diagnostic performance of physicians interpreting X-ray or US images transmitted by mobile phones can be similar to that of physicians interpreting original images.24,25) However, due to the imperfections and aberration of mobile phone lenses, there is a limitation to collecting pictures free of diffraction without additional devices and equipment.26) Using a mobile phone to take pictures may generate noise, depending on environmental factors, such as the distance from the monitor and lighting.17,27) Thus, the process of taking pictures of the PACS monitor would inevitably generate blurring caused by light diffraction, resulting in obstacles that would lower the diagnostic abilities of SERA compared to when SERA is run on original US images.28) Meanwhile, previous studies have shown that AI-CAD using pictures of X-rays taken by mobile phones can show excellent performance when detecting cardiac devices.29) Likewise, there is a possibility that even with lower resolution or more artifacts, US images taken by mobile phones may be of sufficient quality to achieve respectable diagnostic performance when used with AI-CAD. Therefore, our study aimed to evaluate whether SERA using pictures of lower quality taken by mobile phones could achieve comparable diagnostic value to SERA using original US images downloaded from PACS. Compared to SERA_P and radiologists, SERA_M showed satisfactory AUC and sensitivity. However, SERA_M showed the lowest specificity, followed by SERA_P and radiologists. This indicates that SERA still needs fine-tuning to improve its diagnostic performance. Based on the suggested risk ranges of ACR TI-RADS, categorization using SERA_P results showed satisfactory malignancy rates. The malignancy rates of SERA_M were generally within the suggested risk ranges of ACR TI-RADS, except for TR3 (4.69%). This suggests that SERA, even with pictures taken by mobile phones, has decent categorization abilities similar to ACR TI-RADS. Therefore, pictures taken by a mobile phone can be used to categorize nodules.
There are several limitations to our study. First, our study was a retrospective study conducted with data collected from a referral center. Thus, a selection bias was unavoidable with a high malignancy rate (15.4%) for the test set of thyroid nodules. Selection bias may have also arisen because we only included nodules that underwent surgery or FNA, and we also excluded patients with indeterminate results lost to follow-up. Second, our study population was not large with a total of 259 nodules after exclusion, with 40 malignant and 219 benign nodules. Our findings need to be confirmed with a larger study population. Third, while radiologists prospectively obtained the individual US features from thyroid US, only one radiologist assigned final assessments according to ACR TI-RADS through retrospective review. Also, we did not consider the differences in how the US descriptors were defined compared to ACR TI-RADS in our data analysis. If we had analyzed final assessments during real-time examinations, the results might have been different. Fourth, our study was conducted using a single mobile phone. Different models or camera settings may have affected the quality of images, resulting in deviation of diagnostic results. Fifth, this study was conducted in one tertiary hospital with high-quality US machines. When SERA is used in local hospitals with US images of lower quality, the performance of both SERA_P and SERA_M might be disturbed. Further assessment or validation of our findings should be conducted at an external institution. Finally, we only included thyroid nodules that underwent US-FNA but not surgery, and consequently, did not include surgical histopathology in our analysis. Thus, there is the possibility of false- negative or false-positive results, although those rates are expected to be very low as previous literature reports a false-negative rate of 3% and false-positive rates around 3-4%.30)
In conclusion, SERA performed with pictures taken by a mobile phone showed a comparable diagnostic performance to SERA performed with images directly downloaded from PACS. Using the results of this study, we hope that further research on operating AI-CAD programs on mobile devices can make AI-CAD more convenient and accessible to users.
This study was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (2021R1A2C2007492). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
No potential conflict of interest relevant to this article was reported.
![]() |
![]() |