Original Article |
|
Corresponding author: Melisa Öçbe ( melisabozkurtt@windowslive.com ) © 2024 Shila Kahalian, Marieh Rajabzadeh, Melisa Öçbe, Mahmut Sabri Medisoglu.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kahalian S, Rajabzadeh M, Öçbe M, Medisoglu MS (2024) ChatGPT-4.0 in oral and maxillofacial radiology: prediction of anatomical and pathological conditions from radiographic images. Folia Medica 66(6): 863-868. https://doi.org/10.3897/folmed.66.e135584
|
Introduction: ChatGPT has the ability to generate human-like text, analyze and understand medical images using natural Language processing (NLP) algorithms. It can generate real-time diagnosis and recognize patterns and learn from previous cases to improve accuracy by combining patient history, symptoms, and image characteristics. It has been used recently for learning about maxillofacial diseases, writing and translating radiology reports, and identifying anatomical landmarks, among other things.
Materials and methods: In this study, 52 radiographic images were queried on the OpenAI application ChatGPT-4.0. The responses were evaluated with and without using clues for specific radiographs to see if adding clues during prompting improved diagnostic accuracy.
Results: The true prediagnosis rate without any clue was 30.7%. By adding one clue this rate significantly increased to 56.9%. There was not a significant difference in accurate diagnosis of anatomical landmarks, cysts, and tumors (p>0.05). However, including internal structure information improved the diagnostic accuracy (p<0.05)
Conclusion: GPT-4.0 showed a tendency to misdiagnose closely located anatomical structures and by adding additional clues its performance showed improvement, while its ability to recognize diverse differential diagnoses remains limited.
anatomical landmarks, artificial intelligence, ChatGPT, oral radiology
ChatGPT (Chat Generative Pre-Trained Transformer) has lately become the most popular artificial intelligence (AI) chatbot developed by OpenAI (AIBY Inc., Miami, Florida).[
ChatGPT has the ability to analyze and understand medical images using Natural Language Processing (NLP) algorithms, which is a major advantage in radiology.[
Given the rapid adoption of AI in healthcare, exploring ChatGPT’s diagnostic potential in dentomaxillofacial radiology is essential.[
The aim of this study was to evaluate the performance of ChatGPT-4.0, specifically utilizing its image-based conversation feature. The study focuses on assessing the tool’s ability to accurately identify anatomical landmarks, cysts, and tumors by highlighting these structures with arrows and providing relevant diagnostic clues.
In this study, a total of 52 radiographic images (panoramic radiographies, periapical radiographies, cone-beam computed tomography sections) were obtained from the archive of the Department of Oral and Maxillofacial Radiology at the Health and Technology University in Kocaeli and analyzed using the OpenAI ChatGPT-4.0 application. Images depicting anatomical structures were selected randomly, while those used for cyst/tumor detection were chosen from cases with confirmed histopathological diagnoses. All images were collected from the archive between May and July 2024. The inclusion criteria required suboptimal image quality. The radiographs were classified based on 3 categories; anatomical landmarks (n=18), odontogenic and non-odontogenic cysts (n=11), and tumors (n=23).
The process for choosing clues was systematic, aimed at emulating clinical context by providing ChatGPT with pertinent anatomical and structural information to improve its interpretative capability. The clues were selected based on key diagnostic criteria commonly used by radiologists and were categorized into four types:
1. Location: Each radiograph included a clue about the anatomical location (e.g., “mandible,” “maxilla,” “anterior,” or “posterior”) to contextualize the image and aid ChatGPT in narrowing down diagnostic possibilities based on spatial orientation.
2. Internal structure: Clues about the internal characteristics of the lesion or structure were provided, describing radiopacity or radiolucency, the presence of an impacted tooth, ground glass appearance, or textural qualities (heterogeneous or homogeneous) within the region of interest. These descriptions were intended to guide ChatGPT’s interpretation toward specific pathologies by highlighting relevant radiographic features.
3. Peripheral structure: Information on the borders of the lesion, such as “well-defined sclerotic,” “well-defined corticated,” or “ill-defined,” was added to indicate lesion boundary characteristics. This clue type helped distinguish between benign and malignant pathologies, as border definitions can be critical in radiological assessment.
4. Adjacent structures: Clues on proximity to critical anatomical landmarks (e.g., “close to mandibular canal,” “adjacent to maxillary sinus floor”) were included to provide context on nearby structures. This aimed to enhance ChatGPT’s understanding of the lesion’s spatial relationships, which are essential for accurate localization and diagnosis.
Each image underwent a standardized prompting sequence: initially presented without clues, followed by a second presentation incorporating the designated clues as outlined above.
After gathering all the answers provided by ChatGPT-4.0, one oral and maxillofacial radiologist and one anatomy specialist evaluated the answers as true or false. During the data collection process, responses were evaluated both with and without the inclusion of clues for certain specific radiographs to determine if adding clues during prompting resulted in improved diagnostic accuracy. The clues given to ChatGPT included location, internal structure (radiopacity/radiolucency, impacted tooth, ground glass appearance, heterogenous or homogenous), peripheral structure, and adjacent structures. Each anatomical structure or pathological condition image was presented twice to the ChatGPT: once without any clue and once with a clue.
After collecting ChatGPT’s responses for each prompted image, an oral and maxillofacial radiologist and an anatomy specialist independently evaluated the answers, determining their correctness based on the expected diagnosis. Responses were scored on a scale from 0 to 2 (Fig.
The prompt scoring rubric is depicted. Responses to the prompts were evaluated on a scale from 0 to 2. If multiple pre-diagnoses were suggested, 2 points were given if the correct diagnosis appeared first on the list.
Data were analyzed using IBM Statistical Package for the Social Sciences (SPSS) Statistics 22 (SPSS Inc., Chicago, IL, USA). Descriptive statistical methods for average prompt scoring were used to evaluate the data. The chi-square test was used to compare qualitative data. Statistical significance was set at p<0.05.
The accuracy of the prediagnosis was measured, and the results were analyzed statistically. The true prediagnosis rate without any clue was 30.7%. When at least one clue was included, the incidence of true answers increased significantly to 56.9%. Average prompt score was found 9.3 for anatomical structures, 4.1 for odontogenic and non-odontogenic cysts, and 7.6 for odontogenic and non-odontogenic tumor pre-diagnosis. There was no statistically significant difference found between anatomical structure prediction, odontogenic and non-odontogenic cyst, and tumor pre-diagnosis (p>0.05).
A statistical analysis was performed to determine the significance of the different types of clues. There was no statistically significant correlation found with the inclusion of a location, peripheral structure, radiopacity/radiolucency, and adjacent structures clues to the prompt (p>0.05). However, mentioning the internal structure of the anatomical or pathological condition was found to be significant for achieving a true prediagnosis, indicating that this type of information is particularly effective in enhancing diagnostic accuracy (p<0.05).
The analysis of false predictions made by ChatGPT revealed that errors often involved confusing adjacent anatomical structures. Specifically, nutrient canals of mandibular molars were frequently misdiagnosed as mandibular molar roots, the anterior nasal spine was misdiagnosed as the incisive canal, and the greater palatine foramen was mistaken for the maxillary sinus. These misdiagnoses highlight the AI’s challenge in distinguishing between closely situated structures, suggesting a need for improved contextual understanding and differentiation capabilities.
Additionally, the study found that querying about lesions associated with impacted teeth did not consistently lead to the accurate prediagnosis of a dentigerous cyst. However, when the presence of an impacted tooth was explicitly mentioned in the prompt, ChatGPT correctly suggested a dentigerous cyst but did not identify other potential pathologies such as adenomatoid odontogenic tumor, odontogenic myxoma, or ameloblastoma. This indicates that while the AI can recognize specific pathologies when given explicit cues, it may overlook other relevant differential diagnoses in the context of impacted teeth. Ameloblastoma, tooth socket, external oblique ridge, median palatal suture, genial tubercule, mandibular canal was detected by ChatGPT without giving any location, peripheral, internal or adjacent structure information. Table
Structures or lesions queried without any location, internal structure, peripheral structure and adjacent structure information to ChatGPT and the pre-diagnosis
| Anatomical structure or lesion | Pre-diagnosis |
| Nutrient canals (Mandible posterior) | Mandibular molar roots |
| Nutrient canals (Mandible anterior) | Mandibular fracture |
| Internal oblique ridge | Mandibular canal |
| Anterior nasal spine | Incisive canal |
| Greater palatine foramen | Maxillary sinus |
| Mandibular incisive foramen | Periapical lesion and a Stafne’s bone cavity |
| Dentigerous cyst | Periapical lesion |
| Odontogenic keratocyst | Traumatic bone cyst and ameloblastoma |
| Buccal bifurcation cyst | Radicular cyst and dentigerous cyst |
| Radicular cyst (Maxilla anterior region) | Nasopalatine canal cyst |
| Odontoma | Dentigerous cyst |
| Odontogenic myxoma | Odontogenic keratocyst and osteoradionecrosis |
| Hypersementosis | Internal root resorption |
| Osteoma | Sialolithiasis and mandibular torus |
| Cementoblastoma | Periapical cemento-osseous dysplasia |
| Osteoid osteoma | Cemento-osseous dysplasia and ameloblastic fibro-odontoma |
| Osteoblastoma | Ameloblastic fibro-odontoma |
| Arteriovenous malformation | Stafne bone cavity |
| Idiopathic osteosclerosis | Condensing osteitis |
| Torus maxillaris | Internal root resorption and maxillary sinus floor |
| Calcifying epithelial odontogenic tumor | Dentigerous cyst amnd odontogenic keratocyst |
In this study, we evaluated the capability of ChatGPT-4.0 to diagnose dental radiographs using its image-based conversation feature, focusing on anatomical landmarks and lesions such as tumors and odontogenic and non-odontogenic cysts. Initially, without providing any clues, the prediagnosis rate was 30.7%. Subsequently, when we offered at least one clue, the rate increased to 56.9%. With additional hints, the average prompt scores increased.
In previous studies, Silva et al. evaluated ChatGPT-3.5’s ability to describe radiolucent lesions in panoramic radiographs and establish differential diagnoses, finding that it correctly identified the diagnosis as the first hypothesis in 25% of cases, within the first two hypotheses in 57.14% of cases, and within the first three in 67.85% of cases, with 46% of responses including contraindications, indicating a reliance on general patterns rather than detailed evaluations.[
Despite its improved performance with additional hints, the analysis of ChatGPT’s diagnostic performance revealed several common misdiagnoses, often involving the confusion of closely situated anatomical structures.[
When comparing ChatGPT versions 3, 3.5, and 4.0 in predicting preliminary diagnoses based on radiological images, notable differences in performance emerge. ChatGPT-3, while capable of basic interpretation and clinical decision-making tasks, often lacked the nuance needed for accurate radiological predictions.[
In the present study, ChatGPT-4.0’s diagnostic performance was significantly influenced by the inclusion of specific clues. The true pre-diagnosis rate increased from 30.7% without clues to 56.9% with at least one clue, emphasizing the importance of contextual information in improving AI accuracy. Specifically, clues related to internal structure were found to be statistically significant (p<0.05), underscoring their critical role in enhancing diagnostic precision. Albagieh et al. did not include diagnostic clues in their study design and found lower overall accuracy among large language models (LLMs), with a median accuracy score of 50%.[
The current study highlighted ChatGPT-4.0’s tendency to misdiagnose adjacent anatomical structures, revealing challenges in differentiating closely situated elements. For example, nutrient canals were mistaken for mandibular molar roots, while the anterior nasal spine was confused with the incisive canal. This points to the model’s limitations in spatial recognition, suggesting a need for more advanced contextual understanding to improve AI accuracy in complex anatomical regions. Albagieh et al., on the other hand, reported moderate agreement among LLMs (kappa=0.622), which was higher than the weak agreement observed among senior residents (kappa=0.396). This suggests that while LLMs may offer more consistent responses across cases, they still lack the nuanced understanding that human clinicians can apply to diverse clinical scenarios.[
The present study found that ChatGPT-4.0’s diagnostic accuracy varied across different case types, with an average prompt score of 9.3 for anatomical structures, 4.1 for odontogenic cysts, and 7.6 for tumors. This suggests that while the model performs relatively well in anatomical identification, it has limitations in diagnosing pathologies, particularly when dealing with complex cases like odontogenic cysts and tumors. Similarly, Albagieh et al. observed significant performance divergence in specific cases, such as the management of a large painful ulcer in a post-kidney transplant patient, where LLMs uniformly favored topical corticosteroids, contrasting with 70% of residents who chose intralesional injections (p=0.022). These findings further indicate that AI models, including ChatGPT, may adhere to general treatment patterns but often miss the clinical nuance needed for accurate management decisions.[
Several limitations can be considered for this study. First, this study used 52 radiographic images from a single institution, limiting the generalizability of the results. The dataset may not represent the full range of anatomical and pathological conditions, especially rare cases. Image selection based on suboptimal quality could introduce bias, affecting AI performance with higher-quality or more diverse images. Diagnostic accuracy depended on including specific prompts, which may affect reproducibility in other clinical scenarios. Additionally, the study was conducted under controlled conditions, lacking real-world clinical validation where factors like patient history and clinician judgment influence diagnostics. Finally, ChatGPT-4.0’s ability to generate a comprehensive list of differential diagnoses, especially for complex cases, was not fully assessed.
The findings of this study indicate that ChatGPT-4.0 in its current iteration lacks sufficient efficiency in accurately detecting oral and maxillofacial pathologies. Although incorporating specific clues into the prompt did enhance the ChatGPT-4.0’s ability to generate a more comprehensive and accurate preliminary diagnosis list, the results revealed that its diagnostic capabilities remain limited.
Consequently, further development and refinement are necessary to improve the ChatGPT-4.0’s diagnostic accuracy and usability before it can be considered a viable option for patient use. Future research should also explore integrating more advanced algorithms and expanding the dataset to encompass a broader range of pathologies, which could ultimately contribute to the ChatGPT-4.0’s evolution into a more reliable diagnostic aid in the field of oral and maxillofacial radiology.