The Potentials and Limitations of ChatGPT in X-ray Interpretation

Contact Info

Collective Journal of Robotics and AI

https://doi.org/10.70107/collectjroboticsandai-art0059

Affiliations
1 Newcastle Medicine (NUMed) University Malaysia 2 International Medical School, MSU, Malaysia

*Corresponding Author: Siamak Sarrafan, International Medical School, MSU, Malaysia.

Citation: Sarrafan S, Iftikhar Hanif M, Azmi I. The Potentials and Limitations of ChatGPT in X-ray Interpretation. Collect J Robotics and AI. Vol 2 (1) 2025; ART0059.

Abstract

Since the development of Artificial Intelligence, there has been a continuous effort to utilize it in the medical field. Medical imaging was one of the first areas where Artificial Intelligence was put into practice. ChatGPT is a large language model that operates using an AI algorithm. It is accessible and affordable. This study investigates the capabilities and limitations of ChatGPT in diagnosing long bone fractures. A total of 200 X-ray images of upper and lower limb fractures were evaluated separately by two independent orthopaedic surgeons and ChatGPT. While ChatGPT’s image analysis showed some promise, our study revealed multiple critical concerns regarding its diagnostic accuracy. We noted that integrating patient history and clinical signs would significantly enhance the accuracy of ChatGPT’s diagnoses. Our study results do not support the use of current ChatGPT models for interpreting or diagnosing upper and lower limb fractures based on radiologic images. However, with the rapid advancement of large language models and anticipated future improvements in ChatGPT, accurate fracture diagnosis using this model may become achievable.

Keywords: Artificial Intelligence, ChatGPT, X-ray, Fracture, AI in Healthcare, Orthopaedic.

Introduction

Background and Rationale

X-rays have been extensively used since their discovery in 1895. Their affordability and accessibility make them the modality of choice for diagnosing most orthopaedic problems, particularly fractures [33,41,29]. However, X-rays have limitations in diagnosing fractures. The possibility of a misdiagnosis or missing a fracture has always been a challenge and can compromise patient management. Moreover, remote healthcare centres may lack access to X-ray imaging or a professional to interpret the images [28,2,34]. Therefore, an accessible and affordable program that assists in diagnosing fractures or provides a second opinion would be highly beneficial. Recent advances in large language models like ChatGPT, particularly its ability to process and interpret images, seem promising. This technology could potentially be used for analysing medical images such as X-rays, and there has been significant interest in its application [25,21,43,26].

The Emergence of Artificial Intelligence in Medical Imaging

In the 1960s, Dendral created a program capable of solving complex problems. Although initially designed for organic chemistry, it later became the foundation for MYCIN—a program developed at Stanford University for identifying bacteria and suggesting suitable antibiotics [6], [36]. Today, it is difficult to find an innovation that does not incorporate AI in some capacity. Virtual reality and self-driving cars are just a few examples of AI’s integration into daily life, and its role is expected to expand further. However, while considering AI’s potential applications, it is essential to maintain an evidence-based assessment of its realistic capabilities, as both overestimation and underestimation should be avoided [38], [42]. The field of medical science is no exception. Recently, AI has been integrated into various aspects of medical technology, demonstrating promising results in enhancing diagnostic investigations and treatment planning [4,18,19,25,47]. However, the performance of ChatGPT in interpreting medical images has yielded mixed results. Some studies have shown that ChatGPT performs well in analysing X-ray images but struggles with interpreting ultrasound results. Additionally, while it may provide seemingly accurate analyses of X-rays, its explanations may lack the depth needed for scientific validation and widespread clinical adoption [29,40]. Moreover, evaluating GPT-4V’s potential in medical licensing exams, such as the USMLE, including image-based questions and radiology board exams, has shown promising results [25]. There is overwhelming agreement that newer versions of GPT with image-processing capabilities outperform previous ones [29]. However, some studies highlight inconsistencies in producing accurate and reliable results [24]. These mixed findings emphasize the need for further research to determine AI’s actual capability before its widespread implementation in clinical settings.

Study Design

In our study, 200 X-rays of upper and lower extremity fractures—equally divided between upper and lower limbs—were analysed using ChatGPT-4. These X-rays were formatted into PDFs to ensure optimal image clarity and contrast, with a minimum resolution of 300 DPI. Importantly, no patient history, symptoms, or details about the site of injury were provided to ChatGPT-4. All X-ray images were pre-evaluated by two orthopaedic surgeons, both of whom could clearly identify the fractures. Our study revealed significant limitations in ChatGPT-4’s ability to interpret extremity X-rays, which are discussed below.

Common Problems Encountered

Incorrect site identification: In 35% of cases, ChatGPT-4 misidentified the anatomical region of the X-ray. For example, a foot X-ray was sometimes misclassified as a hand X-ray.
Failure to detect fractures: In 40% of cases, ChatGPT-4 failed to detect fractures, even when an obvious fracture line was visible.
Incorrect diagnoses: Alarmingly, in 25% of cases, ChatGPT-4 provided incorrect diagnoses. For instance, it reported hip dislocation when the hip joint was not dislocated or identified non-existent fractures at unrelated sites.

Effect of Hints and Additional Information

We investigated whether providing hints or supplementary information could improve accuracy. In 20 cases where fractures were initially undiagnosed, an arrow marking the fracture site increased diagnostic accuracy in only 30% of cases. In the remaining 70%, ChatGPT-4 continued to misinterpret the X-ray findings, despite the visual cue.

Effect of Providing Clinical History and Symptoms

When a comprehensive patient history, including the mechanism of injury and signs such as the precise point of tenderness, was provided, ChatGPT-4’s diagnostic accuracy improved to 74%.

Ethical and Legal Considerations

Our study found that ChatGPT-4 often refused to analyse X-ray images due to concerns about patient confidentiality. We had to specify that our request was for educational purposes before ChatGPT-4 would provide descriptions of the given X-rays. Similar challenges were observed with other AI applications handling medical images, such as the DALL-E3 image generator. These findings align with previous studies [39,7,10]. ChatGPT’s accessibility raises legal and ethical concerns, particularly in medical applications where patient health is directly impacted [14,7,5,12]. We believe that the use of ChatGPT and other Natural Language Processing (NLP) models in radiology presents significant legal, ethical, and regulatory challenges. Therefore, clear privacy standards must be established before these technologies are widely used. These measures are crucial to preventing unauthorized access to sensitive medical data. Additionally, patients should be informed about the role of NLP models in their diagnostic process. Transparency in AI-assisted radiological analysis is the responsibility of healthcare providers. Regulatory agencies, such as the U.S. Food and Drug Administration (FDA), are working to address AI-related challenges in medical imaging. However, there are currently no specific guidelines for using ChatGPT for educational purposes or supplementary diagnostics [22,32,45].

Conclusion

The hype surrounding AI advancements, such as ChatGPT passing medical board exams, may create an inflated perception of its capabilities. However, it is crucial to recognize the current limitations [45,49,14]. Our study demonstrates that while ChatGPT-4’s image-upload feature is a significant advancement, it is not yet reliable for diagnosing fractures independently. Future iterations, such as ChatGPT-5, may improve diagnostic accuracy [28,24,39]. However, as of now, ChatGPT-4 is not suitable as a standalone X-ray image interpreter. It can, however, assist in drafting reports once a diagnosis has already been made [32,5]. If ChatGPT is specifically trained for X-ray interpretation in the future, it may become a valuable tool in medical imaging. Until then, it is not a substitute for dedicated AI diagnostic tools such as Computer-Aided Diagnosis (CAD) or Computer-Aided Detection (CAD) software [4,14,23,46].

Future Directions

Despite its current shortcomings - particularly in accurately identifying fractures - ChatGPT-4 holds potential for future improvement. Its limitations highlight the need for specialized training and exposure to diverse, high-quality clinical datasets. Future iterations, such as GPT-5, may integrate real-time clinical data and enhanced image-processing capabilities, significantly improving AI’s accuracy and utility in medical diagnostics [14,24]. As AI technology advances, it has the potential to become an essential tool in radiology and other medical fields, complementing human expertise and improving patient outcomes [32,5]. While AI is not yet a substitute for expert physicians, continuous testing and refinement could establish it as a valuable supplementary resource in the near future [7,30].

About Rscope Collective Journals

Contact Info

Account

About Rscope Collective Journals

Contact Info

Account

Journal Menu

Articles

Useful links

Home Menu

Contact Info

Collective Journal of Robotics and AI

Abstract

Introduction

Background and Rationale

The Emergence of Artificial Intelligence in Medical Imaging

Study Design

Common Problems Encountered

Effect of Hints and Additional Information

Effect of Providing Clinical History and Symptoms

Ethical and Legal Considerations

Conclusion

Future Directions

References

About Rscope Collective Journals

Contact Info

Account

Follow Us

About Rscope Collective Journals

Contact Info

Account

Follow Us

Journal Menu

Articles

Useful links

Home Menu

Contact Info

Collective Journal of Robotics and AI

The Potentials and Limitations of ChatGPT in X-ray Interpretation

Abstract

Introduction

Background and Rationale

The Emergence of Artificial Intelligence in Medical Imaging

Study Design

Common Problems Encountered

Effect of Hints and Additional Information

Effect of Providing Clinical History and Symptoms

Ethical and Legal Considerations

Conclusion

Future Directions

References