- 1. Acosta, J. N., Falcone, G. J., Rajpurkar, P., & Topol, E. J. (2022). Multimodal biomedical AI. Nature Medicine, 28(9), 1773-1784.
- 2. Aaland, M. O., & Smith, S. A. (1996). Some study on X-ray diagnostics. Journal of Diagnostic Imaging, 10(2), 101-105.
- 3. Born, C. T., et al. (1989). Orthopedic study. Journal of Orthopedic Research, 7(4), 543-550.
- 4. Brin, D., Sorin, V., Barash, Y., Konen, E., Nadkarni, G., Glicksberg, B. S., & Klang, E. (2023). Assessing GPT-4 Multimodal Performance in Radiological Image Analysis.
- 5. Dave, T., Athaluri, S. A., & Singh, S. (2023). ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6.
- 6. Feigenbaum, E. A., Buchanan, B. G., & Lederberg, J. (1970). On generality and problem-solving: A case study using the DENDRAL program. ntrs.nasa.gov.
- 7. Fournier-Tombs, E., & McHardy, J. (2023). A Medical Ethics Framework for Conversational Artificial Intelligence. J Med Internet Res, 25.
- 8. Goodman, L. R. (1995). Basic Radiology. W.B. Saunders.
- 9. Iftikhar, M., Sarrafan, S., & Ramamurti, V. (2023). Study on the application of AI in orthopedic surgery. Ecronicon, 14(6), 101-110.
- 10. Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., & Donald, R. et al. (2023). Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Research Square.
- 11. Lederberg, J. (1990). How DENDRAL was conceived and born. CiteSeer X (The Pennsylvania State University), 14-44.
- 12. Mesko, B., & Topol, E. J. (2023). The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine, 6(1), 120.
- 13. Myers, T. G., Ramkumar, P. N., Ricciardi, B. F., Urish, K. L., Kipper, J., & Ketonis, C. (2020). Artificial Intelligence and Orthopaedics: An Introduction for Clinicians. The Journal of Bone and Joint Surgery American Volume.
- 14. Naik, N., Hameed, B. M. Z., Shetty, D. K., et al. (2022). Legal and Ethical Consideration in Artificial Intelligence in Healthcare: Who Takes Responsibility? Frontiers in Surgery, 9(862322), 1-6.
- 15. OpenAI (2023). GPT-4V(ision) system card. OpenAI.
- 16. Panchbhai, A. (2015). Principles of Diagnostic Radiology. McGraw-Hill.
- 17. Sarrafan, S., Beheshti, B., Iftikhar, M., & Ramamurti, V. (2023). Applications of Intelligent Implants for Infection Control in Orthopaedics: an Innovative Approach. Ecronicon, 14(6).
- 18. Senkaiahliyan, S., Toma, A., Ma, J., et al. (2023). GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment. medRxiv.
- 19. Singhal, K., et al. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.
- 20. Swain, G., Burns, K. A., & Etkind, P. (2008). Preparedness: medical ethics versus public health ethics. J Public Health Manag Pract, 14(4), 354-357.
- 21. Thawkar, O., Shaker, A., Mullappilly, S., et al. (2023). XrayGPT: Chest Radiographs Summarization using Large Medical Vision-Language Models. ArXiv.
- 22. Thirunavukarasu, D. S. J., et al. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.
- 23. Toma, A., Lawler, P. R., Ba, J., et al. (2023). Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
- 24. Wang, C., Liu, S., Yang, H., et al. (2023). Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res, 25.
- 25. Yang, Z., Li, L., Lin, K., et al. (2023). The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). Published online October 11, 2023. Accessed November 11, 2023.
- 26. Acosta, J. N., Falcone, G. J., Rajpurkar, P., & Topol, E. J. (2022). Multimodal biomedical AI. Nature Medicine, 28(9), 1773-1784.
- 27. Aaland, M. O., & Smith, S. A. (1996). Some study on X-ray diagnostics. Journal of Diagnostic Imaging, 10(2), 101-105.
- 28. Born, C. T., & Zahar, D. R. (1989). Orthopedic study on X-ray diagnostics. Journal of Orthopedic Research, 7(4), 543-550.
- 29. Brin, D., Sorin, V., Barash, Y., Konen, E., Nadkarni, G., Glicksberg, B. S., & Klang, E. (2023). Assessing GPT-4 Multimodal Performance in Radiological Image Analysis. MedRxiv.
- 30. Dave, T., Athaluri, S. A., & Singh, S. (2023). ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6.
- 31. Feigenbaum, E. A., Buchanan, B. G., & Lederberg, J. (1970). On generality and problem-solving: A case study using the DENDRAL program. ntrs.nasa.gov.
- 32. Fournier-Tombs, E., & McHardy, J. (2023). A Medical Ethics Framework for Conversational Artificial Intelligence. J Med Internet Res, 25.
- 33. Goodman, L. R. (1995). Basic Radiology. W.B. Saunders.
- 34. Iftikhar, M., Sarrafan, S., & Ramamurti, V. (2023). Study on the application of AI in orthopedic surgery. Ecronicon, 14(6), 101-110.
- 35. Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., & Donald, R. et al. (2023). Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Research Square.
- 36. Lederberg, J. (1990). How DENDRAL was conceived and born. CiteSeer X (The Pennsylvania State University), 14-44.
- 37. Mesko, B., & Topol, E. J. (2023). The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine, 6(1), 120.
- 38. Myers, T. G., Ramkumar, P. N., Ricciardi, B. F., Urish, K. L., Kipper, J., & Ketonis, C. (2020). Artificial Intelligence and Orthopaedics: An Introduction for Clinicians. The Journal of Bone and Joint Surgery American Volume.
- 39. Naik, N., Hameed, B. M. Z., Shetty, D. K., Swain, D., Shah, M., Paul, R., et al. (2022). Legal and Ethical Consideration in Artificial Intelligence in Healthcare: Who Takes Responsibility? Frontiers in Surgery, 9(862322), 1-6.
- 40. OpenAI (2023). GPT-4V(ision) system card. OpenAI.
- 41. Panchbhai, A. (2015). Principles of Diagnostic Radiology. McGraw-Hill.
- 42. Sarrafan, S., Beheshti, B., Iftikhar, M., & Ramamurti, V. (2023). Applications of Intelligent Implants for Infection Control in Orthopaedics: an Innovative Approach. Ecronicon, 14(6).
- 43. Senkaiahliyan, S., Toma, A., Ma, J., Chan, A. W., Ha, A., An, K. R., Suresh, H., Rubin, B., & Wang, B. (2023). GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment. medRxiv.
- 44. Singhal, K., et al. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.
- 45. Swain, G., Burns, K. A., & Etkind, P. (2008). Preparedness: medical ethics versus public health ethics. J Public Health Manag Pract, 14(4), 354-357.
- 46. Thawkar, O., Shaker, A., Mullappilly, S., Cholakkal, H., Anwer, R. M., Khan, S., Laaksonen, J., & Khan, F. S. (2023). XrayGPT: Chest Radiographs Summarization using Large Medical Vision-Language Models. ArXiv.
- 47. Thirunavukarasu, D. S. J., et al. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.
- 48. Toma, A., Lawler, P. R., Ba, J., Krishnan, R. G., Rubin, B. B., & Wang, B. (2023). Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
- 49. Wang, C., Liu, S., Yang, H., Guo, J., Wu, Y., & Liu, J. (2023). Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res, 25.
Journal Menu
Articles
Useful links
- Peer Review
- Why Submit?
- Submission Checklist
- Article Types
- Instructions for Authors
- Article Processing Fee
Home Menu
The Potentials and Limitations of ChatGPT in X-ray Interpretation
Affiliations
1 Newcastle Medicine (NUMed) University
Malaysia
2 International Medical School, MSU,
Malaysia
*Corresponding Author: Siamak Sarrafan, International Medical School, MSU, Malaysia.
Citation: Sarrafan S, Iftikhar Hanif M, Azmi I. The Potentials and Limitations of ChatGPT in X-ray Interpretation. Collect J Robotics and AI. Vol 2 (1) 2025; ART0059.
Abstract
Since the development of Artificial Intelligence, there has been a continuous effort to utilize it in the medical field. Medical imaging was one of the first areas where Artificial Intelligence was put into practice. ChatGPT is a large language model that operates using an AI algorithm. It is accessible and affordable. This study investigates the capabilities and limitations of ChatGPT in diagnosing long bone fractures. A total of 200 X-ray images of upper and lower limb fractures were evaluated separately by two independent orthopaedic surgeons and ChatGPT. While ChatGPT’s image analysis showed some promise, our study revealed multiple critical concerns regarding its diagnostic accuracy. We noted that integrating patient history and clinical signs would significantly enhance the accuracy of ChatGPT’s diagnoses. Our study results do not support the use of current ChatGPT models for interpreting or diagnosing upper and lower limb fractures based on radiologic images. However, with the rapid advancement of large language models and anticipated future improvements in ChatGPT, accurate fracture diagnosis using this model may become achievable.
Keywords: Artificial Intelligence, ChatGPT, X-ray, Fracture, AI in Healthcare, Orthopaedic.
Introduction
Background and Rationale
X-rays have been extensively used since their discovery in 1895. Their affordability and accessibility make them the modality of choice for diagnosing most orthopaedic problems, particularly fractures [33,41,29]. However, X-rays have limitations in diagnosing fractures. The possibility of a misdiagnosis or missing a fracture has always been a challenge and can compromise patient management. Moreover, remote healthcare centres may lack access to X-ray imaging or a professional to interpret the images [28,2,34]. Therefore, an accessible and affordable program that assists in diagnosing fractures or provides a second opinion would be highly beneficial. Recent advances in large language models like ChatGPT, particularly its ability to process and interpret images, seem promising. This technology could potentially be used for analysing medical images such as X-rays, and there has been significant interest in its application [25,21,43,26].
The Emergence of Artificial Intelligence in Medical Imaging
In the 1960s, Dendral created a program capable of solving complex problems. Although initially designed for organic chemistry, it later became the foundation for MYCIN—a program developed at Stanford University for identifying bacteria and suggesting suitable antibiotics [6], [36]. Today, it is difficult to find an innovation that does not incorporate AI in some capacity. Virtual reality and self-driving cars are just a few examples of AI’s integration into daily life, and its role is expected to expand further. However, while considering AI’s potential applications, it is essential to maintain an evidence-based assessment of its realistic capabilities, as both overestimation and underestimation should be avoided [38], [42]. The field of medical science is no exception. Recently, AI has been integrated into various aspects of medical technology, demonstrating promising results in enhancing diagnostic investigations and treatment planning [4,18,19,25,47]. However, the performance of ChatGPT in interpreting medical images has yielded mixed results. Some studies have shown that ChatGPT performs well in analysing X-ray images but struggles with interpreting ultrasound results. Additionally, while it may provide seemingly accurate analyses of X-rays, its explanations may lack the depth needed for scientific validation and widespread clinical adoption [29,40]. Moreover, evaluating GPT-4V’s potential in medical licensing exams, such as the USMLE, including image-based questions and radiology board exams, has shown promising results [25]. There is overwhelming agreement that newer versions of GPT with image-processing capabilities outperform previous ones [29]. However, some studies highlight inconsistencies in producing accurate and reliable results [24]. These mixed findings emphasize the need for further research to determine AI’s actual capability before its widespread implementation in clinical settings.
Study Design
In our study, 200 X-rays of upper and lower extremity fractures—equally divided between upper and lower limbs—were analysed using ChatGPT-4. These X-rays were formatted into PDFs to ensure optimal image clarity and contrast, with a minimum resolution of 300 DPI. Importantly, no patient history, symptoms, or details about the site of injury were provided to ChatGPT-4. All X-ray images were pre-evaluated by two orthopaedic surgeons, both of whom could clearly identify the fractures. Our study revealed significant limitations in ChatGPT-4’s ability to interpret extremity X-rays, which are discussed below.
Common Problems Encountered
- Incorrect site identification: In 35% of cases, ChatGPT-4 misidentified the anatomical region of the X-ray. For example, a foot X-ray was sometimes misclassified as a hand X-ray.
- Failure to detect fractures: In 40% of cases, ChatGPT-4 failed to detect fractures, even when an obvious fracture line was visible.
- Incorrect diagnoses: Alarmingly, in 25% of cases, ChatGPT-4 provided incorrect diagnoses. For instance, it reported hip dislocation when the hip joint was not dislocated or identified non-existent fractures at unrelated sites.
Effect of Hints and Additional Information
We investigated whether providing hints or supplementary information could improve accuracy. In 20 cases where fractures were initially undiagnosed, an arrow marking the fracture site increased diagnostic accuracy in only 30% of cases. In the remaining 70%, ChatGPT-4 continued to misinterpret the X-ray findings, despite the visual cue.
Effect of Providing Clinical History and Symptoms
When a comprehensive patient history, including the mechanism of injury and signs such as the precise point of tenderness, was provided, ChatGPT-4’s diagnostic accuracy improved to 74%.
Ethical and Legal Considerations
Our study found that ChatGPT-4 often refused to analyse X-ray images due to concerns about patient confidentiality. We had to specify that our request was for educational purposes before ChatGPT-4 would provide descriptions of the given X-rays. Similar challenges were observed with other AI applications handling medical images, such as the DALL-E3 image generator. These findings align with previous studies [39,7,10]. ChatGPT’s accessibility raises legal and ethical concerns, particularly in medical applications where patient health is directly impacted [14,7,5,12]. We believe that the use of ChatGPT and other Natural Language Processing (NLP) models in radiology presents significant legal, ethical, and regulatory challenges. Therefore, clear privacy standards must be established before these technologies are widely used. These measures are crucial to preventing unauthorized access to sensitive medical data. Additionally, patients should be informed about the role of NLP models in their diagnostic process. Transparency in AI-assisted radiological analysis is the responsibility of healthcare providers. Regulatory agencies, such as the U.S. Food and Drug Administration (FDA), are working to address AI-related challenges in medical imaging. However, there are currently no specific guidelines for using ChatGPT for educational purposes or supplementary diagnostics [22,32,45].
Conclusion
The hype surrounding AI advancements, such as ChatGPT passing medical board exams, may create an inflated perception of its capabilities. However, it is crucial to recognize the current limitations [45,49,14]. Our study demonstrates that while ChatGPT-4’s image-upload feature is a significant advancement, it is not yet reliable for diagnosing fractures independently. Future iterations, such as ChatGPT-5, may improve diagnostic accuracy [28,24,39]. However, as of now, ChatGPT-4 is not suitable as a standalone X-ray image interpreter. It can, however, assist in drafting reports once a diagnosis has already been made [32,5]. If ChatGPT is specifically trained for X-ray interpretation in the future, it may become a valuable tool in medical imaging. Until then, it is not a substitute for dedicated AI diagnostic tools such as Computer-Aided Diagnosis (CAD) or Computer-Aided Detection (CAD) software [4,14,23,46].
Future Directions
Despite its current shortcomings - particularly in accurately identifying fractures - ChatGPT-4 holds potential for future improvement. Its limitations highlight the need for specialized training and exposure to diverse, high-quality clinical datasets. Future iterations, such as GPT-5, may integrate real-time clinical data and enhanced image-processing capabilities, significantly improving AI’s accuracy and utility in medical diagnostics [14,24]. As AI technology advances, it has the potential to become an essential tool in radiology and other medical fields, complementing human expertise and improving patient outcomes [32,5]. While AI is not yet a substitute for expert physicians, continuous testing and refinement could establish it as a valuable supplementary resource in the near future [7,30].