Study finds accuracy and reliability of chatbot responses encouraging, but mistakes sometimes happen.

Researchers Douglas Johnson, M.D., M.S.C.I., and Lee Wheless, M.D., Ph.D., led colleagues at Vanderbilt University Medical Center in putting ChatGPT to the test of responding to physician questions. Their findings may support the use of a medical chatbot in the future.

The findings, reported in JAMA Network Open, were somewhat better than expected, but with some limitations.

“We were fairly impressed with ChatGPT and the other chatbots that were available,” Johnson said.

While chatbots have been able to perform well when responding to the multiple-choice questions of board exams, data have been limited on how well the technology responds to more open-ended questions, for example: What treatment is optimal for a certain subgroup of patients?

Testing the Chatbot

To explore the capacity for ChatGPT to handle such open-ended questions posed by physicians, Johnson, Wheless, and their colleagues recruited a team of 33 physicians from 17 specialties to generate questions relevant to their areas of specialty.

They asked them to submit questions at various levels of difficulty, and some that were more likely or less likely to be answered with a yes or no. The queries also needed to be relevant to the model’s trained knowledgebase.

Chatbot responses to questions were rated for outcomes such as accuracy, completeness, and reproducibility.

How the Chatbot Fared

The researchers said they were surprised by the overall accuracy of ChatGPT. On a scale of 1 to 6, with 6 being completely correct, overall the chatbot’s median score for accuracy was 5.5 (interquartile range, 4.0-6.0).

“There is no way of telling it was a wrong answer without having the expert knowledge.”

“That’s very reassuring,” Wheless said.

However, he explained: “When it was wrong, it was very wrong.”

Furthermore, the model also provided wrong answers with a similar level of confidence as when answers were correct.

“There is no way of telling it was a wrong answer without having the expert knowledge,” Wheless said.

Surprisingly to the researchers, the chatbot performed at about the same level of accuracy when producing binary answers or a more descriptive response. It also performed about the same in terms of accuracy and completeness for content across varied levels of difficulty.

While the study was initially conducted using ChatGPT version 3.5, the team upgraded to the newer version when ChatGPT version 4 was released. With that, the bot’s performance seemed to improve.

On some specific inquires, the tool also improved its accuracy over a period of days, hinting at the capacity for the tool to learn as it operates.

“I think the general idea is, repeated questioning is telling the model that there’s something wrong with the initial answer, and so it needs to go back and refine that,” Wheless suggested.

Implementation Now and in the Future

Johnson said ChatGPT has come a long way, even since the time of their study. In the future, a medical chatbot could be useful in helping patients make decisions about whether to go see a doctor and potentially serve as a support tool for physicians.

“It can be a valuable resource for physicians, but it should not be used as a standalone resource. The information should be checked and validated.”

“It can be a valuable resource for physicians, but it should not be used as a standalone resource,” Johnson said. “The information should be checked and validated.”

Wheless also pointed out other limitations to the healthcare chatbot’s abilities. For example, even though the medical chatbot may be able to extract a conclusion from a study, it could lack the ability to ascertain the study’s level of bias.

Johnson said his team recently has been researching a chatbot’s handling of questions involving immunotherapy toxicities, such as how well the tool demonstrates adherence to guideline-based information on this topic.

He also pointed to other efforts at Vanderbilt geared toward automating some areas of physician workload, such as ways large language models or healthcare chatbots can assist in patient care.

“Hopefully this will allow us to spend more time with patients and less time doing paperwork.”

About the Expert

Douglas Johnson, M.D.

Douglas Johnson, M.D., M.S.C.I., is associate professor in the Division of Hematology and Oncology at Vanderbilt University Medical Center and leads the Melanoma Clinical Research Program at Vanderbilt-Ingram Cancer Center. His research explores ways to profile cancers to predict which patients will benefit from immune therapies as well as understanding the effectiveness and toxicities of immune therapies in high-risk patients.

Lee Wheless, M.D.

Lee Wheless, M.D., Ph.D., is an assistant professor of dermatology at Vanderbilt University School of Medicine. His research focuses on understanding skin cancer risk, including through deep learning analyses of pathology images for predicting lesions at high risk.