Artificial intelligence-driven chatbots are increasingly fielding queries from patients, physicians, medical researchers and insurance companies, but little regulation exists for their use in the healthcare setting.
“If a healthcare system wants to put a chatbot on their webpage for anybody to use, how do we know that’s a good product?” asks bioethicist Susannah Rose, M.S.S.W., Ph.D. “We need a way to assess their quality.”
That was Rose’s thinking behind ia new project at Vanderbilt University Medical Center: the Vanderbilt Chatbot Accuracy and Reliability Evaluation System (V-CARES).
Principal investigator Rose is working with Zhijun Yin, Ph.D., M.S., from the department of Biomedical Informatics and Computer Science, to establish a “measuring stick” for chatbot safety and reliability. Development of the rating system is being system supported by ARPA-H funding and may transform how healthcare systems assess and select AI tools, Rose said.
Assessment Framework
V-CARES evaluates chatbots that are based on large language models (LLMs) and trained on vast amounts of text to understand and generate human-like responses. Chatbots can provide fast, personalized responses that may be more useful than static medical information on websites like Mayo Clinic or Cleveland Clinic. But the quality of chatbot information can also be harder to verify, with chatbots known to sometimes “hallucinate,” providing answers with incorrect information, missing details, or inserting unauthorized preferences or beliefs.
“As healthcare systems increasingly use chatbots to do different kinds of work, we need a way to assess their quality.”
While some LLMs, like Microsoft’s Copilot, provide references for their answers, most – including OpenAI’s ChatGPT and Google’s Gemini – do not, making quality assessment more challenging for non-experts.
The new V-CARES system will include a rating system for existing LLMs and establish standards for the quality of those used in healthcare settings. This means physicians, healthcare organizations and insurance companies could make better informed decisions about which LLMs are safe and most effective.
Human-Centered Development
For their first case study, V-CARES will incorporate feedback from approximately 400 patient participants with diagnosed with mental illnesses, including major depression and anxiety. Other feedback will come from medical professionals, mental health specialists and community members.
The researchers will use these responses to detect hallucinations, omissions and misaligned values in LLMs in the initial study using diverse input to ensure the V-CARES system considers both technical accuracy and real-world clinical utility.
Rose highlighted said her team’s approach stands out as “equally weighted between the human side and the AI side.”
This human-centered approach “democratizes the standards” being developed for evaluation.
Creating Market Pressure
Beyond providing transparency for patients and healthcare providers, V-CARES could create market pressure for technology companies to improve commercial LLM systems.
“If a company knows they received a rating of C+ for medical-related content, this may motivate them to improve their product,” Rose noted, adding that it could help drive competition based on quality, rather than just interface design or marketing.
“If a company knows they received a rating of C+ for medical-related content, this may motivate them to improve their product.”
The project, which launched in September 2024, is funded by ARPA-H to deliver quick, iterative product development. The team expects to complete its first evaluation framework for depression and anxiety applications within a year.
Following validation in mental health applications, the system will expand to assess chatbots across other medical domains. Moving forward, V-CARES could help healthcare organizations make more informed decisions about which AI tools to implement, improving trust and empathy with AI tools.