Scientists at Google have created an AI chatbot for conducting medical interviews. It matched or surpassed human primary care practitioners on most criteria, including accuracy, politeness, and empathy .
Before doctors begin treating, they do a lot of talking. The initial conversation with the patient is extremely important, as it can lead to a correct or incorrect diagnosis. But today, where there is talking, there is AI. Can a chatbot based on a large language model replace primary care practitioners as a gateway for patients and serve them just as faithfully?
Last year, we reported on an attempt to do just that. That study, where AI clearly outperformed human health practitioners, had many limitations: for instance, the model was not specifically trained to provide medical advice, and it was pitched against Reddit threads where questions were answered by human practitioners. In addition, in the few months since that paper was published, large language models have made great strides.
This time, the challenge was picked up by a heavyweight: Google itself. A team of researchers from Google Research and Google DeepMind published a pre-print paper (meaning that it has not been peer-reviewed yet) that describes a dedicated chatbot for conducting medical interviews. The idea was to create a system that “would understand clinical language, intelligently acquire information under uncertainty, and engage in natural, diagnostically useful medical conversations with patients and those who care for them”. The chatbot was then pitched against board-certified primary care physicians, and… you can probably guess what happened.
Empathetic as only a machine can be?
The system is called Articulated Medical Intelligence Explorer (AMIE), and first, it had to be trained. Choosing proper datasets was a challenge in itself, and the researchers ended up using a variety of sources, such as summaries and transcripts of audio recordings from real-world medical visits. AMIE was then fine-tuned using input from AI and humans. Among other things, it mastered the art of diagnostic dialogue by impersonating all three agents involved: the patient, the doctor, and the moderator, who monitors the exchange and provides feedback. Over many iterations, it learned from its own mistakes and kept improving.
Then, the experiment commenced. One of its limitations was that it did not employ real patients but rather actors who interacted with AMIE in accordance with the scenarios they were given. Their interlocutor was either AMIE or a certified primary care physician (PCP), chosen in a blinded, randomized way. The conversations were then assessed by the simulated patient and a human physician who had no connection to the physician who had answered the questions.
At the end of the day, AMIE outperformed the human physicians in 24 out of 26 categories. It matched them in acquiring information but then prevailed on metrics such as differential diagnoses, which were more accurate and complete than those provided by PCPs. Just like in the aforementioned earlier study, AMIE particularly excelled in empathy and communication skills. It was characterized as more polite, honest, and trustworthy on average than PCPs by both the simulated patients and the human moderators.
The machine is not necessarily better
The researchers note that PCPs are usually not trained in conversing with the patient via a text-based chat, which might have affected their performance. They cannot be expected to match AI’s speed, consistency, patience, and tirelessness. A face-to-face visit or even a telehealth chat might still hold many advantages in some settings. However, an online chat could be the only option available for many people, especially in poorer countries and communities, which makes AMIE a big step towards democratizing healthcare.
“To our knowledge, this is the first time that a conversational AI system has ever been designed optimally for diagnostic dialogue and taking the clinical history,” Alan Karthikesalingam, a research scientist at Google Health and the study’s co-author, said to Nature, adding that the results should be interpreted with caution and humility. “This in no way means that a language model is better than doctors in taking clinical history,” Karthikesalingam noted.
In this study, we introduced AMIE, an LLM based AI system optimised for clinical dialogue with diagnostic reasoning capabilities. We compared AMIE consultations to those performed by PCPs using a randomized, double-blind crossover study with human simulated patients in the style of an Objective Structured Clinical Examination (OSCE). Notably, our study was not designed to be representative of clinical conventions either for traditional OSCE evaluations, for remote- or tele-medical consultation practices, or for the ways clinicians usually use text and chat messaging to communicate with patients. Our evaluation instead mirrored the most common way by which people interact with LLMs today, leveraging a potentially scalable and familiar mechanism for AI systems to engage in remote diagnostic dialogue. In this setting, we observed that AMIE, an AI system optimised specifically for the task, outperformed PCPs on simulated diagnostic conversations when evaluated along multiple clinically-meaningful axes of consultation quality.
 Tu, T., Palepu, A., Schaekermann, M., Saab, K., Freyberg, J., Tanno, R., … & Natarajan, V. (2024). Towards Conversational Diagnostic AI. arXiv preprint arXiv:2401.05654.