April 19, 2024
Artificial intelligence answers medical questions

One of the best known large AI language models is OpenAI’s Chat-GPT. With the help of large databases, it is possible with the model to answer a large number of questions and generate easily understandable texts from just a few inputs. The AI ​​powerhouse DeepMind, which belongs to Google, now wants to achieve something similar in medicine.

In any case, the potential of such language models to answer medical questions is given, says Clemens Heitzinger, one of the heads of the Center for Artificial Intelligence and Machine Learning at the Graz University of Technology. “The advantage is that everyone can handle it and patients can interact with these models in natural language,” he explains to science.ORF.at, adding: “Of course, you have to pay close attention to how reliable the AI ​​recommendations and the responses generated are actual.”

Heitzinger was not involved in the DeepMind model, but has been working on developing another AI model that suggests treatment steps for blood poisoning patients and thus increases their chances of survival.

New evaluation procedure

To verify the performance of AI language models, experts often use evaluation methods in the form of benchmarks. Tests allow you to discover the practical utility of a model.

However, in a study recently presented in the journal “Nature,” DeepMind experts note that previous benchmarks often have only limited significance in medicine. Most of these would only rate the performance of language models in individual medical tests. The experts therefore present a new benchmark: MultiMedQA. This is made up of a total of seven datasets, six of which have questions from medical research and patients, and a new dataset of over 3,000 medical questions that were frequently searched online.

Revamped AI model

Based on the Google PaLM language model, DeepMind experts have created a revised model for medical questions that performs at least as well as other modern AI language models on most data sets of the MultiMedQA benchmark. The new model, called Med-PaLM, was tested with questions similar to those used in medical licensing exams in the United States. On average, it was 17% more accurate than comparable speech models.

In an evaluation by physicians, Med-PaLM even performed as well as medical professionals in many respects. Nine clinicians evaluated the model’s performance. In each case, one person rated a model response to random questions from the baseline datasets. As a result, 92.6 percent of Med-PaLM responses matched the scientific consensus, close to 92.9 percent of physician responses.

In many other areas, however, the quality of AI-generated information has not yet reached the level of expertise of medical professionals. Nearly 19 percent of Med-PaLM responses contained incorrect or inappropriate content – ​​this was the case for only 1.4 percent of responses from experts.

Commercial interests vs. scientific

According to Heitzinger, the reason for this lies solely in the data used to train the model: “These large language models depend on the quality of the datasets used for learning.” contains responses, the data records used should be checked carefully.

It would therefore generally be important for research to gain insight into these data. However, the commercial interests of large companies often get in the way, and training data cannot even be viewed on Med-PaLM. “At the end of the day, of course, these are also company secrets, and not every company is going to be happy to let you look at their cards,” says Heitzinger.

First attempts at regulation

The fact that datasets remain hidden is not just a problem in medicine. Only recently, Swiss researchers have shown that AI language models can generate highly compelling fake reports that are nearly indistinguishable from reports from real people on platforms like Twitter.

To regulate the use of AI in the future, the European Parliament is planning the “AI law”. This is the world’s first comprehensive AI law, where AI is divided into four areas, according to the risk emanating from different systems. Facial recognition software for real-time population monitoring is considered particularly risky: its use should be banned completely. However, the risk of language models such as Chat-GPT and Med-PaLM has not yet been precisely regulated.

There is still much work to be done

In any case, it is still too early to use Med-PaLM in daily medical practice – even the developers of DeepMind are aware of this. There are still too many limitations and the approach can still be improved in some areas. Physicians and Med-PaLM responses were scored by only one person in the studies, which could skew the result. The indication of medical sources also needs to be further improved.

#Artificial #intelligence #answers #medical #questions

Leave a Reply

Your email address will not be published. Required fields are marked *