Funny or sad? AI's response to what percentage of factual questions are answered incorrectly by LLMs [View all]
I don't use AI (for the most part). I have an android phone, and the Google widget has three icons, the magnifying glass which claims to be full on AI; the microphone which is voice-to-text then invoking Google standard search; and the camera which is supposed to be a visual search using AI.
Using the microphone icon, I asked this question: "What percentage of factual questions are wrongly answered by LLMs?"
The first answer is the AI summary. I don't know if the actual magnifying glass is supposed to use a different, more accurate version of AI, but AI damns itself with its own answer quoted below:
------
The percentage of factual questions wrongly answered by LLMs varies significantly depending on the model, the type of question, the subject matter, and the evaluation benchmark used. Studies show a wide range of performance, with some models failing to answer correctly more often than not.
Key findings from various studies include:
Accuracy Range:
One analysis of eight AI models showed that more than 60% of queries were incorrectly answered overall.
Another study found that most models achieved a factual accuracy of 30-50% on short-form questions.
A BBC study showed that 45% of AI queries produced erroneous answers.
Model Performance:
Model accuracy varied in an evaluation on disease epidemiology questions:
ChatGPT-4: 76.2% of responses were accurate (23.8% were wrong).
Google Gemini: 50.0% accurate.
ChatGPT-3.5: 45.2% accurate.
In a multilingual fact-checking analysis, GPT-4o achieved 64.83% accuracy on factual claims, while GPT-3.5 Turbo was 58.19% accurate.
Question Type Impact:
Misconception-Prone Questions: LLMs perform poorly on the TruthfulQA benchmark, which uses questions designed to elicit common human misconceptions. The best-performing models in an initial study only achieved around 54-58% truthfulness, compared to human performance of 94%.
Domain-Specific Questions: Some LLMs achieve high factual accuracy (nearly 90-95%) on simple tasks with sufficient information in clinical cases but struggle with complex reasoning and rare diseases.
Long-form Answers: Factual accuracy tends to decrease in long-form responses, dropping from over 50% for facts requested early in a prompt to around 30% for later facts.
Hallucination and Error Propagation:
Models sometimes provide more wrong answers than correct ones in specific contexts, showing a systematic failure to retrieve correct information.
Errors can "snowball," decreasing the accuracy of subsequent answers within the same response.
Factual errors, often called "hallucinations," remain a challenge for even the most advanced LLMs. Human oversight and verification of critical information provided by these models is still needed.