Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

General Discussion

Showing Original Post only (View all)

UniqueUserName

(403 posts)
Wed Nov 26, 2025, 06:49 AM Nov 26

Funny or sad? AI's response to what percentage of factual questions are answered incorrectly by LLMs [View all]

I don't use AI (for the most part). I have an android phone, and the Google widget has three icons, the magnifying glass which claims to be full on AI; the microphone which is voice-to-text then invoking Google standard search; and the camera which is supposed to be a visual search using AI.

Using the microphone icon, I asked this question: "What percentage of factual questions are wrongly answered by LLMs?"
The first answer is the AI summary. I don't know if the actual magnifying glass is supposed to use a different, more accurate version of AI, but AI damns itself with its own answer quoted below:

------
The percentage of factual questions wrongly answered by LLMs varies significantly depending on the model, the type of question, the subject matter, and the evaluation benchmark used. Studies show a wide range of performance, with some models failing to answer correctly more often than not.
Key findings from various studies include:

Accuracy Range:
One analysis of eight AI models showed that more than 60% of queries were incorrectly answered overall.
Another study found that most models achieved a factual accuracy of 30-50% on short-form questions.
A BBC study showed that 45% of AI queries produced erroneous answers.

Model Performance:
Model accuracy varied in an evaluation on disease epidemiology questions:
ChatGPT-4: 76.2% of responses were accurate (23.8% were wrong).
Google Gemini: 50.0% accurate.
ChatGPT-3.5: 45.2% accurate.
In a multilingual fact-checking analysis, GPT-4o achieved 64.83% accuracy on factual claims, while GPT-3.5 Turbo was 58.19% accurate.

Question Type Impact:
Misconception-Prone Questions: LLMs perform poorly on the TruthfulQA benchmark, which uses questions designed to elicit common human misconceptions. The best-performing models in an initial study only achieved around 54-58% truthfulness, compared to human performance of 94%.
Domain-Specific Questions: Some LLMs achieve high factual accuracy (nearly 90-95%) on simple tasks with sufficient information in clinical cases but struggle with complex reasoning and rare diseases.
Long-form Answers: Factual accuracy tends to decrease in long-form responses, dropping from over 50% for facts requested early in a prompt to around 30% for later facts.
Hallucination and Error Propagation:
Models sometimes provide more wrong answers than correct ones in specific contexts, showing a systematic failure to retrieve correct information.
Errors can "snowball," decreasing the accuracy of subsequent answers within the same response.
Factual errors, often called "hallucinations," remain a challenge for even the most advanced LLMs. Human oversight and verification of critical information provided by these models is still needed.

3 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
Latest Discussions»General Discussion»Funny or sad? AI's respon...