Rude Prompts, Better Answers: How Tone Impacts LLM Accuracy

Added
Article: NeutralCommunity: NeutralDeeply Divisive

Researchers evaluated how varying levels of politeness in prompts affected the accuracy of ChatGPT 4o across math, science, and history questions. The study revealed that 'Very Rude' prompts achieved the highest accuracy at 84.8%, while 'Very Polite' prompts performed the worst at 80.8%. This suggests that, unlike humans, modern AI models may provide more accurate results when addressed with an impolite tone.

Key Points

  • The study tested five levels of prompt politeness ranging from Very Polite to Very Rude using ChatGPT 4o.
  • Impolite prompts outperformed polite ones, with 'Very Rude' prompts achieving the highest accuracy at 84.8%.
  • The findings suggest a shift in how modern LLMs process tonal variations compared to older models studied in previous research.
  • The research highlights that the pragmatic wording of a prompt significantly influences the model's performance on academic tasks.
  • The results raise questions about the social dimensions of human-AI interaction and the effectiveness of traditional social norms when prompting AI.

Sentiment

The sentiment is highly divided, characterized by a tension between technical pragmatism and concern for the psychological effects of normalized hostility.

In Agreement

  • Berating models can sometimes break them out of repetitive error cycles or 'crap fixes'.
  • Direct or 'rude' prompts may remove unnecessary 'fluff' tokens that distract the model from the core task.
  • Adversarial or high-pressure prompts might simulate a type of 'investment' that forces better reasoning.
  • AI is an inanimate tool, and treating it with human-like politeness is a category error that can be seen as degrading to actual human interaction.

Opposed

  • The 4% accuracy gain is not worth the risk of habituating hostile behavior that could leak into real-world social interactions.
  • The study's sample size of 250 prompts is too small to be statistically significant and may just represent noise.
  • The 'polite' prompts used in the study were poorly constructed and sounded passive-aggressive rather than genuinely respectful.
  • Other models, particularly from Anthropic, are known to push back or terminate conversations when faced with abusive language.
  • Kindness is a habit for the benefit of the speaker's own character, regardless of whether the receiver is sentient.
Rude Prompts, Better Answers: How Tone Impacts LLM Accuracy | TD Stuff