Apple: LLMs Accurately Recognize Activities from Captioned Audio and Motion Data

Added Nov 22, 2025
Article: PositiveCommunity: NegativeMixed
Apple: LLMs Accurately Recognize Activities from Captioned Audio and Motion Data

Apple shows LLMs can infer user activities by fusing text captions from audio and IMU sensors, avoiding raw audio for privacy. Tested on Ego4D with Gemini-2.5-pro and Qwen-32B, the models achieve above-chance zero-shot accuracy and improve with one-shot examples. The approach reduces the need for bespoke multimodal models and Apple released materials to replicate the study.

Key Points

  • LLMs perform late fusion over text captions from audio and IMU predictions to recognize activities without using raw audio.
  • Zero-shot classification beats chance; one-shot examples further boost accuracy across 12 daily activities.
  • Evaluations used Ego4D data with Gemini-2.5-pro and Qwen-32B in both closed-set and open-ended settings.
  • The method works well when aligned multimodal training data is limited, avoiding extra memory/compute for custom multimodal models.
  • Apple published supplemental materials to enable researchers to reproduce the results.

Sentiment

The Hacker News community is predominantly skeptical and concerned about the surveillance implications of this research. While a few commenters acknowledge the technical interest and potential health benefits, the overwhelming response frames the work as advancing surveillance capabilities. The community largely views Apple's privacy-first framing with suspicion, arguing that the underlying capability is concerning regardless of current implementation.

In Agreement

  • The technique is technically interesting and the use of text summaries rather than raw audio is a privacy-conscious design choice
  • Activity recognition could improve health features like fall detection and distinguishing falls from playing with children
  • LLM-based late fusion is a pragmatic approach that avoids the need for bespoke multimodal training data

Opposed

  • This research fundamentally enables surveillance infrastructure regardless of its stated privacy benefits
  • Activity recognition from sensor data is not new — alarms have been sounded since Android apps began requesting motion sensor permissions
  • Data collected now will become more exploitable as technology advances, following a harvest-now-decrypt-later pattern
  • Even people with nothing to hide contribute to a surveillance ecosystem that harms journalists and activists
  • LLMs are not genuinely necessary for this task and their inclusion feels career-driven rather than technically justified
  • The inevitability of universal tracking through ubiquitous vibration-sensing devices is deeply concerning
Apple: LLMs Accurately Recognize Activities from Captioned Audio and Motion Data | TD Stuff