Apple: LLMs Accurately Recognize Activities from Captioned Audio and Motion Data

Apple shows LLMs can infer user activities by fusing text captions from audio and IMU sensors, avoiding raw audio for privacy. Tested on Ego4D with Gemini-2.5-pro and Qwen-32B, the models achieve above-chance zero-shot accuracy and improve with one-shot examples. The approach reduces the need for bespoke multimodal models and Apple released materials to replicate the study.

Key Points

LLMs perform late fusion over text captions from audio and IMU predictions to recognize activities without using raw audio.
Zero-shot classification beats chance; one-shot examples further boost accuracy across 12 daily activities.
Evaluations used Ego4D data with Gemini-2.5-pro and Qwen-32B in both closed-set and open-ended settings.
The method works well when aligned multimodal training data is limited, avoiding extra memory/compute for custom multimodal models.
Apple published supplemental materials to enable researchers to reproduce the results.

Sentiment

The overall sentiment of the Hacker News discussion is predominantly negative and cautionary. While acknowledging the technical interest, the overwhelming focus is on the significant privacy implications and the potential for pervasive surveillance presented by the research.

In Agreement

The technical approach of using LLMs for late multimodal sensor fusion, where LLMs process natural language descriptions rather than raw data, is technically interesting.
The clarification that LLMs consume text summaries derived from raw data (audio and motion) rather than the raw data itself aligns with the paper's methodology and is an important technical distinction.

Opposed

The technology raises significant privacy concerns, with one commenter likening it to the realization of '1984's Telescreens' at scale.
The overall concept is described as 'creepy' due to its potential for pervasive monitoring of user activities.
Concerns about apps requiring motion sensor permissions, which enable similar activity recognition, are not new and have been a subject of alarm since early Android versions.