Talking with Actionbits---A part-enhanced VLM for action and interaction recognition in animals

Mar 21, 2026·
Yang Yang
,
Ren Nakagawa
,
Risa Shinoda
,
Hiroaki Santo
,
Kenji Oyama
,
Takenao Ohkawa
Fumio Okura
Fumio Okura
· 0 min read
Abstract
Understanding animal actions and interactions is essential for behavior analysis and ecological monitoring. Although large-scale in-the-wild datasets have advanced animal action recognition, existing methods still struggle with fine-grained motion, spatial relations, and multi-individual interactions. To address these challenges, we introduce AIRA, a unified framework for Action and Interaction Recognition in Animals. Built upon a vision-language model (VLM), AIRA learns in an action-centered representation space defined by body parts and their corresponding motions, thereby improving robustness to background noise and enabling cross-species generalization via a unified mammal-centric part ontology. To model actions, we treat body parts and motion as primary cues and introduce Actionbit tokens-compact representations for parts and motions generated by a large language model (LLM) that encode which parts move and how. We further propose Part-Enhanced Prompt Fine-tuning (PEPF) to make the VLM explicitly sensitive to part and pose cues. Within PEPF, the Action-actionbit Alignment (AbA) module enriches action representations with fine-grained part-motion semantics, and Part-Vision Prompting (PVP) extracts keyframes through action-aware prompting. Experiments across multiple benchmarks show consistent improvements in both action and interaction recognition, highlighting the importance of action-centered adaptation and relational reasoning for understanding animal behavior in the wild.
Type
Publication
Sensors, 26(6):1969