Talking with Actionbits---A part-enhanced VLM for action and interaction recognition in animals

Mar 21, 2026·

Yang Yang

Ren Nakagawa

Risa Shinoda

Hiroaki Santo

Kenji Oyama

Takenao Ohkawa

Fumio Okura

· 0 min read

PDF Code

Abstract

Understanding animal actions and interactions is essential for behavior analysis and ecological monitoring. Although large-scale in-the-wild datasets have advanced animal action recognition, existing methods still struggle with fine-grained motion, spatial relations, and multi-individual interactions. To address these challenges, we introduce AIRA, a unified framework for Action and Interaction Recognition in Animals. Built upon a vision-language model (VLM), AIRA learns in an action-centered representation space defined by body parts and their corresponding motions, thereby improving robustness to background noise and enabling cross-species generalization via a unified mammal-centric part ontology. To model actions, we treat body parts and motion as primary cues and introduce Actionbit tokens-compact representations for parts and motions generated by a large language model (LLM) that encode which parts move and how. We further propose Part-Enhanced Prompt Fine-tuning (PEPF) to make the VLM explicitly sensitive to part and pose cues. Within PEPF, the Action-actionbit Alignment (AbA) module enriches action representations with fine-grained part-motion semantics, and Part-Vision Prompting (PVP) extracts keyframes through action-aware prompting. Experiments across multiple benchmarks show consistent improvements in both action and interaction recognition, highlighting the importance of action-centered adaptation and relational reasoning for understanding animal behavior in the wild.

Type

Journal article

Publication

Sensors, 26(6):1969

Last updated on Mar 21, 2026

Computer Vision Animal Recognition

Authors

Fumio Okura

Associate Professor

← Instance-wise distribution control of text-to-image diffusion models Apr 1, 2026

Interaction-via-Actions: Cattle interaction detection with joint learning of action-interaction latent space Mar 6, 2026 →