Domain Generalization using Action Sequences for Egocentric Action Recognition

Abstract

Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model’s generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model’s robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.

The SeqDG Architecture

SeqDG is a domain generalization framework for egocentric action recognition that leverages the consistency of action sequences across different environments. The model takes both visual inputs (video clips) and textual narrations (e.g., verb-noun pairs) for a sequence of actions surrounding a central action. Visual features are encoded using a transformer encoder with a classification token, while textual features are extracted using a pre-trained language model. The central action in the sequence is masked in both modalities, and two separate decoders with cross-modal attention reconstruct the missing visual and textual information — a process that encourages the model to capture temporal and semantic dependencies across the sequence.

Mixing action sequences across domains

SeqDG includes a sequence-mixing augmentation strategy (SeqMix), where actions with the same label but from different domains are combined to improve robustness to domain shifts. The final action classification is performed using the CLS token from the visual encoder, and the model is jointly trained with classification and reconstruction losses. During inference, only visual inputs are used.

Experimental results

We validate SeqDG on the EPIC-KITCHENS-100 UDA benchmark, comparing both UDA and DG methods in the Cross-Domain setting (target val. set). Models are evaluated in terms of Top-1 and Top-5 Verb, Noun and Action accuracy (%).

Cite us


@article{NASIRIMAJD2025,
title = {Domain generalization using action sequences for egocentric action recognition},
journal = {Pattern Recognition Letters},
year = {2025},
issn = {0167-8655},
doi = {https://doi.org/10.1016/j.patrec.2025.06.010},
url = {https://www.sciencedirect.com/science/article/pii/S0167865525002387},
author = {Amirshayan Nasirimajd and Chiara Plizzari and Simone Alberto Peirone and Marco Ciccone and Giuseppe Averta and Barbara Caputo}
}