Saliency-Driven Multi-Scale Feature Discrepancy Fusion for Fine-Grained Video Anomaly Detection
Abstract
Video Anomaly Detection (VAD), a critical task in intelligent surveillance systems, plays a vital role in public safety, traffic management, and emergency response. However, detecting small-scale and transient anomalies in complex scenes remains a significant challenge due to the scarcity of anomaly samples and the difficulty in capturing fine-grained features. To address these issues, this paper proposes a novel dynamic feature enhancement framework built upon the Masked Autoencoder (MAE) architecture. At the core of the proposed framework is the Multi-Scale Discrepancy Saliency Fusion (MDSF) module, which explicitly models and dynamically amplifies channel-wise feature discrepancies between teacher and student networks, thereby enhancing the saliency of anomalous regions. Furthermore, MDSF integrates multi-scale semantic features through a saliency-guided fusion strategy, enabling the model to effectively capture anomalies across varying spatial and temporal resolutions. The proposed method is trained in an end-to-end manner without requiring pre-trained weights and is evaluated on standard benchmark datasets, including UCSD Ped2, Avenue, and ShanghaiTech. Experimental results demonstrate that the proposed MDSF module significantly improves detection accuracy while maintaining low computational complexity, highlighting its practical value and strong generalization capabilities for real-world video anomaly detection tasks.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 INNO-PRESS: Journal of Emerging Applied AI

This work is licensed under a Creative Commons Attribution 4.0 International License.