FCSC: Weakly-Supervised Temporal Action Localization via Feature Calibration-assisted Sequence Comparison
DOI:
https://doi.org/10.63313/JCSFT.9032Keywords:
Weakly - supervised, Multi-modal feature calibration module, Sequence similarity optimize, Maximum consistent subsequenceAbstract
Weakly-supervised temporal action localization task is to identify action categories and start and end times in unedited videos. How to achieve feature calibration between different modalities in this task, and how to further optimize action boundaries based on the similarity of action common sequences remains an urgent problem to be solved. Based on the above issues, we propose a novel network framework, weakly supervised temporal action localization via feature calibration-assisted sequence comparison (FCSC). The core of the FCSC framework lies in the Multi-Modal Feature Calibration Module (MFCM), which utilizes global and local contextual information from the primary and auxiliary modalities to enhance RGB and FLOW features, respectively, achieving deep feature calibration. In addition, the framework introduces an improved distinguishable edit distance metric to sequence similarity optimize (SSO) and maximum consistent subsequence (MCS) to narrow the gap between classification and localization tasks. After multiple experiments, it has been proven that FCSC achieved maps of 47.7% and 27.9%, respectively on the THUMOS14 and ActiveNet1.2 temporal action recognition benchmark test sets, fully verifying the effectiveness of the model.
References
[1] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian tem-poral awareness networks for action localization. In CVPR, 2019.
[2] Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. Class semantics-based attention for action detection. In ICCV, 2021.
[3] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In CVPR, 2020.
[4] Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. Enriching local and global contexts for temporal action localization. In ICCV, 2021.
[5] Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. Cross-modal consen-sus network for weakly supervised temporal action localization. In ACM MM, 2021.
[6] Linjiang Huang, Liang Wang, and Hongsheng Li. Foreground-action consistency network for weakly supervised temporal action localization. In ICCV, 2021.
[7] Zhijun Li Lijun Zhang Fan Lu Alois Knoll Sanqing Qu, Guang Chen. Acm-net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967, 2021.
[8] Wenfei Yang, Tianzhu Zhang, Xiaoyuan Yu, Tian Qi, Yongdong Zhang, and Feng Wu. Uncer-tainty guided collaborative training for weakly supervised temporal action detection. In CVPR, 2021.
[9] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W- talc: Weakly-supervised tem-poral activity localization andclassification. In ECCV, 2018.
[10] Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. Weakly-supervised action localization by generative atten- tion modeling. In CVPR, 2020.
[11] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly su-pervised action recognition and detection. In CVPR, 2017.
[12] Yunlu Xu, Chengwei Zhang, Zhanzhan Cheng, Jianwen Xie, Yi Niu, Shiliang Pu, and Fei Wu. Segregated temporal as- sembly recurrent networks for weakly supervised multiple action detection. In AAAI, 2019.
[13] Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In CVPR, 2021.
[14] Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, and Guangwei Yu. Weakly supervised action selection learning in video. In CVPR, 2021
[15] Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. Action unit memory network for weakly supervised temporal action localization. In CVPR, 2021.
[16] Yuan Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jian- qiang Huang, and Hanwang Zhang. The blessings of unla- beled background in untrimmed videos. In CVPR, 2021.
[17] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly su-pervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4325– 4334, 2017.
[18] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action lo-calization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6752– 6761, 2018.
[19] Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. Cross-modal consen-sus network for weakly supervised temporal action localization. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1591–1599, 2021.
[20] Wenfei Yang, Tianzhu Zhang, Xiaoyuan Yu, Tian Qi, Yongdong Zhang, and Feng Wu. Uncer-tainty guided collaborative training for weakly supervised temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 53–63, 2021.
[21] Linjiang Huang, Liang Wang, and Hongsheng Li. Weakly supervised temporal action locali-zation via representative snippet knowledge propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3272–3281, 2022.
[22] Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. TIP (2018).
[23] Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. MININet: Multiple Instance Ranking Network for Video Highlight Detection. In ECCV.
[24] Ya Jing, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Cross-Modal Cross-Domain Moment Alignment Network for Person Search. In CVPR.
[25] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised learning of audio-visual objects from video. arXiv (2020)
[26] Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation. In CVPR.
[27] Jonathan Munro and Dima Damen. 2020. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition. In CVPR.
[28] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.
[29] Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. Few-shot video classification via temporal alignment. In CVPR, 2020.
[30] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. D3tw: Dis-criminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In CVPR, 2019.
[31] Xiaobin Chang, Frederick Tung, and Greg Mori. Learning discriminative prototypes with dynamic time warping. In CVPR, 2021.
[32] Isma Hadji, Konstantinos G Derpanis, and Allan D Jepson. Representation learning via glob-al temporal alignment and cycle-consistency. In CVPR, 2021.
[33] Xingyu Cai and Tingyang Xu. Dtwnet: a dynamic timewarping network. NeurIPS, 2019.
[34] Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series. In ICML, 2017.
[35] Nikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Animesh Garg, and Allan D Jepson. Drop-dtw: Aligning common signal between sequences while dropping outliers. arXiv:2108.11996, 2021.
[36] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zis-serman. Temporal cycleconsistency learning. In CVPR, 2019.
[37] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR.
[38] Ziyi Liu, Le Wang, Wei Tang, Junsong Yuan, Nanning Zheng, and Gang Hua. Weakly super-vised temporal action localization through learning explicit subspaces for action and con-text. In AAAI, 2021.
[39] Haiping Zhang, Haixiang Lin, Dongjing Wang, Dongyang Xu, Fuxing Zhou, Liming Guan, Dongjing Yu, Xujian Fan. TSCANet: a two stream context aggregation network for weakly supervised temporal action localization.
[40] Zhai Y, Wang L, Tang W, Zhang Q, Yuan J, Hua G (2020) Two-stream Consensus Network for Weakly-supervised Temporal Action Localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pp. 37–54. Springer
[41] Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethink-ing the Faster R-cnn Architecture for Temporal Action Localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139
[42] Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian Temporal Awareness Networks for Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353
[43] Lee S, Jung J, Oh C, Yun S (2024) Enhancing Temporal Action Localization: Advanced s6 Modeling with Recurrent Mechanism.
[44] Chen G, Huang Y, Xu J, Pei B, Chen Z, Li Z, Wang J, Li K, Lu T, Wang L (2024) Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding.
[45] Paul S, Roy S, Roy-Chowdhury AK (2018) W-TALC: Weakly-Supervised Temporal Activity Localization and Classifcation, pp. 588–607
[46] Nguyen P, Liu T, Prasad G, Han B (2018) Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761
[47] Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised Action Localization by Generative Attention Modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019
[48] Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization
[49] Liu Z, Wang L, Zhang Q, Tang W, Yuan J, Zheng N, Hua G (2021) Acsnet: Action-context Separation Network for Weakly Supervised Temporal Action Localization.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Erytis Publishing Limited.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.













