WavePolyp: Video Polyp Segmentation via Hierarchical Wavelet-Based Feature Aggregation and Inter-Frame Divergence Perception
International Conference on Learning Representations (ICLR)
Yuhua Zhang1, Guilian Chen1, Yuanqin He1, Huisi Wu1, Jing Qin2
1Shenzhen University
2The Hong Kong Polytechnic University
Abstract
Automatic polyp segmentation from colonoscopy videos is a crucial technique that assists clinicians in improving the accuracy and efficiency of diagnosis, preventing polyps from developing into cancer. However, video polyp segmentation (VPS) is a challenging task due to (1) the significant inter-frame divergence in videos, (2) the high camouflage of polyps in normal colon structures and (3) the clinical requirement of real-time performance. In this paper, we propose a novel segmentation network, WavePolyp, which consists of two innovative components: a hierarchical wavelet-based feature aggregation (HWFA) module and inter-frame divergence perception (IDP) blocks. Specifically, HWFA excavates and amplifies discriminative information from high-frequency and low-frequency features decomposed by wavelet transform, hierarchically aggregating them into refined spatial representations within each frame. This module enhances the representation capability of intra-frame spatial features, effectively addressing the high camouflage of polyps in normal colon structures. Furthermore, IDP perceives and captures inter-frame polyp divergence through a temporal divergence perception mechanism, enabling accurate polyp tracking while mitigating temporal inconsistencies caused by the significant inter-frame variations across frames. Extensive experiments conducted on the SUN-SEG and CVC-612 datasets demonstrate that our method outperforms other state-of-the-art methods. Codes will be released upon publication.

Figure 1: Visual comparison of our proposed method with different SOTA methods on the SUNSEG-Easy and SUN-SEG-Hard test sets. Red, green and yellow represent the GT, prediction and their overlapping regions, respectively.

Figure 2: Visualization of module ablation on test set SUN-SEG-Hard.

Figure 3: t-SNE visualization of features. Red represents lesion regions, while blue represents the opposite.

Figure 4: Overview of WavePolyp, which mainly introduces hierarchical wavelet-based feature aggregation (HWFA) and inter-frame divergence perception (IDP).

Figure 5: Visual analysis of frequency-based feature disentanglement. From left to right: (a) Original Frame, (b) Prediction Mask, (c) High-Frequency Features (
), and (d) Low-Frequency Features (
).

Figure 6: Impact of IDP on temporal stability. Row 2: w/o IDP; Row 3: Ours. The IDP module ensures consistent tracking and suppresses flickering. (Red: GT, Green: Pred, Yellow: Overlap).
Acknowledgement
This work was supported partly by National Natural Science Foundation of China (No. 62273241), Natural Science Foundation of Guangdong Province, China (No. 2024A1515011946), the Shenzhen Research Foundation for Basic Research, China (No. JCYJ20250604181940054), and a grant under Collaborative Research with World-leading Research Groups scheme of The Hong Kong Polytechnic University (project no. G-SACF).
Bibtex
@article{zhangwavepolyp,
title={WAVEPOLYP: VIDEO POLYP SEGMENTATION VIA HI-ERARCHICAL WAVELET-BASED FEATURE AGGREGA-TION AND INTER-FRAME DIVERGENCE PERCEPTION},
author={Zhang, Yuhua and Chen, Guilian and He, Yuanqin and Wu, Huisi and Qin, Jing}
}
Downloads