Super-efficient Echocardiography Video Segmentation via Proxy- and Kernel-Based Semi-supervised Learning
Association for the Advancement of Artificial Intelligence (AAAI)
Huisi Wu1* Jingyin Lin1 Wende Xie1 Jing Qin2
1Shenzhen University 2 The Hong Kong Polytechnic University
Automatic segmentation of left ventricular endocardium in echocardiography videos is critical for assessing various cardiac functions and improving the diagnosis of cardiac diseases. It is yet a challenging task due to heavy speckle noise, significant shape variability of cardiac structure, and limited labeled data. Particularly, the real-time demand in clinical practice makes this task even harder. In this paper, we propose a novel proxy- and kernel-based semi-supervised segmentation network (PKEcho-Net) to comprehensively address these challenges. We first propose a multi-scale region proxy (MRP) mechanism to model the region-wise contexts, in which a learnable region proxy with an arbitrary shape is developed in each layer of the encoder, allowing the network to identify homogeneous semantics and hence alleviate the influence of speckle noise on segmentation. To sufficiently and efficiently exploit temporal consistency, different from traditional methods which only utilize the temporal contexts of two neighboring frames via feature warping or self-attention mechanism, we formulate the semi-supervised segmentation with a group of learnable kernels, which can naturally and uniformly encode the appearances of left ventricular endocardium, as well as extracting the inter-frame contexts across the whole video to resist the fast shape variability of cardiac structures. Extensive experiments have been conducted on two famous public echocardiography video datasets, EchoNet-Dynamic and CAMUS. Our model achieves the best performance-efficiency trade-off when compared with other state-of-the-art approaches, attaining comparative accuracy with a much faster speed.
Figure 1: Challenges in echocardiography video segmentation. (a) speckle noise and blurred contours. (b)-(c) the inter- and intra- sequence shape variabilities of cardiac structures.
Figure 2: Overview of our PKEcho-Net, which mainly consists of an MRP module and a KSS mechanism for semi-supervised segmentation of echocardiography videos (only the first frame in ED and the last frame in ES are annotated).
Figure 3: More details of MRP. (a) Our MRP mechanism in each layer of the encoder. (b) Region proxies are established by learning the relationships between the 3×3 neighborhood of each pixel in fi and the corresponding regions in fi-1.
Figure 4: Illustration of KSS. We iteratively train a group of learnable kernels to uniformly encode the identical left ventricular appearances and the inter-frame contexts across the whole video, which is also much more compatible and memory efficient.
Figure 5: Visual comparison with different state-of-the-art methods on the CAMUS and EchoNet-Dynamic test sets. Red, green, and yellow regions represent the ground truth, prediction, and their overlapping regions, respectively.
Figure 6: Correlation graphs for the clinical metrics on the CAMUS (left) and EchoNet-Dynamic (right) test sets.
Figure 7: Performance vs. efficiency on the CAMUS test set.
Figure 8: Visual comparison of feature maps restored from different affinity maps. (a) Input image. (b) Ground truth. (c) feature map extracted in the backbone. (d-g) feature maps gradually restored to input image size using affinity maps.
This work was supported partly by National Natural Science Foundation of China (Nos. 61973221 and 62273241), Natural Science Foundation of Guangdong Province, China (No. 2019A1515011165), the COVID-19 Prevention Project of Guangdong Province, China (No. 2020KZDZX1174), the Major Project of the New Generation of Artificial Intelligence (No. 2018AAA0102900), and the Hong Kong Research Grant Council under General Research Fund Scheme (Project no. 15205919).