深圳大学计算机与软件学院

MOGeo: Beyond One-to-One Cross-View Object Geo-localization

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Bo Lv, Qingwang Zhang, Le Wu,Yuanyuan Li, Yingying Zhu^*

Shenzhen University

Figure 1. Comparison of cross-view object geo-localization in single-object and multi-object scenarios. Click points represent query objects, while bounding boxes in geo-tagged satellite im ages indicate location information. Points and bounding boxes of the same color form an object pair, such as p₁ and b₁, where b₁’s geographic location is considered the position of p₁.

Abstract

Cross-View Object Geo-Localization (CVOGL) aims to lo cate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CM Location, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.

Figure 2: Overview of our framework.

Figure 3. Cross-view multi-object feature fusion module. As shown in the figure, this module sequentially fuses query object features with the reference image to obtain an attention map for each object, and finally combines fused features with the attention map to further enhance localization in the reference image.

Figure 4. Relationship between model parameters and inference time, measured on the validation set of the CMLocation-V1.

Acknowledgement

This work was supported in part Guangdong Basic and Applied Basic Research Foundation under Grant 2026A1515011137, in part by the Key Project of Department of Education of Guangdong Province under Grant 2023ZDZX1016, and in part by Shenzhen Science and Technology Program under Grant JCYJ20240813142510014 and Grant 20220810142553001.

Downloads

Paper