نبذة مختصرة : Urban scene understanding and functional identification are essential for accurately characterizing the spatial structure and optimizing the city layouts during rapid urbanization. Multimodal data is important for recognizing the distribution patterns of urban functions and revealing internal details. Previous studies have focused primarily on remote sensing imagery and points of interest (POIs) data, overlooking the role of building characteristics in determining functions of urban scenes. These studies are also limited in terms of mining and fusing multimodal features. To address these challenges, this study proposes a multimodal fusion framework that integrates remote sensing imagery, POIs, and building footprints for urban scene understanding and functional mapping. The framework employs a dual-branch model that extracts visual semantic features from the remote sensing imagery and socioeconomic features from auxiliary data, such as POIs and building footprints. A branch attention module is designed to assign weights to dual-branch features. Additionally, a multiscale feature fusion module is introduced to extract and combine multiscale features through modal interaction. Experiments in Beijing and Chengdu validate the effectiveness of the proposed framework with overall accuracy of 90.04% and 92.07%, and kappa coefficient of 0.881 and 0.895, respectively. This study provides empirical evidence to support accurate urban planning and further promote urban sustainable development. The source code is at: https://github.com/sssuchen/MMFF.
No Comments.