REST

REST: Holistic Learning for End-to-End Semantic Segmentation of Whole-Scene Remote Sensing Imagery

¹Wuhan University, ²University of Trento, ³Cornell University,
⁴South China University of Technology, ⁵Purdue University

IEEE TPAMI 2025

Abstract

Semantic segmentation of remote sensing imagery (RSI) is a fundamental task that aims at assigning a category label to each pixel. To pursue precise segmentation with one or more fine-grained categories, semantic segmentation often requires holistic segmentation of whole-scene RSI (WRI), which is normally characterized by a large size. However, conventional deep learning methods struggle to handle holistic segmentation of WRI due to the memory limitations of the graphics processing unit (GPU), thus requiring to adopt suboptimal strategies such as cropping or fusion, which result in performance degradation. Here, we introduce the Robust End-to-end semantic Segmentation architecture for whole-scene remoTe sensing imagery (REST). REST is the first intrinsically end-to-end framework for truly holistic segmentation of WRI, supporting a wide range of encoders and decoders in a plug-and-play fashion. It enables seamless integration with mainstream semantic segmentation methods, and even more advanced foundation models. Specifically, we propose a novel spatial parallel interaction mechanism (SPIM) within REST to overcome GPU memory constraints and achieve global context awareness. Unlike traditional parallel methods, SPIM enables REST to process a WRI effectively and efficiently by combining parallel computation with a divide-and-conquer strategy. Both theoretical analysis and experiments demonstrate that REST attains near-linear throughput scalability as additional GPUs are employed. Extensive experiments demonstrate that REST consistently outperforms existing cropping-based and fusion-based methods across a variety of scenarios, ranging from single-class to multi-class segmentation, from multispectral to hyperspectral imagery, and from satellite to drone platforms. The robustness and versatility of REST are expected to offer a promising solution for the holistic segmentation of WRI, with the potential for further extension to large-size medical imagery segmentation.

The superiority of REST

Per-class result comparison on the Five-Billion-Pixels dataset and SkySense with UPerNet is chosen as the baseline.

Comparison of performance (IoU) and efficiency (inference time) across different methods on the GLH-Water dataset.

Performance of REST with different baselines (i.e., encoders and decoders) on the Five-Billion-Pixels dataset.

Performance of REST on the Five-Billion-Pixels dataset with different image sizes, and the baseline is SkySense with UPerNet.

REST further improves the strong capabilities of various remote sensing foundation models on the Five-Billion-Pixels dataset. RSFMs can only handle cropped image tiles with a size of 2048×2048. Integrating our REST, RSFMs can handle whole-scene remote sensing imagery with evidently improved performance and slightly increased model parameters. SkySense (Swin-Huge) requires 16 NVIDIA A100 GPUs for the holistic segmentation, while 4 NVIDIA A100 GPUs are enough for the others.

REST is compatible with remote sensing foundation models

Explainability analysis of REST

Visualization of feature maps on Five-Billion-Pixels dataset. Enhanced with REST, the model can exploit the features of the entire spatial region in the WRI. SkySense with UPerNet is chosen as the baseline.

Visualization of t-SNE results on Five-Billion-Pixels dataset. SkySense with UPerNet is chosen as the baseline. When combined with REST, the features exhibit more distinct classification boundaries, demonstrating the strong feature representation ability brought by REST.

The confusion matrices of results on the Five-Billion-Pixels dataset. a, the confusion matrix of the baseline. b, the confusion matrix of baseline + our REST. SkySense with UPerNet is chosen as the baseline. After the introduction of REST, the accuracy performance across various categories improves, and the confusion between fine-grained categories decreases.

Visualization of experimental results

Visualization of segmentation results on Five-Billion-Pixels dataset. REST successfully distinguishes between fine-grained categories (e.g., river, lake, pond), while the results of other competing methods mostly show confusion.

Visualization of segmentation results on GLH-Water dataset. REST accurately extracts the complete water body, while the competing methods present omission errors in different locations.

Visualization of segmentation results on UAVid dataset. Compared with other methods, REST precisely identifies the vehicle in the image as a moving car instead of a static car, completely extracts the road, and significantly reduces misclassification problems.

Visualization of segmentation results on WHU-OHS dataset. REST demonstrates better segmentation results than the competing methods even on the challenging hyperspectral imagery datasets.

BibTeX

@article{rest2025, title={REST: Holistic Learning for End-to-End Semantic Segmentation of Whole-Scene Remote Sensing Imagery}, author={Chen, Wei and Bruzzone, Lorenzo and Dang, Bo and Gao, Yuan and Deng, Youming and Yu, Jin-Gang and Yuan, Liangqi and Li, Yansheng}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2025}, volume={}, number={}, pages={1-18}, publisher={IEEE}, doi={10.1109/TPAMI.2025.3609767}}