نبذة مختصرة : Image Coding for Machines (ICM) aims to develop systems and frameworks for image compression and transmission tailored for computer vision tasks such as object detection and instance segmentation. This paper focuses on multi-task ICM not only for object detection and instance segmentation but also for image compression. In recent years, Contrastive Language-Image Pre-training (CLIP) has demonstrated its powerful capability in extracting text and image features and has been applied to various vision tasks. Inspired by CLIP, channel-wise context model and mask convolutional neural network (PixelCNN), we propose a Blur CLIP context model (BCcm) for reducing bitrate usage and a two-hyperprior multitask framework augmented by BCcm (TMFBC). Through experiments, we have observed that down and up sample resize pairs can significantly reduce the bitrate usage. We integrated a plug-and-play down and up sample resize pair at both ends of TMFBC. Additionally, we propose a Resnet Pyramid Hierarchical Feature Extractor (RPHFE) based on down and up sample resize pairs to endow decoded images with richer multi-scale features. We term the aforementioned pipeline as TMFBC with down and up Sample Pairs (TMFBC-SP). We compare the proposed TMFBC-SP with state-of-the-art (SOTA) methods and demonstrate that we can achieve higher object detection and instance segmentation accuracy, measured by mean average precision (mAP), using fewer bitrates than existing approaches. We also compare ours with coarse-to-fine and context models on image compression, ours achieves higher PSNR and uses less bitrate.
No Comments.