Cross-modal Co-occurrence Attributes Alignments for Person Search by Language
Published in ACM International Conference on Multimedia (ACM MM), 2022
Recommended citation: Kai Niu, Linjiang Huang (Corresponding Author), Yan Huang, Peng Wang, Liang Wang, Yanning Zhang. "Cross-modal Co-occurrence Attributes Alignments for Person Search by Language".ACM International Conference on Multimedia (ACM MM) MM 2022.
Abstract
Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs some noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method called Matrix Decomposition Guided Cross-modal Attribute Alignments, which can better deal with noise and obtain significant improvements in retrieval performance for the problem of person search by language. First, we innovatively construct visual and textual attribute dictionaries relying on matrix decomposition, and then carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing one-to-one alignments across modalities, alleviating the noise from non-correspondence attributes. The whole MDGCAA method can be trained end-to-end without any pre-processing, i.e., requiring no additional computational overheads. It significantly outperforms the existing solutions, and finally achieves the currently state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.