Relation-aware Attention for Video Captioning via Graph Learning

Published in Pattern Recognition (PR, Impact Factor: 8.0), 136: 109204, April, 2023

Video captioning often uses an attentive encoder-decoder as the baseline model. However, the conventional attention mechanism still remains two problems. First, the attended visual feature is often irrelevant to the target word state, because the attention process only uses the unidirectional flow from vision to linguistics, while lacking the reverse flow. Second, each attention result is independent, because it is computed only based on the previous word states while not considering the attention information from the past and future. This does not suit the attention habits of human beings. In this paper, we improve the conventional attention mechanism to a relation-aware attention mechanism. To this end, we propose two kinds of graph learning strategies, namely the linguistics-to-vision heterogeneous graph (HTG) and the vision-to-vision homogeneous graph (HMG). The HTG aims to enhance the inter-relation of attention by reversely modeling the relation of each word with respect to every attended visual feature, supporting proper semantic alignment in between. The HMG aims to enhance the intra-relation of attention by capturing the relations among all of the attended visual features, which can leverage the attention information from the past and future to guide the current attention process. Extensive experiments on two public datasets show that our proposed method not only significantly improves the baseline model, but also outperforms state-of-the-art methods.

Download paper here