Video Description with Spatial-Temporal Attention
Published in ACM International Conference on Multimedia (ACM MM), 1014–1022, 2017
Temporal attention has been widely used in video description to adaptively focus on important frames. However, most existing methods based on temporal attention suffer from the problems of recognition error and detail missing, because only coarse frame-level global features are employed. Inspired by recent successful work in image description using spatial attention, we propose a spatial-temporal attention (STAT) method to address such problems. In particular, first, we take advantage of object-level local features to address the problem of detail missing. Second, the STAT method further selects relevant local features by spatial attention and then attend to important frames by temporal attention to recognize related semantics. The proposed two-stage attention mechanism can recognize the salient objects more precisely with high recall and automatically focus on the most relevant spatial-temporal segments given the sentence context. Extensive experiments on two well-known benchmarks suggest that STAT method outperforms the state-of-the-art methods on MSVD with BLEU4 score 0.511, and achieves superior BLEU4 score 0.374 on MSR-VTT-10K. Compared to the method without local features, the relative improvements derived from our STAT method are 10.1% and 0.8% respectively on two benchmarks. Compared to the method using only temporal attention, the relative improvements derived from our STAT method are 18.3% and 9.0% respectively on two benchmarks.