Cross-Modal and Hierarchical Modeling of Video and Text

Bowen Zhang*, Hexiang Hu*, and Fei Sha


Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (hse), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.


  1. Propose to hierarchically model cross-modal sequential data.
  2. Preserve correspondence of complex structures across modalities through discriminative losses and contrastive losses.
  3. State-of-the-art performance on video and paragraph retrieval.
  4. Systematical study on several tasks involving video and language.



  1. Video and Text Retrieval: ActivityNet Dense Caption Dataset and Didmeo Dataset
  2. Video Captioning: ActivityNet Dense Caption Dataset
  3. Zero-shot Action Recognition: ActivityNet V1.3



B. Zhang*, H. Hu* and F. Sha, Cross-Modal and Hierarchical Modeling of Video and Text, in European Conference on Computer Vision (ECCV), 2018.