Real-time Action Recognition with Enhanced Motion Vector CNNs

Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao and Hanli Wang


The deep two-stream architecture exhibited excellent performance on video based action recognition. The most computationally expensive step in this approach comes from the calculation of optical flow which prevents it to be real-time. This paper accelerates this architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation. However, motion vector lacks fine structures, and contains noisy and inaccurate motion patterns, leading to the evident degradation of recognition performance. Our key insight for relieving this problem is that optical flow and motion vector are inherent correlated. Transferring the knowledge learned with optical flow CNN to motion vector CNN can significantly boost the performance of the latter. Specifically, we introduce three strategies for this, initialization transfer, supervision transfer and their combination. Experimental results show that our method achieves comparable recognition performance to the state-of-the-art, while our method can process 390.7 frames per second, which is 27 times faster than the original two-stream method.


In this paper, we purpose to use motion vector to accelerate the processing speed of two-stream ConvNets. However, directly using motion vector as input achieves inferior performance. To improve the performance, enhanced motion vector CNN is purposed. Our insight behind this algorithm is that we find both motion vector and optical flow share some common knowledge, though they belongs to different domain. The knowledge of optical flow CNN can enhance the performance of motion vector CNN.

  • Motion vector CNN: We first use ffmpeg to extract motion vectors from videos. Then, we choose the Clarifai network architecture and train the model parameters from scratch.

  • Enhanced motion vector CNN: We first use optical flows to train an optical flow CNN (OF-CNN) and employ OF-CNN as pre-trained model. Then, we fine-tune the model parameters of enhanced motion vector CNN (EMV-CNN) using two losses. The first one is called ground truth loss which is supervised by ground truth label. The second one is softmax cross entropy loss which is supervised by the output of optical flow CNN.
  • Results

  • Examples of Motion Vectors and Optical Flows

  • Speed comparison of optical flow and motion vectors

  • Performance of OF-CNN, MV-CNN and EMV-CNN on UCF-101

  • Performance of our method on UCF-101 and THUMOS14

  • Downloads

    Code, Model, and Prototxt for motion vector CNN is released.

    Steps for implementing MV-CNN

    1. Using motion vector code to generate Motion Vector Images.
    2. Generating the file list for training. The list should follow this format
    3. We use the VideoData layer as input and Clarifai Net (CNN-M-2048) with PReLU to train MV-CNN.
    4. For data augmentation, random crop, random filp and multi-scale (scale_ratio: [1,.875,.75]) are used.
    5. Following these steps, you should get 74.4% (MV-CNN train from scratch) for UCF-101 Split1.


    B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, Real-time Action Recognition with Enhanced Motion Vector CNNs, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.