Action Recognition and Localization in a Weakly Supervised Manner

Action Recognition and Localization in a Weakly Supervised Manner

Problem Overview

One of the biggest challenges in Video Content Analysis is the annotation requirements at scale.

The increase in visual data and the necessity to make meaningful sense out of it requires sophisticated Models. These models are data heavy and require annotations at a granular level to learn patterns in the data

To reduce dependency on data annotation we need more algorithms with less dependency on annotation.

To bring forth the idea we have benchmarked the performance of a novel weakly supervised method for action recognition and classification.

In this method we localize an action in a video without bounding box information, but, with class labels only.


Deep learning has been successfully applied to action recognition and localization tasks because of its ability to learn discriminative features automatically from images/videos. Since the success of these methods depends on the availability of accurate spatiotemporal annotations, they cannot leverage the benefit of rapid growth in unstructured video data having only clip-level annotations. Also, most of the existing deep learning methods treat each frame independently and thus ignore temporal continuity (i.e., motion information) in a video clip which is crucial for action recognition and localization.


We approach the problem by identifying key descriptor points in videos along with their mutual interactions for a particular action. The mutual relationships amplify the presence of a particular action in a video. The background and the irrelevant feature points are filtered by thresholding the mutual relationship index w.r.t to the key descriptors.

The steps followed are as follows:

  • Extract local features + Global Features through
  • Capture mutual relationships among local descriptors
  • Remove noisy points by thresholding action-specific mutual relationship score
  • Create a relationship tube/graph for each action point
  • Learn to classify the graph into certain action buckets


The method has shown a state-of-the-art performance on benchmark datasets compared to fully supervised methods as well as other weakly supervised methods.