One of the biggest challenges in Video Content Analysis is the annotation requirements at scale.
The increase in visual data and the necessity to make meaningful sense out of it requires sophisticated Models. These models are data heavy and require annotations at a granular level to learn patterns in the data
To reduce dependency on data annotation we need more algorithms with less dependency on annotation.
To bring forth the idea we have benchmarked the performance of a novel weakly supervised method for action recognition and classification.
In this method we localize an action in a video without bounding box information, but, with class labels only.
Deep learning has been successfully applied to action recognition and localization tasks because of its ability to learn discriminative features automatically from images/videos. Since the success of these methods depends on the availability of accurate spatiotemporal annotations, they cannot leverage the benefit of rapid growth in unstructured video data having only clip-level annotations. Also, most of the existing deep learning methods treat each frame independently and thus ignore temporal continuity (i.e., motion information) in a video clip which is crucial for action recognition and localization.
We approach the problem by identifying key descriptor points in videos along with their mutual interactions for a particular action. The mutual relationships amplify the presence of a particular action in a video. The background and the irrelevant feature points are filtered by thresholding the mutual relationship index w.r.t to the key descriptors.
The steps followed are as follows:
The method has shown a state-of-the-art performance on benchmark datasets compared to fully supervised methods as well as other weakly supervised methods.