In this project, we seek to reduce computation and memory overheads of various backbone networks for various applications ranging form detection, tracking, pose estimation, etc.
Learning to capture long-range inter-dependencies in visual data is of primary interest for deep convolution neural networks. However, convolution operations in vanilla CNNs are responsible for capturing local relations, hence, are inefficient to capture long-range dependencies.
We try to overcome the issues associated with CNNs by capturing long-range dependencies in visual data. We aim to achieve the same while keeping in mind the computation and parameter overhead. We have observed that by dividing the feature maps into multiple subspaces and learning to capture spatial as well as cross channel interactions among the feature maps provides us with compute efficient way to capture long-range dependencies in the data.
Convolution neural networks or CNNs have achieved exceptional performance in various cognitive tasks. The unprecedented performance of CNNs stems from the rich representational power of CNNs, which in-turn stems from the deeper and wider layers of networks. Deeper and wider layers boost the expressiveness and discrimination ability of the network by circumventing the limitations of convolution operators, viz., locality, and linearity. In CNN's, convolution operators capture the local (e.g., 3 × 3) feature correlations and enable weight sharing to reduce the number of learnable parameters. Multiple convolution operators are stacked in CNNs to enlarge the receptive field and capture the long-range dependencies, which makes the CNNs deeper. Further, since the linearity of convolution operation leads to inefficient capturing of the non-linear abstraction of input data, CNN's employ a higher number of filters per layer which are learned to capture all the possible variations of the same latent concept. However, this makes the CNNs wider. Altogether, deeper, and wider layers in CNNs leads to high computational cost (measured in the number of floating-point operations or FLOPs), and the number of parameters increases which makes deployment of CNNs on resource-constrained platforms quite challenging.
For a set of intermediate feature maps F∈ R^(c x h x w), where c is the number of channels, h and w are the spatial dimensions of the feature maps. Our objective is to learn to capture spatial inter-dependencies in the feature maps without incurring significant parameter and computation overhead. As shown in Figure, Subspace Attention Pooling divides the input feature maps (F) into g mutually exclusive groups [F1, F2, ....Fn˜, ....Fn] where each group has G feature maps. For a set of attention maps, We define Fn˜ as a group of intermediate feature maps and proceed as follows.
Here, U and V are attention parameters. We approximate the attention map Amap as a low-rank approximation: A_map = UV T where U, V ∈ Rf×1. Gsq(.) denotes squeeze operation i.e weighted sum of individual feature maps. This way of pooling enables the network to capture spatial relations in the feature maps. The pooled features are then passed through a multi-layer perceptron which broadcasts the spatial information from each subspace. The final set of feature maps is obtained by Gscale(.) operation which distributes the spatial information among the feature maps.
Results have shown that it has achieved a 26% reduction in computation overhead and a 33% reduction in memory overhead without disrupting the performance of Mobilenet V1 and Mobilenet v2 in various classification and localization tasks.
It has various applications with simple and minimal changes to the existing backbone networks for various applications.