Human Action Recognition (HAR) is increasingly common in our society today. It can be found in self-driving cars, surveillance systems and cashier-free stores such as AmazonGo. The task of classifying and predicting human action is difficult, mainly due to the fact that it heavily relies on video data which contains noise in form of unrelated information of the surrounding and temporal aspects. One method to withstand these issues is a two-stream Convolutional Neural Network architecture that determine the spatial aspect of the action using a single image and the temporal aspect of the action using stacks of Optical Flow (OF) frames. An issue with OF is the limited Frames Per Second (FPS) that is supported by the method. To combat the low FPS a Dynamic Images (DI) network can be used, which utilizes Approximate Rank Pooling to faster create images representing motion from video data. With the increased FPS by the DI networks it is feasible for multi-view real-time HAR.
In this study, a data set is gathered at a self-serving vending fridge with a multi-view camera setup. A DI network is used together with different fusion models to investigate the effect of the multi-view camera setup. It is concluded that using a multi-view setup and fusing the DI networks can in one specific case of using support vector classifier fusion provide statistical evidence of an increased mean accuracy compared with stand-alone single-view DI networks.