An Efficient Deep Learning Solution to 3D Hand Pose Estimation for XR Applications using Monochrome Cameras

Yasmin Baba

Abstract

Recent technological advancements in the field of deep learning are providing new ways to naturalise user interactions in extended reality (XR) applications. One effective approach is through the use of 3D hand tracking, enabling users to interact with virtual content with their hands, rather than controllers. 3D hand tracking is a computer vision task most commonly referred to as 3D hand pose estimation, which aims to estimate the location of human hand joints from images and videos. This work proposes a novel and lightweight deep learning solution to 3D hand pose estimation, providing a model that is suitable for real-time use on embedded devices, such as XR headsets. The proposed solution is split into two deep neural networks: a convolutional neural network that performs 2D hand joint estimation and a feed-forward neural network that lifts the 2D coordinates to 3D by inferring depth. The design of the proposed 2D joint prediction sub-network incorporates parts of the MobileNetV2 architecture, in addition to a final differentiable spatial to numerical transform layer that allows for accurate numerical coordinate regression. The depth estimation sub-network utilises a simple multi-stage feed-forward architecture to provide iterative refinement of the predicted depth coordinates. With a current lack of research and solutions surrounding the use of monochrome cameras in hand pose estimation, this work provides a first look into hand pose estimation using monochrome data in addition to an analysis regarding how the loss of colour information within training data impacts the performance of a hand pose prediction model.