3D dynamic hand gestures recognition using the Leap Motion sensor and

Convolutional Neural Networks

Andrea Ranieri - CNR-IMATI

History of computer vision in 4 slides #1


History of computer vision in 4 slides #2

History of computer vision in 4 slides #3

  • Deep: 7 hidden “weight” layers
  • Learned: all feature extractors initialized at white Gaussian noise and learned from the data
  • Entirely supervised
  • More data = good

  • Trained with stochastic gradient descent on two NVIDIA GPUs for about a week
  • 650,000 neurons, 60,000,000 parameters, 630,000,000 connections
  • Final feature layer: 4096-dimensional



History of computer vision in 4 slides #4


CNNs glossary #1

  • 2D Convolutions
  • Actually performed over the 3-channel volume
  • Different kernels (learned) convolved with the input at different layers produce different outputs (feature maps)

2D convolution

CNNs glossary #2

  • Lower layer filters learn simple patterns (lines, curves, color gradients)
  • Higher layer filters learn complex patterns (eyes, faces, textures, distinctive components of objects)

CNN kernels CNN kernels

CNNs glossary #3

  • CNNs are trained in two phases:
    • in the forward pass, features are extracted from the input image and the output of the network is compared to the ground truth through a loss function
    • in the backward pass, neurons’ parameters (weights and biases) are adjusted through backpropagation (1989) and gradient descent
  • Before ResNets, the vanishing gradient problem made deep CNNs difficult to train, because the so called “loss landscape” was too noisy for gradient descent to make progress

Visualizing the Loss Landscape of Neural Nets

CNNs glossary #4

  • At a higher level of abstraction, a CNN model is trained starting from:
    • the network architecture (ResNet-34, ResNet-50, EfficientNet-B4, etc.)
    • the dataset (typically composed of training/validation/test set) and a set of data augmentation transformations
    • the loss function (crossentropy loss, MSE, MAE, etc.)
    • the choice of hyperparameters (batch size, learning rate, number of training epochs, etc.)

Dynamic Hands Gestures ResNet-50 Training Results

3D Dynamic Hand Gestures Recognition

problem outline
  • Dynamic Hand Gestures Recognition can be split into two subproblems:
    • acquisition of the skeletons of the hands (hard)
      • through hardware like gloves or Leap Motion
      • through software with other NN models like Google MediaPipe
      • (we’re using the Leap Motion sensor now because it’s convenient, but our approach is versatile)
    • actual understanding of the dynamic gesture using “all its history” (also hard)
      • training “traditional” tabular data classifiers (SVMs, feed-forward NN, LSTMs)
      • training CNN classifiers to leverage 2D image structure -> powerful (learned) features extractors + transfer learning

grab pinch tap swipe-left swipe-right swipe-O swipe-V OK expand three

3D Dynamic Hand Gestures Recognition

our approach
  • Acquire hands skeleton data via the Leap Motion sensor (138 floats + label)
  • Reinterpret the data to match the Leap Motion connection map to obtain both nodes and edges
  • Draw the skeleton into a custom 3D visualizer made with VisPy (PyQt5 backend)
  • Draw only the nodes representing the fingertips
  • As the gesture progresses, keep the fingertips history and draw it with decreasing alpha values
    • The history of the gesture fades away with time
  • Add variable amounts of 3D noise as further data augmentation
  • Capture the canvas of the 3D visualizer as the gesture progresses to create the dataset
    • With different amounts of fingertips history and noise
    • 4.6 Gb, ~77k images (starting from 468 training sequences)

swipe-O-hst-050 swipe-O-hst-200 swipe-O-hst-400 swipe-O-hst-400-noise

3D Dynamic Hand Gestures Recognition

training the network and results

view-based-method-classification-accuracy view-based-method-false-positives

3D Dynamic Hand Gestures Recognition

results #2
  • Advantages of this approach:


Future research directions #1

  • What’s the weak point of our current approach?
    • There is no way to determine when a gesture starts or ends (it’s just a “single image” classifier)
    • Once again, the developer has the burden of deciding when one gesture ends and another begins, when two predictions are part of the same gesture, etc.
  • How do we do it?
    • One prediction every two seconds (avg. duration of the gestures)
    • Thresholds


Future research directions #2


Future research directions #3

  • Same SHREC 2020 contest dataset
    • This time not just single images, but entire sequences (30 frames)
    • Plain sequences with basic data augmentation
      • without partial sequences, without noise
  • The good news is that the network is learning





Andrea Ranieri