History of computer vision in 4 slides #1

Credits: DSpace@MIT - Seymour A. Papert, The Summer Vision Project

History of computer vision in 4 slides #2

Credits: Kirill Danilyuk - CarND Project 1: Lane Lines Detection — A Complete Pipeline

History of computer vision in 4 slides #3

Deep: 7 hidden “weight” layers
Learned: all feature extractors initialized at white Gaussian noise and learned from the data
Entirely supervised
More data = good
Trained with stochastic gradient descent on two NVIDIA GPUs for about a week
650,000 neurons, 60,000,000 parameters, 630,000,000 connections
Final feature layer: 4096-dimensional

Alexnet

neuron

Credits: Alex Krizhevsky - ImageNet Classification with Deep Convolutional Neural Networks

History of computer vision in 4 slides #4

Very deep: up to 152 layers in the original paper
Solves the the vanishing gradient problem through skip connections
Batch normalization instead of dropout act as a regularizer

Credits: Deep residual learning for image recognition

CNNs glossary #1

2D Convolutions
Actually performed over the 3-channel volume
Different kernels (learned) convolved with the input at different layers produce different outputs (feature maps)

2D convolution

Credits: Practical 3a: Convolutional Networks by Deep Learning Indaba

CNNs glossary #2

Lower layer filters learn simple patterns (lines, curves, color gradients)
Higher layer filters learn complex patterns (eyes, faces, textures, distinctive components of objects)

Credits: Visualizing and Understanding Convolutional Networks

CNNs glossary #3

CNNs are trained in two phases:
- in the forward pass, features are extracted from the input image and the output of the network is compared to the ground truth through a loss function
- in the backward pass, neurons’ parameters (weights and biases) are adjusted through backpropagation (1989) and gradient descent
Before ResNets, the vanishing gradient problem made deep CNNs difficult to train, because the so called “loss landscape” was too noisy for gradient descent to make progress

Visualizing the Loss Landscape of Neural Nets

Credits: Visualizing the Loss Landscape of Neural Nets

CNNs glossary #4

At a higher level of abstraction, a CNN model is trained starting from:
- the network architecture (ResNet-34, ResNet-50, EfficientNet-B4, etc.)
- the dataset (typically composed of training/validation/test set) and a set of data augmentation transformations
- the loss function (crossentropy loss, MSE, MAE, etc.)
- the choice of hyperparameters (batch size, learning rate, number of training epochs, etc.)

Dynamic Hands Gestures ResNet-50 Training Results

3D Dynamic Hand Gestures Recognition

problem outline

Dynamic Hand Gestures Recognition can be split into two subproblems:
- acquisition of the skeletons of the hands (hard)
  - through hardware like gloves or Leap Motion
  - through software with other NN models like Google MediaPipe
  - (we’re using the Leap Motion sensor now because it’s convenient, but our approach is versatile)
- actual understanding of the dynamic gesture using “all its history” (also hard)
  - training “traditional” tabular data classifiers (SVMs, feed-forward NN, LSTMs)
  - training CNN classifiers to leverage 2D image structure -> powerful (learned) features extractors + transfer learning

grab pinch tap swipe-left swipe-right swipe-O swipe-V three

3D Dynamic Hand Gestures Recognition

our approach

Acquire hands skeleton data via the Leap Motion sensor (138 floats + label)
Reinterpret the data to match the Leap Motion connection map to obtain both nodes and edges
Draw the skeleton into a custom 3D visualizer made with VisPy (PyQt5 backend)
Draw only the nodes representing the fingertips
As the gesture progresses, keep the fingertips history and draw it with decreasing alpha values
- The history of the gesture fades away with time
Add variable amounts of 3D noise as further data augmentation
Capture the canvas of the 3D visualizer as the gesture progresses to create the dataset
- With different amounts of fingertips history and noise
- 4.6 Gb, ~77k images (starting from 468 training sequences)

swipe-O-hst-050 swipe-O-hst-200 swipe-O-hst-400 swipe-O-hst-400-noise

3D Dynamic Hand Gestures Recognition

training the network and results

The network has been trained in Python using Jupyter Notebook and the Fast.ai v1 library
- a little bit more high level than Pytorch
- great community that keeps it up to date with literature
ResNet-50 pretrained on ImageNet and on our other 2k image hand gestures dataset
- Progressive resizing technique (192x108, 384x216, 576x324 and 960x540)
- Standard augmentation (rotate, crop+pad, flip, warp, zoom, brightness, contrast)
- Fit one cycle policy for training (i.e. cyclical learning rate)

view-based-method-classification-accuracy view-based-method-false-positives

Credits: SFINGE 3D: A novel benchmark for online detection and recognition of heterogeneous hand gestures from 3D fingers’ trajectories

3D Dynamic Hand Gestures Recognition

results #2

Advantages of this approach:
- real-time (ResNet-50 theoretical throughput is 536 FPS on an Nvidia P40)
- scalable (with the dataset size)

view-based-method-results-table

Credits: SFINGE 3D: A novel benchmark for online detection and recognition of heterogeneous hand gestures from 3D fingers’ trajectories

Future research directions #1

What’s the weak point of our current approach?
- There is no way to determine when a gesture starts or ends (it’s just a “single image” classifier)
- Once again, the developer has the burden of deciding when one gesture ends and another begins, when two predictions are part of the same gesture, etc.
How do we do it?
- One prediction every two seconds (avg. duration of the gestures)
- Thresholds

3d-dynamic-hand-gestures-classifier-animated-gif

Future research directions #2

Two (or three?) possible ways to make the NN deal with temporal information:
- 3D ResNets - Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? (2020)
- CNN+LSTM - Beyond Short Snippets: Deep Networks for Video Classification (2015)
- Vision Transformers? - An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale (under review at ICLR 2021)
Currently working on:
- 3D-ResNet-50-KMS
  - pretrained on Kinetics-700 (K), Moments in Time (M), STAIR-Actions (S)

resnet-3d-training-round-2

Future research directions #3

Same SHREC 2020 contest dataset
- This time not just single images, but entire sequences (30 frames)
- Plain sequences with basic data augmentation
  - without partial sequences, without noise
The good news is that the network is learning

Questions?

cat-meleon

3D dynamic hand gestures recognition using the Leap Motion sensor and

Convolutional Neural Networks

Andrea Ranieri - CNR-IMATI

History of computer vision in 4 slides #1

History of computer vision in 4 slides #2

History of computer vision in 4 slides #3

History of computer vision in 4 slides #4

CNNs glossary #1

CNNs glossary #2

CNNs glossary #3

CNNs glossary #4

3D Dynamic Hand Gestures Recognition

problem outline

3D Dynamic Hand Gestures Recognition

our approach

3D Dynamic Hand Gestures Recognition

training the network and results

3D Dynamic Hand Gestures Recognition

results #2

Future research directions #1

Future research directions #2

Future research directions #3

Questions?