Dense 3D Mapping Based on ElasticFusion and Mask-RCNN

December 14, 2018

Demo Video

Project Goal

The goal of this project is to build a simultaneously localization and semantic mapping (SLAM) system.

Source Code:

All C++ source code is available on my GitHub Page.

Pipeline:

A real-time dense visual SLAM (ElasticFusion) system to generate surfel map.
A segmentor based on Mask-RCNN to do semantic segmentation on input 2D RGB streams, and then project semantic segmentation label from 2D pixel to surfel on 3D dense map.
A bayesian update scheme to refresh the semantic segmentation results on existing surfels.

The system flow is below:

MaskFusion Pipeline

1. Tracking

I employ ElasticFusion [1] as the pipeline’s SLAM part. It gives the camera pose estimation by minimise the joint geometric and photometric cost function:

\begin{equation} E_{track}=E_{icp}+w_{rgb}E_{rgb} \end{equation}

The geometric cost function $E_{icp}$ is given by iterative closest point (ICP) algorithm:

\begin{equation} E_{icp}=\sum (\textbf{v}^{k}-exp(\hat{\xi})\textbf{T}\textbf{v}_{t}^{k})\cdot \textbf{n}^{k})^2 \end{equation}

The photometric cost function $E_{rgb}$ represent the cost over the photometric error (intensity difference) between pixels:

\begin{equation} E_{rgb}=\sum_{\textbf{u}\in\Omega}(I(\textbf{u},C_{t}^{l})-I(\pi(\textbf{K}exp(\hat{\xi})\textbf{T}\textbf{p}(\textbf{u},D_{t}^{l})),\hat{C}_{t-1}^{a}))^2 \end{equation}

where as above $\textbf{T}$ is the current estimate of the transformation from the previous camera pose to the current one.

2. Mask R-CNN

Mask R-CNN [2] serves as the segmentation part. It provides superior segmentation quality of at most 80 different classes at 5Hz frame rate. Its outputs is a set of binary mask within corresponding instance’s class ID and confidences. Mask R-CNN’s architecture is below:

Mask R-CNN Architecture

Here are all 80 different classes Mask R-CNN can do instance segmentation:

Class Color

The Mask R-CNN doesn’t output each classes confidence of the pixel, which is required by the Bayesian Update part. So I do the following process of the Mask-R-CNN outputs (bin masks $M$ , corresponding $class$ and $confidence$ to get the confidence of the $i$ th class for pixel $\textbf{u}$ :

Post Process

3. Bayesian Update

Fusion

In the 3D surfel map generated by ElasticFusion, each surfel stores a discrete probabilities distribution, $P(L_s=l_i)$ over the set of class labels. Each newly generated surfel is initialised with a uniform distribution over the semantic classes, as it begin with no a priori knowledge.

Once Mask R-CNN thread make the inference of $\textbf{I}_k$ done, we get a per-pixel independent probability distribution over the class labels $P(O_{\textbf{u}}=l_{i} \vert \textbf{I}_{k})$ , then the following recursive Bayesian update [4] will update all class label probabilities per surfel, and finally normalise with $Z$ to yield a proper distribution.

Bayesian Update

4. Future Work

Mask R-CNN do execellent in segment the main part of the object from background, but it can not give a perfect boundary of the instance. However, those boundary are usually discontinuities in depth value. So maybe we can refine the segmentation results by doing geometric segmentation given Mask R-CNN inference and depth frame.

Comparison

Reference

[1] ElasticFusion: Real-Time Dense SLAM and Light Source Estimation, T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison and S. Leutenegger, IJRR ‘16

[2] Mask R-CNN, Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick, ICCV ‘17

[3] Kolkir C++ implementation of mask-rcnn.

[4] SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks, J. McCormac, A. Handa, A. J. Davison, and S. Leutenegger, ICRA ‘17