论文笔记：End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization

2021年8月25日

Table of Contents

This article combines traditional PnP methods with deep learning, with a relatively simple overall idea. The main focus is on how to backpropagate the residuals of the traditional PnP method to the neural network, enabling End2End training and computation without given data associations (Blind PnP).

1 Backpropagating a PnP solver (BPnP)

First, we describe the PnP problem using mathematical language.

Define g as a PnP solver, and its output y is the solved 6DoF pose:

boldsymbol{y}=g(boldsymbol{x}, boldsymbol{z}, mathbf{K})tag{1}

Where x represents the 2D coordinate observations of feature points on the image, z represents the 3D point coordinates in space, representing a one-to-one mapping pair, and K is the camera intrinsic parameters:

$begin{array}{l}boldsymbol{x}=left[begin{array}{llll}boldsymbol{x}_{1}^{T} & boldsymbol{x}_{2}^{T} & ldots & boldsymbol{x}_{n}^{T}end{array}right]^{T} in mathbb{R}^{2 n times 1} \boldsymbol{z}=left[begin{array}{llll}boldsymbol{z}_{1}^{T} & boldsymbol{z}_{2}^{T} & ldots & boldsymbol{z}_{n}^{T}end{array}right]^{T} in mathbb{R}^{3 n times 1}end{array}tag{2}$
In fact, solving PnP is such an optimization problem:

$boldsymbol{y}=underset{boldsymbol{y} in S E(3)}{arg min } sum_{i=1}^{n}left|boldsymbol{r}_{i}right|_{2}^{2}tag{3}$
Where $boldsymbol{pi}_{i}=pileft(boldsymbol{z}_{i} mid boldsymbol{y}, mathbf{K}right)$ is the projection function, and $boldsymbol{r}_{i}=boldsymbol{x}_{i}-boldsymbol{pi}_{i}$ is the reprojection error.

Thus, we can simplify it as:

$boldsymbol{y}=underset{boldsymbol{y} in S E(3)}{arg min } quad|boldsymbol{x}-boldsymbol{pi}|_{2}^{2}tag{4}$
Where:

boldsymbol{pi}:=left[boldsymbol{pi}_{1}^{T}, ldots, boldsymbol{pi}_{n}^{T}right]^{T}tag{5}

1.1 Implicit function differentiation

1.2 Constructing the constraint function f

Define the objective function of PnP as:

$o(boldsymbol{x}, boldsymbol{y}, boldsymbol{z}, mathbf{K})=sum_{i=1}^{n}left|boldsymbol{r}_{i}right|_{2}^{2}tag{6}$
When the objective function reaches its minimum, we have:

$left.frac{partial o(boldsymbol{x}, boldsymbol{y}, boldsymbol{z}, mathbf{K})}{partial boldsymbol{y}}right|_{boldsymbol{y}=g(boldsymbol{x}, boldsymbol{z}, mathbf{K})}=mathbf{0}tag{7}$
We define it as:

$f(boldsymbol{x}, boldsymbol{y}, boldsymbol{z}, mathbf{K})=left[f_{1}, ldots, f_{m}right]^{T}tag{8}$
Where:

begin{aligned}f_{j} &=frac{partial o(boldsymbol{x}, boldsymbol{y}, boldsymbol{z}, mathbf{K})}{partial y_{j}} \&=2 sum_{i=1}^{n}leftlangleboldsymbol{r}_{i}, frac{partial boldsymbol{r}_{i}}{partial y_{j}}rightrangle \&=sum_{i=1}^{n}leftlangleboldsymbol{r}_{i}, boldsymbol{c}_{i j}rightrangle \boldsymbol{c}_{i j} &=-2 frac{partial boldsymbol{pi}_{i}}{partial y_{j}}end{aligned}tag{9}

1.3 Forward and backward pass

We first rewrite the PnP function, introducing the initial pose $y^{(0)}$ :

$boldsymbol{y}=gleft(boldsymbol{x}, boldsymbol{z}, mathbf{K}, boldsymbol{y}^{(0)}right)tag{10}$
According to the implicit function differentiation rule:

$begin{aligned}frac{partial g}{partial boldsymbol{x}} &=-left[frac{partial f}{partial boldsymbol{y}}right]^{-1}left[frac{partial f}{partial boldsymbol{x}}right] \frac{partial g}{partial boldsymbol{z}} &=-left[frac{partial f}{partial boldsymbol{y}}right]^{-1}left[frac{partial f}{partial boldsymbol{z}}right] \frac{partial g}{partial mathbf{K}} &=-left[frac{partial f}{partial boldsymbol{y}}right]^{-1}left[frac{partial f}{partial mathbf{K}}right]end{aligned}tag{11}$
For neural networks, we can obtain the gradient of the output $nabla boldsymbol{y}$ , and the gradients for each input are:

begin{aligned}nabla boldsymbol{x} &=left[frac{partial g}{partial boldsymbol{x}}right]^{T} nabla boldsymbol{y} \nabla boldsymbol{z} &=left[frac{partial g}{partial boldsymbol{z}}right]^{T} nabla boldsymbol{y} \nabla mathbf{K} & =left[frac{partial g}{partial mathbf{K}}right]^{T} nabla boldsymbol{y}end{aligned}tag{12}

1.4 Implementation notes

2 End-to-end learning with BPnP

2.1 Pose estimation

This section describes the process of estimating the pose y based on the observed keypoint coordinates x, given a known map z and camera intrinsic parameters K, as shown below:

The loss is described by the following formula:

l(boldsymbol{x}, boldsymbol{y})=left|pi(boldsymbol{z} mid boldsymbol{y}, mathbf{K})-pileft(boldsymbol{z} mid boldsymbol{y}^{*}, mathbf{K}right)right|_{2}^{2}+lambda R(boldsymbol{x}, boldsymbol{y})tag{13}

An important gradient update in the above process is calculated as follows:

frac{partial ell}{partial boldsymbol{theta}}=frac{partial l}{partial boldsymbol{y}} frac{partial g}{partial boldsymbol{x}} frac{partial h}{partial boldsymbol{theta}}+frac{partial l}{partial boldsymbol{x}} frac{partial h}{partial boldsymbol{theta}}tag{14}

The figure below demonstrates the convergence process in two cases, where the first row is $lambda = 1$ and the second row is $lambda = 0$ .

Figure 1 represents $h(I; theta) = theta$

Figure 2 represents a modified VGG11 network

From these two experiments, we can see that both can converge regardless of whether the regularization term is used, and when $lambda = 1$ , both achieve better convergence results.

2.2 SfM with calibrated cameras

This section is

The loss is defined as follows:

$lleft(left{boldsymbol{y}^{(j)}right}_{j=1}^{N}, boldsymbol{z}right)=sum_{j=1}^{N}left|boldsymbol{x}^{(j)}-pileft(boldsymbol{z}^{(j)} mid boldsymbol{y}^{(j)}, mathbf{K}right)right|_{2}^{2}tag{15}$
The gradient derivation in the process is as follows:

$frac{partial ell}{partial boldsymbol{theta}}=sum_{j=1}^{N}left(frac{partial l}{partial boldsymbol{z}^{(j)}} frac{partial boldsymbol{z}^{(j)}}{partial boldsymbol{theta}}+frac{partial l}{partial boldsymbol{y}^{(j)}} frac{partial boldsymbol{y}^{(j)}}{partial boldsymbol{z}^{(j)}} frac{partial boldsymbol{z}^{(j)}}{partial boldsymbol{theta}}right)tag{16}$
The figure below demonstrates the convergence process of an SfM:

2.3 Camera calibration

The camera calibration process is as follows:

The loss is defined as follows:

$l(mathbf{K}, boldsymbol{y})=|boldsymbol{x}-pi(boldsymbol{z} mid boldsymbol{y}, mathbf{K})|_{2}^{2}tag{17}$
The gradient derivation is as follows:

frac{partial ell}{partial boldsymbol{theta}}=frac{partial l}{partial mathbf{K}} frac{partial mathbf{K}}{partial boldsymbol{theta}}+frac{partial l}{partial boldsymbol{y}} frac{partial g}{partial mathbf{K}} frac{partial mathbf{K}}{partial boldsymbol{theta}}tag{18}

3 Object pose estimation with BPnP

Finally, the author designed an object pose estimation process based on BPnP:

Paper & Code

Paper: https://arxiv.org/abs/1909.06043
Code: https://github.com/BoChenYS/BPnP
Video: https://www.youtube.com/watch?v=eYmoAAsiBEE

技术刘