# Paper Notes: ASLFeat: Learning Local Features of Accurate Shape and Localization

This article is based on further improvements to D2-Net, with the main innovations as follows:

1) Using Deformable Convolution for dense transformation estimation and feature extraction

2) Using a feature pyramid to adapt to spatial resolution and using low-level details for precise keypoint localization

# 1 Methods

## 1.1 Prerequisites

The network design in this article is based on the following two works: DCN and D2-Net. First, let's review the main ideas of these two works:

**Deformable convolutional networks (DCN)**

Deformable Convolutional Networks (DCN) aim to learn dynamic receptive fields. For traditional convolution, the formula is:

\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}\right)\end{equation}

where \mathbf{p} represents the center point coordinates of the convolution, \mathbf{p}_{n} represents the offset within the convolution range \mathcal{R}, and \mathbf{x}(\cdot) represents the pixel value at that point. DCN adds an offset prediction \Delta p and feature weight \Delta m on top of this:

\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}+\Delta \mathbf{p}_{n}\right) \cdot \Delta \mathbf{m}_{n}\end{equation}

Since \Delta p is generally a decimal, nonlinear interpolation is used in actual implementation.

**D2-Net**

The core idea of D2-Net is to combine the descriptor and detector into one, describe-and-detect. By learning a feature map, local descriptors are obtained using L2-Norm, and the detector is obtained by performing the following calculation on y:

Calculate local score:

\begin{equation}\alpha_{i j}^{c}=\frac{\exp \left(\mathbf{y}_{i j}^{c}\right)}{\sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \exp \mathbf{y}_{i^{\prime} j^{\prime}}^{c}}\end{equation}

where \mathcal{N}(i, j) represents the range of neighboring pixels, for example, a 3 \times 3 convolution has 9 neighboring pixels.

Calculate channel-wise-score:

\begin{equation}\beta_{i j}^{c}=\mathbf{y}_{i j}^{c} / \max _{t} \mathbf{y}_{i j}^{t}\end{equation}

The final detector score is obtained by taking the max of the two:

Based on this, the subsequent improvements in this article are proposed.

## 1.2 DCN with Geometric Constraints

The authors believe that the original version of DCN has too much freedom and may predict arbitrary deformations. However, for the task of visual localization, the deformation is global with limited degrees of freedom, typically 1) similarity, 2) affine, and 3) homography:

Thus, while traditional DCN learns irrelevant parameters, it cannot ensure the validity of geometric constraints. The authors improved this by using geometric constraints.

**Affine-constrained DCN**

Typically, an Affine Transform with rotation and scaling is as follows:

\begin{equation}\mathbf{S}=\lambda R(\theta)=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\end{equation}

In some papers, such as AffNet, a bending coefficient is introduced. Similarly, this article defines the Affine Transform as:

**Homography-constrained DCN**

Homography is usually solved with four point pairs. This article follows the method in "Unsupervised deep homography" to solve the H matrix using a differentiable linear solver. A typical linear equation is defined as \mathbf{M h}=\mathbf{0}, where \mathbf{M} \in \mathbb{R}^{8 \times 9}. Here, \mathbf{h} is a 9-dimensional vector representing the 9 parameters of the \mathbf{H} matrix, which must satisfy the constraints \mathbf{H}_{33}=1 and \mathbf{H}_{31}=\mathbf{H}_{32}=-, leaving 6 parameters. The linear equation is rewritten as \hat{\mathbf{M}}_{(i)} \hat{\mathbf{h}}=\hat{\mathbf{b}}_{(i)}, where \hat{\mathbf{M}}_{(i)} \in \mathbb{R}^{2 \times 6}, and for each matching point pair (u_i, v_i) and (u_i', v_i'):

\begin{equation}\hat{\mathbf{M}}_{(i)}=\left[\begin{array}{cccccc}0 & 0 & -u_{i} & -v_{i} & v_{i}^{\prime} u_{i} & v_{i}^{\prime} v_{i} \\u_{i} & v_{i} & 0 & 0 & -u_{i}^{\prime} u_{i} & -u_{i}^{\prime} v_{i}\end{array}\right]\end{equation}And:

\begin{equation}\hat{\mathbf{b}}_{(i)}=\left[-v_{i}^{\prime}, u_{i}^{\prime}\right]^{T} \in \mathbb{R}^{2 \times 1}\end{equation}The final linear equation is constructed as:

\begin{equation}\hat{\mathbf{M}} \hat{\mathbf{h}}=\hat{\mathbf{b}}\end{equation}Using a differentiable linear solver (tf.matrix solve), \mathbf{h} can be solved.

In practice, we use the four offsets {(−1, −1),(1, −1),(1, 1),(−1, 1)} multiplied by \mathbf{H} to obtain new offsets.

By defining all the geometric transformations above, the final offset with geometric constraints is obtained using the following formula:

\begin{equation}\triangle \mathbf{p}_{n}=\mathbf{T} \mathbf{p}_{n}-\mathbf{p}_{n}, \text { where } \mathbf{p}_{n} \in \mathcal{R}\end{equation}## 1.3 Selective and Accurate Keypoint Detection

**Keypoint peakiness measurement**

In D2-Net, the detector score is obtained by combining spatial and channel-wise responses. In the calculation of channel-wise-score, the ratio-to-max is used, which may weaken the correlation with its actual distribution across the channel (?). Based on this, the authors made the following improvements to the two scores (mainly aiming to use the peak value as the score standard for the detector d):

\begin{equation}\beta_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{C} \sum_{t} \mathbf{y}_{i j}^{t}\right)\end{equation}Correspondingly:

\begin{equation}\alpha_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{|\mathcal{N}(i, j)|} \sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \mathbf{y}_{i^{\prime} j^{\prime}}^{c}\right)\end{equation}The softplus activation function is as follows:

\begin{equation}f(x)=\ln \left(1+e^{x}\right)\end{equation}**Multi-level keypoint detection (MulDet)**

The keypoint localization accuracy of D2-Net is insufficient because detection is performed on low-resolution feature maps. There are various methods to restore spatial resolution (as shown in the figure below), such as learning additional feature decoders or using dilated convolutions. However, these methods either increase the number of learning parameters or consume a large amount of GPU memory or computational power.

The authors propose a simple and effective solution that does not require additional learning weights. By leveraging the inherent pyramid feature hierarchy of convolutional networks, detection is combined from multiple feature levels, i.e., a hierarchical scale fusion method.

\begin{equation}\hat{\mathbf{s}}=\frac{1}{\sum_{l} w_{l}} \sum_{l} w_{l} \mathbf{s}^{(l)}\end{equation}The final score is calculated by weighted summation of the scores at the same position across different layers.

## 1.4 Learning Framework

**Network architecture**

The final network design is shown below:

Similar to VGG's design, there are two conv layers per scale, and three deformable conv layers (conv6, conv7, and conv8) are used in the last layer. The MulDet part uses conv1, conv3, and conv8 for feature extraction.

In formula 14, w_{i}=1,2,3, and in formula 3, \mathcal{N}(i, j)=3,2,1.

**Loss design**

Let \mathcal{C} represent the set of feature point matches in the image pair (I, I^{\prime}). Similar to D2-Net, the Loss is defined as follows:

\begin{equation}\mathcal{L}\left(I, I^{\prime}\right)=\frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} \frac{\hat{s}_{c} \hat{s}_{c}^{\prime}}{\sum_{q \in \mathcal{C}} \hat{s}_{q} \hat{s}_{q}^{\prime}} \mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)\end{equation}Where \hat{S}_{k} and \hat{S}_{k}^{\prime} represent the detector scores for (I and I^{\prime}), and \mathbf{f}_{k} and \mathbf{f}_{k}^{\prime} represent the descriptors. \mathcal{M}(\cdot, \cdot) represents the ranking loss, which replaces the hardest-triplet loss in D2-Net. Here, \mathcal{M} is defined as follows:

\begin{equation}\begin{array}{l}\mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)=\left[D\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)-m_{p}\right]_{+}+ \\\quad\left[m_{n}-\min \left(\min _{k \neq c} D\left(\mathbf{f}_{c}, \mathbf{f}_{k}^{\prime}\right), \min _{k \neq c} D\left(\mathbf{f}_{k}, \mathbf{f}_{c}^{\prime}\right)\right)\right]_{+}\end{array}\end{equation}Where D(\cdot, \cdot) represents the Euclidean distance. m_p and m_n are 0.2 and 1.0, respectively.

# 2 Experiments

## 2.1 Image Matching

The experimental results are as follows:

## 2.2 Comparisons with other methods

The experimental results are as follows:

## 2.3 3D Reconstruction

The experimental results are as follows:

## 2.4 Visual Localization

The experimental results are as follows:

# Paper & Code

**Paper**

https://arxiv.org/abs/2003.10071

**Code**

https://github.com/lzx551402/ASLFeat

# References

[1] https://blog.csdn.net/phy12321/article/details/106040545

[2] https://www.jianshu.com/p/c184c2efbecc