论文笔记：ASLFeat: Learning Local Features of Accurate Shape and Localization

2021年12月1日

Table of Contents

本文是基于 D2-Net 的进一步改进，主要创新点如下：

1）使用 Deformable Convolution 来进行稠密的变换估计和特征提取
2）使用特征金字塔适应空间分辨率以及使用 low-level 细节来进行精确的特征点定位

1 Methods

1.1 Prerequisites

本文的网络设计基于以下两个工作：DCN 和 D2-Net，首先回顾这两个工作的主要思想：

Deformable convolutional networks (DCN)

可变形卷积 (Deformable Convolutional Networks, DCN) 的目的主要是学习动态感受野，对于传统卷积来说，其公式为：

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}\right)\end{equation}$
其中 $\mathbf{p}$ 代表卷积的中心点坐标， $\mathbf{p}_{n}$ 代表卷积 $\mathcal{R}$ 范围内的偏移量， $\mathbf{x}(\cdot)$ 表示取该点的像素值。而 DCN 就是在此基础上再加上一个 offset 预测 $\Delta p$ 以及 feature 权重 $\Delta m$ ：

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}+\Delta \mathbf{p}_{n}\right) \cdot \Delta \mathbf{m}_{n}\end{equation}$
其中由于 $\Delta p$ 一般都是小数，所以实际实现会用到非线性插值的方式。

D2-Net

D2-Net 核心思想是将 descriptor 和 detector 合二为一，describe-and-detect，通过学习一个 feature map ，获取 local descriptors 的方式就是 L2-Norm，而获取 detector 的方式则是在 y 上面进行如下计算：

计算 local score ：

$\begin{equation}\alpha_{i j}^{c}=\frac{\exp \left(\mathbf{y}_{i j}^{c}\right)}{\sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \exp \mathbf{y}_{i^{\prime} j^{\prime}}^{c}}\end{equation}$
其中 $\mathcal{N}(i, j)$ 代表近邻像素范围，例如 $3 \times 3$ 卷积就是 3 近邻共计 9 个像素。

计算 channel-wise-score

$\begin{equation}\beta_{i j}^{c}=\mathbf{y}_{i j}^{c} / \max _{t} \mathbf{y}_{i j}^{t}\end{equation}$
最终 detector score 由二者取 max 获得：

\begin{equation}s_{i j}=\max _{t}\left(\alpha_{i j}^{c} \beta_{i j}^{c}\right)\end{equation}

基于此本文提出了后面的改进。

1.2 DCN with Geometric Constraints

对于 DCN 作者认为原始版本自由度过高了，可能预测出任意形变，而对于视觉定位这个任务来说，形变是整体的只有有限的自由度，通常仅为 1) similarity, 2) affine and 3) homography：

因此传统的 DCN 学习了无关紧要的参数的同时还无法保证几何约束成立。对此作者使用几何约束进行了改进。

Affine-constrained DCN

通常一个包含旋转和缩放的 Affine Transform 变换如下：

$\begin{equation}\mathbf{S}=\lambda R(\theta)=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\end{equation}$
在一些文章例如 AffNet 中还引入了弯曲系数，因此本文仿照其将 Affine Transform 定义为：

\begin{equation}\begin{aligned}\mathbf{A} &=\mathbf{S} A^{\prime}=\lambda R(\theta) A^{\prime} \\&=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\left(\begin{array}{ll}a_{11}^{\prime} & 0 \\a_{21}^{\prime} & a_{22}^{\prime}\end{array}\right)\end{aligned}\end{equation}

Homography-constrained DCN

Homography 通常用4个点对可以求解，本文仿照《Unsupervised deep homography》文章中的方式用可微的线性求解器求 H 矩阵。通常一个线性方程定义为为： $\mathbf{M h}=\mathbf{0}$ ， $\mathbf{M} \in \mathbb{R}^{8 \times 9}$ 。其中 $\mathbf{h}$ 是一个9维向量表示 $\mathbf{H}$ 矩阵的9个参数，其中需要满足 $\mathbf{H}_{33}=1$ 、 $\mathbf{H}_{31}=\mathbf{H}_{32}=-$ 三个约束条件，因此实际上是6个参数。重新改写线性方程为： $\hat{\mathbf{M}}_{(i)} \hat{\mathbf{h}}=\hat{\mathbf{b}}_{(i)}$ ，其中 $\hat{\mathbf{M}}_{(i)} \in \mathbb{R}^{2 \times 6}$ 并且对于每对匹配点 $(u_i, v_i)$ 与 $(u_i', v_i')$ 有：

\begin{equation}\hat{\mathbf{M}}_{(i)}=\left[\begin{array}{cccccc}0 & 0 & -u_{i} & -v_{i} & v_{i}^{\prime} u_{i} & v_{i}^{\prime} v_{i} \\u_{i} & v_{i} & 0 & 0 & -u_{i}^{\prime} u_{i} & -u_{i}^{\prime} v_{i}\end{array}\right]\end{equation}

以及：

\begin{equation}\hat{\mathbf{b}}_{(i)}=\left[-v_{i}^{\prime}, u_{i}^{\prime}\right]^{T} \in \mathbb{R}^{2 \times 1}\end{equation}

最终构造线性方程如下：

\begin{equation}\hat{\mathbf{M}} \hat{\mathbf{h}}=\hat{\mathbf{b}}\end{equation}

使用可微的线性求解器（tf.matrix solve）即可求解获得 $\mathbf{h}$ 。

在实践中，我们使用 ${(-1, -1),(1, -1),(1, 1),(-1, 1)}$ 四个偏移量再乘以 $\mathbf{H}$ 获得新的偏移量。

通过定义上述的所有几何变换，最终加入几何约束的偏移量使用如下公式获得：

\begin{equation}\triangle \mathbf{p}_{n}=\mathbf{T} \mathbf{p}_{n}-\mathbf{p}_{n}, \text { where } \mathbf{p}_{n} \in \mathcal{R}\end{equation}

1.3 Selective and Accurate Keypoint Detection

Keypoint peakiness measurement

在 D2-Net 中，detector 的分数通过 spatial 和 channel-wise responses 来共同获得。在 channel-wise-score 计算中使用了 ratio-to-max，一个可能的影响是与其在 channel 上的实际分布关联变弱（?），基于此作者对于两个 score 做了如下改进（主要是希望使用峰值作为 detector d的分数标准）：

\begin{equation}\beta_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{C} \sum_{t} \mathbf{y}_{i j}^{t}\right)\end{equation}

相应地：

\begin{equation}\alpha_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{|\mathcal{N}(i, j)|} \sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \mathbf{y}_{i^{\prime} j^{\prime}}^{c}\right)\end{equation}

其中 softplus 激活函数如下：

\begin{equation}f(x)=\ln \left(1+e^{x}\right)\end{equation}

Multi-level keypoint detection (MulDet)

D2-Net 关键点的定位精度不够，原因在于检测是从低分辨率特征图获得的。恢复空间分辨率的方法有多种（如下图所示），例如通过学习额外的特征解码器或采用膨胀卷积，但是这些方法要么增加了学习参数的数量，要么消耗了巨大的GPU内存或计算能力。

作者提出了一个简单而有效的解决方案，不需要额外的学习权重，通过利用卷积网络固有的金字塔特征层次，从多个特征层次组合检测，即一种分层的尺度融合方法。

\begin{equation}\hat{\mathbf{s}}=\frac{1}{\sum_{l} w_{l}} \sum_{l} w_{l} \mathbf{s}^{(l)}\end{equation}

最终 score 的计算是把各层的相同位置 score 进行加权求和。

1.4 Learning Framework

Network architecture

最终的网络设计如下图所示：

与 VGG 的设计一样，每个尺度两个 conv，在最后一层使用了3层 deformable conv （conv6, conv7 和 conv8），MulDet 部分使用 conv1 conv3 conv8 三层特征。

公式 14 中的 $w_{i}=1,2,3$ ，公式 3 中 $\mathcal{N}(i, j)=3,2,1$ 。

Loss design

设 $\mathcal{C}$ 代表图像对 $(I, I^{\prime})$ 中的特征点匹配对。与 D2-Net 类似定义 Loss 如下：

\begin{equation}\mathcal{L}\left(I, I^{\prime}\right)=\frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} \frac{\hat{s}_{c} \hat{s}_{c}^{\prime}}{\sum_{q \in \mathcal{C}} \hat{s}_{q} \hat{s}_{q}^{\prime}} \mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)\end{equation}

其中 $\hat{S}_{k}$ 和 $\hat{S}_{k}^{\prime}$ 代表 $(I$ 和 $I^{\prime})$ 对应 detector score， $\mathbf{f}_{k}$ 和 $\mathbf{f}_{k}^{\prime}$ 代表 descriptors， $\mathcal{M}(\cdot, \cdot)$ 代表 ranking loss，这一部分代替了 D2-Net 中的 hardest-triplet loss。我们这里 $\mathcal{M}$ 定义如下：

\begin{equation}\begin{array}{l}\mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)=\left[D\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)-m_{p}\right]_{+}+ \\\quad\left[m_{n}-\min \left(\min _{k \neq c} D\left(\mathbf{f}_{c}, \mathbf{f}_{k}^{\prime}\right), \min _{k \neq c} D\left(\mathbf{f}_{k}, \mathbf{f}_{c}^{\prime}\right)\right)\right]_{+}\end{array}\end{equation}

其中 $D(\cdot, \cdot)$ 代表欧氏距离。 $m_p$ 和 $m_n$ 分别为 0.2 和 1.0。

2 Experiments

2.1 Image Matching

实验结果如下：

2.2 Comparisons with other methods

实验结果如下

2.3 3D Reconstruction

实验结果如下：

2.4 Visual Localization

实验结果如下：

论文 & 源码

论文
https://arxiv.org/abs/2003.10071

源码
https://github.com/lzx551402/ASLFeat

参考材料

[1] https://blog.csdn.net/phy12321/article/details/106040545
[2] https://www.jianshu.com/p/c184c2efbecc

About The Author

skylook

增强现实、图像识别技术爱好者。

技术刘

论文笔记：ASLFeat: Learning Local Features of Accurate Shape and Localization

1 Methods

1.1 Prerequisites

1.2 DCN with Geometric Constraints

1.3 Selective and Accurate Keypoint Detection

1.4 Learning Framework

2 Experiments

2.1 Image Matching

2.2 Comparisons with other methods

2.3 3D Reconstruction

2.4 Visual Localization

论文 & 源码

参考材料

About The Author

skylook

Add a Comment

1 Methods

1.1 Prerequisites

1.2 DCN with Geometric Constraints

1.3 Selective and Accurate Keypoint Detection

1.4 Learning Framework

2 Experiments

2.1 Image Matching

2.2 Comparisons with other methods

2.3 3D Reconstruction

2.4 Visual Localization

论文 & 源码

参考材料

Related Posts

About The Author

skylook

Add a Comment