论文筆記：ASLFeat: 學習準確形狀和定位的局部特徵

2021年12月1日

Table of Contents

本文是基於 D2-Net 的進一步改進，主要創新點如下：

1）使用 Deformable Convolution 來進行稠密的變換估計和特徵提取
2）使用特徵金字塔適應空間解析度以及使用 low-level 細節來進行精確的特徵點定位

1 方法

1.1 先決條件

本文的網絡設計基於以下兩個工作：DCN 和 D2-Net，首先回顧這兩個工作的主要思想：

Deformable convolutional networks (DCN)

可變形卷積 (Deformable Convolutional Networks, DCN) 的目的主要是學習動態感受野，對於傳統卷積來說，其公式為：

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}\right)\end{equation}$
其中 $\mathbf{p}$ 代表卷積的中心點座標， $\mathbf{p}_{n}$ 代表卷積 $\mathcal{R}$ 範圍內的偏移量， $\mathbf{x}(\cdot)$ 表示取該點的像素值。而 DCN 就是在此基礎上再加上一個 offset 預測 $\Delta p$ 以及 feature 權重 $\Delta m$ ：

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}+\Delta \mathbf{p}_{n}\right) \cdot \Delta \mathbf{m}_{n}\end{equation}$
其中由於 $\Delta p$ 一般都是小數，所以實際實現會用到非線性插值的方式。

D2-Net

D2-Net 核心思想是將 descriptor 和 detector 合二為一，describe-and-detect，通過學習一個 feature map，獲取 local descriptors 的方式就是 L2-Norm，而獲取 detector 的方式則是在 y 上面進行如下計算：

計算 local score：

$\begin{equation}\alpha_{i j}^{c}=\frac{\exp \left(\mathbf{y}_{i j}^{c}\right)}{\sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \exp \mathbf{y}_{i^{\prime} j^{\prime}}^{c}}\end{equation}$
其中 $\mathcal{N}(i, j)$ 代表近鄰像素範圍，例如 $3 \times 3$ 卷積就是 3 近鄰共計 9 個像素。

計算 channel-wise-score

$\begin{equation}\beta_{i j}^{c}=\mathbf{y}_{i j}^{c} / \max _{t} \mathbf{y}_{i j}^{t}\end{equation}$
最終 detector score 由二者取 max 獲得：

\begin{equation}s_{i j}=\max _{t}\left(\alpha_{i j}^{c} \beta_{i j}^{c}\right)\end{equation}

基於此本文提出了後面的改進。

1.2 DCN with Geometric Constraints

對於 DCN 作者認為原始版本自由度過高了，可能預測出任意形變，而對於視覺定位這個任務來說，形變是整體的只有有限的自由度，通常僅為 1) similarity, 2) affine 和 3) homography：

因此傳統的 DCN 學習了無關緊要的參數的同時還無法保證幾何約束成立。對此作者使用幾何約束進行了改進。

Affine-constrained DCN

通常一個包含旋轉和縮放的 Affine Transform 變換如下：

$\begin{equation}\mathbf{S}=\lambda R(\theta)=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\end{equation}$
在一些文章例如 AffNet 中還引入了彎曲係數，因此本文仿照其將 Affine Transform 定義為：

\begin{equation}\begin{aligned}\mathbf{A} &=\mathbf{S} A^{\prime}=\lambda R(\theta) A^{\prime} \\&=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\left(\begin{array}{ll}a_{11}^{\prime} & 0 \\a_{21}^{\prime} & a_{22}^{\prime}\end{array}\right)\end{aligned}\end{equation}

Homography-constrained DCN

Homography 通常用4個點對可以求解，本文仿照《Unsupervised deep homography》文章中的方式用可微的線性求解器求 H 矩陣。通常一個線性方程定義為： $\mathbf{M h}=\mathbf{0}$ ， $\mathbf{M} \in \mathbb{R}^{8 \times 9}$ 。其中 $\mathbf{h}$ 是一個9維向量表示 $\mathbf{H}$ 矩陣的9個參數，其中需要滿足 $\mathbf{H}_{33}=1$ 、 $\mathbf{H}_{31}=\mathbf{H}_{32}=-$ 三個約束條件，因此實際上是6個參數。重新改寫線性方程為： $\hat{\mathbf{M}}_{(i)} \hat{\mathbf{h}}=\hat{\mathbf{b}}_{(i)}$ ，其中 $\hat{\mathbf{M}}_{(i)} \in \mathbb{R}^{2 \times 6}$ 並且對於每對匹配點 $(u_i, v_i)$ 與 $(u_i', v_i')$ 有：

\begin{equation}\hat{\mathbf{M}}_{(i)}=\left[\begin{array}{cccccc}0 & 0 & -u_{i} & -v_{i} & v_{i}^{\prime} u_{i} & v_{i}^{\prime} v_{i} \\u_{i} & v_{i} & 0 & 0 & -u_{i}^{\prime} u_{i} & -u_{i}^{\prime} v_{i}\end{array}\right]\end{equation}

以及：

\begin{equation}\hat{\mathbf{b}}_{(i)}=\left[-v_{i}^{\prime}, u_{i}^{\prime}\right]^{T} \in \mathbb{R}^{2 \times 1}\end{equation}

最終構造線性方程如下：

\begin{equation}\hat{\mathbf{M}} \hat{\mathbf{h}}=\hat{\mathbf{b}}\end{equation}

使用可微的線性求解器（tf.matrix solve）即可求解獲得 $\mathbf{h}$ 。

在實踐中，我們使用 ${(-1, -1),(1, -1),(1, 1),(-1, 1)}$ 四個偏移量再乘以 $\mathbf{H}$ 獲得新的偏移量。

通過定義上述的所有幾何變換，最終加入幾何約束的偏移量使用如下公式獲得：

\begin{equation}\triangle \mathbf{p}_{n}=\mathbf{T} \mathbf{p}_{n}-\mathbf{p}_{n}, \text { where } \mathbf{p}_{n} \in \mathcal{R}\end{equation}

1.3 選擇性和精確的關鍵點檢測

Keypoint peakiness measurement

在 D2-Net 中，detector 的分數通過 spatial 和 channel-wise responses 來共同獲得。在 channel-wise-score 計算中使用了 ratio-to-max，一個可能的影響是與其在 channel 上的實際分佈關聯變弱（?），基於此作者對於兩個 score 做了如下改進（主要是希望使用峰值作為 detector d的分數標準）：

\begin{equation}\beta_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{C} \sum_{t} \mathbf{y}_{i j}^{t}\right)\end{equation}

相應地：

\begin{equation}\alpha_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{|\mathcal{N}(i, j)|} \sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \mathbf{y}_{i^{\prime} j^{\prime}}^{c}\right)\end{equation}

其中 softplus 激活函數如下：

\begin{equation}f(x)=\ln \left(1+e^{x}\right)\end{equation}

Multi-level keypoint detection (MulDet)

D2-Net 關鍵點的定位精度不夠，原因在於檢測是從低解析度特徵圖獲得的。恢復空間解析度的方法有多種（如下圖所示），例如通過學習額外的特徵解碼器或採用膨脹卷積，但是這些方法要麼增加了學習參數的數量，要麼消耗了巨大的GPU記憶體或計算能力。

作者提出了一個簡單而有效的解決方案，不需要額外的學習權重，通過利用卷積網絡固有的金字塔特徵層次，從多個特徵層次組合檢測，即一種分層的尺度融合方法。

\begin{equation}\hat{\mathbf{s}}=\frac{1}{\sum_{l} w_{l}} \sum_{l} w_{l} \mathbf{s}^{(l)}\end{equation}

最終 score 的計算是把各層的相同位置 score 進行加權求和。

1.4 學習框架

Network architecture

最終的網絡設計如下圖所示：

與 VGG 的設計一樣，每個尺度兩個 conv，在最後一層使用了3層 deformable conv （conv6, conv7 和 conv8），MulDet 部分使用 conv1 conv3 conv8 三層特徵。

公式 14 中的 $w_{i}=1,2,3$ ，公式 3 中 $\mathcal{N}(i, j)=3,2,1$ 。

Loss design

設 $\mathcal{C}$ 代表圖像對 $(I, I^{\prime})$ 中的特徵點匹配對。與 D2-Net 類似定義 Loss 如下：

\begin{equation}\mathcal{L}\left(I, I^{\prime}\right)=\frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} \frac{\hat{s}_{c} \hat{s}_{c}^{\prime}}{\sum_{q \in \mathcal{C}} \hat{s}_{q} \hat{s}_{q}^{\prime}} \mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)\end{equation}

其中 $\hat{S}_{k}$ 和 $\hat{S}_{k}^{\prime}$ 代表 $(I$ 和 $I^{\prime})$ 對應 detector score， $\mathbf{f}_{k}$ 和 $\mathbf{f}_{k}^{\prime}$ 代表 descriptors， $\mathcal{M}(\cdot, \cdot)$ 代表 ranking loss，這一部分代替了 D2-Net 中的 hardest-triplet loss。我們這裡 $\mathcal{M}$ 定義如下：