논문 노트: ASLFeat: 정확한 형태와 위치를 학습하는 로컬 특징

2021年12月1日

Table of Contents

本文是基于 D2-Net 의 추가 개선으로, 주요 혁신점은 다음과 같습니다:

1) Deformable Convolution을 사용하여 밀집 변환 추정 및 특징 추출 수행
2) 특징 피라미드를 사용하여 공간 해상도를 적응시키고 low-level 세부 사항을 사용하여 정확한 특징점 위치를 결정

1 방법

1.1 전제 조건

본 논문의 네트워크 설계는 다음 두 가지 작업을 기반으로 합니다: DCN 및 D2-Net, 먼저 이 두 작업의 주요 아이디어를 검토합니다:

Deformable convolutional networks (DCN)

변형 가능한 합성곱 (Deformable Convolutional Networks, DCN) 의 주요 목적은 동적 수용 영역을 학습하는 것입니다. 전통적인 합성곱의 공식은 다음과 같습니다:

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}\right)\end{equation}$
여기서 $\mathbf{p}$ 는 합성곱의 중심점 좌표를 나타내고, $\mathbf{p}_{n}$ 는 합성곱 $\mathcal{R}$ 범위 내의 오프셋을 나타내며, $\mathbf{x}(\cdot)$ 는 해당 지점의 픽셀 값을 나타냅니다. DCN은 여기에 offset 예측 $\Delta p$ 및 feature 가중치 $\Delta m$ 를 추가합니다:

$\begin{equation}\mathbf{y}(\mathbf{p})=\sum_{\mathbf{p}_{n} \in \mathcal{R}} \mathbf{w}\left(\mathbf{p}_{n}\right) \cdot \mathbf{x}\left(\mathbf{p}+\mathbf{p}_{n}+\Delta \mathbf{p}_{n}\right) \cdot \Delta \mathbf{m}_{n}\end{equation}$
여기서 $\Delta p$ 는 일반적으로 소수이므로 실제 구현에서는 비선형 보간 방법을 사용합니다.

D2-Net

D2-Net의 핵심 아이디어는 descriptor와 detector를 하나로 통합하는 describe-and-detect 방식입니다. feature map을 학습하여 local descriptors를 얻는 방식은 L2-Norm이며, detector를 얻는 방식은 y에서 다음과 같은 계산을 수행하는 것입니다:

local score 계산:

$\begin{equation}\alpha_{i j}^{c}=\frac{\exp \left(\mathbf{y}_{i j}^{c}\right)}{\sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \exp \mathbf{y}_{i^{\prime} j^{\prime}}^{c}}\end{equation}$
여기서 $\mathcal{N}(i, j)$ 는 근접 픽셀 범위를 나타내며, 예를 들어 $3 \times 3$ 합성곱은 3개의 근접 픽셀, 총 9개의 픽셀을 의미합니다.

channel-wise-score 계산

$\begin{equation}\beta_{i j}^{c}=\mathbf{y}_{i j}^{c} / \max _{t} \mathbf{y}_{i j}^{t}\end{equation}$
최종 detector score는 두 값을 max로 취하여 얻습니다:

\begin{equation}s_{i j}=\max _{t}\left(\alpha_{i j}^{c} \beta_{i j}^{c}\right)\end{equation}

이를 바탕으로 본 논문은 후속 개선을 제안합니다.

1.2 기하학적 제약이 있는 DCN

DCN의 저자는 원래 버전의 자유도가 너무 높아 임의의 변형을 예측할 수 있다고 생각했습니다. 그러나 시각적 위치 추정 작업에서는 변형이 전체적으로 제한된 자유도를 가지며, 일반적으로 1) 유사성, 2) 아핀, 3) 호모그래피로 제한됩니다:

따라서 전통적인 DCN은 중요하지 않은 매개변수를 학습하면서도 기하학적 제약을 보장할 수 없습니다. 이에 대해 저자는 기하학적 제약을 사용하여 개선을 제안했습니다.

Affine-constrained DCN

일반적으로 회전 및 스케일링을 포함하는 아핀 변환은 다음과 같습니다:

$\begin{equation}\mathbf{S}=\lambda R(\theta)=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\end{equation}$
일부 논문, 예를 들어 AffNet에서는 곡률 계수를 도입했으며, 본 논문에서는 이를 모방하여 아핀 변환을 다음과 같이 정의했습니다:

\begin{equation}\begin{aligned}\mathbf{A} &=\mathbf{S} A^{\prime}=\lambda R(\theta) A^{\prime} \\&=\lambda\left(\begin{array}{cc}\cos (\theta) & \sin (\theta) \\-\sin (\theta) & \cos (\theta)\end{array}\right)\left(\begin{array}{ll}a_{11}^{\prime} & 0 \\a_{21}^{\prime} & a_{22}^{\prime}\end{array}\right)\end{aligned}\end{equation}

Homography-constrained DCN

호모그래피는 일반적으로 4개의 점 쌍으로 해결할 수 있으며, 본 논문에서는 《Unsupervised deep homography》 논문에서 사용된 방식에 따라 미분 가능한 선형 해석기를 사용하여 H 행렬을 구했습니다. 일반적으로 선형 방정식은 $\mathbf{M h}=\mathbf{0}$ 로 정의되며, $\mathbf{M} \in \mathbb{R}^{8 \times 9}$ 입니다. 여기서 $\mathbf{h}$ 는 $\mathbf{H}$ 행렬의 9개 매개변수를 나타내는 9차원 벡터이며, $\mathbf{H}_{33}=1$ , $\mathbf{H}_{31}=\mathbf{H}_{32}=-$ 세 가지 제약 조건을 만족해야 하므로 실제로는 6개의 매개변수입니다. 선형 방정식을 다음과 같이 다시 작성합니다: $\hat{\mathbf{M}}_{(i)} \hat{\mathbf{h}}=\hat{\mathbf{b}}_{(i)}$ , 여기서 $\hat{\mathbf{M}}_{(i)} \in \mathbb{R}^{2 \times 6}$ 이며, 각 매칭 점 쌍 $(u_i, v_i)$ 및 $(u_i', v_i')$ 에 대해 다음과 같습니다:

\begin{equation}\hat{\mathbf{M}}_{(i)}=\left[\begin{array}{cccccc}0 & 0 & -u_{i} & -v_{i} & v_{i}^{\prime} u_{i} & v_{i}^{\prime} v_{i} \\u_{i} & v_{i} & 0 & 0 & -u_{i}^{\prime} u_{i} & -u_{i}^{\prime} v_{i}\end{array}\right]\end{equation}

그리고:

\begin{equation}\hat{\mathbf{b}}_{(i)}=\left[-v_{i}^{\prime}, u_{i}^{\prime}\right]^{T} \in \mathbb{R}^{2 \times 1}\end{equation}

최종적으로 선형 방정식을 다음과 같이 구성합니다:

\begin{equation}\hat{\mathbf{M}} \hat{\mathbf{h}}=\hat{\mathbf{b}}\end{equation}

미분 가능한 선형 해석기(tf.matrix solve)를 사용하여 $\mathbf{h}$ 를 구할 수 있습니다.

실제로 우리는 ${(-1, -1),(1, -1),(1, 1),(-1, 1)}$ 네 개의 오프셋에 $\mathbf{H}$ 를 곱하여 새로운 오프셋을 얻습니다.

위에서 정의한 모든 기하학적 변환을 통해 최종적으로 기하학적 제약이 있는 오프셋은 다음 공식을 사용하여 얻습니다:

\begin{equation}\triangle \mathbf{p}_{n}=\mathbf{T} \mathbf{p}_{n}-\mathbf{p}_{n}, \text { where } \mathbf{p}_{n} \in \mathcal{R}\end{equation}

1.3 선택적이고 정확한 특징점 검출

Keypoint peakiness 측정

D2-Net에서 detector의 점수는 spatial 및 channel-wise 응답을 통해 함께 얻습니다. channel-wise-score 계산에서 ratio-to-max를 사용했는데, 이로 인해 채널에서의 실제 분포와의 연관성이 약해질 수 있습니다. 이에 저자는 두 점수에 대해 다음과 같은 개선을 제안했습니다(주로 피크 값을 detector의 점수 기준으로 사용하려는 목적):

\begin{equation}\beta_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{C} \sum_{t} \mathbf{y}_{i j}^{t}\right)\end{equation}

이에 따라:

\begin{equation}\alpha_{i j}^{c}=\operatorname{softplus}\left(\mathbf{y}_{i j}^{c}-\frac{1}{|\mathcal{N}(i, j)|} \sum_{\left(i^{\prime}, j^{\prime}\right) \in \mathcal{N}(i, j)} \mathbf{y}_{i^{\prime} j^{\prime}}^{c}\right)\end{equation}

softplus 활성화 함수는 다음과 같습니다:

\begin{equation}f(x)=\ln \left(1+e^{x}\right)\end{equation}

다중 레벨 특징점 검출 (MulDet)

D2-Net의 특징점 위치 정확도가 부족한 이유는 저해상도 특징 맵에서 검출이 이루어지기 때문입니다. 공간 해상도를 복원하는 방법은 여러 가지가 있습니다(아래 그림 참조). 예를 들어 추가적인 특징 디코더를 학습하거나 팽창 합성곱을 사용하는 방법이 있지만, 이러한 방법은 학습 매개변수의 수를 증가시키거나 GPU 메모리 또는 계산 능력을 크게 소모합니다.

저자는 추가적인 학습 가중치가 필요 없는 간단하고 효과적인 해결책을 제안했습니다. 이는 합성곱 네트워크의 고유한 피라미드 특징 계층을 활용하여 여러 특징 계층에서 결합하여 검출하는, 즉 계층적 스케일 융합 방법입니다.

\begin{equation}\hat{\mathbf{s}}=\frac{1}{\sum_{l} w_{l}} \sum_{l} w_{l} \mathbf{s}^{(l)}\end{equation}

최종 점수는 각 계층의 동일한 위치에서 점수를 가중 합산하여 계산됩니다.

1.4 학습 프레임워크

네트워크 아키텍처

최종 네트워크 설계는 다음 그림과 같습니다:

VGG의 설계와 마찬가지로 각 스케일에서 두 개의 conv가 있으며, 마지막 레이어에서는 3개의 deformable conv(conv6, conv7, conv8)를 사용했습니다. MulDet 부분은 conv1, conv3, conv8 세 개의 특징을 사용합니다.

공식 14에서 $w_{i}=1,2,3$ , 공식 3에서 $\mathcal{N}(i, j)=3,2,1$ .

손실 설계

$\mathcal{C}$ 는 이미지 쌍 $(I, I^{\prime})$ 에서 특징점 매칭 쌍을 나타냅니다. D2-Net과 유사하게 Loss는 다음과 같이 정의됩니다:

\begin{equation}\mathcal{L}\left(I, I^{\prime}\right)=\frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} \frac{\hat{s}_{c} \hat{s}_{c}^{\prime}}{\sum_{q \in \mathcal{C}} \hat{s}_{q} \hat{s}_{q}^{\prime}} \mathcal{M}\left(\mathbf{f}_{c}, \mathbf{f}_{c}^{\prime}\right)\end{equation}

여기서 $\hat{S}_{k}$ 및 $\hat{S}_{k}^{\prime}$ 는 $(I$ 및 $I^{\prime})$ 에 해당하는 detector 점수를 나타내며, $\mathbf{f}_{k}$ 및 $\mathbf{f}_{k}^{\prime}$ 는 descriptors를 나타냅니다. $\mathcal{M}(\cdot, \cdot)$ 는 ranking loss를 나타내며, 이는 D2-Net의 hardest-triplet loss를 대체합니다. 여기서 $\mathcal{M}$ 은 다음과 같이 정의됩니다: