论文笔记：Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation

2021年2月15日

Table of Contents

Due to the large spatial and temporal complexity of the classic Dot product Attention, although it works well, the amount of data input for images and videos is often much larger than that of text, which limits the application of this classic Attention in high-resolution images. This paper proposes a linear Attention module, hoping to solve this problem without significant performance loss.

The main contributions of this paper are:
1) A new linearized Attention method is proposed, which can reduce the computational complexity from $𝑂(𝑁^2)$ to $𝑂(𝑁)$ .
2) This Attention mechanism makes the combination of Attention and the network more general and flexible;
3) In the task of semantic segmentation, the introduction of this Attention mechanism has improved performance on various baselines.

1 METHODOLOGY

A. Definition of Dot-Product Attention

Given an input feature vector $X=left[boldsymbol{x}_{1}, cdots, boldsymbol{x}_{N}right] in mathbb{R}^{N times D_{x}}$ , where $N$ represents the feature length, and $D_x$ represents the feature dimension. Then Dot-Product Attention generates the Query Matrix, Key Matrix, and Value Matrix through the dot product transformation matrices $boldsymbol{W}_{q} in mathbb{R}^{D_{x} times D_{k}}$ , $boldsymbol{W}_{k} in mathbb{R}^{D_{x} times D_{k}}$ , and $boldsymbol{W}_{v}= mathbb{R}^{D_{x} times D_{v}}$ , respectively, as follows:

$begin{aligned}&boldsymbol{Q}=boldsymbol{X} boldsymbol{W}_{q} in mathbb{R}^{N times D_{k}} \&boldsymbol{K}=boldsymbol{X} boldsymbol{W}_{k} in mathbb{R}^{N times D_{k}} \&boldsymbol{V}=boldsymbol{X} boldsymbol{W}_{boldsymbol{v}} in mathbb{R}^{boldsymbol{N} times boldsymbol{D}_{v}}end{aligned}tag{1}$
where the dimensions of Q and K must be the same.

On this basis, a normalization function $rho$ is introduced to measure the similarity between $boldsymbol{q}_{i}^{T} in mathbb{R}^{D_{k}}$ and $boldsymbol{k}_{j} in mathbb{R}^{D_{k}}$ as $rholeft(boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}right) in mathbb{R}^{1}$ . Scaled Dot-Product Attention is essentially a weighted average of $v_j$ using $rholeft(boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}right) in mathbb{R}^{1}$ as the weight. The definition of Dot-Product Attention for the entire matrix is as follows:

$D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})=rholeft(boldsymbol{Q} boldsymbol{K}^{T}right) boldsymbol{V}tag{2}$
The most commonly used normalization function is softmax:

$rholeft(boldsymbol{Q}^{T} boldsymbol{K}right)=operatorname{softmax}_{text {row }}left(boldsymbol{Q} boldsymbol{K}^{T}right)tag{3}$
For $boldsymbol{Q} in mathbb{R}^{N times D_{k}}$ and $boldsymbol{K}^{T} in mathbb{R}^{D_{k} times N}$ , $boldsymbol{Q}boldsymbol{K}^{T} in mathbb{R}^{N times N}$ , so the time and space complexity of $rho$ is $𝑂(𝑁^2)$ .

B. Generalization of Dot-Product Attention Based on Kernel

Generalizing the definition of Dot-Product Attention based on Kernel. For the softmax form of the normalization function, it can be written as:

$D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}} boldsymbol{v}_{j}}{sum_{j=1}^{N} e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}}}tag{4}$
$e^{boldsymbol{q}_i^{top}boldsymbol{k}_j}$ is the weight for the weighted average of $v_j$ , and this formula can be further generalized as:

$begin{gathered}D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right) boldsymbol{v}_{j}}{sum_{j=1}^{N} operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)} \operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right) geq 0end{gathered}tag{5}$
Obviously, when $operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)=e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}}$ , this formula is equivalent to the standard softmax Attention formula. This form of Attention is generally referred to as Non-Local networks in CV.

Clearly, if we directly remove the exponential definition $text{sim}(boldsymbol{q}_i, boldsymbol{k}_j) = boldsymbol{q}_i^{top}boldsymbol{k}_j$ , it does not satisfy the non-negative property. To achieve this, we can consider adding a kernel function and redefine it as:

$text{sim}(boldsymbol{q}_i, boldsymbol{k}_j) = phi(boldsymbol{q}_i)^{top} varphi(boldsymbol{k}_j)tag{6}$
Then the Attention formula can be redefined as:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} phileft(boldsymbol{q}_{i}right)^{T} varphileft(boldsymbol{k}_{j}right) boldsymbol{v}_{j}}{sum_{j=1}^{N} phileft(boldsymbol{q}_{i}right)^{T} varphileft(boldsymbol{k}_{j}right)}=frac{phileft(boldsymbol{q}_{i}right)^{T} sum_{j=1}^{N} varphileft(boldsymbol{k}_{j}right) boldsymbol{v}_{j}^{T}}{phileft(boldsymbol{q}_{i}right)^{T} sum_{j=1}^{N} varphileft(boldsymbol{k}_{j}right)}tag{7}

C. Linear Attention Mechanism

Unlike many previous linearization methods, this paper uses the first-order Taylor expansion for linearization:

$e^{boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}} approx 1+boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}tag{8}$
To ensure $boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j} geq-1$ , we perform l2 norm on $boldsymbol{q}_{i}$ and $boldsymbol{k}_{j}$ :

$operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)=1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) .tag{9}$
Then the Attention formula can be rewritten as:

$D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N}left(1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)right) boldsymbol{v}_{j}}{sum_{j=1}^{N}left(1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)right)}=frac{sum_{j=1}^{N} boldsymbol{v}_{j}+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T} sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) boldsymbol{v}_{j}^{T}}{N+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T} sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)}tag{10}$
Using the vector form:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})=frac{sum_{j} boldsymbol{V}_{i, j}+left(frac{boldsymbol{Q}}{|boldsymbol{Q}|_{2}}right)left(left(frac{boldsymbol{K}}{|boldsymbol{K}|_{2}}right)^{T} boldsymbol{V}right)}{N+left(frac{boldsymbol{Q}}{|boldsymbol{Q}|_{2}}right) sum_{j}left(frac{boldsymbol{K}}{|boldsymbol{K}|_{2}}right)_{i, j}^{T}}tag{11}

Where $sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) boldsymbol{v}_{j}^{T}$ and $sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)$ can be precomputed and reused for each query.

2 EXPERIMENTAL RESULTS

The authors introduced the Linear Attention proposed in this paper into several common image segmentation frameworks and obtained results that improved compared to the baselines, demonstrating the general effectiveness of the structure proposed in this paper.

Paper & Code

Paper
1636966378-Linear Attention Mechanism- An Efficient Attention for Semantic Segmentation

Code
https://github.com/lironui/Linear-Attention-Mechanism
https://github.com/lironui/MAResU-Net
https://github.com/lironui/MACU-Net

References

[1] https://kexue.fm/archives/7546

技术刘