论文笔记:Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation

Due to the large spatial and temporal complexity of the classic Dot product Attention, although it works well, the amount of data input for images and videos is often much larger than that of text, which limits the application of this classic Attention in high-resolution images. This paper proposes a linear Attention module, hoping to solve this problem without significant performance loss.

The main contributions of this paper are:
1) A new linearized Attention method is proposed, which can reduce the computational complexity from 𝑂(𝑁^2) to 𝑂(𝑁).
2) This Attention mechanism makes the combination of Attention and the network more general and flexible;
3) In the task of semantic segmentation, the introduction of this Attention mechanism has improved performance on various baselines.

1 METHODOLOGY

A. Definition of Dot-Product Attention

Given an input feature vector X=left[boldsymbol{x}_{1}, cdots, boldsymbol{x}_{N}right] in mathbb{R}^{N times D_{x}}, where N represents the feature length, and D_x represents the feature dimension. Then Dot-Product Attention generates the Query Matrix, Key Matrix, and Value Matrix through the dot product transformation matrices boldsymbol{W}_{q} in mathbb{R}^{D_{x} times D_{k}}, boldsymbol{W}_{k} in mathbb{R}^{D_{x} times D_{k}}, and boldsymbol{W}_{v}= mathbb{R}^{D_{x} times D_{v}}, respectively, as follows:

begin{aligned}&boldsymbol{Q}=boldsymbol{X} boldsymbol{W}_{q} in mathbb{R}^{N times D_{k}} \&boldsymbol{K}=boldsymbol{X} boldsymbol{W}_{k} in mathbb{R}^{N times D_{k}} \&boldsymbol{V}=boldsymbol{X} boldsymbol{W}_{boldsymbol{v}} in mathbb{R}^{boldsymbol{N} times boldsymbol{D}_{v}}end{aligned}tag{1}
where the dimensions of Q and K must be the same.

On this basis, a normalization function rho is introduced to measure the similarity between boldsymbol{q}_{i}^{T} in mathbb{R}^{D_{k}} and boldsymbol{k}_{j} in mathbb{R}^{D_{k}} as rholeft(boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}right) in mathbb{R}^{1}. Scaled Dot-Product Attention is essentially a weighted average of v_j using rholeft(boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}right) in mathbb{R}^{1} as the weight. The definition of Dot-Product Attention for the entire matrix is as follows:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})=rholeft(boldsymbol{Q} boldsymbol{K}^{T}right) boldsymbol{V}tag{2}
The most commonly used normalization function is softmax:

rholeft(boldsymbol{Q}^{T} boldsymbol{K}right)=operatorname{softmax}_{text {row }}left(boldsymbol{Q} boldsymbol{K}^{T}right)tag{3}
For boldsymbol{Q} in mathbb{R}^{N times D_{k}} and boldsymbol{K}^{T} in mathbb{R}^{D_{k} times N}, boldsymbol{Q}boldsymbol{K}^{T} in mathbb{R}^{N times N}, so the time and space complexity of rho is 𝑂(𝑁^2).

B. Generalization of Dot-Product Attention Based on Kernel

Generalizing the definition of Dot-Product Attention based on Kernel. For the softmax form of the normalization function, it can be written as:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}} boldsymbol{v}_{j}}{sum_{j=1}^{N} e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}}}tag{4}
e^{boldsymbol{q}_i^{top}boldsymbol{k}_j} is the weight for the weighted average of v_j, and this formula can be further generalized as:

begin{gathered}D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right) boldsymbol{v}_{j}}{sum_{j=1}^{N} operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)} \operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right) geq 0end{gathered}tag{5}
Obviously, when operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)=e^{boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}}, this formula is equivalent to the standard softmax Attention formula. This form of Attention is generally referred to as Non-Local networks in CV.

Clearly, if we directly remove the exponential definition text{sim}(boldsymbol{q}_i, boldsymbol{k}_j) = boldsymbol{q}_i^{top}boldsymbol{k}_j, it does not satisfy the non-negative property. To achieve this, we can consider adding a kernel function and redefine it as:

text{sim}(boldsymbol{q}_i, boldsymbol{k}_j) = phi(boldsymbol{q}_i)^{top} varphi(boldsymbol{k}_j)tag{6}
Then the Attention formula can be redefined as:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N} phileft(boldsymbol{q}_{i}right)^{T} varphileft(boldsymbol{k}_{j}right) boldsymbol{v}_{j}}{sum_{j=1}^{N} phileft(boldsymbol{q}_{i}right)^{T} varphileft(boldsymbol{k}_{j}right)}=frac{phileft(boldsymbol{q}_{i}right)^{T} sum_{j=1}^{N} varphileft(boldsymbol{k}_{j}right) boldsymbol{v}_{j}^{T}}{phileft(boldsymbol{q}_{i}right)^{T} sum_{j=1}^{N} varphileft(boldsymbol{k}_{j}right)}tag{7}

C. Linear Attention Mechanism

Unlike many previous linearization methods, this paper uses the first-order Taylor expansion for linearization:

e^{boldsymbol{q}_{i}^{T} boldsymbol{k}_{j}} approx 1+boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j}tag{8}
To ensure boldsymbol{q}_{i}{ }^{T} boldsymbol{k}_{j} geq-1, we perform l2 norm on boldsymbol{q}_{i} and boldsymbol{k}_{j}:

operatorname{sim}left(boldsymbol{q}_{i}, boldsymbol{k}_{j}right)=1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) .tag{9}
Then the Attention formula can be rewritten as:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})_{i}=frac{sum_{j=1}^{N}left(1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)right) boldsymbol{v}_{j}}{sum_{j=1}^{N}left(1+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)right)}=frac{sum_{j=1}^{N} boldsymbol{v}_{j}+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T} sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) boldsymbol{v}_{j}^{T}}{N+left(frac{boldsymbol{q}_{i}}{left|boldsymbol{q}_{i}right|_{2}}right)^{T} sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right)}tag{10}
Using the vector form:

D(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})=frac{sum_{j} boldsymbol{V}_{i, j}+left(frac{boldsymbol{Q}}{|boldsymbol{Q}|_{2}}right)left(left(frac{boldsymbol{K}}{|boldsymbol{K}|_{2}}right)^{T} boldsymbol{V}right)}{N+left(frac{boldsymbol{Q}}{|boldsymbol{Q}|_{2}}right) sum_{j}left(frac{boldsymbol{K}}{|boldsymbol{K}|_{2}}right)_{i, j}^{T}}tag{11}

Where sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) boldsymbol{v}_{j}^{T} and sum_{j=1}^{N}left(frac{boldsymbol{k}_{j}}{left|boldsymbol{k}_{j}right|_{2}}right) can be precomputed and reused for each query.

2 EXPERIMENTAL RESULTS

The authors introduced the Linear Attention proposed in this paper into several common image segmentation frameworks and obtained results that improved compared to the baselines, demonstrating the general effectiveness of the structure proposed in this paper.

Paper & Code

Paper
1636966378-Linear Attention Mechanism- An Efficient Attention for Semantic Segmentation

Code
https://github.com/lironui/Linear-Attention-Mechanism
https://github.com/lironui/MAResU-Net
https://github.com/lironui/MACU-Net

References

[1] https://kexue.fm/archives/7546

Add a Comment

Your email address will not be published. Required fields are marked *