Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification 내용 정리 [XAI-13]

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

iMTE

Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification 내용 정리 [XAI-13] 본문

Deep learning study/Explainable AI, 설명가능한 AI

Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification 내용 정리 [XAI-13]

Wonju Seo 2021. 6. 21. 14:33

논문 제목 : Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification

논문 주소 : https://arxiv.org/abs/2101.09666

Grad-CAM guided channel-spatial attention module for fine-grained visual classification

Fine-grained visual classification (FGVC) is becoming an important research field, due to its wide applications and the rapid development of computer vision technologies. The current state-of-the-art (SOTA) methods in the FGVC usually employ attention mech

arxiv.org

주요 내용 정리 :

1) 최근 연구들을 찾아보는 도중 Grad-CAM을 사용해서 fine-grained visual classification (FGVC)의 성능 향상시킨 짧은 연구가 있어서 소개하려고 한다. 이 연구의 핵심은 기존 channel-spatial attention에 Grad-CAM을 사용해서 좀 더 object의 class discritive 한 부분을 보는 것을 목적으로 하였다. 논문을 읽어보면 저자는 attention 방법이 background와 같은 중요하지 않은 부분을 볼 수 도 있기에, 이를 해결하기 위해서 Grad-CAM을 쓴 것으로 보인다. 중요한 것은 Grad-CAM을 사용해서 attention을 주기보다 Grad-CAM에서 획득된 weights을 attention weights과 비슷하게 함으로써 attention mechanism에서 만들어지는 map이 Grad-CAM에서 만들어지는 CAM과 비슷하게 만들어지게 하였다.

2) 밑의 그림을 보면, 기존의 channel-spatial attention module에 따로 Grad-CAM을 추가해서 attention을 guide하는 것을 보여준다.

밑의 그림이 좀 더 specific하게 어떻게 Grad-CAM guided channel-spatial attention module을 만드는 지 보여주고 있다.

위 식에서 $F_{cp}$ 는 convolution 연산들과 pooling operation, $A=[a_1,a_2,...,a_C]\in R^{C\times W \times H}$ 는 feature map 을 의미한다.

(1) 먼저, global average pooling인 $F_{cg}$ 를 통과시켜 획득된 값에서 두개의 fully connected layers (softmax function 포함) $F_{cr}$ 를 통과시켜 각 channel의 weights을 구한다 $S=[s_1,s_2,...,S_C]\in R^C$ .

(2) 다음으로 feature map A를 S로 re-scaling 한다. 이는 Weighted feature map $B$ 를 얻는 과정을 의미한다. $B=[b_1,b_2,...b_C]\in R^{C\times W\times H}$ .

$b c = F c m (a c, s c) = a c \cdot F c r (F c g (A)) c <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>b</mi><mi>c</mi></msub><mo>=</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>m</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>a</mi><mi>c</mi></msub><mo>,</mo><msub><mi>s</mi><mi>c</mi></msub><mo stretchy="false">)</mo><mo>=</mo><msub><mi>a</mi><mi>c</mi></msub><mo>\cdot</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>r</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>g</mi></mrow></msub><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo><msub><mo stretchy="false">)</mo><mi>c</mi></msub></math>$

(3) 다음으로, channel-wise summation과 2D softmax function을 포함하는 $F_{fa}$ 를 통과시킨 다음, B의 featuremaps이 channel dimension으로 flatten되어, spatial attention weights $T\in R^{W\times H}$ 를 획득한다.

(4) 마지막으로, channel-spatial attention-wegited feature map $D=[d_1,d_2,...d_C]\in R^{C\times W \times H}$ 로 획득한다.

$d c = F s m (a c, T) = a c ⊙ F f a (B) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>d</mi><mi>c</mi></msub><mo>=</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>m</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>a</mi><mi>c</mi></msub><mo>,</mo><mi>T</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>a</mi><mi>c</mi></msub><mo>⊙</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><mi>B</mi><mo stretchy="false">)</mo></math>$

이때, $\odot$ 은 Hadamard product이고, $T$ 는 다음과 같다.

$T=Ffa(B)=∑Cc=1bc∑Wi=1∑Hj=1∑Cc=1bc,i,j<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>T</mi><mo>=</mo><msub><mi>F</mi><mrow data-mjx-texclass="ORD"><mi>f</mi><mi>a</mi></mrow></msub><mo stretchy="false">(</mo><mi>B</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>=</mo><mn>1</mn></mrow><mi>C</mi></munderover><msub><mi>b</mi><mi>c</mi></msub></mrow><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>W</mi></munderover><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>H</mi></munderover><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>=</mo><mn>1</mn></mrow><mi>C</mi></munderover><msub><mi>b</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>,</mo><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow></mfrac></math>$

이렇게 획득된 D에 대해서 classification이 진행된다. $F_{tc}$ 는 multiple FC classifier로 classificaiton을 담당한다.

3) 위에서 channel-spatial attention module이 완성되었으니, Grad-CAM을 추가해야한다. Grad-CAM에서 각 featurea map $k$ 의 importance는 다음과 같이 표현된다.

$βkc=1W×HW∑i=1H∑j=1∂yk∂Ac,i,j<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>c</mi></mrow><mi>k</mi></msubsup><mo>=</mo><mfrac><mn>1</mn><mrow><mi>W</mi><mo>×</mo><mi>H</mi></mrow></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>W</mi></munderover><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>H</mi></munderover><mfrac><mrow><mi>∂</mi><msup><mi>y</mi><mi>k</mi></msup></mrow><mrow><mi>∂</mi><msub><mi>A</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>,</mo><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow></mfrac></math>$

이제 이렇게 얻은 $\beta$ 를 갖고, attention module에서 획득된 $S$ 와의 KL divergence로 새로운 loss를 정의한다.

$LGGAM=12(KL(S||˜βk)+KL(˜βk||S))<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>G</mi><mi>G</mi><mi>A</mi><mi>M</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><mi>K</mi><mi>L</mi><mo stretchy="false">(</mo><mi>S</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">~</mo></mover></mrow><mi>k</mi></msup><mo stretchy="false">)</mo><mo>+</mo><mi>K</mi><mi>L</mi><mo stretchy="false">(</mo><msup><mrow data-mjx-texclass="ORD"><mover><mi>β</mi><mo stretchy="false">~</mo></mover></mrow><mi>k</mi></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>S</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$

위 식에서 $\tilde \beta^k$ 는 $\beta$ 에 sigmoid를 취한 형태이고, (즉, $\tilde \beta_c^k = sigmoid(\beta_c^k)$ ), KL은 KL divergence를 의미한다.

마지막으로, original cross-entropy loss를 추가하여 최종 loss를 획득한다.

$L o s s = L C E + λ L G G A M <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mi>o</mi><mi>s</mi><mi>s</mi><mo>=</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>C</mi><mi>E</mi></mrow></msub><mo>+</mo><mi>λ</mi><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>G</mi><mi>G</mi><mi>A</mi><mi>M</mi></mrow></msub></math>$

4) 먼저, numerical performance를 확인해보면, 저자들이 제안한 방법이 SOTA보다 더 좋은 성능을 두 데이터 셋에서 보임을 알 수 있다.

추가적으로, spatial attention, channel attention, GGAM-Loss를 각각 제거하였을 때, 혹은 모두 사용할 때의 성능 차이를 확인해본 결과, 모든 방법을 사용할 때 가장 좋은 성능을 보였다.

Visual performance 평가를 위해 baseline, channel attention, spatial attention, channel-spatial attention에 GGAM-loss를 적용시킨 경우와 적용시키지 않은 경우의 Grad-CAM을 비교하였다. GGAM-Loss를 적용시킬 때, 좀 더 정확하게 물체 주변에 heatmap이 형성되는 것을 확인할 수 있다.

(Object localization만 생각한다면 channel attention + GGAM-Loss와 channel-spatial attention w/o GGAM-Loss가 좋은 성능을 보인 것 같은데, FGVC 문제 상, high order features를 추출하는 것이 더 나은 방법임으로, 제안된 방법이 부리 주변에서 high importance를 가진 것은 저자가 제안한 의도를 성취했다고 볼 수 있다.)

마지막으로, $\lambda$ 값을 바꿨을 때의 성능 차이를 확인해본 결과, GGAM-Loss를 사용하지 않는 경우보다는 성능이 좋아지는 것을 확인 할 수 있었다.

Channel attention, spatial attention이 어떻게 module화가 되는 지 간단히 확인 할 수 있었고, Grad-CAM의 방법을 그대로 사용해서 CAM을 형성한다음에 다시 attention을 넣는 방법을 사용하지 않고, Loss를 추가해서 attention 을 guide 했다는 점에서 매우 유용한 접근 방법이라고 생각이 든다.

저작자표시

'Deep learning study > Explainable AI, 설명가능한 AI' 카테고리의 다른 글

Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs 내용 정리 [XAI-15] (0)	2021.08.12
Grad-CAM: Why did you say that? 내용 정리 [XAI-14] (0)	2021.06.24
Ablation-CAM: Visual Explanations for Deep Convolutional Network Via Gradient-free Localization [XAI-12] (0)	2021.06.17
Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks 내용 정리 [XAI-11] (0)	2021.06.09
SS-CAM: Smoothed Score-CAM for Sharper Visual Feature Localization 내용 정리 [XAI-10] (0)	2021.06.01

'Deep learning study/Explainable AI, 설명가능한 AI' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

iMTE

iMTE

Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification 내용 정리 [XAI-13] 본문

Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification 내용 정리 [XAI-13]

논문 제목 : Grad-CAM Guided Channel-spatial Attention Module for Fine-grained Visual Classification

논문 주소 : https://arxiv.org/abs/2101.09666

'Deep learning study > Explainable AI, 설명가능한 AI' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역