Towards Learning Spatially Discriminative Feature Representation 내용 정리 [XAI-21]

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

iMTE

Towards Learning Spatially Discriminative Feature Representation 내용 정리 [XAI-21] 본문

Deep learning study/Explainable AI, 설명가능한 AI

Towards Learning Spatially Discriminative Feature Representation 내용 정리 [XAI-21]

Wonju Seo 2021. 9. 13. 13:40

논문 제목 : Towards Learning Spatially Discriminative Feature Representation

논문 주소 : https://arxiv.org/abs/2109.01359

Towards Learning Spatially Discriminative Feature Representations

The backbone of traditional CNN classifier is generally considered as a feature extractor, followed by a linear layer which performs the classification. We propose a novel loss function, termed as CAM-loss, to constrain the embedded feature maps with the c

arxiv.org

주요 내용 정리 :

1) 저자는 class activation map (CAM)을 loss에 포함시켜서 더 나은 image classification, regularization effect, transfer learning, few shot learning, 마지막으로 knowledge distilation을 유도하는 방법을 제시하였다. 결과에서는 CAM loss가 성능을 향상시키는데, CAM이 background에 대한 effect를 suppress하고 discriminative한 부분들 더 강조하게 되는 것이다.

밑의 그림을 보면 class-agnostic activation map (CAAM) 결과를 보면, cross entropy (CE)를 사용한 것 보다 CAM-loss를 사용한 결과가 lable accuracy가 높다는 것을 알 수 있다. CE 경우, CAAM은 background 혹은 주요하지 않은 (ox를 보면 검정색 part는 black bear와 비슷하다) 부분까지도 고려한다. 반면, CAM loss는 이런 부분 보다는 더 discriminative 한 부분 (ox를 보면 얼굴 쪽) 부분을 더 보고, 주요하지 않은 부분과 background는 suppress 하는 것을 확인 할 수 있다.

2) 먼저, CAM을 정의하자

$f_k(x,y)$ 는 $k$ -th feature map의 (x,y)에서의 activation을 의미한다. Global average pooling layer를 통과한 이후 $f_k$ 는 다음과 같이 변환된다.

$Fk=1H×W∑x,yfk(x,y)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>F</mi><mi>k</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mi>H</mi><mo>×</mo><mi>W</mi></mrow></mfrac><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>,</mo><mi>y</mi></mrow></munder><msub><mi>f</mi><mi>k</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

위 식에서 $H$ 와 $W$ 는 feature map의 크기이다. 주어진 target class $i$ 에 대해서, softmax layer의 입력은 다음과 같이 표현된다.

$z i = \sum k w i k F k <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>z</mi><mi>i</mi></msub><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>k</mi></munder><msubsup><mi>w</mi><mi>k</mi><mi>i</mi></msubsup><msub><mi>F</mi><mi>k</mi></msub></math>$

위 식에서 $w_k^i$ 는 unit $k$ 에 대한 class $i$ 에 대응되는 weight이 되고, class $i$ 에 대한 $F_k$ 의 중요도를 나타낸다. 최종적으로, $z_i$ 는 다음과 같이 표현된다.

$zi=1H×W∑kwik∑x,yfk(x,y)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>z</mi><mi>i</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mi>H</mi><mo>×</mo><mi>W</mi></mrow></mfrac><munder><mo data-mjx-texclass="OP">∑</mo><mi>k</mi></munder><msubsup><mi>w</mi><mi>k</mi><mi>i</mi></msubsup><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>,</mo><mi>y</mi></mrow></munder><msub><mi>f</mi><mi>k</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

$=1H×W∑x,y∑kwikfk(x,y)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mo>=</mo><mfrac><mn>1</mn><mrow><mi>H</mi><mo>×</mo><mi>W</mi></mrow></mfrac><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>,</mo><mi>y</mi></mrow></munder><munder><mo data-mjx-texclass="OP">∑</mo><mi>k</mi></munder><msubsup><mi>w</mi><mi>k</mi><mi>i</mi></msubsup><msub><mi>f</mi><mi>k</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

위 식으로 부터 CAM은 다음과 같이 표현된다.

$C A M i (x, y) = \sum k w i k f k (x, y) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>C</mi><mi>A</mi><msub><mi>M</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>k</mi></munder><msubsup><mi>w</mi><mi>k</mi><mi>i</mi></msubsup><msub><mi>f</mi><mi>k</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

다음으로 CAAM 은 다음과 같이 표현된다.

$C A A M (x, y) = \sum k f k (x, y) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>C</mi><mi>A</mi><mi>A</mi><mi>M</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>k</mi></munder><msub><mi>f</mi><mi>k</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo></math>$

$CAAM$ 과 $CAM$ 을 min-max normalization 한 이후에 $L_{cam}$ 은 다음과 같의 정의된다.

$Lcam=1H×W∑x,y||CAAM′(x,y)−CAM′i(x,y)||l1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>a</mi><mi>m</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mi>H</mi><mo>×</mo><mi>W</mi></mrow></mfrac><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>,</mo><mi>y</mi></mrow></munder><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>C</mi><mi>A</mi><mi>A</mi><msup><mi>M</mi><mo data-mjx-alternate="1">′</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mo>−</mo><mi>C</mi><mi>A</mi><msubsup><mi>M</mi><mi>i</mi><mo data-mjx-alternate="1">′</mo></msubsup><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mi>l</mi><mn>1</mn></mrow></msub></math>$

위 식을 통해서, $CAAM'$ 이 $CAM'$ 에 가깝게 되도록 될 것이다.

$L_{ce}$ 는 다음과 같이 정의 되며 (Cross-entropy), 최종 CAM-Loss는 다음과 같다.

$Lce=−logezi∑jezj<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>e</mi></mrow></msub><mo>=</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mfrac><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>i</mi></msub></mrow></msup><mrow><munder><mo data-mjx-texclass="OP">∑</mo><mi>j</mi></munder><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>j</mi></msub></mrow></msup></mrow></mfrac></math>$

$C A M - l o s s = α L c a m + L c e <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>C</mi><mi>A</mi><mi>M</mi><mo>-</mo><mi>l</mi><mi>o</mi><mi>s</mi><mi>s</mi><mo>=</mo><mi>α</mi><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>a</mi><mi>m</mi></mrow></msub><mo>+</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>e</mi></mrow></msub></math>$

$\alpha$ 는 combination ratio이다. 주요한 점은, $L_{ce}$ 는 $W$ 를 update 하고, $L_{cam}$ 은 $\theta$ 를 update 한다. 이렇게 하는 이유는 $W$ 와 $L_{cam}$ 사이의 correlation을 제거하기 위함이다. (여기서 $W$ 는 global averge pooling 이후의 fully connected layer의 weights을 의미하고, $\theta$ 는 backbone의 weights이다.)

$\alpha$ 의 경우 다음의 식을 통해서 정하는데, 특정 epoch을 넘어가면 c가 0이 아니고, 특정 epoch 전에는 c가 0이 된다. 이렇게 하는 이유는 처음 학습시에 생성된 CAM은 너무 discrete하며, 마지막 epoch에 가까울 수록 CAM이 더 나은 heatmap을 만들어 주기 때문이다. $\alpha$ 는 밑의 ablation test를 통해서 결정되었다.

Knowledge distilation의 기존 방법은 weak student가 strong teacher를 따라하게 하는데, 이때 그들 사이의 Kullback-Leibler divergence를 최소화하는 방법을 사용했다. 최종적인 Loss는 다음과 같다.

$L = β L c e + (1 - β) L k d + γ L c c m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mo>=</mo><mi>β</mi><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>e</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>β</mi><mo stretchy="false">)</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>d</mi></mrow></msub><mo>+</mo><mi>γ</mi><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>c</mi><mi>m</mi></mrow></msub></math>$

위 식에서 $\beta$ 와 $\gamma$ 는 combine ratio 이다. 위 식을 자세하게보자면,

$Lkd=1nn∑i=1τ2(pτtilogpτti−pttilogpτsi)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>d</mi></mrow></msub><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msup><mi>τ</mi><mn>2</mn></msup><mo stretchy="false">(</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow><mi>τ</mi></msubsup><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow><mi>τ</mi></msubsup><mo>−</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow><mi>t</mi></msubsup><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><msubsup><mi>p</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow><mi>τ</mi></msubsup><mo stretchy="false">)</mo></math>$

위 식에서 $\tau$ 는 temperature factor, $p_{si}^\tau$ 는 studnet의 soft target이고, $p_{ti}^\tau$ 는 teacher의 soft target 이다.

다음으로, $L_at$ 는 두 모델의 attention map의 일치 정도를 보는 것으로, 다음과 같다.

$L a t = | | C A M' s i - C A M' t i | | l 1 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>a</mi><mi>t</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>C</mi><mi>A</mi><msubsup><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>i</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo>-</mo><mi>C</mi><mi>A</mi><msubsup><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mi>l</mi><mn>1</mn></mrow></msub></math>$

위 식에서 $CAM'_{si}$ 은 normalized student CAM, $CAM'_{ti}$ 는 normalized teacher CAM 을 의미한다.

저자가 제안하는 CAAM-CAM matching (CCM)은 AT와 다르게 student의 CAAM을 사용하는 점이 다르다. 이는 다음과 같이 표현된다.

$L c c m = | | C A A M' s - C A M' t i | | l 1 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>c</mi><mi>m</mi></mrow></msub><mo>=</mo><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>C</mi><mi>A</mi><mi>A</mi><msubsup><mi>M</mi><mi>s</mi><mo data-mjx-alternate="1">'</mo></msubsup><mo>-</mo><mi>C</mi><mi>A</mi><msubsup><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mo stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><mi>l</mi><mn>1</mn></mrow></msub></math>$

위 식에서 $CAAM'_s$ 는 normalized student CAAM 이다.

(논문에서 $L_{ccm}$ 이 $L_{at}$ 만 사용하는 것 대비 왜 좋은 효과를 나타내는지 나타내고 있는데, 이 부분은 한 번 읽어보길 바란다.)

전체적인 CAM-loss의 구조는 다음과 같다.

3) Results

(1) Image classification, apply to various network structure : 밑의 표를 보면, CAM-loss를 사용했을 때, top 1 error가 모든 모델, 두 데이터 셋에서 낮아졌음을 확인 할 수 있다.

또한, 밑의 그림을 보면 CAM-loss가 regularization effect를 줄 수 있음을 보여주고 있다.

(2) Combine with regularization methods : 다양한 regularization methods를 사용 했을 때에도, CAM-loss를 사용하거나 같이 사용하는 경우 상당한 성능 향상을 보여주었다.

(3) Compare with other loss functions : 다른 loss function 과도 비교할 때, 성능이 향상되었음을 알 수 있다.

(4) Transfer learning : ImageNet-1k 에 pre-trained 된 모델을 CUB과 stanford dogs 데이터 셋에 fine-tuning을 했을 때에도, 단순 CE를 사용한 경우보다 CAM-loss를 사용한 경우 더 나은 성능을 보여주었다.

(5) Few shot learning : few shot learning case에 대해서도 CAM loss를 사용하는 경우 더 나은 성능을 보여 주었다.

여기서 왜 CAM-loss가 few shot learning 성능을 향상시켰을까? 저자는 CAM-loss가 background에 대한 effect를 suppress 하기 때문이다. 밑의 그림을 보면 few shot image classificaiton은 background에 영향을 받는 것으로 보인다.

(6) Knowledge distilation : Knowledge distilation 결과를 보면, KD, AT에 비해서 제안된 CCM이 더 나은 성능을 보여주었음을 알 수 있다.

최근 연구에서 CAM이 background effect를 제거하고 discriminative한 부분은 더 강조하는 역할로써 성능 향상이 되는 효과를 보여주고 있는 것으로 보인다. 아마 CAM 사용은 성능 향상에 필수적으로 사용 될 것으로 생각된다. 이는 모델의 설명가능성이 성능 향상에 도움된다는 것을 의미할 수 있음으로 매우 유의미한 발전으로 해석된다.

저작자표시

'Deep learning study > Explainable AI, 설명가능한 AI' 카테고리의 다른 글

Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis 내용 정리 [XAI-23] (2)	2021.11.11
Towards Better Explanations of Class Activation Mapping 내용 정리 [XAI-22] (0)	2021.09.30
Informative Class Activation Maps 내용 정리 [XAI-20] (0)	2021.09.06
Eigen-CAM: Class Activation Map Using Principal Components 내용 정리 [XAI-19] (0)	2021.09.01
Combinational Class Activation Maps for Weakly Supervised Object Localization 내용 정리 [XAI-18] (0)	2021.08.25