[Paper] YOLO 9000

Abstract

9000개 이상의 object category를 탐지할 수 있는 방법

여러가지 개선 사항
multi-scale training method
speed, accuracy의 trade off
object detection + classification 함께 training 하는 방법

Introduction

대부분의 detection method가 작은 물체로 한정되어있다 현재의 object detection dataset이 다른 task의 dataset에 비해 한정되어있다

가장 일반적인 object detection task dataset -> 수십~수백의 테그가 있는 수천~수십만의 이미지

때문에 이미 가직 ㅗ있는 방대한 분류 데이터를 활용하여 현재의 detection system 범위를 확장하는 새로운 방법 제안

서로 다른 데이터셋을 결합

deteion 및 calsslfication 데이터 모두에 대해 object detector를 train 할 수 있는 joint training 알고리즘을 제안한다

YOLO 9000 : 실시간으로 9000개 이상의 다양한 객체 카테고리를 탐지할 수 있는 object detector

YOLO의 단점 : 상당한 수의 위치 결정 오류 region-proposal method에 비해 사애적으로 낮은 recall을 가진다.

따라서 분류 정확도를 유지하며 recall과 위치 결정을 개선하는데 중점

YOLO v2를 통해 빠른 속도를 유지하며 더 정확한 deterctor 네트워크를 확장하는 대신 네트워크를 단순화하고 representation을 배우기 쉽도록 한다.

Better

정확도를 올리기 위한 방법

Batch Normalization

YOLO의 모든 컨볼루션 레이어에 배치 정규화를 추가 -> mAP에서 성능 개선 + 모델 정규화에 도움 모델에서 dropout 제거

High resolution classifier

ImageNet에서 pre-trained된 classifier 사용

원래의 YOLO -> 224 x 224에서 학습 후 448로 증가 이는 네트워크가 object detection으로 전환하고 새로운 입력 해상도에 맞춰야 함을 의미

YOLOc2 : 먼저 분류 네트워크를 ImageNet에서 전체 448 x 448 해상도로 10 epoch동안 미세 조정 (filter를 조정) 이후 결과 네트워크를 detection에 fine tune

Convolutinal with Anchor Boxes

YOLO는 convolution feature extractor 위에 있는 fc layer를 사용하여 bbox의 좌ㅏ표를 에측

좌표를 직접 예측하는 대신 Faster R-CNN은 사전에 선택된 기준을 사용하여 bbox 예측 -> 좌표 대신 offset 예측 문제가 단순하고 네트워크가 학습하기 쉬움

YOLO에서 fclayer를 제거하고 anchor box를 사용하여 bbox 예측

하나의 pooling layer를 제거하여 conv layer output을 high resolution으로 만든다
네트워크를 축소하여 448 x 448 대신 416 입력이미지에서 작동(홀수 개의 위치 생성) -> 단일 center cell 만들어짐
큰 객체들은 이미지 중앙을 차지하는 경향이 있느넫, 중앙에 단일 center cell이 있는것이 이러한 객체 예측에 도움

v1(cell 별로 2개 -> 7x7x2 갯수의 bbox)보다 많은 수의 bbox 예측

anchor box 사용 시 mAP 감소하지만 recall 값 상승 (recall 값 높음 -> 모델이 실제 객체의 위치를 예측한 비율이 높음)

YOLO v1 -> region proposal 기반의 방법보다 이미지 당 상대적으로 적은 bbox 예측하기 때문

anchor bo를 통해 더 많은 수의 bbox를 예측하며 실제 객체의 위치 잘 포착

Dimension clusters

기존에는 anchor box의 크기와 비율을 사전에 정의 but 더 좋은 학습 시작을 위해 사전 조건을 선택 -> k-means clustering 통해 prior 값 탐색

데이터셋의 모든 gtbox의 w,h 값을 사용하여 k-means clustering 수행

일반적인 euclidean distance -> 큰 bbox는 작은 box에 비해 큰 error 발생 box의 크기와 무관하게 선택한 prior이 좋은 IoU 값을 가지도록 하기 위해 새로운 distance metric 사용

k = 5일 때 적절한 trade-off

Direction location prediction

YOLO와 anchor box를 함께 사용했을 때 문제점은 초기 iteration 시 모델이 불안정한 점

bbox는 위 식과 같이 위치 조정 t_x,t_y 는 제한된 범위가 없기 때문에 anchor box는 이미지 내의 임의의 지점에 위치할 수 있는 문제가 있음

이를 위해 YOLO의 방식을 사용하여 grid cell에 상대적인 위치 좌표를 예측하는 방법을 선택

C_S,C_y 는 좌상단 offset 예측하는 bbox의 좌표가 0~1 사이의 값을 가진다. 이를 위해 t_x,t_y에 logistic regression(σ) 적용

정리 : Dimension clustering을 통해 최적의 prior 선택, anchor box 중심부 좌표를 직접 예측

Fine-Grained Features

YOLO v2는 최종적으로 13 x 13 크기의 feature map을 출력 -> 큰 객체 예측은 좋지만 작은 객체에 대해서는 좋지 않음

마지막 pooling 수행 전 feature map 추출하여 26 x 26 크기의 feature map 생성 이후 channel 유지하며 13 x 13 크기의 feature map 생성 -> 보다 작은 객체에 대한 정보 함축

최종적으로 13 x 13 크기의 feature map 각 grid cell 별로 5개의 bbox가 20개 class score 와 confidence, x, y, w, h 예측

Multi-scale Training

다양한 입력 이미지를 사용하여 네트워크 학습 10 batch 마다 입력 이미지의 크기 랜덤하게 선택하여 학습

모델이 이미지를 1/32로 downsample 시키기 때문에, 입력 이미지 크기를 32배수 중 선택(320~608)

다양한 크기의 이미지를 입력으로 받을 수 있고, 속도와 정확도 사이의 trade-off

Faster

Darknet-19

YOLO v1은 모델 마지막에 fc layer 통해 prediction 수행

하지만 이는 parameter 수 증가와 느린 detection 속도라는 단점 존재 마지막 layer에 global average pooling 사용하여 fc layer 제거

Training for classification

Darknet-19 를 class가 1000개인 데이터셋을 통해 학습

Training for detection

마지막 conv layer 제거하고 3x3 conv layer로 대체 이후 1x1 conv layer 추가

Stronger

detection dataset -> 일반적이고 범용적인 객체에 대한 정보 classification dataset -> 세부적인 객체에 대한 정보

이 때 예를들면 요크셔 테리어 와 개를 별개의 class로 분류할 가능성 있음

Hierarchy classification

이를 위해 계층적인 tree 의 WordTree 제안

각 노드는 범주를 의미 하위 범주는 자식 노드가 되고 자식 노드의 하위는 또 자식 노드가 된다.

특정 범주에 속할 확률은 조건부 확률의 곱으로 표현

Dataset combination with WordTree

ImageNet과 COCO dataset을 합쳐 WordTree 구성 (4 : 1 비율)

Joint classification and detection

grid cell 별로 3개의 anchor box를 사용하여 학습

네트워크가 detection dataset의 이미지를 보면 detection loss는 원래 방법대로, classification loss는 특정 범주와 상위 범주에 대해서만 loss 계산

네트워크가 classification dataset의 이미지를 보면 classification loss에 대해서만 backward pass 수행

localization loss : 객체의 중심이 존재하는 grid cell의 bbox loss와 존재하지 않는 cell의 bbox loss를 각각 구한 후 더함

DDOING

[Paper] YOLO 9000

Abstract

Introduction

Better

Batch Normalization

High resolution classifier

Convolutinal with Anchor Boxes

Dimension clusters

Direction location prediction

Fine-Grained Features

Multi-scale Training

Faster

Darknet-19

Training for classification

Training for detection

Stronger

Hierarchy classification

Dataset combination with WordTree

Joint classification and detection

참고

2024.04.30
[Paper] Paint by Example: Exemplar-based Image Editing with Diffusion Models

2024.04.17
[Paper] DEVA: Anything with Decoupled Video Segmentation

2024.04.14
[Paper] Video Panoptic Segmentation

2024.04.12
[Paper] Video Object Segmentation with Episodic Graph Memory Networks

DDOING

Abstract

Introduction

Better

Batch Normalization

High resolution classifier

Convolutinal with Anchor Boxes

Dimension clusters

Direction location prediction

Fine-Grained Features

Multi-scale Training

Faster

Darknet-19

Training for classification

Training for detection

Stronger

Hierarchy classification

Dataset combination with WordTree

Joint classification and detection

참고

2024.04.30 [Paper] Paint by Example: Exemplar-based Image Editing with Diffusion Models

2024.04.17 [Paper] DEVA: Anything with Decoupled Video Segmentation

2024.04.14 [Paper] Video Panoptic Segmentation

2024.04.12 [Paper] Video Object Segmentation with Episodic Graph Memory Networks

2024.04.30
[Paper] Paint by Example: Exemplar-based Image Editing with Diffusion Models

2024.04.17
[Paper] DEVA: Anything with Decoupled Video Segmentation

2024.04.14
[Paper] Video Panoptic Segmentation

2024.04.12
[Paper] Video Object Segmentation with Episodic Graph Memory Networks