1 of 48

Segment Anything Model(SAM)과 친해지기

김경환

2 of 48

발표자 소개

김경환 [GitHub | Linkedin]

ML Engineer | Smilegate AI 센터

#AI #5년차 #ML개발자 #스타트업

#Vision #RL #FoundationModel #AI Application

3 of 48

SAM 이론적 배경
SAM 사용 사례 소개
SAM Example 실습
Q&A

4 of 48

Segment Anything Model

https://segment-anything.com/demo

5 of 48

Motivation

최근 Large Language Model(LLM)이 높은 Zero-shot / Few-shot Generalization 성능을 보이고 있다.
LLM과 같이 대량의 데이터셋을 pre-train하고, downstream task에 대해�높은 zero-shot generalization 성능을 보이는 모델을 Foundation Model[1] 이라 부른다.
컴퓨터 비전 분야에서도 CLIP[2], ALIGN 같이 Vision-Language Dataset으로�Foundation Model을 만들려는 시도가 있었다.

[1] R. Bommasani, "On the Opportunities and Risks of Foundation Models," arXiv preprint arXiv:2108.07258, 2021.�[2] A. Radford, "Learning Transferable Visual Models From Natural Language Supervision," arXiv preprint arXiv:2103.00020, 2021.

6 of 48

Motivation

Downstream task에 대한 높은 zero-shot generalization?

gif ref: https://wikidocs.net/178446

다음 단어를 예측하는 Task

Q&A 생성 Task

감정 분석 Task

문장 요약 Task

7 of 48

Motivation

하지만, 컴퓨터 비전 분야에는 Vision-Language task를 제외하고도,�아직 풀어야 하는 다양한 task가 존재한다.

Image Segmentation
3D reconstruction
High resolution
…

위 task들은 Foundation Model을 학습하기 위한 대량의 이미지 데이터셋을�구축하기 어렵다.

8 of 48

Objective

Image Segmentation을 위한 Foundation Model을 만들어보자!

How?

9 of 48

Foundation Model for Image Segmentation

What Task will enable zero-shot generalization?
What is the corresponding Model architecture?
What Data can power this task and model?

목표 달성을 위한 세가지 질문:

10 of 48

Segment Anything Task

NLP에서는 Foundation Model을 학습하기 위해 Next token prediction task를 정의.

gif ref: https://wikidocs.net/178446

11 of 48

Segment Anything Task

Promptable Segmentation Task

→ 이미지와 어떤 “Prompt”가 주어졌을 때 “유효한 Mask”를 반환

12 of 48

Segment Anything Task

Prompt ?

→ 이미지에서 분할할 대상을 지정하는 것.�→ Points, BBox, Mask, 심지어는 Text가 될 수도 있다.

13 of 48

Segment Anything Task

유효한 Mask ?

→ 모호한(ambiguous) prompt가 주어졌을 때도 합리적인 mask를 출력해야함.

14 of 48

Segment Anything Data Engine & Dataset

Next token prediction task의 경우 웹 상에서 “텍스트 데이터”를 대규모로 얻을 수 있다.
Promptable Segmentation task 학습을 위한 “Segmentation Mask”는 구하기 어렵다.

→ Segmentation을 위한

대규모 데이터셋을 직접 만들자!

15 of 48

Segment Anything Data Engine & Dataset

대규모 Mask data(1.1B)를 획득하기 위한 3가지 Stage

Assisted-manual stage
Semi-automatic stage
Fully automatic stage

16 of 48

Segment Anything Data Engine & Dataset

Assisted-manual stage

공개된 Segmentation dataset을 이용해 SAM을 초기 학습
전문 Annotator들이 웹 기반의 인터페이스에서 초기 학습된 SAM을 이용해 데이터 생성
새로 취득한 Data로만 점진적 모델 학습 진행 (6회 추가 학습)
120k 이미지로부터 4.3M Mask 취득

17 of 48

Segment Anything Data Engine & Dataset

2. Semi-automatic stage

Mask의 종류를 다양화 하는 것을 목표로 함
1단계에서 학습된 SAM을 이용해 신뢰도 높은 Mask를 작업화면에 표시
Annotator들은 그 외 Object를 작업
새로 취득한 Data로 점진적 모델 학습 (5회 추가 학습)
180k 이미지로부터 5.9M Mask를 취득

18 of 48

Segment Anything Data Engine & Dataset

3. Fully automatic stage

완전 자동화된 Annotation 단계
2단계까지 학습된 모델에 32 x 32 Regular Grid Point를 입력하여 Mask 획득
IoU 값이 높은 Mask만 남김
중복된 Mask 제거 등 후처리 작업 수행
11M 이미지로부터 1.1B Mask를 취득

19 of 48

Segment Anything Data Engine & Dataset

SA-1B 데이터셋

Annotator가 만들지 않고 SAM이 자동으로 만든 데이터만 포함

20 of 48

Segment Anything Model

유연한 Prompt 지원
Prompt의 모호함에 대한 대처
Mask를 real-time으로 연산

목표를 위한 모델의 제약 조건

Promptable Segmentation Task
Real-world에서 Interactive하게 사용 가능 (속도)

목표

Image encoder

Prompt encoder

Mask decoder

Segment Anything Model (SAM)

21 of 48

Segment Anything Model

강력한 Image encoder가 image embedding을 계산
Prompt encoder가 prompt의 embedding을 계산
두 embedding 정보를 Lightweight Mask decoder에서 결합하여 mask 예측

22 of 48

code: https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/modeling/image_encoder.py#L17

Segment Anything Model: Image encoder

고해상도 이미지를 처리하기 위해
Masked Autoencoder(MAE)로 Pre-training한 Vision Transformer (ViT) 구조를 사용
동일한 이미지에 대한 embedding은 다른 prompt에 재사용할 수 있다.

→ 하나의 이미지에 대해서 한번만 embedding을 계산함.

23 of 48

Segment Anything Model: Prompt encoder

Sparse prompt: points, boxes, text

이미지에서 segment할 대상을 지정할 정보

Dense prompt: masks

이미지와 공간적으로 대응되는 정보
Image embedding과 Element-wise로 더해진다.

code: https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/modeling/prompt_encoder.py#L16

24 of 48

Segment Anything Model: Mask decoder

Image embedding과 prompt embedding을 받아 mask를 출력
Prompt encoder + Mask decoder는 50ms 이내에 mask를 예측
모호성에 대응하기 위해 여러개의 mask를 예측하도록 설계함. (Whole, Part, Sub-part)

code: https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/modeling/mask_decoder.py#L16

25 of 48

Applications: One click Segmentation

code: https://github.com/MrSyee/SAM-remove-background/blob/main/jupyternotebook/sam_click.ipynb

26 of 48

Applications: Everything

https://segment-anything.com/demo

공식 데모 사이트의 Everything 데모

27 of 48

Applications: Everything

https://segment-anything.com/demo

공식 데모 사이트의 Everything 데모

28 of 48

Applications: Everything

1,024개의 지점을 64개씩 16번 배치 수행
Mask와 iou_prediction 추론
Mask를 필터링

threshold 이하의 mask를 제거
겹쳐지는 mask들을 제거

각 mask의 hole과 island를 제거

code: https://github.com/facebookresearch/segment-anything/blob/main/segment_anything/automatic_mask_generator.py

29 of 48

Applications: SAM with CLIP

Everything으로 모든 Mask 찾기
CLIP으로 query text와�각 Mask 간의 유사도 측정
유사도 점수가 높은 Mask만 남김

code:https://github.com/Curt-Park/segment-anything-with-clip

30 of 48

Applications: Grounded Segment Anything

https://github.com/IDEA-Research/Grounded-Segment-Anything

Grounding-Dino로 Open-Vocabulary Object Detection 수행
SAM으로 Bounding Box 내의 Object에 대한 Segmentation 수행

31 of 48

Applications: Inpaint Anything (w/ Stable Diffusion)

https://github.com/geekyutao/Inpaint-Anything

SAM으로 원하는 Object를 Segmentation
Stable Diffusion 등으로 이미지 재생성

Text prompt: "a teddy bear on a bench"

32 of 48

Applications: SAM-Track

SAM이 첫 번째 프레임의 Segmentation 수행.
DeAOT로 다음 Frame의 Mask를 예측.

code: https://github.com/z-x-yang/Segment-and-Track-Anything

33 of 48

Applications: Awesome Segment Anything

34 of 48

Applications: Segment Everything Examples

code: https://github.com/annotation-ai/technical-demo

35 of 48