Digging Into Self-Supervised Monocular Depth Estimation(monodepth2)

Papers

Digging Into Self-Supervised Monocular Depth Estimation(monodepth2)

mooh_0812 2022. 3. 4. 16:16

https://arxiv.org/abs/1806.01260

Digging Into Self-Supervised Monocular Depth Estimation

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of

arxiv.org

Abstract

a minimum reprojection loss, designed to robustly handle occlusions
a full-resolution multi-scale sampling method that reduces visual artifacts
auto-masking loss to ignore training pixels that violate camera motion assumptions

1. Introduction

단순 이미지 하나에서 depth 추출하는것은 어렵다(with out second image to enable triangulation)
Generating high quality depth-from-color는 LIDAR를 대체하기 좋음(저렴함)
이렇게 이미지에서 depth를 추출할수 있으면 use large unlabeled image datasets for the pretraining of deep networks for downstream discriminative tasks하기 쉬움
but 감독 학습을 위한 정확한 실측 depth data로 크고 다양한 훈련 데이터 세트를 수집하는 것 자체가 힘들다
이에 대한 대응으로 최근 몇몇 self-supervised approaches들은 대신 동기화된 스테레오 쌍이나 단안 비디오만을 사용하여 단안 깊이 추정 모델을 훈련하는 것이 가능(but challenges가 존재)
propose three architectural and loss innovations that lead to large improvements in monocular depth estimation when training with monocular video, stereo pairs, or both
1. monocular supervision을 사용할 때 발생하는 occluded pixels 문제를 해결하기 위한 appearance matching loss
2. auto masking : monocular training에서 상대적인 카메라 움직임이 관찰되지 않는 픽셀을 무시하는 접근법
3. 입력 해상도에서 모든 영상 샘플링을 수행하여 depth artifacts를 줄이는 multi-scale appearance matching loss

2. Related Work

2.1 Supervised Depth Estimation

combining local predictions
non-parametric scene sampling(DepthTransfer: Depth Extraction from Video Using Non-parametric Sampling)
end-to-end supervised learning

(Depth Map Prediction from a Single Image using a Multi-Scale Deep Network)

(Deeper Depth Prediction with Fully Convolutional Residual Networks)

(Deep Ordinal Regression Network for Monocular Depth Estimation)

위와같이 다양한 방법들은 학습시 ground truth가 필요한 fully supervised 방식인데 이러한 방식은 실제 환경에서 획득하기 어려운 단점이 존재함
weakly supervised training data, Synthetic training data 등등

2.2 Self-supervised Depth Estimation

ground truth가 없기에 image reconstruction 즉 supervisory signal을 만들어서 대안으로 사용해보자

Self-supervised Stereo Training

- synchronized stereo pairs를 사용, pair간의 pixel disparities를 통해 monocular depth estimation(GT로 사용)

Self-supervised Monocular Training

- monocular videos에서 frame사이의 camera pose를 추정해 depth estimation

3. Method

3.1 Self-Supervised Training

pixel하나당 depth를 찾아내는것은 ill-posed problem임
이 애매함을 좀 해결하기위해 Classical binocular and multi-view stereo methods를 사용 depth maps을 강제로 smoothness하게 만듬
다른 시점의 이미지에서 target image로 view-synthesis를 예측하는 network training

- minimization of a photomaetric reprojection error at training time

$I_{t}$ : single color input
$D_{t}$ : depth map
$I_t'$ : relative pose for each source view
$T_{t->t'}$ : respect to target image $I_t$'s pose
$L_p$ : photo metric reprojection error

pe : photometric reconstruction error
proj() : resulting 2D coordinates of the projected depths $D_t$ in $I_t'$(depth와 $I_t$에서 $I_t'$로의 pose $T_{t->t}'$를 이용하여 projection된 위치)
$\left< \right>$ : sampling operator

보통 error는 bilineara sampling된 pixel값의 L1 distance를 사용하지만 이 논문에선 L1 distance에 SSIM을 추가

$\alpha$ = 0.85, use edge-aware smoothness

여기서 $d_{t}^{*} = d_{t} / \bar{d_t}$이며 mean-normalized inverse depth를 의미
학습에서는 target 이미지 앞뒤의 2개 이미지를 source이미지로 하여 pose와 depth를 estimation 하도록 학습
stereo방식에선 stereo pair의 다른 이미지를 source이미지로 사용

3.2 Improved Self-Supervised Depth Estimation

Per-Pixel Minimum Reprojection Loss

multiple source images에서 reprojection error를 계산할때 여러 source images로 부터 계산된 error을 평균내서 사용
pixel이 target image에서 보이지만 source image에서 안보일때 문제가 발생
만약 network가 pixel에대한 정확한 depth값을 예측했음에도 source image에서 corresponing color가 가려졌을때 pixel값의 차이는 error로 계산되고 penalty가 되어버림
이 논문에선 이러한 문제를 해결하기위해 평균대신 minimum을 사용함

Auto-Masking Stationary Pixels

Self-supervised monocular training 주로 카메라는 움직이고 scene은 static한 상황으로 가정
위의 가정이 깨질경우 성능이 떨어지며 이는 depth map에서 무한대로 멀리 있는 hole로 나타나게 된다
여기에 착안하여 새로운 auto-masking 방법을 제안. 만약 target 이미지와 source 이미지 사이에 appearance 변화가 없으면 이를 계산에서 제외
이는카메라가 움직이지 않는 상황이나 카메라와 동일한 속도로 물체가 움직이는 부분을 무시할 수 있게 해준다
이전의 방법들처럼 픽셀 마다 mask $\mu$를 할당하여 loss로 사용하되, 이전에는 mask를 네트워크가 estimation하도록 하거나, 혹은 객체 움직임에서 추정해 냈지만 여기에서는 더 간단한 방법을 사용하였다
warp된 source image와의 reprojection error가 warp하지 않은 source image보다 작은 경우, 이러한 상황으로 판단하여 mask를 0으로, 그렇지 않으면 1로 할당

Multi-scale Estimation

Bilinear sampler에 의한 local minimum에 빠지는 것을 막기 위해 multi-scale depth prediction을 사용
이전 방법들은 decoder에서 각 scale에 대해 별도로 loss들을 계산하여 이를 합하는 방법을 사용
이는 low resolution에서 low-texture regions 영역으로 인해 holes들이 생기는것을 확인
논문에서는 photometric error를 저해상도의 이미지에서 계산하는 대신 저해상도 depth map을 원래의 해상도로 업샘플한 뒤, 높은 해상도에서 error(pe)를 계산(Fig. 3(d))
이는 matching patches와 비슷한 과정인데, 저 해상도의 한 depth는 고해상도로 warping 될 경우 patch에 해당되기 때문

Final Training Loss

combine our per-pixel smoothness and masked photometric losses

$$L = \mu L_p + \lambda L_s$$

3.3 Additional Considerations

network based on the general U-Net architecture

ResNet18 as encoder

start with weights pretrained on ImageNet