Deep Learning for Computer Vision
Table of Contents
3d representation
- depth map
- voxel grid
- point cloud
- mesh
- implicit function
From 2d image to depth map:
unet2d
From 2d image to 3d voxel grid:
- Fully connected
- Transformer
- voxel tube (channel dimension as the z-axis of the camera)
From 2d image to point cloud
- FC
- Transformer
From 2d image to mesh
2d image -> 3d mask on a grid -> mesh -> deformed mesh
Model on 3d:
- voxel grid: conv3d
- point cloud: point-wise MLP (conv1d with kernel size 1)
- mesh: graph convolution
Mean Average Precision (mAP) in Object detection
AP = area under Precision vs. Recall Curve
- Pick a class (e.g. dog), and a threshold for IoU (e.g. 0.5, IoU > 0.5 means detected)
- Sort all the predicted boxes by the probability of dog class.
- Iterator all these boxes from high probability to low
- Compute the Precision and Recall as if you set the threshold of probability to be the current value.
- IoU > 0.5 means detected, otherwise not
- plot this Precision and Recall curve and compute
To get AP=1, hit all GT boxes with IoU > 0.5 and have no FP detections ranked abover any TP
mAP
repeat AP for all classes, some different IoU thresholds.
R-CNN
non-DL region proposal; Resample the image in each region; CNN on each region; a class and bounding box per region;
Fast R-CNN
CNN on the entire image; non-DL region proposal; Resample the CNN feature in each region (e.g. into 7x7); small CNN on each region; a class and bounding box per region;
Faster R-CNN
same as Fast R-CNN, expect DL region proposal: On the CNN output feature map, add another CNN. every position on its output is a (K) anchor box(es): (a probability of being a region, the bbox).
Single Stage Object Detection (SSD)
All R-CNNs have two stages: region proposal + final bbox. SSD only has the first stage of Faster R-CNN. CNN on the entire image. Instead of a DL based region proposal, SSD outputs the class and the bbox at these anchor box directly. SSD is 10x faster than Faster R-CNN but not as accurate.