Tesla AI Day
Table of Contents
https://www.youtube.com/watch?v=j0z4FweCy4M
Cache feature maps on disk
- shared back-bone + multiple heads;
- back-bone output feature maps are cached on disk;
- fine-tune heads based on cached features.
Late fusion -> early fusion
- vector space: 3d space, 2d birds-eye view From the later slides (Multi-Cam Features HxWxCxT 20x80x256x60), this vector space seems to be an ego-centric 2d birds-eye view
- Late fusion Detection on 2d images and then back-project to 3d
- early fusion: not so early back-bone on each 2d image; fuse to vector space; heads on vector space
Transformer for fuse / back-project
- each image provides keys and values
- positional encoded 3d grid / 2d birds-eye view grid are the queries
- Transformer is doing the data-dependent back-project part
Camera rectify
- It sounds like correct the camera orientation error during installation by software composition.
- This is before everything. It is like pre-processing.
Video Queue
- Queue is going into an RNN.
- Features: Kinematics(velocity and acceleration) + vector space features + positional encodings
- time queue: push stuff in every certain time
- spatial queue: push stuff in when the car travels a certain distance.
- This can be thought as fusing time and space. And it is a late fusion again. The future direction is to do early fusion using optical flow / cost volume??
Spatial RNN
- This sounds like a larger/global 2d birds-eye view which is not ego-centric.
- And you only update the area that the car can "see" in its ego-centric vector space.
- However, how large is this non-ego-centric birds-eye view is unknown. And how this large tensor can be stored in memory is unknown.