UM-CV 17 3D Vision
Topics in 3D vision: Computing correspondences, stereo, structure from motion, simultaneous localization and mapping(SLAM), view synthesis, differentiable graphics, etc.
Focus: 3D shape prediction - Shape representations, shape comparison metrics, camera systems, coordinates, datasets, and an example of Mesh R-CNN.
@Credits: EECS 498.007 | Video Lecture: UM-CV
Personal work for the assignments of the course: github repo.
Notice on Usage and Attribution
These are personal class notes based on the University of Michigan EECS 498.008 / 598.008 course. They are intended solely for personal learning and academic discussion, with no commercial use.
For detailed information, please refer to the complete notice at the end of this document
Shape representations
Depth map
Depth map assigns a depth value to each pixel in the image. RGB-D image.
- Cannot caputure occluded regions. Data can be recorded directly for some types of 3D sensors.
- Task: Predicting Depth Maps
- Problem: Scale/Depth Ambiguity
- Solution: Use a loss function that is invariant to scale, e.g. log scale.
Surfacer Normals
This gives a vector giving the normal vector to the object in the world for that pixel. Ground-truth normals: 3 × H × W
Voxel grid
Represent a shape with a V × V × V grid of occupancies. Just like segmentation masks in Mask R-CNN, but in 3D.
- Conceptually simple
- High resolution to capture fine structures
- Scaling to high resolutions is nontrivial
- Processing Voxel Inputs: 3D Convolution similar
- Generating Voxel Shapes from 2D Images
- 2D image -> 2D CNN -> FC -> 3D CNN -> Voxel grid
- Problem: 3D Convolution is computationally expensive, also memory intensive. e.g. storing a 1024^3 voxel grid requires 4GB of memory.
- Voxel Tubes 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction, ECCV 2016: Final conv layer: V filters Interpret as a "tube" of voxel scores. Problem: loss of translation invariance in z direction.
- Scaling Voxel Networks: Octree-based methods OctNet: Learning Deep 3D Representations at High Resolutions, ICCV 2017
- Nested Shape Layers Matryoshka Networks: Predicting 3D Geometry via Nested Shape Layers, CVPR 2018
Implicit Surface
Learn a function to classify arbitrary 3D points as inside/outside the shape. Signed Distance Function (SDF) gives the Euclidean distance to the surface of the shape; sign gives inside/outside
Point cloud
A set of points in 3D space.
- (+) Can represent fine structures without huge numbers of points
- ( ) Requires new architectures, losses, etc
- (-) Doesn't explicitly represent the surface of the shape: extracting a mesh for rendering or other. Need postprocessing to transform into another format.
- Segmentation: PointNet PointNet, CVPR 2017
- Generating Pointcloud Outputs: A Point Set Generation Network for 3D Object Reconstruction from a Single Image, CVPR 2017
- Chamfer Distance: Measure the distance between two point clouds. dCD(X,Y)=∑x∈Xminy∈Y∥x−y∥2+∑y∈Yminx∈X∥x−y∥2
Mesh
Triangle mesh:
- Vertices: Set of V points in 3D space
- Faces: Set of triangles over the Vertices
- (+) Standard representation for graphics
- (-) Requires new architectures of neural Networks
Pixel2Mesh Generating 3D Mesh Models from Single RGB Images, ECCV 2018
Key ideas:
- idea 1 Iterative mesh refinement
- Start from initial ellipsoid mesh network predicts offsets for each vertex repeat.
- Predicting Triangle Meshes: Graph convolution
- Input: Graph with a feature vector at each vertex.
- Output: New feature vector for each vertex.
- Graph Convolution: hv(l+1)=σ(∑u∈N(v)cv1W(l)hu(l))
- idea 2: How to incorporate image information?
- Aligned vertex features
- For each vertex of the mesh, use camera information fo project onto image plane.
- Use a bilinear interpolation to sample a CNN feature
- Loss function
- Prediction & Ground-truth is not unique, so we want our prediction to be invariant to the representation.
- Loss: Chamfer distance between predicted samples and ground-truch samples.
- GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects, ICML 2019
Shape Comparison Metrics
IoU over 3D shapes
Chamfer distance
- Problem: Very sensitive to outliers
F1 Score
- Precision@t = fraction of predicted points within distance t of some ground-truth point.
- Recall@t = fraction of ground-truth points within distance t of some predicted point.
- What Do single-view 3D Reconstruction Networks Learn? CVPR 2019
Fig: F1 score
Camera Systems, Coordinates
Canonical Coordinates: Predict 3D shape in a canonical coordinate system(e.g. front of chair is +z) regardless of the viewpoint of the camera.
View Coordinates: Predict 3D shape aligned to the viewpoint of the camera.
Many papers predict in canonical coordinates - easier to load data. However, canonical view breaks the "principle of feature alignment": Predictions should be aligned to inputs.
Idea: View-Centric Voxel Predictions
Datasets
3D Datasets: Object-centric. ShapeNet. ~50 categories, ~50k 3D CAD models.
Pix3D: Some papers train on ShapeNet and show qualitative results here, but use ground-truth segmentation masks. IKEA furniture aligned to ~17k images.
Example
Mesh R-CNN
Input: Single RGB image
Output: A set of detected objects for each object:
Mask R-CNN
- bbox
- category label
- instance segmentation
Mesh head
- 3d triangle mesh
Problem of Mesh deformation The topology is fixed by the inital mesh - All predictions are homomorphic to the initial mesh.
Mesh R-CNN Pipeline
Fig: Mesh R-CNN Pipeline
Add shape regularizers to the loss function to encourage the predicted mesh to be regular and smooth.
Concept: Amodal completion - predict occluded parts of the objects.
Notice on Usage and Attribution
This note is based on the University of Michigan's publicly available course EECS 498.008 / 598.008 and is intended solely for personal learning and academic discussion, with no commercial use.
- Nature of the Notes: These notes include extensive references and citations from course materials to ensure clarity and completeness. However, they are presented as personal interpretations and summaries, not as substitutes for the original course content.
- Original Course Resources: Please refer to the official University of Michigan website for complete and accurate course materials.
- Third-Party Open Access Content: This note may reference Open Access (OA) papers or resources cited within the course materials. These materials are used under their original Open Access licenses (e.g., CC BY, CC BY-SA).
- Proper Attribution: Every referenced OA resource is appropriately cited, including the author, publication title, source link, and license type.
- Copyright Notice: All rights to third-party content remain with their respective authors or publishers.
- Content Removal: If you believe any content infringes on your copyright, please contact me, and I will promptly remove the content in question.
Thanks to the University of Michigan and the contributors to the course for their openness and dedication to accessible education.