**Assignment 2: Single View to 3D**
Gunjan Sethi: gunjans@andrew.cmu.edu
(#) Contents
- [Fitting a Voxel Grid](#q11)
- [Fitting a Pointcloud](#q12)
- [Fitting a a Mesh](#q13)
- [Reconstructing 3D from SingleView: Image to Voxel Grid](#q21)
- [Reconstructing 3D from SingleView: Image to Pointcloud](#q22)
- [Reconstructing 3D from SingleView: Image to Mesh](#q23)
- [Quantitative Comparisons](#q24)
- [Hyperparameter Variations](#q25)
- [Interpreting the Model](#q26)
(#) Exploring Loss Functions
(##) 1.1. Fitting a Voxel Grid
The goal is to define the binary cross entropy loss to help fit a 3D binary voxel
grid.
Command:
`python fit_data.py --type 'vox' --max-iter 25000`
| Source Voxel Grid | Ground Truth Voxel Grid | Optimized Voxel Grid |
| --- | --- | --- |
|
|
|
|
(##) 1.2. Fitting a PointCloud
The goal is to implement the chamfer loss function to help fit a 3D pointcloud.
Command:
`python fit_data.py --type 'point' --max-iter 12000`
| Source Pointcloud | Ground Truth Pointcloud |
| --- | --- |
|
|
|
| 3k Iters | 6k Iters | 9k Iters | 12k Iters |
| --- | --- | --- |--- |
|
|
|
|
|
(##) 1.3. Fitting a Mesh
The goal is to define Laplacian smoothing loss to help fit a 3D mesh.
Command:
`python3 fit_data.py --type 'mesh' --max_iter 10000`
| Source Mesh | Ground Truth Mesh |
| --- | --- |
|
|
|
| 3k Iters | 6k Iters | 9k Iters | Optimized |
| --- | --- | --- |--- |
|
|
|
|
|
(#) Reconstructing 3D from Single View
The goal is to train a single view to 3D pipeline for voxels, pointclouds and meshes.
(##) 2.1. Image to Voxel Grid - Ablations
(###) Single ConvTranspose3D Layer
Trained for 7.5k iters with batch size 64.
`nn.ConvTranspose3d(512, 1, kernel_size=32, stride=1)`
| Images | Groudtruth Mesh | Predicted Vox |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(###) 4-Layer ConvTranspose3D Model
This decoder model is inspired by Pix2Vox. Trained for 9k iters with batch size 64.
```
self.layer1 = torch.nn.Sequential(
torch.nn.ConvTranspose3d(512, 128, kernel_size=8, stride=2, padding=1),
torch.nn.BatchNorm3d(128),
)
self.layer2 = torch.nn.Sequential(
torch.nn.ConvTranspose3d(128, 32, kernel_size=8, stride=2, padding=1),
torch.nn.BatchNorm3d(32),
)
self.layer3 = torch.nn.Sequential(
torch.nn.ConvTranspose3d(32, 8, kernel_size=4, stride=2, padding=1),
torch.nn.BatchNorm3d(8),
)
self.layer4 = torch.nn.Sequential(
torch.nn.ConvTranspose3d(8, 1, kernel_size=1),
)
```
| Images | Groudtruth Mesh | Predicted Vox |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(###) 4 FC Layers [Final]
Trained for 8k iterations with 64 batch size.
```
nn.Sequential(
nn.Linear(512, 2048),
nn.ReLU(),
nn.Linear(2048, 8192),
nn.ReLU(),
nn.Linear(8192, (32 * 32 * 32))
)
```
| Images | Groudtruth Mesh | Predicted Vox |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(##) 2.2. Image to Pointcloud - Ablations
Decoder Model
(###) Single FC Layer
Trained for 10k iters with 2 batch size.
```
nn.Linear(512, (args.n_points * 3))
```
| Images | Groudtruth Mesh | Predicted Pointcloud |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(###) 3-Layer FC Decoder [Final]
Trained for 10k iters with 64 batch size.
```
nn.Sequential(
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 4096),
nn.ReLU(),
nn.Linear(4096, (args.n_point * 3))
)
```
| Images | Groudtruth Mesh | Predicted Pointcloud |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(##) 2.3. Image to Mesh - Ablations
(###) 3-Layer FC Decoder [Final]
Train for 7k iterations with 64 batch size.
```
nn.Sequential(
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 4096),
nn.ReLU(),
nn.Linear(4096, mesh_pred.verts_packed().shape[0] * 3)
)
```
| Images | Groudtruth Mesh | Predicted Pointcloud |
| -- | -- | -- |
|
|
|
|
|
|
|
|
|
|
|
|
(##) 2.4. Quantitative Comparisons
Using simple linear layers works pretty well for all 3D representations. However, using similar
models, voxels show a lower F1 score as compared to meshes and pointclouds.
| Voxel | Pointcloud | Mesh |
| -- | -- | -- |
| F1_0.05: 93.043; Avg F1_0.05: 90.688 | F1_0.05: 99.760; Avg F1_.05: 96.874 | F1_0.05: 99.199; Avg F1_0.05: 95.833 |
|
|
|
|
(##) 2.5. Hyperparameter Variations
(###) Voxel Predictions
Varying batch size from 2 to 64 significantly boosted the F1 score (by 20%). Even qualitatively, the
models trained with 64 batch size are able to capture unique characteristics of chairs.
| Groundtruth | Trained with batchsize = 2 | Trained with batchsize = 64 |
| -- | -- | -- |
|
|
|
|
(###) Pointcloud Predictions
Varying batch size from 2 to 64 boosted the F1 score by 5%. The model trained with a larger
batch size can better capture nuances in the chair stucture, for example, the structure of the
chair legs as shown in the below example.
| Groundtruth | Trained with batchsize = 2 | Trained with batchsize = 64 |
| -- | -- | -- |
|
|
|
|
Further, the number of predicted points does not change the model performance beyond 5000.
With 1000 points, the F1 score is around 70 but boosts to 90+ with 5000 points. Below models are
trained with batch size 2.
| Groundtruth | n_points = 1000 | n_points = 5000 | n_points = 10000 |
| -- | -- | -- | -- |
|
|
|
|
|
(##) 2.6. Model Interpretation
(###) Voxel Predictions Through Iterations
| Sample 0 | Sample 140 |
| -- | -- |
|
|
|
(###) Mesh Deformations Through Iterations
| Sample 0 | Sample 140 |
| -- | -- |
|
|
|