**Assignment 2: Single View to 3D** Gunjan Sethi: gunjans@andrew.cmu.edu

(#) Contents - [Fitting a Voxel Grid](#q11) - [Fitting a Pointcloud](#q12) - [Fitting a a Mesh](#q13) - [Reconstructing 3D from SingleView: Image to Voxel Grid](#q21) - [Reconstructing 3D from SingleView: Image to Pointcloud](#q22) - [Reconstructing 3D from SingleView: Image to Mesh](#q23) - [Quantitative Comparisons](#q24) - [Hyperparameter Variations](#q25) - [Interpreting the Model](#q26) (#) Exploring Loss Functions (##) 1.1. Fitting a Voxel Grid The goal is to define the binary cross entropy loss to help fit a 3D binary voxel grid. Command: `python fit_data.py --type 'vox' --max-iter 25000` | Source Voxel Grid | Ground Truth Voxel Grid | Optimized Voxel Grid | | --- | --- | --- | |

|

|

| (##) 1.2. Fitting a PointCloud The goal is to implement the chamfer loss function to help fit a 3D pointcloud. Command: `python fit_data.py --type 'point' --max-iter 12000` | Source Pointcloud | Ground Truth Pointcloud | | --- | --- | |

|

| | 3k Iters | 6k Iters | 9k Iters | 12k Iters | | --- | --- | --- |--- | |

|

|

|

| (##) 1.3. Fitting a Mesh The goal is to define Laplacian smoothing loss to help fit a 3D mesh. Command: `python3 fit_data.py --type 'mesh' --max_iter 10000` | Source Mesh | Ground Truth Mesh | | --- | --- | |

|

| | 3k Iters | 6k Iters | 9k Iters | Optimized | | --- | --- | --- |--- | |

|

|

|

| (#) Reconstructing 3D from Single View The goal is to train a single view to 3D pipeline for voxels, pointclouds and meshes. (##) 2.1. Image to Voxel Grid - Ablations (###) Single ConvTranspose3D Layer Trained for 7.5k iters with batch size 64. `nn.ConvTranspose3d(512, 1, kernel_size=32, stride=1)` | Images | Groudtruth Mesh | Predicted Vox | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (###) 4-Layer ConvTranspose3D Model This decoder model is inspired by Pix2Vox. Trained for 9k iters with batch size 64. ``` self.layer1 = torch.nn.Sequential( torch.nn.ConvTranspose3d(512, 128, kernel_size=8, stride=2, padding=1), torch.nn.BatchNorm3d(128), ) self.layer2 = torch.nn.Sequential( torch.nn.ConvTranspose3d(128, 32, kernel_size=8, stride=2, padding=1), torch.nn.BatchNorm3d(32), ) self.layer3 = torch.nn.Sequential( torch.nn.ConvTranspose3d(32, 8, kernel_size=4, stride=2, padding=1), torch.nn.BatchNorm3d(8), ) self.layer4 = torch.nn.Sequential( torch.nn.ConvTranspose3d(8, 1, kernel_size=1), ) ``` | Images | Groudtruth Mesh | Predicted Vox | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (###) 4 FC Layers [Final] Trained for 8k iterations with 64 batch size. ``` nn.Sequential( nn.Linear(512, 2048), nn.ReLU(), nn.Linear(2048, 8192), nn.ReLU(), nn.Linear(8192, (32 * 32 * 32)) ) ``` | Images | Groudtruth Mesh | Predicted Vox | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (##) 2.2. Image to Pointcloud - Ablations Decoder Model (###) Single FC Layer Trained for 10k iters with 2 batch size. ``` nn.Linear(512, (args.n_points * 3)) ``` | Images | Groudtruth Mesh | Predicted Pointcloud | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (###) 3-Layer FC Decoder [Final] Trained for 10k iters with 64 batch size. ``` nn.Sequential( nn.Linear(512, 1024), nn.ReLU(), nn.Linear(1024, 4096), nn.ReLU(), nn.Linear(4096, (args.n_point * 3)) ) ``` | Images | Groudtruth Mesh | Predicted Pointcloud | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (##) 2.3. Image to Mesh - Ablations (###) 3-Layer FC Decoder [Final] Train for 7k iterations with 64 batch size. ``` nn.Sequential( nn.Linear(512, 1024), nn.ReLU(), nn.Linear(1024, 4096), nn.ReLU(), nn.Linear(4096, mesh_pred.verts_packed().shape[0] * 3) ) ``` | Images | Groudtruth Mesh | Predicted Pointcloud | | -- | -- | -- | |

|

|

| |

|

|

| |

|

|

| (##) 2.4. Quantitative Comparisons Using simple linear layers works pretty well for all 3D representations. However, using similar models, voxels show a lower F1 score as compared to meshes and pointclouds. | Voxel | Pointcloud | Mesh | | -- | -- | -- | | F1_0.05: 93.043; Avg F1_0.05: 90.688 | F1_0.05: 99.760; Avg F1_.05: 96.874 | F1_0.05: 99.199; Avg F1_0.05: 95.833 | |

|

|

| (##) 2.5. Hyperparameter Variations (###) Voxel Predictions Varying batch size from 2 to 64 significantly boosted the F1 score (by 20%). Even qualitatively, the models trained with 64 batch size are able to capture unique characteristics of chairs. | Groundtruth | Trained with batchsize = 2 | Trained with batchsize = 64 | | -- | -- | -- | |

|

|

| (###) Pointcloud Predictions Varying batch size from 2 to 64 boosted the F1 score by 5%. The model trained with a larger batch size can better capture nuances in the chair stucture, for example, the structure of the chair legs as shown in the below example. | Groundtruth | Trained with batchsize = 2 | Trained with batchsize = 64 | | -- | -- | -- | |

|

|

| Further, the number of predicted points does not change the model performance beyond 5000. With 1000 points, the F1 score is around 70 but boosts to 90+ with 5000 points. Below models are trained with batch size 2. | Groundtruth | n_points = 1000 | n_points = 5000 | n_points = 10000 | | -- | -- | -- | -- | |

|

|

|

| (##) 2.6. Model Interpretation (###) Voxel Predictions Through Iterations | Sample 0 | Sample 140 | | -- | -- | |

|

| (###) Mesh Deformations Through Iterations | Sample 0 | Sample 140 | | -- | -- | |

|

|