- Title: Physical Primitive Decomposition
- Author: Zhijian Liu, William T. Freeman, Joshua B. Tenenbaum and Jiajun Wu
- Affiliation: Massachusetts Institute of Technology, Google Research
- Publication status: European Conference on Computer Vision (ECCV 2018)
- Short name: PPD
- Year: 2018
- Idea: 5
- Usability: 4
- Presentation: 5
- Overall: 4
- Objects are made of functional parts, each with distinct geometry, physics, functionality, and affordances.
- For intelligent agents to better explore and interact with the world, there must be development of a distributed, physical, interpretable representation of objects.
- Goal is to explain the shape and physics of an object with shape primitives with physical parameters.
- Proposed a formulation that learns a part-based object representation from both visual observations and physical interactions. Starting from a single image and a voxelised shape, the model recovers the geometric primitives and infers their physical properties from texture.
- The contributions include:
- The problem of physical primitive decomposition - learning of compact, disentangled primitives.
- Present a novel learning paradigm that learns to characterise shapes in physical primitives to explain both their geometry and physics.
- Demonstrate that their system can achieve good performance on both synthetic and real data.
-
Evaluates their system for physical primitive decomposition in three scenerios:
- Decomposing Block Towers: They generate a dataset of synthetic block towers, where each block has distinct geometry and physics. The model recontructs the physical primitives by making use of both appearance and motion cues.
- Decomposing Tools: They evaluate the system in a set of synthetic tools, demonstrating its applicability to daily-life shapes.
- Decomposing Real Objects: They built a new dataset of real block towers in dynamic scenes, and evaluated the model's generalisation power to real videos.
-
Ablation studies are also presented to understand how each source of information contributes to the final performance.
-
Human behavioral experiments to contrast the performance of the model compared with humans is undertaken, where the question of 'which block is heavier' is asked.
- Ablation study refers to removing some feature of the model or algorithm, and seeing how that affects performance.
- Texture for materials is obtained by cropping the center portion of images from the MINC dataset.
- 3D Warehouse, where the unrelated models were manually removed.
- Network CNN encoder input is: voxels (323 grid), images and object trajectories.
- Predicted metrics: size = (sx sy, sz), position = (px, py, pz), rotation in quaternion form = (qw, qx, qy, qz) and density.
- Shape parameters are continuous real values, where density is a discrete value. Therefore, the density value is discretised into ND = 100 slots, so that estimating desnity becomes a ND-way classification.
- The voxel feature, image feature and trajectory feature are concatenated together and mapped to a low dimensional feature using a fully-connected layer. The set of physical primitives are sequentially predicted by a recurrent generator.
- Loss function consists of a geometry loss and physics loss.
- The performance of shape reconstruction by the F1 score between the prediction and ground truth is used. Each primitive in prediction is labeled as a true positive if its IoU with a ground-truth primitive is greater than 0.5.
- For physics estimations, they employed two types of metrics i) density measures top-k accuracy and root-mean-square error (RMSE) and ii) trajetory measure: mean-absolute error (MAE) between simulated trajectory (using predicted physical parameters) and ground-truth trajectory.
- Perform PCA on the point clouds and align models by their PCA axes.
- Use energy-based optimisation to fit the primitives from the point cloud, and assign each vertex to its nearest primitive and refine each primitive with the minium oriented bounding box of vertices assigned to it.
- Pre-train model on their PPD model due to there being less training examples.
- Purchased sets of blocks with different materials from Amazon.
- Block tower is placed at a specific position on the desk, and they used a copper ball (hung by a pendulum) to hit the tower.
- The appearance of every fram in RGB video is used to extract a 3D trajectory, by using 2D keypoint tracking and 3D pose reconstruction.
- For shape reconstruction, the model achieves 97.5 in terms of F1 score.
- For physics estimation, experiments suggest that appearance alone is not sufficient for density estimation.
- When combining with object trajectories, the estimation of the density is much better.
- 85.9 in terms of F1 score.
- The shape recontruction is not as good as that of the block towers dataset because the synthetic tools are more complicated, and orientations might introduce some ambiguity.
- The physics estimation is better since there are less primitives in the synthetic tool dataset.
- The model cannot effectively predict the physical parameters because images and object trajectories are much noiser than those in synthetic dataset.
- RMSE = 18.7 over the whole dataset and 10.1 over block towers with two blocks.
- 3D Warehouse: https://3dwarehouse.sketchup.com
- Use combination of multiple networks learning different information for CSGNet paper.
- Primitives
- Binvox
- Blender
- Bullet Physics Engine