نبذة مختصرة : In this work, we research and evaluate end-toend learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird’s eye view mapping, which is shown to be more robust than using an inertial measurement unit (IMU) aided flat-plane assumption. At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D topview Cartesian coordinate system. It is demonstrated that the network learns to be invariant to pitch and roll perturbation of the camera view without requiring IMU data. The evaluations on Cityscapes show that our end-to-end learning of semanticmetric occupancy grids achieves 72.1% frequency weighted IoU, compared to 60.2% when using an IMU-aided flatplane assumption. Furthermore, our network achieves real-time inference rates of approx. 35 Hertz for an input image with a resolution of 256×512 pixels and an output map with 64×64 occupancy grid cells using a Titan V GPU.
No Comments.