Direct 3D Detection of Vehicles in Monocular Images with a CNN Based 3D Decoder

Michael Weber, David Michael Fürst, J. Marius Zöllner

In: IEEE Intelligent Vehicles Symposium. IEEE Intelligent Vehicles Symposium (IV-2019) IEEE 2019.


In autonomous driving, the detection of objects like surrounding vehicles based on monocular RGB images is usually performed by 2D bounding box detectors. The resulting 2D objects can be used for a first coarse 3D position estimate but for a precise location, additional sensor data has to be taken into account. For further use in sensor fusion systems and environment maps it is preferable to detect objects, their orientation and dimensions directly in 3D coordinates. To address this 3D object detection task, we propose a direct 3D bounding box estimator which is realized as CNN decoder module and can be connected to most 2D object detectors like SSD[1], OverFeat[2], YOLO[3] and RetinaNet[4] or directly to CNN feature extractors like VGG [2] and ResNet [5]. The 3D parameters of the objects such as dimension and orientation are directly predicted by the CNN module. To successfully train this complex MultiNet architecture, a combination and modification of current loss functions is proposed. The fastest of the proposed network module combinations is capable of detecting objects in 3D camera coordinates at a frame rate of 28 fps.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz