Panoptic Segmentation enables Real-World Computer Vision in Autonomous Driving

8 min readDec 28, 2022

KEYWORDS: #AI #ComputerVision #PanopticSegmentation #AutonomousDriving #Tesla #maadaa #RadarSensor #InstanceSegmentation #SemanticSegmentation

Just a few days ago, a news story about “Tesla told the FCC that it plans to market a new radar starting next month” [1] went viral all over the internet.

Some believe it will be a huge improvement. However, others are concerned that Tesla may not achieve its promised self-driving capability with the current hardware.

Elon Musk believes that a human uses only vision to drive. A vehicle can also depend on vision only, according to Musk:

“When your vision works, it works better than the best human because it’s like having eight cameras, it’s like having eyes in the back of your head, besides your head, and has three eyes of different focal distances looking forward. This is — and processing it at a speed that is superhuman. There’s no question in my mind that with a pure vision solution, we can make a car that is dramatically safer than the average person.” [7]

However, as Elected reported, Musk did not rule out radar sensors completely, he believes that a “very high-resolution radar” would help, but such a radar sensor did not exist at the time.

Interestingly enough, just in December, last year, Tesla’s (former) Director of Artificial Intelligence, Andrej Karpathy shared some thoughts about looking for people to help Tesla cluster together parts of an image/video that belong to the same object class and then get it labeled.[2]

In this case, The clustering of various parts of an image or video together is a type of data annotation called panoptic segmentation.

Whether it will be pure vision or vision and radar for Tesla’s Autopilot, as one of the fundamental AI technologies, panoptic segmentation plays a pivotal role in the autonomous self-driving field.

Autonomous self-driving is a typical dense prediction task of computer vision, including image segmentation, instance segmentation, and panoptic segmentation.

In this article, let’s closely examine how panoptic segmentation works in autonomous self-driving industries.

What is Panoptic Segmentation?

In brief, the task definition is simple: Each pixel of an image must be assigned a semantic label and an instance id. Pixels with the same label and id belong to the same object; for stuff labels, the instance id is ignored.

For more professional and detailed explanations of panoptic segmentation, please read:

Repost | Overview of Panoptic Segmentation — towards Real-World Computer Vision (Part.1)

Note: This article has been authorized for reposting by the original author: Overview of Panoptic Segmentation —…

maadaa-ai.medium.com

Why Panoptic Segmentation?

Panoramic segmentation is essential for establishing the safety and accuracy of self-driving vehicles.

To construct an effective autonomous driving system, granular scene comprehension and enhanced scene perception are necessary.

[3] propose to use of panoptic segmentation to enhance a sensor fusion-based environment perception which could benefit from the rich information provided by panoptic segmentation.

With panoptic segmentation, the image and video can be accurately parsed for both semantic (where pixels indicate automobiles, pedestrians, and drivable space, respectively) and instance information (where pixels represent the same car vs. other car objects).

Furthermore, separating the foreground from the background allows for a better understanding of the distance estimates between objects.

The AI-driven machine control system can use panoptic segmentation to distinguish and capture these pieces of information all at once, in real time in order to assess the situation and make quick and precise decisions when accelerating, braking and steering.

For example, in [4] NVIDIA offers an efficient method for pixel-level semantic and instance segmentation of camera pictures using a single DNN capable of performing many tasks. This strategy allows the training of a DNN based on panoptic segmentation, which seeks to comprehend the scene holistically instead of piecemeal.

All of these are executed with the help of appropriate hardware, such as LiDAR cameras and sensors.

Therefore, on the other hand, hardware sensors collect LiDAR data for self-driving cars, which boosts research of panoptic segmentation on LiDAR data. Such as in [5]Stefano Gasperini et al. preset a Panoster, a panoptic segmentation method for LiDAR point clouds. As in Fig 5, Their method directly delivers instance IDs with a learning-based clustering solution, embedded in the model and optimized for the pure and non-fragmented cluster. [6] Using panoptic segmentation to understand the semantic class of each point in a LiDAR sweep is important, as well as knowing which instance of that class it belongs to.

Fig5 Results of LiDAR panoptic segmentation

Sounds a bit difficult to understand?

Well, for example, it is quite common to meet complex driving scenarios such as construction along the road, individuals in crowds or vehicles in traffic jams, etc.

Such type of road conditions has proven significant challenging for AI systems to perceive objects’ full structures when they are partially hidden.

So, in this case, bounding boxes are not enough to detect objects that do not neatly fit into boxes. That’s why panoptic segmentation is so needed because it provides a more detailed understanding of complex driving scenarios.

Related open datasets

In order to train a panoptic segmentation model, we first need labeled training data.

In this section, we will introduce available panoptic segmentation datasets, both public and commercial datasets available for both 2D images and 3D point cloud data.

1. Mapillary Vistas Dataset

Mapillary Vistas Dataset [35] can be used, for instance, semantic and panoptic segmentation, which is represented as a traffic-related dataset with a large-scale collection of segmented images. This dataset is a more challenging dataset consisting of 25,000 street scene images which are split into 18000 training images, 2000 validation images, and 5000 testing sets. The number of classes is 65 in total, where 28 refers to “stuff” classes and 37 is for “thing” classes. Moreover, it incorporates different image sizes ranging from 1024 × 768 to 4000 × 6000. As shown in Fig 30, Mapillary Vistas Dataset contains diverse street-level images with pixel‑accurate and instance‑specific human annotations for understanding street scenes.

Panoptic-Deeplab achieves the state-of-the-art performance on the Mapillary Vistas Dataset. With a SWideRNet, panoptic-deeplab achieves 44.8 on the validation set. codes can be found in https://github.com/google-research/deeplab2, it’s notebal that this database is the official codebase of Deeplab serious models, including Axial-Deeplab, Panoptic-Deeplab and ViP-Deeplab etc.

Link: https://www.mapillary.com/dataset/vistas

2. Cityscapes Dataset

Cityscapes [36] is the most used dataset for panoptic segmentation, focusing on semantic understanding of urban street scenes. It collects street views of 50 cities within a several-month span. The cityscapes dataset contain 5000 pictures of self-centered driving scenes in an urban environment. And it split into 2975 train set, 500 val set, and 1525 test set. It has 19 classes of dense pixel annotations, and 8 of the 19 classes have instance-level masks.

PanopticFCN achieves 61.4 PQ on this dataset with a single-path framework, code can be found in https://github.com/dvlab-research/PanopticFCN.

Link: https://www.cityscapes-dataset.com/

3. Indian Driving Dataset

Indian Driving Dataset (IDD) [38] proposes a novel dataset for road scene understanding in unstructured environments. Unlike other urban scene understanding datasets, IDD consists of scenes that do not have well-delineated infrastructures, such as lanes and sidewalks. As a result, it has a significantly more number of ‘thing’ instances in each scene compared to other datasets, and it only has a small number of well-defined categories for traffic participants.

It consists of 10,000 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. The label set is expanded compared to popular benchmarks such as Cityscapes to account for new classes. The dataset consists of images obtained from a front-facing camera attached to a car. The car was driven around Hyderabad, Bangalore cities, and their outskirts. The images are mostly of 1080p resolution, but there are also some images with 720p and other resolutions.

Link: https://idd.insaan.iiit.ac.in/

4. BDD100K Panoptic Segmentation

BDD is a large driving video dataset captured in different cities in the US. It consists of 100,000 +-40s videos, of which 10,000 videos have pixel-wise annotations. The annotations use 10 thing categories (mainly for non-stationary objects) and 30 stuff categories.

Link: https://doc.bdd100k.com/download.html

5.SemanticKITTI Panoptic Segmentation

SemanticKITTI is a dataset of lidar sequences of street scenes in Karlsruhe (Germany). It contains 11 driving sequences with panoptic segmentation labels. The labels use 6 thing and 16 stuff categories.

Link: http://www.semantic-kitti.org/dataset.html#download

6. nuScenes-lidarseg

nuScenes is a large-scale autonomous driving dataset. It consists of 1000 20s scenes of urban street scenes in Singapore and Boston. The dataset includes point clouds captured by a lidar sensor, as well as synchronized camera data. The nuScenes-lidarseg annotations use 23 thing and 9 stuff classes.