BEV+Transformer: Next Generation of Autonomous Vehicles and Data Challenges?
KEYWORDS: Autonomous Driving Industry, AI Technology in Autonomous Driving, BEV (Bird’s Eye-View) in Autonomous Vehicles, Data Labeling in Autonomous Vehicles, Challenges of Data Annotation in Autonomous Vehicles.
The autonomous driving industry is undoubtedly one of the industries where AI technology has brought revolutionary innovation, and autonomous driving technology is entering a new generation of technology frameworks centered on BEV (Bird’s Eye-View) and Transformer as the core of the new generation of technology frameworks, bringing a leap forward in the perception and generalization capabilities of autonomous driving.
However, to train a working BEV model, we need to perform a large amount of data acquisition and preprocessing, which has a crucial impact on the performance and effect of scene perception.
And under the premise of massive data, how to guarantee the quality of data labeling? This is an issue that we need to explore in depth. In this paper, we will analyze in detail, from the development of autopilot algorithms to the advantages of BEV+Transformer, to the importance and challenges of data annotation, and the solutions.
1. BEV+Transformer to Improve Automatic Driving Perception and Generalization Ability
The automatic driving algorithm module can be divided into three parts: perception, decision-making, and planning control. Among them, the perception module is the key component, which has experienced diverse model iterations: CNN (2011–2016), RNN+GAN (2016–2018), BEV (2018–2020), Transformer+BEV (2020 to present), and Occupancy Network (2022 to present)..
At present, BEV+Transformer has become one of the mainstream models. The automated driving solution that emphasizes perception and light maps opens a new chapter in the automated driving industry.
BEV is known as Bird’s Eye View. BEV presents vehicle information from a bird ‘ s-eye view perspective and is an expression of cross-camera and multimodal fusion in autonomous driving systems. Its core idea is to transform traditional 2D image perception into 3D perception. For BEV perception, the key is to take 2D images as input and output a 3D frame. The optimal feature representation is elegantly obtained in a multi-camera view, allowing the vehicle to judge its relationship to space.
BEVs are advantageous in two ways.
First, autonomous driving is a 3D or BEV perception problem. Using the BEV perspective can provide more comprehensive scene information, which helps the vehicle perceive the surrounding environment and make accurate decisions.
Another important reason is to facilitate multimodal fusion. Autopilot systems usually use a variety of sensors, such as cameras, lidar, millimeter-wave radar, etc. The BEV perspective can uniformly express the data of different sensors on the same plane, which makes the fusion and processing of sensor data more convenient.
Transformer, which is basically a deep learning model based on a self-attention mechanism, Transformer is more suitable for perspective transformation due to the global attention mechanism. Each position in the target domain visits the same distance as each position in the source domain, thus overcoming the local limitation of the convolutional layer’s sensory field in CNN.
The combination of the two can fully utilize the spatial information of the environment provided by BEV and Transformer’s ability to model heterogeneous data from multiple sources to achieve more accurate environmental perception, longer-term motion planning, and more globalized decision-making.
Tesla is the first company in the industry to use BEV+Transformer for visual perception tasks. At Tesla AI Day, Tesla revealed many of the complex inner workings of the neural networks that power the Tesla FSD. One of the most interesting building blocks is a building block called “Image-to-BEV Transformation + Multi-Camera Fusion”. At the center of this block is a Transformer module, or more specifically, a cross-attention module. This revolution allowed Musk to confidently declare that Tesla’s perception relies not on lidar and millimeter-wave radar but on pure vision to obtain accurate three-dimensional world information.
Based on the above practices, many autonomous vehicle companies and automotive suppliers have begun to experiment with BEV+transformer. Representatives of Chinese companies include NIO, Lixiang, Xpeng, Baidu, Horizon, Haomo.ai, etc.
2. How to guarantee the quality of data annotation under the premise of massive data?
Although BEV+Transformer has become the mainstream trend of autopilot algorithms. However, the premise is to train a working BEV model. Training a working BEV requires a large amount of data acquisition and pre-processing, which has a significant impact on the performance and effect of scene perception.
According to Tesla 2022, the Tesla FSD model with BEV + Transformer has 1 billion parameters, about 10 times more than the previous version of the model.
“In order to build Autonomy, we obviously need to train our neural net with data from millions of vehicles,” Musk said at Tesla’s Q2 earnings call in July. He explained: “This has been proven over and over again, the more training data you have, the better the results. ”
Musk added: “It barely works at 2 million. It’s — it’s slightly worse at 3 million. It’s like, wow, OK, we’re seeing something. But then, you get to, like, 10 million training examples, it’s, like, it becomes incredible.”
Data alone isn’t enough. Training a BEV model is still a huge systematic project, and one of the key steps is to label the data well. The quality of the data labeling directly determines the quality of the model.
In addition, due to the complexity of roads all around the world, especially in Asian cities, the variability of weather conditions and other factors, for the autonomous driving industry, in the face of such large and diversified labeling objects, to ensure the high quality of data labeling, and to be able to complete high-quality labeling in a relatively short period of time. It has also become an urgent problem for automated driving companies to solve.
In order to better help the autonomous driving industry solve such a problem, after years of R&D and testing, the MaidX Auto-4D platform from maadaa.ai has been launched. It is a “Human-in-the-loop” auto-annotation platform that supports BEV multi-sensor fusion data.
Based on years of experience in the field of automated driving, maadaa.ai has found that it can significantly reduce the cost of manual labeling by 90% or more.
For example, for purely manual data production, one hour of data requires 3,000 to 5,000 man-hours. With MaidX Auto-4D, one hour of data can be reduced to 200 to 500 man-hours.
The “Human-in-the-Loop” data labeling processes include Data Quality Check, Data Cleansing, Automatic Labeling with 95%, Manual Inspection and Correction, and Ground Truth.
The benefits of the MaidX Auto-4D platform are:
1. Auto-Annotation Engine supporting BEV Multi-Sensor Fusion Data.
2. Support data annotation for multi-sensor fusion under BEV.
3. Omni-directional target object attributes (ID, dimension, position, classification, velocity, acceleration, occlusion relationship, visibility, trajectory, lane belonging to, etc.)
4. Forward/backward tracking is used to achieve improved detection accuracy of the current frame using historical and future information.
5. Deep learning models supporting multi-tasking (Segmentation, Object Detection, etc.)
6. Reduces manual labeling costs by 90% or more.
MaidX Auto-4D platform is a one-stop solution for efficient and high-quality annotation of massive and diversified data. While ensuring the quality, it also shortens the annotation time, reduces the annotation cost, and improves the labeling efficiency by 3 to 4 times.
For more information, please visit: https://maadaa.ai/Industries/Autonomous_Driving