XPENG announces further breakthrough in autonomous driving
06 Jan 2026|48 views
In collaboration with Peking University, XPENG has developed the FastDriveVLA, a novel visual token pruning framework - specifically designed for end-to-end autonomous driving Vision-Language-Action (VLA) models - that enables autonomous driving AI to "drive like a human" by focusing only on essential information.
This research breakthrough has been accepted by the Association for the Advancement of Artificial Intelligence (AAAI) 2026, one of the world's premier AI conferences. What makes this more impressive overall is that the event has a highly selective acceptance rate of just 17.6% this year (4,167 papers out of 23,680 submissions).
As AI large models evolve rapidly, VLA models are being widely adopted in end-to-end autonomous driving systems due to their strong capabilities in complex scene understanding and action reasoning. These models encode images into large numbers of visual tokens, which serve as the foundation for the model to "see" the world and make driving decisions. However, processing large numbers of tokens increases computational load onboard the vehicle, impacting inference speed and real-time performance.
And while visual token pruning has been recognised as a viable method to accelerate VLA inference, existing approaches - whether based on text-visual attention or token similarity - have shown limitations in driving scenarios. To address this, the FastDriveVLA is inspired by how human drivers focus on relevant foreground information (e.g., lanes, vehicles, pedestrians) while ignoring non-critical background areas.
This method introduces an adversarial foreground-background reconstruction strategy that enhances the model’s ability to identify and retain valuable tokens. And during the nuScenes autonomous driving benchmark, the number of visual tokens was reduced from 3,249 to 812, achieving a nearly 7.5x reduction in computational load while maintaining high planning accuracy.
This marks the second time this year that XPENG has been recognised at a top-tier global AI conference. In addition, at its Tech Day in November 2025, the brand unveiled its VLA 2.0 architecture, which removes the "language translation" step and enables direct Visual-to-Action generation, a breakthrough that redefines the conventional V-L-A pipeline.
These accomplishments reflect XPENG's full-stack in-house capabilities, from model architecture design and training to distillation and vehicle deployment. Looking ahead, XPENG has said that it remains committed to achieving L4 level autonomous driving.
In collaboration with Peking University, XPENG has developed the FastDriveVLA, a novel visual token pruning framework - specifically designed for end-to-end autonomous driving Vision-Language-Action (VLA) models - that enables autonomous driving AI to "drive like a human" by focusing only on essential information.
This research breakthrough has been accepted by the Association for the Advancement of Artificial Intelligence (AAAI) 2026, one of the world's premier AI conferences. What makes this more impressive overall is that the event has a highly selective acceptance rate of just 17.6% this year (4,167 papers out of 23,680 submissions).
As AI large models evolve rapidly, VLA models are being widely adopted in end-to-end autonomous driving systems due to their strong capabilities in complex scene understanding and action reasoning. These models encode images into large numbers of visual tokens, which serve as the foundation for the model to "see" the world and make driving decisions. However, processing large numbers of tokens increases computational load onboard the vehicle, impacting inference speed and real-time performance.
And while visual token pruning has been recognised as a viable method to accelerate VLA inference, existing approaches - whether based on text-visual attention or token similarity - have shown limitations in driving scenarios. To address this, the FastDriveVLA is inspired by how human drivers focus on relevant foreground information (e.g., lanes, vehicles, pedestrians) while ignoring non-critical background areas.
This method introduces an adversarial foreground-background reconstruction strategy that enhances the model’s ability to identify and retain valuable tokens. And during the nuScenes autonomous driving benchmark, the number of visual tokens was reduced from 3,249 to 812, achieving a nearly 7.5x reduction in computational load while maintaining high planning accuracy.
This marks the second time this year that XPENG has been recognised at a top-tier global AI conference. In addition, at its Tech Day in November 2025, the brand unveiled its VLA 2.0 architecture, which removes the "language translation" step and enables direct Visual-to-Action generation, a breakthrough that redefines the conventional V-L-A pipeline.
These accomplishments reflect XPENG's full-stack in-house capabilities, from model architecture design and training to distillation and vehicle deployment. Looking ahead, XPENG has said that it remains committed to achieving L4 level autonomous driving.
Latest COE Prices
January 2026 | 1st BIDDING
NEXT TENDER: 21 Jan 2026
CAT A$102,009
CAT B$119,100
CAT C$75,503
CAT E$122,000
View Full Results Thank You For Your Subscription.
