V2XPnP
Vehicle-to-Everything (V2X) Spatio-Temporal Fusion
for Multi-Agent Perception and Prediction
What is V2XPnP?
V2XPnP is the first open-source V2X spatio-temporal fusion framework for cooperative perception and prediction.
- A novel intermediate fusion model (what to transmit) within one-step communication (when to transmit).
- Based on unified Transformer architecture integrating diverse attention fusion modules for V2X spatial-temporal information (how to fuse).
- Comprehensive benchmarks and open-source codes for multi-agent perception and prediction.
V2XPnP Sequential Dataset is the first large-scale, real-world V2X sequential dataset featuring multiple agents and all V2X collaboration modes (VC, IC, V2V, I2I).
- Multiple connected and automated agents: two vehicles and two infrastructures.
- Multi-modal sequential sensor data (40k LiDAR frames and 208k camera data), object trajectories (136 object/scene and 10 object types), and map data (PCD map and vector map) across 24 intersections.
V2XPnP Fusion Framework
Cooperative temporal perception and prediction task requires the integration of temporal information across historical frames and spatial information from multiple agents, which can be summaried as three questions: (1) What information to transmit? (2) When to transmit it? (3) How to fuse information across spatial and temporal dimensions?
What to Transmit
- Early Fusion: Raw historical perception data
- Late Fusion: Detected results at each historical frame
- Intermediate Fusion: Intermediate spatio-temporal features
When to Transmit
- Multi-step Communication: Share the current frame’s data in each transmission, and obtain the complete history through multi-step transmission.
- One-step Communication: share all historical data within a single transmission.
We advocate Intermediate Fusion within One-step Communication, because it effectively balances the trade-off between accuracy and increased transmission load. Moreover, its capability to transmit intermediate spatial-temporal features makes it well-suited for end-to-end perception and prediction, allowing for feature sharing across multiple tasks and reducing computational demand.
V2XPnP leverages a Unified Transformer Structure for effective spatial-temporal fusion, including Temporal Attention, Self-spatial Attention, Multi-agent Spatial Attention, and Map Attention. Each agent first extracts its inter-frame and self-spatial features, which can support single-vehicle perception and prediction while reducing the communication load, and then the multi-agent spatial attention module fuses the single-agent feature across different agents.
V2XPnP Sequential Dataset
Our dataset comprises 100 scenarios, and support all collaboration modes: vehicle-centric (VC), infrastructure-centric (IC), vehicle-to-vehicle (V2V), infrastructure-to-infrastructure (I2I).
Data Annotation and Sequential Processing
- 3D bounding boxes are annotated with SUSTechPOINTS by expert annotators under eight rounds of review and revision.
- Develop a V2X sequential data processing pipeline to track objects across time and different agents’ views based on the multi-agent spatio-temporal graph.
Trajectory and Map Generation
- Provide ground-truth trajectory dataset derived from the all agents’ data and a trajectory retrieve module to return observable trajectories of surrounding objects based on their actual visibility relationships.
- LiDAR sequences are fused to form the PCD map and vector map is built within RoadRunner. The final map follows the Waymo format.
Qualitative Results
Qualitative results of cooperative perception and prediction across different fusion model in VC and IC sample scenario. Our V2XPnP model shows better perception and prediction results.
Benchmark
VC Cooperative Perception and Prediction Benchmark
Method | E2E | Map | AP@0.5 (%) ↑ | ADE (m) ↓ | FDE (m) ↓ | MR (%) ↓ | EPA (%) ↑ |
No Fusion | ✓ | 43.9 | 1.87 | 3.24 | 33.8 | 24.3 | |
No Fusion-FnF* | ✓ | 53.4 | 1.55 | 2.81 | 34.3 | 31.6 | |
Late Fusion | ✓ | 58.1 | 1.59 | 2.82 | 32.4 | 33.0 | |
Early Fusion | ✓ | ✓ | 60.3 | 1.37 | 2.49 | 33.8 | 36.7 |
V2VNet* | ✓ | 48.6 | 2.10 | 3.75 | 42.3 | 25.3 | |
F-Cooper* | ✓ | ✓ | 66.0 | 1.35 | 2.56 | 36.1 | 38.7 |
V2XPnP (Ours) | ✓ | ✓ | 71.6 | 1.35 | 2.36 | 31.7 | 48.2 |
IC Cooperative Perception and Prediction Benchmark
Method | E2E | Map | AP@0.5 (%) ↑ | ADE (m) ↓ | FDE (m) ↓ | MR (%) ↓ | EPA (%) ↑ |
No Fusion | ✓ | 46.4 | 1.69 | 3.06 | 36.2 | 28.8 | |
No Fusion-FnF* | ✓ | 56.7 | 1.34 | 2.65 | 41.4 | 31.7 | |
Late Fusion | ✓ | 55.9 | 1.39 | 2.44 | 30.1 | 32.9 | |
Early Fusion | ✓ | ✓ | 60.5 | 1.39 | 2.63 | 32.8 | 39.5 |
V2VNet* | ✓ | 33.6 | 1.95 | 3.53 | 44.2 | 16.3 | |
F-Cooper* | ✓ | ✓ | 60.2 | 1.21 | 2.32 | 36.3 | 36.3 |
V2XPnP (Ours) | ✓ | ✓ | 71.0 | 1.18 | 2.16 | 34.0 | 46.0 |
V2V Cooperative Perception and Prediction Benchmark
Method | E2E | Map | AP@0.5 (%) ↑ | ADE (m) ↓ | FDE (m) ↓ | MR (%) ↓ | EPA (%) ↑ |
No Fusion | ✓ | 40.8 | 1.99 | 3.38 | 34.0 | 19.8 | |
No Fusion-FnF* | ✓ | 51.9 | 1.67 | 3.12 | 39.3 | 27.5 | |
Late Fusion | ✓ | 55.3 | 1.75 | 3.07 | 34.0 | 30.5 | |
Early Fusion | ✓ | ✓ | 53.0 | 1.64 | 3.11 | 40.2 | 26.9 |
V2VNet* | ✓ | 43.1 | 3.10 | 5.55 | 46.8 | 19.4 | |
F-Cooper* | ✓ | ✓ | 60.2 | 1.69 | 3.22 | 41.1 | 34.4 |
V2XPnP (Ours) | ✓ | ✓ | 70.5 | 1.78 | 3.28 | 39.9 | 40.6 |
I2I Cooperative Perception and Prediction Benchmark
Method | E2E | Map | AP@0.5 (%) ↑ | ADE (m) ↓ | FDE (m) ↓ | MR (%) ↓ | EPA (%) ↑ |
No Fusion | ✓ | 51.0 | 1.69 | 3.06 | 36.2 | 31.7 | |
No Fusion-FnF* | ✓ | 56.6 | 1.34 | 2.65 | 41.4 | 31.7 | |
Late Fusion | ✓ | 61.3 | 1.41 | 2.50 | 30.0 | 41.6 | |
Early Fusion | ✓ | ✓ | 64.6 | 1.57 | 2.98 | 39.9 | 37.7 |
V2VNet* | ✓ | 41.1 | 1.83 | 3.34 | 40.4 | 23.2 | |
F-Cooper* | ✓ | ✓ | 58.6 | 1.34 | 2.58 | 40.0 | 33.6 |
V2XPnP (Ours) | ✓ | ✓ | 69.2 | 1.26 | 2.31 | 36.5 | 42.8 |
Cooperative Temporal Perception Benchmark
Dataset | No Fusion No Temp |
No Fusion FaF* |
No Fusion V2XPnP |
Early Fusion No Temp |
Early Fusion FaF* |
Early Fusion V2XPnP |
Inter Fusion No Temp |
Inter Fusion FaF* |
Inter Fusion V2XPnP |
VC | 43.9 | 57.1 | 60.3 | 63.5 | 67.0 | 71.0 | 65.1 | 70.3 | 74.0 |
IC | 46.4 | 61.1 | 64.7 | 61.0 | 65.5 | 71.4 | 61.1 | 67.1 | 73.2 |
V2V | 40.8 | 53.7 | 59.1 | 54.9 | 56.4 | 66.6 | 58.0 | 61.4 | 69.4 |
I2I | 51.0 | 61.2 | 64.7 | 63.4 | 66.0 | 71.6 | 58.5 | 62.9 | 72.4 |
All Metric of This Table are AP@0.5 (%)
Traditional Cooperative Prediction Benchmark
Dataset | Method | Att Pred AP |
Att Pred ADE |
Att Pred FDE |
Att Pred MR |
LSTM Pred AP |
LSTM Pred ADE |
LSTM Pred FDE |
LSTM Pred MR |
VC | No Fusion | 43.9 | 1.87 | 3.24 | 33.8 | 43.9 | 2.91 | 4.77 | 35.0 |
Late Fusion | 58.1 | 1.59 | 2.81 | 34.3 | 58.1 | 2.76 | 4.60 | 33.7 | |
GT | – | 0.60 | 1.26 | 23.0 | – | 0.66 | 1.31 | 23.0 | |
IC | No Fusion | 46.4 | 2.10 | 3.75 | 42.3 | 46.4 | 2.11 | 3.67 | 35.8 |
Late Fusion | 55.9 | 1.39 | 2.44 | 30.1 | 55.9 | 2.61 | 4.40 | 32.7 | |
GT | – | 0.63 | 1.35 | 26.2 | – | 0.66 | 1.31 | 22.8 | |
V2V | No Fusion | 40.8 | 1.99 | 3.38 | 34.0 | 40.8 | 2.98 | 4.82 | 34.4 |
Late Fusion | 55.3 | 1.75 | 3.07 | 34.0 | 55.3 | 2.87 | 4.79 | 35.0 | |
GT | – | 0.60 | 1.26 | 22.9 | – | 0.66 | 1.31 | 22.8 | |
I2I | No Fusion | 51.0 | 1.69 | 3.06 | 36.2 | 51.0 | 2.11 | 3.67 | 35.9 |
Late Fusion | 61.3 | 1.41 | 2.50 | 30.0 | 61.3 | 2.44 | 4.18 | 32.1 | |
GT | – | 0.63 | 1.35 | 26.2 | – | 0.61 | 1.31 | 25.0 |
Att Pred: Attention Predictor, LSTM Pred: LSTM Predictor. AP@0.5(%), MR(%), the unit of ADE and FDE is meter.
Download
V2XPnP Sequential Dataset
Coming soon.
BibTeX
@article{zhou2024v2xpnp,
title={V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction},
author={Zhou, Zewei and Xiang, Hao and Zheng, Zhaoliang and Zhao, Seth Z. and Lei, Mingyue and Zhang, Yun and Cai, Tianhui and Liu, Xinyi and Liu, Johnson and Bajji, Maheswari and Pham, Jacob and Xia, Xin and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi},
journal={arXiv preprint arXiv:2412.01812},
year={2024}}
All Rights Reserved
Contact Us: jiaqima@ucla.edu