V2XPnP

Vehicle-to-Everything (V2X) Spatio-Temporal Fusion

for Multi-Agent Perception and Prediction

What is V2XPnP?

 

V2XPnP is the first open-source V2X spatio-temporal fusion framework for cooperative perception and prediction.

  • A novel intermediate fusion model (what to transmit) within one-step communication (when to transmit).
  • Based on unified Transformer architecture integrating diverse attention fusion modules for V2X spatial-temporal information (how to fuse).
  • Comprehensive benchmarks and open-source codes for multi-agent perception and prediction.

 

V2XPnP Sequential Dataset is the first large-scale, real-world V2X sequential dataset featuring multiple agents and all V2X collaboration modes (VC, IC, V2V, I2I).

  • Multiple connected and automated agents: two vehicles and two infrastructures.
  • Multi-modal sequential sensor data (40k LiDAR frames and 208k camera data), object trajectories (136 object/scene and 10 object types), and map data (PCD map and vector map) across 24 intersections.

 

V2XPnP Fusion Framework

Cooperative temporal perception and prediction task requires the integration of temporal information across historical frames and spatial information from multiple agents, which can be summaried as three questions: (1) What information to transmit? (2) When to transmit it? (3) How to fuse information across spatial and temporal dimensions?

What to Transmit

  • Early Fusion: Raw historical perception data
  • Late Fusion: Detected results at each historical frame
  • Intermediate Fusion: Intermediate spatio-temporal features

 

When to Transmit

  • Multi-step Communication: Share the current frame’s data in each transmission, and obtain the complete history through multi-step transmission.
  • One-step Communication: share all historical data within a single transmission.

 

We advocate Intermediate Fusion within One-step Communication, because it effectively balances the trade-off between accuracy and increased transmission load. Moreover, its capability to transmit intermediate spatial-temporal features makes it well-suited for end-to-end perception and prediction, allowing for feature sharing across multiple tasks and reducing computational demand.

V2XPnP leverages a Unified Transformer Structure for effective spatial-temporal fusion, including Temporal Attention, Self-spatial Attention, Multi-agent Spatial Attention, and Map Attention. Each agent first extracts its inter-frame and self-spatial features, which can support single-vehicle perception and prediction while reducing the communication load, and then the multi-agent spatial attention module fuses the single-agent feature across different agents.

V2XPnP Sequential Dataset

Our dataset comprises 100 scenarios, and support all collaboration modes: vehicle-centric (VC), infrastructure-centric (IC), vehicle-to-vehicle (V2V), infrastructure-to-infrastructure (I2I).

Data Annotation and Sequential Processing

  • 3D bounding boxes are annotated with SUSTechPOINTS by expert annotators under eight rounds of review and revision.
  • Develop a V2X sequential data processing pipeline to track objects across time and different agents’ views based on the multi-agent spatio-temporal graph.

 

Trajectory and Map Generation

  • Provide ground-truth trajectory dataset derived from the all agents’ data and a trajectory retrieve module to return observable trajectories of surrounding objects based on their actual visibility relationships.
  • LiDAR sequences are fused to form the PCD map and vector map is built within RoadRunner. The final map follows the Waymo format.

 

Qualitative Results

Qualitative results of cooperative perception and prediction across different fusion model in VC and IC sample scenario. Our V2XPnP model shows better perception and prediction results.

Benchmark

VC Cooperative Perception and Prediction Benchmark
Method E2E Map AP@0.5 (%) ↑ ADE (m) ↓ FDE (m) ↓ MR (%) ↓ EPA (%) ↑
No Fusion   43.9 1.87 3.24 33.8 24.3
No Fusion-FnF*   53.4 1.55 2.81 34.3 31.6
Late Fusion   58.1 1.59 2.82 32.4 33.0
Early Fusion 60.3 1.37 2.49 33.8 36.7
V2VNet*   48.6 2.10 3.75 42.3 25.3
F-Cooper* 66.0 1.35 2.56 36.1 38.7
V2XPnP (Ours) 71.6 1.35 2.36 31.7 48.2
IC Cooperative Perception and Prediction Benchmark
Method E2E Map AP@0.5 (%) ↑ ADE (m) ↓ FDE (m) ↓ MR (%) ↓ EPA (%) ↑
No Fusion   46.4 1.69 3.06 36.2 28.8
No Fusion-FnF*   56.7 1.34 2.65 41.4 31.7
Late Fusion   55.9 1.39 2.44 30.1 32.9
Early Fusion 60.5 1.39 2.63 32.8 39.5
V2VNet*   33.6 1.95 3.53 44.2 16.3
F-Cooper* 60.2 1.21 2.32 36.3 36.3
V2XPnP (Ours) 71.0 1.18 2.16 34.0 46.0
V2V Cooperative Perception and Prediction Benchmark
Method E2E Map AP@0.5 (%) ↑ ADE (m) ↓ FDE (m) ↓ MR (%) ↓ EPA (%) ↑
No Fusion   40.8 1.99 3.38 34.0 19.8
No Fusion-FnF*   51.9 1.67 3.12 39.3 27.5
Late Fusion   55.3 1.75 3.07 34.0 30.5
Early Fusion 53.0 1.64 3.11 40.2 26.9
V2VNet*   43.1 3.10 5.55 46.8 19.4
F-Cooper* 60.2 1.69 3.22 41.1 34.4
V2XPnP (Ours) 70.5 1.78 3.28 39.9 40.6
I2I Cooperative Perception and Prediction Benchmark
Method E2E Map AP@0.5 (%) ↑ ADE (m) ↓ FDE (m) ↓ MR (%) ↓ EPA (%) ↑
No Fusion   51.0 1.69 3.06 36.2 31.7
No Fusion-FnF*   56.6 1.34 2.65 41.4 31.7
Late Fusion   61.3 1.41 2.50 30.0 41.6
Early Fusion 64.6 1.57 2.98 39.9 37.7
V2VNet*   41.1 1.83 3.34 40.4 23.2
F-Cooper* 58.6 1.34 2.58 40.0 33.6
V2XPnP (Ours) 69.2 1.26 2.31 36.5 42.8
Cooperative Temporal Perception Benchmark
Dataset No Fusion
No Temp
No Fusion
FaF*
No Fusion
V2XPnP
Early Fusion
No Temp
Early Fusion
FaF*
Early Fusion
V2XPnP
Inter Fusion
No Temp
Inter Fusion
FaF*
Inter Fusion
V2XPnP 
VC 43.9 57.1 60.3 63.5 67.0 71.0 65.1 70.3 74.0
IC 46.4 61.1 64.7 61.0 65.5 71.4 61.1 67.1 73.2
V2V 40.8 53.7 59.1 54.9 56.4 66.6 58.0 61.4 69.4
I2I 51.0 61.2 64.7 63.4 66.0 71.6 58.5 62.9 72.4

All Metric of This Table are AP@0.5 (%)

Traditional Cooperative Prediction Benchmark
Dataset Method Att Pred
AP
Att Pred
ADE
Att Pred
FDE
Att Pred
MR
LSTM Pred
AP
LSTM Pred
ADE
LSTM Pred
FDE
LSTM Pred
MR 
VC No Fusion 43.9 1.87 3.24 33.8 43.9 2.91 4.77 35.0
Late Fusion 58.1 1.59 2.81 34.3 58.1 2.76 4.60 33.7
GT 0.60 1.26 23.0 0.66 1.31 23.0
IC No Fusion 46.4 2.10 3.75 42.3 46.4 2.11 3.67 35.8
Late Fusion 55.9 1.39 2.44 30.1 55.9 2.61 4.40 32.7
GT 0.63 1.35 26.2 0.66 1.31 22.8
V2V No Fusion 40.8 1.99 3.38 34.0 40.8 2.98 4.82 34.4
Late Fusion 55.3 1.75 3.07 34.0 55.3 2.87 4.79 35.0
GT 0.60 1.26 22.9 0.66 1.31 22.8
I2I No Fusion 51.0 1.69 3.06 36.2 51.0 2.11 3.67 35.9
Late Fusion 61.3 1.41 2.50 30.0 61.3 2.44 4.18 32.1
GT 0.63 1.35 26.2 0.61 1.31 25.0

Att Pred: Attention Predictor, LSTM Pred: LSTM Predictor. AP@0.5(%), MR(%), the unit of ADE and FDE is meter.

Download

V2XPnP Sequential Dataset

Coming soon.

BibTeX

@article{zhou2024v2xpnp,
 
title={V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction},
  author={Zhou, Zewei and Xiang, Hao and Zheng, Zhaoliang and Zhao, Seth Z. and Lei, Mingyue and Zhang, Yun and Cai, Tianhui and Liu, Xinyi and Liu, Johnson and Bajji, Maheswari and Pham, Jacob and Xia, Xin and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi},
  journal={arXiv preprint arXiv:2412.01812},
  year={2024}

}

Copyright © 2023 UCLA Mobility Lab
All Rights Reserved
Contact Us: jiaqima@ucla.edu