V2XPnP

Vehicle-to-Everything (V2X) Spatio-Temporal Fusion

for Multi-Agent Perception and Prediction

What is V2XPnP?

V2XPnP is the first open-source V2X spatio-temporal fusion framework for cooperative perception and prediction.

A novel intermediate fusion model (what to transmit) within one-step communication (when to transmit).
Based on unified Transformer architecture integrating diverse attention fusion modules for V2X spatial-temporal information (how to fuse).
Comprehensive benchmarks (11 SOAT fusion models) and codes for multi-agent perception and prediction.

V2XPnP Sequential Dataset is the first large-scale, real-world V2X sequential dataset featuring multiple agents and all V2X collaboration modes (VC, IC, V2V, I2I).

Multiple connected and automated agents: two vehicles and two infrastructures.
Multi-modal sequential sensor data (40k LiDAR frames and 208k camera data), object trajectories (136 object/scene and 10 object types), and map data (PCD map and vector map) across 24 intersections.

V2XPnP Fusion Framework

Cooperative temporal perception and prediction task requires the integration of temporal information across historical frames and spatial information from multiple agents, which can be summaried as three questions: (1) What information to transmit? (2) When to transmit it? (3) How to fuse information across spatial and temporal dimensions?

What to Transmit

Early Fusion: Raw historical perception data
Late Fusion: Detected results at each historical frame
Intermediate Fusion: Intermediate spatio-temporal features

When to Transmit

Multi-step Communication: Share the current frame’s data in each transmission, and obtain the complete history through multi-step transmission.
One-step Communication: share all historical data within a single transmission.

We advocate Intermediate Fusion within One-step Communication, because it effectively balances the trade-off between accuracy and increased transmission load. Moreover, its capability to transmit intermediate spatial-temporal features makes it well-suited for end-to-end perception and prediction, allowing for feature sharing across multiple tasks and reducing computational demand.

V2XPnP leverages a Unified Transformer Structure for effective spatial-temporal fusion, including Temporal Attention, Self-spatial Attention, Multi-agent Spatial Attention, and Map Attention. Each agent first extracts its inter-frame and self-spatial features, which can support single-vehicle perception and prediction while reducing the communication load, and then the multi-agent spatial attention module fuses the single-agent feature across different agents.

V2XPnP Sequential Dataset

Our dataset comprises 100 scenarios, and support all collaboration modes: vehicle-centric (VC), infrastructure-centric (IC), vehicle-to-vehicle (V2V), infrastructure-to-infrastructure (I2I).

Data Annotation and Sequential Processing

3D bounding boxes are annotated with SUSTechPOINTS by expert annotators under eight rounds of review and revision.
Develop a V2X sequential data processing pipeline to track objects across time and different agents’ views based on the multi-agent spatio-temporal graph.

Trajectory and Map Generation

Provide ground-truth trajectory dataset derived from the all agents’ data and a trajectory retrieve module to return observable trajectories of surrounding objects based on their actual visibility relationships.
LiDAR sequences are fused to form the PCD map and vector map is built within RoadRunner. The final map follows the Waymo format.

Qualitative Results

Qualitative results of cooperative perception and prediction across different fusion model in VC and IC sample scenario. Our V2XPnP model shows better perception and prediction results.

Benchmark

VC Cooperative Perception and Prediction Benchmark

Method	E2E	Map	AP@0.5 (%) ↑	ADE (m) ↓	FDE (m) ↓	MR (%) ↓	EPA (%) ↑
No Fusion		✓	43.9	1.87	3.24	33.8	24.3
No Fusion-FnF*	✓		53.4	1.55	2.81	34.3	31.6
Late Fusion		✓	58.1	1.59	2.82	32.4	33.0
Early Fusion	✓	✓	60.3	1.37	2.49	33.8	36.7
V2VNet*	✓		48.6	2.10	3.75	42.3	25.3
F-Cooper*	✓	✓	66.0	1.35	2.56	36.1	38.7
CoBEVFlow*	✓	✓	63.3	1.36	2.49	33.0	41.9
DiscoNet*	✓	✓	66.8	1.41	2.62	34.4	42.8
FFNet*	✓	✓	64.6	1.36	2.47	34.7	42.3
V2X-ViT*	✓	✓	69.6	1.39	2.56	35.2	44.7
V2XPnP (Ours)	✓	✓	71.6	1.35	2.36	31.7	48.2

IC Cooperative Perception and Prediction Benchmark

Method	E2E	Map	AP@0.5 (%) ↑	ADE (m) ↓	FDE (m) ↓	MR (%) ↓	EPA (%) ↑
No Fusion		✓	46.4	1.69	3.06	36.2	28.8
No Fusion-FnF*	✓		56.7	1.34	2.65	41.4	31.7
Late Fusion		✓	55.9	1.39	2.44	30.1	32.9
Early Fusion	✓	✓	60.5	1.39	2.63	32.8	39.5
V2VNet*	✓		33.6	1.95	3.53	44.2	16.3
F-Cooper*	✓	✓	60.2	1.21	2.32	36.3	36.3
CoBEVFlow*	✓	✓	57.6	1.38	2.58	31.0	32.5
DiscoNet*	✓	✓	65.4	1.14	2.18	36.1	40.7
FFNet*	✓	✓	61.0	1.18	2.18	35.1	37.5
V2X-ViT*	✓	✓	69.3	1.27	2.39	35.4	43.4
V2XPnP (Ours)	✓	✓	71.0	1.18	2.16	34.0	46.0

V2V Cooperative Perception and Prediction Benchmark

Method	E2E	Map	AP@0.5 (%) ↑	ADE (m) ↓	FDE (m) ↓	MR (%) ↓	EPA (%) ↑
No Fusion		✓	40.8	1.99	3.38	34.0	19.8
No Fusion-FnF*	✓		51.9	1.67	3.12	39.3	27.5
Late Fusion		✓	55.3	1.75	3.07	34.0	30.5
Early Fusion	✓	✓	53.0	1.64	3.11	40.2	26.9
V2VNet*	✓		43.1	3.10	5.55	46.8	19.4
F-Cooper*	✓	✓	60.2	1.69	3.22	41.1	34.4
CoBEVFlow*	✓	✓	58.7	1.72	3.15	40.3	33.6
DiscoNet*	✓	✓	61.2	1.66	3.13	41.2	33.1
FFNet*	✓	✓	56.5	1.68	3.12	39.8	31.2
V2X-ViT*	✓	✓	64.6	1.68	3.13	39.8	36.7
V2XPnP (Ours)	✓	✓	70.5	1.78	3.28	39.9	40.6

I2I Cooperative Perception and Prediction Benchmark

Method	E2E	Map	AP@0.5 (%) ↑	ADE (m) ↓	FDE (m) ↓	MR (%) ↓	EPA (%) ↑
No Fusion		✓	51.0	1.69	3.06	36.2	31.7
No Fusion-FnF*	✓		56.6	1.34	2.65	41.4	31.7
Late Fusion		✓	61.3	1.41	2.50	30.0	41.6
Early Fusion	✓	✓	64.6	1.57	2.98	39.9	37.7
V2VNet*	✓		41.1	1.83	3.34	40.4	23.2
F-Cooper*	✓	✓	58.6	1.34	2.58	40.0	33.6
CoBEVFlow*	✓	✓	58.4	1.31	2.61	41.5	33.0
DiscoNet*	✓	✓	63.5	1.15	2.19	37.5	38.4
FFNet*	✓	✓	66.1	1.41	2.59	36.3	40.9
V2X-ViT*	✓	✓	65.4	1.22	2.33	35.9	41.3
V2XPnP (Ours)	✓	✓	69.2	1.26	2.31	36.5	42.8

Cooperative Temporal Perception Benchmark

Dataset	No Fusion No Temp	No Fusion FaF*	No Fusion V2XPnP	Early Fusion No Temp	Early Fusion FaF*	Early Fusion V2XPnP	Inter Fusion No Temp	Inter Fusion FaF*	Inter Fusion V2XPnP
VC	43.9	57.1	60.3	63.5	67.0	71.0	65.1	70.3	74.0
IC	46.4	61.1	64.7	61.0	65.5	71.4	61.1	67.1	73.2
V2V	40.8	53.7	59.1	54.9	56.4	66.6	58.0	61.4	69.4
I2I	51.0	61.2	64.7	63.4	66.0	71.6	58.5	62.9	72.4

All Metric of This Table are AP@0.5 (%)

Traditional Cooperative Prediction Benchmark

Dataset	Method	Att Pred AP	Att Pred ADE	Att Pred FDE	Att Pred MR	LSTM Pred AP	LSTM Pred ADE	LSTM Pred FDE	LSTM Pred MR
VC	No Fusion	43.9	1.87	3.24	33.8	43.9	2.91	4.77	35.0
	Late Fusion	58.1	1.59	2.81	34.3	58.1	2.76	4.60	33.7
	GT	–	0.60	1.26	23.0	–	0.66	1.31	23.0
IC	No Fusion	46.4	2.10	3.75	42.3	46.4	2.11	3.67	35.8
	Late Fusion	55.9	1.39	2.44	30.1	55.9	2.61	4.40	32.7
	GT	–	0.63	1.35	26.2	–	0.66	1.31	22.8
V2V	No Fusion	40.8	1.99	3.38	34.0	40.8	2.98	4.82	34.4
	Late Fusion	55.3	1.75	3.07	34.0	55.3	2.87	4.79	35.0
	GT	–	0.60	1.26	22.9	–	0.66	1.31	22.8
I2I	No Fusion	51.0	1.69	3.06	36.2	51.0	2.11	3.67	35.9
	Late Fusion	61.3	1.41	2.50	30.0	61.3	2.44	4.18	32.1
	GT	–	0.63	1.35	26.2	–	0.61	1.31	25.0

Att Pred: Attention Predictor, LSTM Pred: LSTM Predictor. AP@0.5(%), MR(%), the unit of ADE and FDE is meter.

Download

V2XPnP Sequential Dataset

Download links of the V2XPnP Sequential Dataset:

Test [link] Val [link] Train[link1, link2, link3, link4] Map[link]

The min sample data of V2XPnP Sequential Dataset can be accessed in Google Drive, and we will release all the data later.

BibTeX

@article{zhou2024v2xpnp,
title={V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction},
author={Zhou, Zewei and Xiang, Hao and Zheng, Zhaoliang and Zhao, Seth Z. and Lei, Mingyue and Zhang, Yun and Cai, Tianhui and Liu, Xinyi and Liu, Johnson and Bajji, Maheswari and Pham, Jacob and Xia, Xin and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi},
journal={arXiv preprint arXiv:2412.01812},
year={2024}

}