PointActionCLIP: Preventing Transfer Degradation in Point Cloud Action Recognition with a Triple-Path CLIP

The overall architecture of our PointActionCLIP.

Abstract

Directly applying CLIP to point cloud action recognition can cause severe accuracy collapse. In this paper, we propose PointActionCLIP, which successfully prevents this transfer degradation with a triplepath CLIP, including the image path, the sequence path, and the label path. Specifically, the image path projects the 3D point cloud sequence onto a 2D image sequence and uses a visual encoder to extract its feature. It also captures the temporal feature of the image sequence with a temporal encoding transformer. The sequence path adopts a pretrained sequence encoder to encode the original point cloud sequence to obtain its spatiotemporal feature. The label path encodes the candidate labels with a text encoder. Finally, we fuse the output of the three paths to obtain the predicted action label. Extensive experiments validate that PointActionCLIP outperforms state-of-the-art (SOTA) methods.

Type
Publication
In 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.