|
Vision language models (VLMs) are demonstrating impressive generalizable capabilities in autonomous driving. However, expanding these capabilities to other sensors, such as 4D radar, is hindered by the lack of large-scale annotated datasets, making direct training infeasible. To bridge this gap, knowledge distillation from pretrained VLMs like CLIP offers a promising path, but the large modality gap between sparse 4D radar data and dense visual-language features remains a challenge. We argue that naive feature-level distillation and simple contrastive alignment are insufficient. We propose a Structural Knowledge Distillation for 4D radar (SKD-4R) framework that instead distills the underlying relational structure of the pretrained VLM’s embedding space. Our method transfers both the cross-modal (image-text) and intra-modal (image-image, text-text) relational graphs from the CLIP teacher to the 4D radar encoder. To enable training, we construct the K-Radar-language dataset, a new benchmark pairing 4D radar with language annotations. On this dataset, SKD-4R improves retrieval recall scores across 4D radar, image and language modality establishing our structural knowledge distillation as an effective path for multimodal alignment in low-resource sensors. |