X-Distill: Cross-Architecture Vision Distillation Enables Data-Efficient Visuomotor Learning

Anonymous RAL Submission

Teaser Figure

Abstract

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly finetuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on 34 simulated benchmarks and 5 challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or finetuned DINOv2 encoders. Notably, X-Distill also surpasses stronger baselines that utilize privileged 3D observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

Real-World Experiments

Task Setup & Generalization

Real-world task configurations

Visualization of configurations for our real-world tasks. The orange arrow provides a schematic representation of the gripper trajectory as derived from the data. The green regions represent the distribution of object/robot configurations seen during training demonstrations, while the red regions illustrate the novel configurations used for generalization testing.

Task-by-Task Video Results

Task 1: Move Cube

4-Way Comparison of Move Cube task.

Task 2: Move Brush

4-Way Comparison of Move Brush task.

Task 3: Writing "AGI"

4-Way Comparison of Writing "AGI" task.

Robustness to Perturbation: X-Distill (left) robustly adapts, while ResNet-scratch (right) fails.

Task 4: Drawer Open

4-Way Comparison of Drawer Open task.

Task 5: Door Close

4-Way Comparison of Door Close task.

Qualitative Analysis

Feature Space Separability (t-SNE)

t-SNE visualization of learned feature spaces

t-SNE visualization of learned feature spaces on the ''Writing AGI'' task.b> Our X-Distill encoder learns to form three distinct clusters corresponding to the task's semantic stages, quantitatively confirming a well-separated feature space with a high Silhouette Score of 0.472, which indicates a high degree of cluster cohesion and separation compared with the baselines. This semantic separability is crucial for the policy to accurately identify the current task stage, enabling precise long-horizon planning for the sequential writing task.

Saliency Map Visualization

Saliency map comparison

Saliency map comparison on the ''Writing AGI'' task. We visualize the model's visual focus at the beginning of each writing stage. Our X-Distill encoder correctly shifts its attention from the gripper (before 'A'), to the letter 'A' (before 'G'), and finally to the letter 'G' (before 'I'). Baseline models exhibit diffuse or irrelevant attention.

Simulation Experiment

Figure below presents the training curves across representative tasks from MetaWorld, Adroit, and DexArt, demonstrating the consistent performance advantages of our X-Distill approach throughout the learning process.

Training curves on representative simulation tasks. Success rates are shown for selected tasks from MetaWorld, Adroit, and DexArt.

The tables below provide comprehensive quantitative comparisons against strong baselines on all benchmark tasks. Our method achieves superior or competitive performance across the majority of tasks, particularly excelling in challenging manipulation scenarios like handle pulling, peg insertion, and complex multi-step operations. The results highlight X-Distill's robustness across varying task difficulties and embodiment domains.

The ablation study tables examine the impact of different teacher-student architecture combinations. Notably, the DINOv2-L to ResNet-18 configuration emerges as the most effective balance between performance and efficiency, while the consistent superiority of distilled representations over from-scratch training underscores the value of our knowledge distillation approach.

Main Results on MetaWorld Tasks

MetaWorld (Easy)
Alg \ Task Lever Pull Door Close Drawer Open Door Lock Door Unlock
ResNet-scratch 30 ± 18 100 ± 0 77 ± 6 70 ± 19 82 ± 5
Theia 48 ± 14 100 ± 0 100 ± 0 47 ± 3 83 ± 8
Depth-Anything 17 ± 2 100 ± 0 63 ± 5 48 ± 2 78 ± 13
DINOv2 47 ± 5 100 ± 0 100 ± 0 63 ± 2 90 ± 4
X-Distill (Ours) 75 ± 8 100 ± 0 100 ± 0 100 ± 0 100 ± 0
MetaWorld (Easy)
Alg \ Task Drawer Close Faucet Close Faucet Open Handle Press Handle Pull
ResNet-scratch 100 ± 0 100 ± 0 100 ± 0 83 ± 13 25 ± 22
Theia 100 ± 0 3 ± 3 7 ± 3 100 ± 0 13 ± 10
Depth-Anything 100 ± 0 88 ± 6 100 ± 0 85 ± 7 18 ± 2
DINOv2 100 ± 0 93 ± 2 100 ± 0 100 ± 0 28 ± 5
X-Distill (Ours) 100 ± 0 100 ± 0 100 ± 0 100 ± 0 95 ± 4
MetaWorld (Easy)
Alg \ Task Handle Pull Side Plate Slide Plate Slide Back Plate Slide Back Side Plate Slide Side
ResNet-scratch 3 ± 5 90 ± 14 100 ± 0 100 ± 0 100 ± 0
Theia 12 ± 16 40 ± 30 38 ± 8 62 ± 47 2 ± 3
Depth-Anything 12 ± 2 80 ± 12 100 ± 0 100 ± 0 100 ± 0
DINOv2 48 ± 5 80 ± 4 100 ± 0 100 ± 0 100 ± 0
X-Distill (Ours) 95 ± 7 100 ± 0 100 ± 0 100 ± 0 100 ± 0
MetaWorld (Easy)
Alg \ Task Reach Wall Window Close Window Open Reach Peg unplug side
ResNet-scratch 77 ± 2 100 ± 0 93 ± 10 47 ± 13 55 ± 8
Theia 67 ± 3 42 ± 6 95 ± 5 48 ± 6 10 ± 0
Depth-Anything 48 ± 10 90 ± 14 70 ± 4 45 ± 4 22 ± 6
DINOv2 53 ± 6 100 ± 0 78 ± 13 52 ± 2 38 ± 2
X-Distill (Ours) 73 ± 6 100 ± 0 100 ± 0 52 ± 6 87 ± 2
MetaWorld (Medium)
Alg \ Task Coffee Push Bin picking Coffee Pull Push Wall Peg Insert Side
ResNet-scratch 82 ± 10 68 ± 9 55 ± 4 48 ± 10 28 ± 13
Theia 32 ± 3 10 ± 0 2 ± 3 3 ± 6 0 ± 0
Depth-Anything 38 ± 2 32 ± 2 47 ± 6 33 ± 2 13 ± 2
DINOv2 35 ± 0 52 ± 13 52 ± 6 48 ± 6 35 ± 4
X-Distill (Ours) 97 ± 5 95 ± 4 95 ± 4 80 ± 0 88 ± 2
MetaWorld (Medium / Hard / Very Hard)
Alg \ Task Sweep Sweep into Pick out of hole Disassemble
ResNet-scratch 22 ± 2 33 ± 9 38 ± 9 50 ± 29
Theia 12 ± 3 37 ± 3 0 ± 0 38 ± 6
Depth-Anything 20 ± 4 22 ± 6 42 ± 10 43 ± 6
DINOv2 48 ± 5 52 ± 5 48 ± 9 38 ± 5
X-Distill (Ours) 85 ± 4 78 ± 5 48 ± 6 88 ± 6

Main Results on Adroit and DexArt Tasks

Alg \ Task Adroit Dexart
Door Pen Relocate Laptop Toilet
ResNet-scratch 47 ± 7 18 ± 2 48 ± 8 52 ± 5 57 ± 2
Theia 7 ± 2 14 ± 1 5 ± 4 0 ± 0 48 ± 2
Depth-Anything 52 ± 6 16 ± 1 53 ± 5 70 ± 4 62 ± 2
DINOv2 57 ± 6 38 ± 11 60 ± 7 53 ± 13 63 ± 5
X-Distill (Ours) 73 ± 9 60 ± 11 72 ± 5 65 ± 4 62 ± 2

Ablation Study: Teacher-Student Architecture Combinations

Main results on MetaWorld tasks. Tasks are grouped by difficulty and arranged across visually aligned sections.

Metaworld - Easy Tasks
Teacher Student Lever Pull Door Close Drawer Open Door Lock
DINOv2-L ResNet-18 (11M) 75 ± 8 100 ± 0 100 ± 0 100 ± 0
ViT-S-Half (11M) 40 ± 4 100 ± 0 100 ± 0 68 ± 10
ConvNeXt (89M) 73 ± 8 100 ± 0 100 ± 0 100 ± 0
DINOv2-S ResNet-18 (11M) 82 ± 12 100 ± 0 100 ± 0 100 ± 0
Metaworld - Easy Tasks
Teacher Student Door Unlock Drawer Close Faucet Close Faucet Open
DINOv2-L ResNet-18 (11M) 100 ± 0 100 ± 0 100 ± 0 100 ± 0
ViT-S-Half (11M) 68 ± 2 100 ± 0 22 ± 17 100 ± 0
ConvNeXt (89M) 100 ± 0 100 ± 0 100 ± 0 100 ± 0
DINOv2-S ResNet-18 (11M) 100 ± 0 100 ± 0 100 ± 0 100 ± 0
Metaworld - Easy Tasks
Teacher Student Handle Press Handle Pull Handle Pull Side Plate Slide
DINOv2-L ResNet-18 (11M) 100 ± 0 95 ± 4 95 ± 7 100 ± 0
ViT-S-Half (11M) 98 ± 2 13 ± 5 10 ± 0 93 ± 2
ConvNeXt (89M) 100 ± 0 80 ± 15 78 ± 6 98 ± 2
DINOv2-S ResNet-18 (11M) 100 ± 0 98 ± 3 90 ± 13 100 ± 0
Metaworld - Easy Tasks
Teacher Student Plate Slide Back Plate Slide Back Side Plate Slide Side Reach Wall
DINOv2-L ResNet-18 (11M) 100 ± 0 100 ± 0 100 ± 0 73 ± 6
ViT-S-Half (11M) 85 ± 7 100 ± 0 100 ± 0 72 ± 5
ConvNeXt (89M) 100 ± 0 100 ± 0 100 ± 0 67 ± 3
DINOv2-S ResNet-18 (11M) 100 ± 0 100 ± 0 100 ± 0 75 ± 5
Metaworld - Easy Tasks
Teacher Student Window Close Window Open Coffee Push Bin Picking
DINOv2-L ResNet-18 (11M) 100 ± 0 100 ± 0 97 ± 5 95 ± 4
ViT-S-Half (11M) 92 ± 8 95 ± 4 35 ± 4 10 ± 4
ConvNeXt (89M) 100 ± 0 100 ± 0 88 ± 8 88 ± 6
DINOv2-S ResNet-18 (11M) 100 ± 0 100 ± 0 95 ± 5 85 ± 9
Metaworld - Medium Tasks
Teacher Student Reach Peg Unplug Side Coffee Pull Push Wall
DINOv2-L ResNet-18 (11M) 52 ± 6 87 ± 2 95 ± 4 80 ± 0
ViT-S-Half (11M) 50 ± 4 33 ± 2 23 ± 18 37 ± 6
ConvNeXt (89M) 52 ± 3 88 ± 8 83 ± 3 73 ± 8
DINOv2-S ResNet-18 (11M) 52 ± 6 88 ± 6 95 ± 0 83 ± 3
Metaworld - Medium/Hard Tasks
Teacher Student Peg Insert Side Sweep Sweep Into Pick Out of Hole
DINOv2-L ResNet-18 (11M) 88 ± 2 85 ± 4 78 ± 5 48 ± 6
ViT-S-Half (11M) 7 ± 6 45 ± 8 20 ± 4 2 ± 2
ConvNeXt (89M) 57 ± 16 75 ± 5 78 ± 6 50 ± 5
DINOv2-S ResNet-18 (11M) 90 ± 5 85 ± 5 78 ± 3 43 ± 16
Very Hard Tasks
Teacher Student Disassemble
DINOv2-L ResNet-18 (11M) 88 ± 6
ViT-S-Half (11M) 40 ± 7
ConvNeXt (89M) 83 ± 8
DINOv2-S ResNet-18 (11M) 90 ± 5