X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Anonymous RAL Submission

[Abstract] [Real-World Experiments] [Simulation Experiment] [Qualitative Analysis]

Abstract

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly finetuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on 34 simulated benchmarks and 5 challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or finetuned DINOv2 encoders. Notably, X-Distill also surpasses stronger baselines that utilize privileged 3D observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

Real-World Experiments

Task Setup & Generalization

Visualization of configurations for our real-world tasks. The orange arrow provides a schematic representation of the gripper trajectory as derived from the data. The green regions represent the distribution of object/robot configurations seen during training demonstrations, while the red regions illustrate the novel configurations used for generalization testing.

Task-by-Task Video Results

Task 1: Move Cube

4-Way Comparison of Move Cube task.

Task 2: Move Brush

4-Way Comparison of Move Brush task.

Task 3: Writing "AGI"

4-Way Comparison of Writing "AGI" task.

Robustness to Perturbation: X-Distill (left) robustly adapts, while ResNet-scratch (right) fails.

Task 4: Drawer Open

4-Way Comparison of Drawer Open task.

Task 5: Door Close

4-Way Comparison of Door Close task.

Qualitative Analysis

Feature Space Separability (t-SNE)

t-SNE visualization of learned feature spaces on the ''Writing AGI'' task.b> Our X-Distill encoder learns to form three distinct clusters corresponding to the task's semantic stages, quantitatively confirming a well-separated feature space with a high Silhouette Score of 0.472, which indicates a high degree of cluster cohesion and separation compared with the baselines. This semantic separability is crucial for the policy to accurately identify the current task stage, enabling precise long-horizon planning for the sequential writing task.

Saliency Map Visualization

Saliency map comparison on the ''Writing AGI'' task. We visualize the model's visual focus at the beginning of each writing stage. Our X-Distill encoder correctly shifts its attention from the gripper (before 'A'), to the letter 'A' (before 'G'), and finally to the letter 'G' (before 'I'). Baseline models exhibit diffuse or irrelevant attention.

Simulation Experiment

Figure below presents the training curves across representative tasks from MetaWorld, Adroit, and DexArt, demonstrating the consistent performance advantages of our X-Distill approach throughout the learning process.

Training curves on representative simulation tasks. Success rates are shown for selected tasks from MetaWorld, Adroit, and DexArt.

The tables below provide comprehensive quantitative comparisons against strong baselines on all benchmark tasks. Our method achieves superior or competitive performance across the majority of tasks, particularly excelling in challenging manipulation scenarios like handle pulling, peg insertion, and complex multi-step operations. The results highlight X-Distill's robustness across varying task difficulties and embodiment domains.

The ablation study tables examine the impact of different teacher-student architecture combinations. Notably, the DINOv2-L to ResNet-18 configuration emerges as the most effective balance between performance and efficiency, while the consistent superiority of distilled representations over from-scratch training underscores the value of our knowledge distillation approach.

Main Results on MetaWorld Tasks

MetaWorld (Easy)
Alg \ Task	Lever Pull	Door Close	Drawer Open	Door Lock	Door Unlock
ResNet-scratch	30 ± 18	100 ± 0	77 ± 6	70 ± 19	82 ± 5
Theia	48 ± 14	100 ± 0	100 ± 0	47 ± 3	83 ± 8
Depth-Anything	17 ± 2	100 ± 0	63 ± 5	48 ± 2	78 ± 13
DINOv2	47 ± 5	100 ± 0	100 ± 0	63 ± 2	90 ± 4
X-Distill (Ours)	75 ± 8	100 ± 0	100 ± 0	100 ± 0	100 ± 0

MetaWorld (Easy)
Alg \ Task	Drawer Close	Faucet Close	Faucet Open	Handle Press	Handle Pull
ResNet-scratch	100 ± 0	100 ± 0	100 ± 0	83 ± 13	25 ± 22
Theia	100 ± 0	3 ± 3	7 ± 3	100 ± 0	13 ± 10
Depth-Anything	100 ± 0	88 ± 6	100 ± 0	85 ± 7	18 ± 2
DINOv2	100 ± 0	93 ± 2	100 ± 0	100 ± 0	28 ± 5
X-Distill (Ours)	100 ± 0	100 ± 0	100 ± 0	100 ± 0	95 ± 4

MetaWorld (Easy)
Alg \ Task	Handle Pull Side	Plate Slide	Plate Slide Back	Plate Slide Back Side	Plate Slide Side
ResNet-scratch	3 ± 5	90 ± 14	100 ± 0	100 ± 0	100 ± 0
Theia	12 ± 16	40 ± 30	38 ± 8	62 ± 47	2 ± 3
Depth-Anything	12 ± 2	80 ± 12	100 ± 0	100 ± 0	100 ± 0
DINOv2	48 ± 5	80 ± 4	100 ± 0	100 ± 0	100 ± 0
X-Distill (Ours)	95 ± 7	100 ± 0	100 ± 0	100 ± 0	100 ± 0

MetaWorld (Easy)
Alg \ Task	Reach Wall	Window Close	Window Open	Reach	Peg unplug side
ResNet-scratch	77 ± 2	100 ± 0	93 ± 10	47 ± 13	55 ± 8
Theia	67 ± 3	42 ± 6	95 ± 5	48 ± 6	10 ± 0
Depth-Anything	48 ± 10	90 ± 14	70 ± 4	45 ± 4	22 ± 6
DINOv2	53 ± 6	100 ± 0	78 ± 13	52 ± 2	38 ± 2
X-Distill (Ours)	73 ± 6	100 ± 0	100 ± 0	52 ± 6	87 ± 2

MetaWorld (Medium)
Alg \ Task	Coffee Push	Bin picking	Coffee Pull	Push Wall	Peg Insert Side
ResNet-scratch	82 ± 10	68 ± 9	55 ± 4	48 ± 10	28 ± 13
Theia	32 ± 3	10 ± 0	2 ± 3	3 ± 6	0 ± 0
Depth-Anything	38 ± 2	32 ± 2	47 ± 6	33 ± 2	13 ± 2
DINOv2	35 ± 0	52 ± 13	52 ± 6	48 ± 6	35 ± 4
X-Distill (Ours)	97 ± 5	95 ± 4	95 ± 4	80 ± 0	88 ± 2

MetaWorld (Medium / Hard / Very Hard)
Alg \ Task	Sweep	Sweep into	Pick out of hole	Disassemble
ResNet-scratch	22 ± 2	33 ± 9	38 ± 9	50 ± 29
Theia	12 ± 3	37 ± 3	0 ± 0	38 ± 6
Depth-Anything	20 ± 4	22 ± 6	42 ± 10	43 ± 6
DINOv2	48 ± 5	52 ± 5	48 ± 9	38 ± 5
X-Distill (Ours)	85 ± 4	78 ± 5	48 ± 6	88 ± 6

Main Results on Adroit and DexArt Tasks

Alg \ Task	Adroit			Dexart
Alg \ Task	Door	Pen	Relocate	Laptop	Toilet
ResNet-scratch	47 ± 7	18 ± 2	48 ± 8	52 ± 5	57 ± 2
Theia	7 ± 2	14 ± 1	5 ± 4	0 ± 0	48 ± 2
Depth-Anything	52 ± 6	16 ± 1	53 ± 5	70 ± 4	62 ± 2
DINOv2	57 ± 6	38 ± 11	60 ± 7	53 ± 13	63 ± 5
X-Distill (Ours)	73 ± 9	60 ± 11	72 ± 5	65 ± 4	62 ± 2

Ablation Study: Teacher-Student Architecture Combinations

Main results on MetaWorld tasks. Tasks are grouped by difficulty and arranged across visually aligned sections.

Metaworld - Easy Tasks
Teacher	Student	Lever Pull	Door Close	Drawer Open	Door Lock
DINOv2-L	ResNet-18 (11M)	75 ± 8	100 ± 0	100 ± 0	100 ± 0
	ViT-S-Half (11M)	40 ± 4	100 ± 0	100 ± 0	68 ± 10
	ConvNeXt (89M)	73 ± 8	100 ± 0	100 ± 0	100 ± 0
DINOv2-S	ResNet-18 (11M)	82 ± 12	100 ± 0	100 ± 0	100 ± 0

Metaworld - Easy Tasks
Teacher	Student	Door Unlock	Drawer Close	Faucet Close	Faucet Open
DINOv2-L	ResNet-18 (11M)	100 ± 0	100 ± 0	100 ± 0	100 ± 0
	ViT-S-Half (11M)	68 ± 2	100 ± 0	22 ± 17	100 ± 0
	ConvNeXt (89M)	100 ± 0	100 ± 0	100 ± 0	100 ± 0
DINOv2-S	ResNet-18 (11M)	100 ± 0	100 ± 0	100 ± 0	100 ± 0

Metaworld - Easy Tasks
Teacher	Student	Handle Press	Handle Pull	Handle Pull Side	Plate Slide
DINOv2-L	ResNet-18 (11M)	100 ± 0	95 ± 4	95 ± 7	100 ± 0
	ViT-S-Half (11M)	98 ± 2	13 ± 5	10 ± 0	93 ± 2
	ConvNeXt (89M)	100 ± 0	80 ± 15	78 ± 6	98 ± 2
DINOv2-S	ResNet-18 (11M)	100 ± 0	98 ± 3	90 ± 13	100 ± 0

Metaworld - Easy Tasks
Teacher	Student	Plate Slide Back	Plate Slide Back Side	Plate Slide Side	Reach Wall
DINOv2-L	ResNet-18 (11M)	100 ± 0	100 ± 0	100 ± 0	73 ± 6
	ViT-S-Half (11M)	85 ± 7	100 ± 0	100 ± 0	72 ± 5
	ConvNeXt (89M)	100 ± 0	100 ± 0	100 ± 0	67 ± 3
DINOv2-S	ResNet-18 (11M)	100 ± 0	100 ± 0	100 ± 0	75 ± 5

Metaworld - Easy Tasks
Teacher	Student	Window Close	Window Open	Coffee Push	Bin Picking
DINOv2-L	ResNet-18 (11M)	100 ± 0	100 ± 0	97 ± 5	95 ± 4
	ViT-S-Half (11M)	92 ± 8	95 ± 4	35 ± 4	10 ± 4
	ConvNeXt (89M)	100 ± 0	100 ± 0	88 ± 8	88 ± 6
DINOv2-S	ResNet-18 (11M)	100 ± 0	100 ± 0	95 ± 5	85 ± 9

Metaworld - Medium Tasks
Teacher	Student	Reach	Peg Unplug Side	Coffee Pull	Push Wall
DINOv2-L	ResNet-18 (11M)	52 ± 6	87 ± 2	95 ± 4	80 ± 0
	ViT-S-Half (11M)	50 ± 4	33 ± 2	23 ± 18	37 ± 6
	ConvNeXt (89M)	52 ± 3	88 ± 8	83 ± 3	73 ± 8
DINOv2-S	ResNet-18 (11M)	52 ± 6	88 ± 6	95 ± 0	83 ± 3

Metaworld - Medium/Hard Tasks
Teacher	Student	Peg Insert Side	Sweep	Sweep Into	Pick Out of Hole
DINOv2-L	ResNet-18 (11M)	88 ± 2	85 ± 4	78 ± 5	48 ± 6
	ViT-S-Half (11M)	7 ± 6	45 ± 8	20 ± 4	2 ± 2
	ConvNeXt (89M)	57 ± 16	75 ± 5	78 ± 6	50 ± 5
DINOv2-S	ResNet-18 (11M)	90 ± 5	85 ± 5	78 ± 3	43 ± 16

Very Hard Tasks
Teacher	Student	Disassemble
DINOv2-L	ResNet-18 (11M)	88 ± 6
	ViT-S-Half (11M)	40 ± 7
	ConvNeXt (89M)	83 ± 8
DINOv2-S	ResNet-18 (11M)	90 ± 5