Anonymous RAL Submission
Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly finetuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on 34 simulated benchmarks and 5 challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or finetuned DINOv2 encoders. Notably, X-Distill also surpasses stronger baselines that utilize privileged 3D observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
Visualization of configurations for our real-world tasks. The orange arrow provides a schematic representation of the gripper trajectory as derived from the data. The green regions represent the distribution of object/robot configurations seen during training demonstrations, while the red regions illustrate the novel configurations used for generalization testing.
4-Way Comparison of Move Cube task.
4-Way Comparison of Move Brush task.
4-Way Comparison of Writing "AGI" task.
Robustness to Perturbation: X-Distill (left) robustly adapts, while ResNet-scratch (right) fails.
4-Way Comparison of Drawer Open task.
4-Way Comparison of Door Close task.
t-SNE visualization of learned feature spaces on the ''Writing AGI'' task.b> Our X-Distill encoder learns to form three distinct clusters corresponding to the task's semantic stages, quantitatively confirming a well-separated feature space with a high Silhouette Score of 0.472, which indicates a high degree of cluster cohesion and separation compared with the baselines. This semantic separability is crucial for the policy to accurately identify the current task stage, enabling precise long-horizon planning for the sequential writing task.
Saliency map comparison on the ''Writing AGI'' task. We visualize the model's visual focus at the beginning of each writing stage. Our X-Distill encoder correctly shifts its attention from the gripper (before 'A'), to the letter 'A' (before 'G'), and finally to the letter 'G' (before 'I'). Baseline models exhibit diffuse or irrelevant attention.
Figure below presents the training curves across representative tasks from MetaWorld, Adroit, and DexArt, demonstrating the consistent performance advantages of our X-Distill approach throughout the learning process.
The tables below provide comprehensive quantitative comparisons against strong baselines on all benchmark tasks. Our method achieves superior or competitive performance across the majority of tasks, particularly excelling in challenging manipulation scenarios like handle pulling, peg insertion, and complex multi-step operations. The results highlight X-Distill's robustness across varying task difficulties and embodiment domains.
The ablation study tables examine the impact of different teacher-student architecture combinations. Notably, the DINOv2-L to ResNet-18 configuration emerges as the most effective balance between performance and efficiency, while the consistent superiority of distilled representations over from-scratch training underscores the value of our knowledge distillation approach.
| MetaWorld (Easy) | |||||
|---|---|---|---|---|---|
| Alg \ Task | Lever Pull | Door Close | Drawer Open | Door Lock | Door Unlock |
| ResNet-scratch | 30 ± 18 | 100 ± 0 | 77 ± 6 | 70 ± 19 | 82 ± 5 |
| Theia | 48 ± 14 | 100 ± 0 | 100 ± 0 | 47 ± 3 | 83 ± 8 |
| Depth-Anything | 17 ± 2 | 100 ± 0 | 63 ± 5 | 48 ± 2 | 78 ± 13 |
| DINOv2 | 47 ± 5 | 100 ± 0 | 100 ± 0 | 63 ± 2 | 90 ± 4 |
| X-Distill (Ours) | 75 ± 8 | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| MetaWorld (Easy) | |||||
|---|---|---|---|---|---|
| Alg \ Task | Drawer Close | Faucet Close | Faucet Open | Handle Press | Handle Pull |
| ResNet-scratch | 100 ± 0 | 100 ± 0 | 100 ± 0 | 83 ± 13 | 25 ± 22 |
| Theia | 100 ± 0 | 3 ± 3 | 7 ± 3 | 100 ± 0 | 13 ± 10 |
| Depth-Anything | 100 ± 0 | 88 ± 6 | 100 ± 0 | 85 ± 7 | 18 ± 2 |
| DINOv2 | 100 ± 0 | 93 ± 2 | 100 ± 0 | 100 ± 0 | 28 ± 5 |
| X-Distill (Ours) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 | 95 ± 4 |
| MetaWorld (Easy) | |||||
|---|---|---|---|---|---|
| Alg \ Task | Handle Pull Side | Plate Slide | Plate Slide Back | Plate Slide Back Side | Plate Slide Side |
| ResNet-scratch | 3 ± 5 | 90 ± 14 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| Theia | 12 ± 16 | 40 ± 30 | 38 ± 8 | 62 ± 47 | 2 ± 3 |
| Depth-Anything | 12 ± 2 | 80 ± 12 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| DINOv2 | 48 ± 5 | 80 ± 4 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| X-Distill (Ours) | 95 ± 7 | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| MetaWorld (Easy) | |||||
|---|---|---|---|---|---|
| Alg \ Task | Reach Wall | Window Close | Window Open | Reach | Peg unplug side |
| ResNet-scratch | 77 ± 2 | 100 ± 0 | 93 ± 10 | 47 ± 13 | 55 ± 8 |
| Theia | 67 ± 3 | 42 ± 6 | 95 ± 5 | 48 ± 6 | 10 ± 0 |
| Depth-Anything | 48 ± 10 | 90 ± 14 | 70 ± 4 | 45 ± 4 | 22 ± 6 |
| DINOv2 | 53 ± 6 | 100 ± 0 | 78 ± 13 | 52 ± 2 | 38 ± 2 |
| X-Distill (Ours) | 73 ± 6 | 100 ± 0 | 100 ± 0 | 52 ± 6 | 87 ± 2 |
| MetaWorld (Medium) | |||||
|---|---|---|---|---|---|
| Alg \ Task | Coffee Push | Bin picking | Coffee Pull | Push Wall | Peg Insert Side |
| ResNet-scratch | 82 ± 10 | 68 ± 9 | 55 ± 4 | 48 ± 10 | 28 ± 13 |
| Theia | 32 ± 3 | 10 ± 0 | 2 ± 3 | 3 ± 6 | 0 ± 0 |
| Depth-Anything | 38 ± 2 | 32 ± 2 | 47 ± 6 | 33 ± 2 | 13 ± 2 |
| DINOv2 | 35 ± 0 | 52 ± 13 | 52 ± 6 | 48 ± 6 | 35 ± 4 |
| X-Distill (Ours) | 97 ± 5 | 95 ± 4 | 95 ± 4 | 80 ± 0 | 88 ± 2 |
| MetaWorld (Medium / Hard / Very Hard) | ||||
|---|---|---|---|---|
| Alg \ Task | Sweep | Sweep into | Pick out of hole | Disassemble |
| ResNet-scratch | 22 ± 2 | 33 ± 9 | 38 ± 9 | 50 ± 29 |
| Theia | 12 ± 3 | 37 ± 3 | 0 ± 0 | 38 ± 6 |
| Depth-Anything | 20 ± 4 | 22 ± 6 | 42 ± 10 | 43 ± 6 |
| DINOv2 | 48 ± 5 | 52 ± 5 | 48 ± 9 | 38 ± 5 |
| X-Distill (Ours) | 85 ± 4 | 78 ± 5 | 48 ± 6 | 88 ± 6 |
| Alg \ Task | Adroit | Dexart | |||
|---|---|---|---|---|---|
| Door | Pen | Relocate | Laptop | Toilet | |
| ResNet-scratch | 47 ± 7 | 18 ± 2 | 48 ± 8 | 52 ± 5 | 57 ± 2 |
| Theia | 7 ± 2 | 14 ± 1 | 5 ± 4 | 0 ± 0 | 48 ± 2 |
| Depth-Anything | 52 ± 6 | 16 ± 1 | 53 ± 5 | 70 ± 4 | 62 ± 2 |
| DINOv2 | 57 ± 6 | 38 ± 11 | 60 ± 7 | 53 ± 13 | 63 ± 5 |
| X-Distill (Ours) | 73 ± 9 | 60 ± 11 | 72 ± 5 | 65 ± 4 | 62 ± 2 |
Main results on MetaWorld tasks. Tasks are grouped by difficulty and arranged across visually aligned sections.
| Metaworld - Easy Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Lever Pull | Door Close | Drawer Open | Door Lock |
| DINOv2-L | ResNet-18 (11M) | 75 ± 8 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| ViT-S-Half (11M) | 40 ± 4 | 100 ± 0 | 100 ± 0 | 68 ± 10 | |
| ConvNeXt (89M) | 73 ± 8 | 100 ± 0 | 100 ± 0 | 100 ± 0 | |
| DINOv2-S | ResNet-18 (11M) | 82 ± 12 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| Metaworld - Easy Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Door Unlock | Drawer Close | Faucet Close | Faucet Open |
| DINOv2-L | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| ViT-S-Half (11M) | 68 ± 2 | 100 ± 0 | 22 ± 17 | 100 ± 0 | |
| ConvNeXt (89M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 | |
| DINOv2-S | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 100 ± 0 |
| Metaworld - Easy Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Handle Press | Handle Pull | Handle Pull Side | Plate Slide |
| DINOv2-L | ResNet-18 (11M) | 100 ± 0 | 95 ± 4 | 95 ± 7 | 100 ± 0 |
| ViT-S-Half (11M) | 98 ± 2 | 13 ± 5 | 10 ± 0 | 93 ± 2 | |
| ConvNeXt (89M) | 100 ± 0 | 80 ± 15 | 78 ± 6 | 98 ± 2 | |
| DINOv2-S | ResNet-18 (11M) | 100 ± 0 | 98 ± 3 | 90 ± 13 | 100 ± 0 |
| Metaworld - Easy Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Plate Slide Back | Plate Slide Back Side | Plate Slide Side | Reach Wall |
| DINOv2-L | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 73 ± 6 |
| ViT-S-Half (11M) | 85 ± 7 | 100 ± 0 | 100 ± 0 | 72 ± 5 | |
| ConvNeXt (89M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 67 ± 3 | |
| DINOv2-S | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 100 ± 0 | 75 ± 5 |
| Metaworld - Easy Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Window Close | Window Open | Coffee Push | Bin Picking |
| DINOv2-L | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 97 ± 5 | 95 ± 4 |
| ViT-S-Half (11M) | 92 ± 8 | 95 ± 4 | 35 ± 4 | 10 ± 4 | |
| ConvNeXt (89M) | 100 ± 0 | 100 ± 0 | 88 ± 8 | 88 ± 6 | |
| DINOv2-S | ResNet-18 (11M) | 100 ± 0 | 100 ± 0 | 95 ± 5 | 85 ± 9 |
| Metaworld - Medium Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Reach | Peg Unplug Side | Coffee Pull | Push Wall |
| DINOv2-L | ResNet-18 (11M) | 52 ± 6 | 87 ± 2 | 95 ± 4 | 80 ± 0 |
| ViT-S-Half (11M) | 50 ± 4 | 33 ± 2 | 23 ± 18 | 37 ± 6 | |
| ConvNeXt (89M) | 52 ± 3 | 88 ± 8 | 83 ± 3 | 73 ± 8 | |
| DINOv2-S | ResNet-18 (11M) | 52 ± 6 | 88 ± 6 | 95 ± 0 | 83 ± 3 |
| Metaworld - Medium/Hard Tasks | |||||
|---|---|---|---|---|---|
| Teacher | Student | Peg Insert Side | Sweep | Sweep Into | Pick Out of Hole |
| DINOv2-L | ResNet-18 (11M) | 88 ± 2 | 85 ± 4 | 78 ± 5 | 48 ± 6 |
| ViT-S-Half (11M) | 7 ± 6 | 45 ± 8 | 20 ± 4 | 2 ± 2 | |
| ConvNeXt (89M) | 57 ± 16 | 75 ± 5 | 78 ± 6 | 50 ± 5 | |
| DINOv2-S | ResNet-18 (11M) | 90 ± 5 | 85 ± 5 | 78 ± 3 | 43 ± 16 |
| Very Hard Tasks | ||
|---|---|---|
| Teacher | Student | Disassemble |
| DINOv2-L | ResNet-18 (11M) | 88 ± 6 |
| ViT-S-Half (11M) | 40 ± 7 | |
| ConvNeXt (89M) | 83 ± 8 | |
| DINOv2-S | ResNet-18 (11M) | 90 ± 5 |