So you think you know distillation; it's easy, right?
We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.
Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)
🧵👇https://t.co/3SlkXVZcG3
This is not a fancy novel method. It's plain old distillation.
But we investigate it thoroughly, for model compression, via the lens of *function matching*.
We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
0. Intuition: Want the student to replicate _the whole function_ represented by the teacher, everywhere that we expect data in input space.
This is a much stronger view than the commonly used "teacher generates better/more informative labels for the data". See pic above.
1. Consistency: to achieve this, teacher and student need to see the same view (crop) of the image. For example, this means no pre-computed teacher logits! We can generate many more views via mixup.
Other approaches may look good early, but eventually fall behind consistency.
2. Patience: The function matching task is HARD! We need to train *a lot* longer than typical, and actually we were not able to reach saturation yet. Overfitting does not happen, as when function-matching, an "overfit" student is great! (Note: w/ pre-computed teacher, we overfit)