Ankitsrihbti's Categories

Ankitsrihbti's Authors

Latest Saves

As promised in paper, torchdistill now supports 🤗 @huggingface transformers, accelerate & datasets packages for deep learning & knowledge distillation experiments with ⤵️ LOW coding cost😎

https://t.co/4cWQIL8x1Z

Paper, new results, trained models and Google Colab are 🔽

1/n

Paper:
https://t.co/GX9JaDRW2r
Preprint: https://t.co/ttiihRjFmG

This work contains the key concepts of torchdistill and reproduced ImageNet results with KD methods presented at CVPR, ICLR, ECCV and NeurIPS

Code, training logs, configs, trained models are all available🙌

2/n

With the latest torchdistill, I attempted to reproduce TEST results of BERT and apply knowledge distillation to BERT-B/L (student/teacher) to improve BERT-B for GLUE benchmark

BERT-L (FT): 80.2 (80.5)
BERT-B (FT): 77.9 (78.3)
BERT-B (KD): 78.9

The pretrained model weights are available @huggingface model hub🤗
https://t.co/mYapfFGoxH

For these experiments, I used Google Colab as computing resource🖥️

So, you should be able to try similar experiments based on the following examples!
Calculating Convolution sizes is something that I found particularly hard after understanding convolutions for the first time.

I couldn't remember the formula because I didn't understand its working exactly.

So here's my attempt to get some intuition behind the calculation.🔣👇


BTW if you haven't read the thread 🧵 on 1D, 2D, 3D CNN, you may want to check it out


First, observe the picture below🖼


The 2 x 2 filter slides over the
3 rows, 2 times and,
4 columns, 3 times

So, let's try subtracting the filter size first
3 - 2 = 1
4 - 2 = 2

Looks short, we'll need to compensate the 1 in both.
3 - 2 + 1 = 2
4 - 2 + 1 = 3

hence the formula so far becomes:


Now let's discuss padding0⃣

Zero padding makes it possible to get output equal to the input by adding extra columns.

It provides extra space for the sliding, making up for the lost space