We introduce a new paradigm for dataset creation based on human π§βπ» and machine π€ collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect π WaNLI, a dataset of 108K NLI examples! π§΅
Paper: https://t.co/IUXcm9wIh2
Our pipeline starts with an existing dataset (MNLI), and uses data maps π to automatically identify pockets of examples that demonstrate challenging π§ reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/
Next we propose a new metric, also inspired by data maps, to automatically filter generations for those most likely to aid model learning. Finally, we validate β
the generated examples through crowdworkers, who assign a gold label π‘ and (optionally) revise for quality βοΈ. 3/
Remarkably, replacing MNLI with WaNLI (which is 4x smaller) for training improves performanceπ on seven OOD test setsπ§ͺ, including by 11% on HANS and 9% on ANLI. Under a data augmentation setting, combining MNLI with WaNLI is more effective than using other augmentation sets. 4/
Our method addresses limitations of crowdsourcing, where workers may resort to repetitive writing strategies π€·, and leverages the great progress in text generation π. We get the best of both worlds: π€βs ability to produce diverse examples, and π§βπ»βs ability to evaluate them. 5/