We introduce a new paradigm for dataset creation based on human 🧑‍💻 and machine 🤖 collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect 🎉 WaNLI, a dataset of 108K NLI examples! 🧵

Paper: https://t.co/IUXcm9wIh2

Our pipeline starts with an existing dataset (MNLI), and uses data maps 📜 to automatically identify pockets of examples that demonstrate challenging 🧐 reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/
Next we propose a new metric, also inspired by data maps, to automatically filter generations for those most likely to aid model learning. Finally, we validate ✅ the generated examples through crowdworkers, who assign a gold label 🟡 and (optionally) revise for quality ✍️. 3/
Remarkably, replacing MNLI with WaNLI (which is 4x smaller) for training improves performance📈 on seven OOD test sets🧪, including by 11% on HANS and 9% on ANLI. Under a data augmentation setting, combining MNLI with WaNLI is more effective than using other augmentation sets. 4/
Our method addresses limitations of crowdsourcing, where workers may resort to repetitive writing strategies 🤷, and leverages the great progress in text generation 📃. We get the best of both worlds: 🤖’s ability to produce diverse examples, and 🧑‍💻’s ability to evaluate them. 5/
We hope our work demonstrates the promise of leveraging LMs in a controlled way to aid the dataset creation process, and encourage the community to think of dataset curation as an AI challenge itself 💡. Co-authored with @swabhz @nlpnoah @YejinChoinka 💟 6/6

More from All

You May Also Like