“Utility is in the Eye of the User: A Critique of NLP Leaderboards” Paper Summary & Analysis

Discussion led by Katie Yang & Stephen Tse, Intelligent Systems subteam

Objectives / Goals of the Paper

NLP models have become increasingly complex in the last decade, with GPT, GPT-2, BERT, and GPT-3 using ever more parameters and computational power. GPT-3, the current SOTA model from Open-AI, has 185 billion parameters and cost $5 million to train, far more than is feasible for all but the largest industrial research labs. The constant drive for increased accuracy, fostered in part by the ubiquitous use of leaderboards in NLP, has meant that other important factors for practical model use have been neglected. Leaderboards measure the performance of a model on a test set; since models are ranked only based on their accuracy, researchers have a strong incentive to focus entirely on this metric, ignoring key measures such as generalizability, performance, or fairness. Additionally, models may become unnecessarily complex just to eke out every last bit of performance.

Paper Contributions

The authors recommend some combination of limited submission caps, bias performance reporting, dynamic test sets, and also showcasing other stats about the model such as model size, computational complexity, etc. rather than simply pure model performance. The limited submission cap would prevent people from tuning their models to the test set, which would be a form of hyperparameter tuning instead of actual model performance measurement. However, even using this strategy, people are able to create multiple accounts for “dummy runs.” Dynamic test sets would also require them to come up with more robust models so that they’re not just tuned towards that specific test set. But one concern that could arise with this strategy is that more information about the actual data set is given, which isn’t given in the original. In addition, some might complain about additional random elements being introduced and used to evaluate. One solution would be to have a different “preliminary” test set (for leaderboard purposes) and a separate“final” test set which is used to conduct final rankings.

Paper Limitations, Further Research, & Potential Applications

This paper wasn’t clear about how it measured the bias and cost efficiency of a model. For example, Bert-base and Bert-large are fairly popular, despite the paper’s insistence that distilbert was more practical. To further explore this, they might use current work from Prof. Joachims [exposure of rankings, fairness in rankings], looking at what other factors users look for in models. There exists published research on fairness measures, so future researchers can build off of that knowledge base.

Cornell Data Science is an engineering project team @Cornell that seeks to prepare students for a career in data science.