“Utility is in the Eye of the User: A Critique of NLP Leaderboards” Paper Summary & Analysis

Discussion led by Katie Yang & Stephen Tse, Intelligent Systems subteam

Objectives / Goals of the Paper

NLP models have become increasingly complex in the last decade, with GPT, GPT-2, BERT, and GPT-3 using ever more parameters and computational power. GPT-3, the current SOTA model from Open-AI, has 185 billion parameters and cost $5 million to train, far more than is feasible for all but the largest industrial research labs. The constant drive for increased accuracy, fostered in part by the ubiquitous use of leaderboards in NLP, has meant that other important factors for practical model use have been neglected. Leaderboards measure the performance of a model on a test set; since models are ranked only based on their accuracy, researchers have a strong incentive to focus entirely on this metric, ignoring key measures such as generalizability, performance, or fairness. Additionally, models may become unnecessarily complex just to eke out every last bit of performance.

Therefore, the authors aim to define a framework, based on microeconomics, for understanding the usefulness of NLP models to “NLP Practitioners”, who are consumers of the model. They define “utility” as the benefit NLP Practitioners gain from using a model for a task. Their goal then is to propose different ways of constructing leaderboards to consider the diversity of use-cases for different NLP practitioners.

This paper makes reference to others that have identified the issue of top ranking models in machine learning leaderboards being held by brittle and biased models that overfit the testing data and mentions their proposed solutions before suggesting their own. These solutions range from certificates of performance against adversarial examples, to the inclusion of separate metrics to judge model bias, to adjusting the submission process entirely by either limiting submissions to prevent hyperparameter tuning or using a dynamic test set.

Paper Contributions

The authors recommend some combination of limited submission caps, bias performance reporting, dynamic test sets, and also showcasing other stats about the model such as model size, computational complexity, etc. rather than simply pure model performance. The limited submission cap would prevent people from tuning their models to the test set, which would be a form of hyperparameter tuning instead of actual model performance measurement. However, even using this strategy, people are able to create multiple accounts for “dummy runs.” Dynamic test sets would also require them to come up with more robust models so that they’re not just tuned towards that specific test set. But one concern that could arise with this strategy is that more information about the actual data set is given, which isn’t given in the original. In addition, some might complain about additional random elements being introduced and used to evaluate. One solution would be to have a different “preliminary” test set (for leaderboard purposes) and a separate“final” test set which is used to conduct final rankings.

Differentiating itself from other papers about similar topics, this paper addresses the fact that although there are certain model attributes that are not valued by the leaderboards, it doesn’t mean that NLP researchers have not worked to address these problems. In addition, it also proposes ways that the leaderboards could improve their rankings and reports for the benefit of the NLP practitioners.

Paper Limitations, Further Research, & Potential Applications

This paper wasn’t clear about how it measured the bias and cost efficiency of a model. For example, Bert-base and Bert-large are fairly popular, despite the paper’s insistence that distilbert was more practical. To further explore this, they might use current work from Prof. Joachims [exposure of rankings, fairness in rankings], looking at what other factors users look for in models. There exists published research on fairness measures, so future researchers can build off of that knowledge base.

We can apply the ideas from this paper to improve current model evaluation leaderboards. This could be an especially impactful application for NLP models because of the prevalence of leaderboards in the field of NLP to rank models; however, this paper’s points can also be used for leaderboards of all different model types, such as those seen on Kaggle, as well.




Cornell Data Science is an engineering project team @Cornell that seeks to prepare students for a career in data science.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Free GPUs for Training Your Deep Learning Models

Seq2Seq model using Convolutional Neural Network

Traffic sign classification

Testing Hardware for Deep Learning with DLBT

Week 5 — Waste Classification

Key Criteria's for Machine Learning Solution Evaluation

Pixie : A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time [ Paper…

Deep Learning Techniques | An Overview

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cornell Data Science

Cornell Data Science

Cornell Data Science is an engineering project team @Cornell that seeks to prepare students for a career in data science.

More from Medium

GPT3: Know About OpenAI’s Language Prediction Model

Kiswahili NLP: ASR, Translation & Zero-Shot Text Classification with Hugging Face

“Can you believe what you read?” or “How well your model summarizes a text?”

EleutherAI’s GPT NeoX20B NLP Model