Baselines are important in Machine Learning. One criticism I’ve had of the NLP field for years is that all of the fancy Deep Learning (DL) was often just marginally better (if at all) than bag-of-words-based models for many of the common tasks.

I remember as an undergrad, I spent a long time studying the heck out of Socher & Manning’s Tree-RNN suite of papers (RAE, MV-RNN, RNTN). I remember being so impressed, only to then see Chris Manning present at NAACL 2015 and hear him casually mention that all those other fancy-math, linguistically-inspired tree networks couldn’t actually beat paragraph2vec (which is a pretty simple model that straight up ignores the ordering of words in a sentence)… but don’t worry because the new Tree-LSTM finally can now! That really stuck with me. It made me much more skeptical of “the hype.” And for what it’s worth, Manning wasn’t trying to pull a fast one; he was on a 2012 paper which showed well-built bag-of-words beat models like the RAE); I just missed that paper amid all of the DL hype. Now, I always want to see strong baselines. Not wimpy un-tuned Logistic Regression that anyone can beat by 20 points, but an actual assessment of “How hard is this task? How simple is this data?”

Of course this complaint is not limited to just the NLP community. I’ve also found myself very frustrated with weak baselines in ML for Health papers as well. Really, every research field will have projects which don’t do a good job evaluating their contribution. But the ML community should be trying as hard as it can to limit the number of unforced errors.

I decided to write this blog post to honor some lovely baselines I’ve seen over the years.

Case Study: Image Captioning

Also during my formative years in undergrad, I was working on a image captioning task with a team of grad students: given an image, generate a sentence describing it (e.g. “a man is playing a guitar”). Our model was as fancy as one could do in 2015: a CNN feature-extractor fed into an LSTM decoder to output a sequence of words. It was one of the most interesting things I’d seen in undergrad research.

Image captioning example from MS COCO. Input is this image and output is something like “a dog with a frisbee in its mouth with a leash attached to it.”

But as you know, Deep Learning language generation (especially back then) could look something like “a man is a man is a guitar”, which left something to be desired but was still pretty cool that it was able to get a lot mostly correct. Then I came across a really interesting paper which looked at Nearest Neighbor methods for image captioning.

Develin et al. 2015: When we use a CNN to extract features from an image, you can plot a t-SNE of those images and observe whether similar images have similar captions.

Rather than using deep models to decode our image representations into words, what if we just returned our “generated” caption to be the caption of the closest image in CNN-feature-extraction space? It’s actually slightly more involved by that, but not by much (they look at a few neighbors and select the “consensus” caption). As it turns out, you can do pretty well with that on a dataset where the sentence complexity is at the level of “a man is playing guitar” or “a train is stopped at a train station”.

Develin et al. 2015: Decently close image captions without any language generation. The authors took a test image (which needs to be captioned), found some similar-looking images, and used either BLEU or CIDEr to find what the “consensus” caption was among the options available.

The authors evaluated this approach and they found that it performs almost as well as fancy decoding models when evaluating with standard Natural Language Generation metrics (e.g. CIDEr, BLEU, METEOR). They did note that in a human evaluation, the truly-generated captions were preferred. Still though, you can do surprisingly well with this simple approach. And that helps researchers diagnose poor model performance and task complexity:
– bad feature-extractors?
– bad language generation?
– low density of similar training caption examples?

I was impressed.

Case Study: Automatic “Reading Comprehension”

In 2015, Google Deep Mind published Teaching Machines to Read and Comprehend so that the field could have some complex datasets allowing them to work on impressive problems like reading comprehension. The paper has over 1,000 citations.

Each entry is a multi-sentence passage for the model to read and then answer a fill-in-the-blank question.

I might’ve picked on Chris Manning earlier, but in all honesty he’s a great scientist. That’s why he’s one of the most famous and successful NLP researchers in the world. And in 2016, his team conducted A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task, which acknowledged that although Google Brain did provide some baselines in their initial paper, “we suspect that their baselines are not that strong.”

They show that a simple feature-based (i.e. non-Deep Learning) classifier can outperform nearly all of the deep models provided in the original paper. They look more closely at many of the examples and conclude that the dataset is easier than previously understood because of the way it was artificially constructed by bootstrapping “supervised” examples from news summary bullet points. In short, the dataset wasn’t really measuring “reading comprehension.”

Chen et al. 2016: A feature-engineered classifier did much better than the initially reported “baselines” in the 2015 paper. The first 8 rows come from the 2015 paper; the final row is the 2016 paper’s shallow classifier.

This example might be well-known to many because the Chen/Manning paper was a finalist for “Best Paper” at ACL 2016. In that paper, they also build their own deep model which reaches the same performance as their “human performance” manual review of 100 randomly-selected examples. They conclude that such a dataset is ill-suited for trying to challenge models to do “reading comprehension.” If someone wants to work on that task, they need a better dataset.

Thanks, strong baselines! You prevented the field from overfitting a series of models to a dataset that wasn’t rich/complex enough!

Case Study: Automatic Text Summarization

The following year, more of Manning’s students continued research with the CNN/DailyMail dataset. Sure, it was not well-suited for reading comprehension, but the articles and bullet point summaries are actually a natural fit for the task of Text Summarization. Using the same data example as above, the new task is for a model to ingest the “Context”/passage and generate the “Query”/summary.

The model that See et al. 2017 built combined the two competing paradigms in language generation at the time: LSTM-decoders and Pointer Networks, making a Pointer-Generator Networks. Each model had their own strengths and weaknesses, and the team was able to do even better by combining them. You can learn about the model at the author’s accessible blog post.

See et al. 2017: To my knowledge, the state-of-the-art for abstractive text summarization on the CNN/DailyMail dataset.

In NLP, there are two kinds of summarization: abstractive or extractive. Abstractive summarization (think pen) generates new text to try to summarize/paraphrase the original source. Extractive summarization (think highlighter), on the other hand, copies snippets from the source and pastes it all together without generating any new text of its own. As you might imagine, abstractive summarization is very hard. This work is an abstractive model which tries to leverage concepts from extractive summarization.

So how does it do? Better than any other abstractive model to date! That’s pretty cool. On the other hand… it still does worse than generating a “summary” which is just returning the first 3 sentences of the article. For what it’s worth, the best extractive model is able to beat that “lead-3 baseline” on its dataset of comparison… but not by much (39.6 vs 39.2).

See et al. 2017: Results show that the best-performing model for summarizing an article is… returning the first 3 sentences of the article.

These super complex Deep Learning models can’t yet beat “literally just return the first 3 sentences of the article and pretend that is a summary.” What does that say about our models? What does it say about the dataset? What does it say about how hard summarization can be?

It’s worth noting poor performance could reflect weak models, an inappropriate task/dataset, or ineffective evaluation metrics. But if it’s a bad dataset or metric and the model is secretly good, we still shouldn’t make grand claims until we can back them up.

Case Study: Word Vector Analogies

In that vein, for our final example, let’s look at how baselines can debug an evaluation and show that a metric isn’t as good as you might have originally thought.

In 2013, Mikolov et al. released a series of papers defining word2vec, which took the NLP community by storm. The first such paper showed that they could learn interesting word vectors which seemed to have this “linear offset” structure that captured intuitive relationships. Word2vec became so popular (i.e. over 12,000 citations) that the word analogy task — especially the famous “King – Man + Woman = Queen” example — became a staple of word embedding evaluations.

How to go from the stated analogy to the set of instructions to run to see whether your word vectors are good enough to get the analogy correct.

The figure above outlines how word analogies “A is to B as C is to ___” became were used to evaluate word embeddings. A lookup engine loads your embeddings, gets an analogy from a dataset, does the vector arithmetic, and finds the word whose embedding which is closest to that result (excluding the three query words). The more queries that the lookup engine gets correct using your word vectors, the better your vectors must be.

Word analogy evaluations were not seen as the end-all-be-all metric, but performance on the task was taken relatively seriously for a while (and maybe to some people still is). But in 2016, I saw a paper at an ACL workshop which explored the analogy task a bit more critically. Essentially, he wanted to see whether the lookup engine was actually measuring how well word2vec was encoding analogies. What effect was the nearest neighbor having in distorting the space? In the example below, the “linear offset” relationship doesn’t hold at all, yet because of how the space is structured, “screaming” would still be returned, and the example would be counted as a success.

Linzen 2016: When a* − a is small and b and b* are
close, the expected answer may be returned even
when the offsets are inconsistent (here screaming
is closest to x).

The paper explored different query engines to try to debug what effect the nearest neighbor lookup was having to obscure the vector arithmetic: there are the standard Mikolov and Levy-Goldberg queries, as well as ignoring the vector offset, and also moving in the opposite direction of the offset. Using “Singular to Plural” in the figure below as an example, if you do c+(b-a) your accuracy is 80% but if you travel in the exact opposite direction with c-(b-a) you can still achieve 45%. Similarly, “Only-b” (e.g. instead of “king-man+woman” you instead only query “king”) achieves 1/3 to 1/2 the performance as the full analogy.

Obviously, the full analogy engine has the highest accuracy, but I found this analysis incredibly interesting. It made me re-think something I’d spent years taking for granted. I really like the Linzen’s paper!

Linzen 2016: Performance on difference analogy types (y-axis) as you vary the query engine (x-axis).

As a total tangent, the day the author presented their paper at RepEval 2016 was the day I got to meet Levy and Goldberg in person (woohoo!).

I think it’s fair to say that word2vec analogies were overhyped. Ian Goodfellow literally wrote the book on Deep Learning, and even he had been misled about parts of the analogy task.

This tweet is from May 2019.

Of course, in fairness to Ian Goodfellow, he is a Computer Vision researcher, not an NLP person. If there was a miscommunication, it was that NLP folks didn’t challenge the word analogy task enough. That’s why we need baselines not just for understanding datasets but also for sanity-checking evaluation metrics.

Good Baselines are Good Science

I think the above work is honestly so cool! Each one “debugs” the complexity of tasks or datasets that many researchers use. And they remind us to have humility in the models that we build and the results that we report.

In all honesty, baselines aren’t sexy. People don’t get grants for baselines just like they don’t get grants for replication. But it’s very important.

As a scientific community, we need to be the ones policing our own hype. Sometimes a new model comes along and it’s seriously revolutionary (e.g. Computer Vision in 2012). But in NLP, we spent a long time between 2012 and now trying to make Deep Learning work, but weren’t seeing the same record-shattering performance as in Vision. Maybe that’ll be different now with BERT and Transformers. With TA’ing the past two semesters, I really haven’t explored that yet, but GPT-2 definitely has seemed impressive. But whatever we do, we should compare against good baselines.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s