ML4H 2019: Machine Learning for Health

A couple weeks ago, I attended the Machine Learning for Health workshop at NeurIPS 2019! It was a lot of fun. 

The workshop was recorded, and are already available on the website, which is tremendously helpful for: researchers who couldn’t attend because of visa or conference capacity limitations, young researchers who couldn’t attend but are looking to understand the field, or anyone else who wasn’t able to attend.

I took some notes on each talk, because there is 5-6 hours of video (with no discernable way to speed it up to 2x). This post is ~9,000 words, which takes only 35-45 minutes to read. I tried using accessible language to explain the concepts from the videos, though I invite you to use this as a guide to identify which talks you are especially interested in watching at their full length.

Please let me know if you identify any typos or misunderstandings (my twitter handle is @willieboag).

Introduction (Brett Beaulieu-Jones)

@beaulieujones

The workshop’s General Chair, Brett Beaulieu-Jones, started the event by recognizing that the event was held on unceded territory of the Coast Salish peoples (a recognized protocol for guests to acknowledge the host nation, its people, and its land). Additionally, he pointed us towards the NeurIPS code of conduct, including the hotline email NeurIPShotline@gmail.com to report inappropriate behavior. 

Brett gave an overview of the submissions and acceptances: abstract acceptance was 34.3% (68/198) and full paper acceptance was 17.1% (19/111). There was a marked increase in papers consulting or being authored by clinicians. And this year half of the papers used multiple datasets (up from just 16% last year). Wow!

Digital Biology (Daphne Koller)

@DaphneKoller

The first invited speaker was Daphne Koller, talking about her new work on Machine Learning for Drug Discovery. discovery has been slowing down exponentially, despite humanity’s great progress on finding breakthrough treatments in the last 50 years. This trend is also known as as Eroom’s Law (i.e. Moore’s Law backwards). It is very difficult to know which drugs to pursue (because of great time and expense), so it would be very helpful to have a “compass” which helps guide which directions to take. Fortunately, as she pointed out, these sorts of heuristic, pattern-matching guesses is exactly what Machine Learning can provide.

She described the convergence between both the “ML Revolution” and the ability to generate Biological data (using tools like CRISPR) has created an inflection point in “Digital Biology.”  At her new startup, they have a data factory for cell lines, where they are able to perturb cells using CRISPR and then phenotype with RNA Sequencing or imaging. Using deep learning, they create a low-dimensional manifold on which the cells exist. In this “remarkable organization” of the data, they then cluster the data to try to identify which structures are similar. This would be especially useful for differentiating labels that are very heterogeneous (e.g. we now know that “breast cancer” isn’t just one thing) and probably shouldn’t be treated with one-size-fits-all drugs.

Figure: Once the cells have been perturbed to explore the space, deep learning can find interesting cell representations, and then those representations can be clustered, as shown here.

I don’t know as much about the economics of pharmaceutical research, though I’m reminded of a presentation I saw about the financial constraints of drug discovery. As Daphne explains, investing in R&D for drugs is boom-or-bust: 95% of the time, you waste your hundreds of millions of dollars of investment, but 5% of the time, it pays off 60-fold (note: numbers may vary depending on drug type). It’s technically a very good “return on investment”, but unless you have enough money to take many shots on goal, you’ll likely end up as a bust. The talk above proposes financing models (e.g. a really clever idea to issue bonds on a cancer drug that will pay off with 98% success rate). In the financial constraints talk, Andrew Lo talks about how the mega-fund of many shots on goal is essentially a collective action problem because individual companies can’t take 150 shots on goal at once, however, Dr. Koller’s work is interesting because it’s not trying to solve the collective problem but rather to increase that 5% individual success rate into something higher. As an analogy, the mega-fund would be like an industrial-grade tomato processing machine that efficiently processes tomatoes but only for large farms that can afford it, whereas Daphne’s efforts would be analogous to new tomato seeds that increase the yield for any farmer, regardless of their farm size. Though I do wonder how successful the  drug discovery “compass” would need to be in order to start paying off (e.g. if it raise the chance from just 5% to 6%, is that enough to discover drugs at an observable faster rate?).

Daphne then gave a whirlwind tour of the initial results that her startup has achieved so far, and then concluded with a philosophical reflection on the progression of science: there were periods where one or two fields were able to make incredible amounts of progress over a short period of time:

  1. Late 1800s: Chemistry (moved away from Alchemy into understanding elements)
  2. 1900s: Physics (understood the connections between matter & energy, space & time)
  3. 1950s: Computing (silicon chips allowed advanced computations)
  4. 1990s:
    1. Microbiology (microscopy, sequencing allowed measurement at new scale)
    2. “Era of Data” (computing PLUS stats and optimization)
  5. Next: Digital Biology (reading, interpreting, and writing biology using new technologies)

Spotlight: Transfer Learning for Medical Imaging (Maithra Raghu)

@maithra_raghu

Maithra talked about her work Transfusion: Understanding Transfer Learning for Medical Imaging, which worked to understand why transfer learning seems to work well on some tasks. Transfer learning involves pretraining on a task (e.g. ImageNet) and fine-tuning on your task of interest (even something clinical, such as MRIs). As we’ve seen, this approach has become standard practice in Vision (with pre-trained ImageNet models) and NLP (with BERT). But one thing that is not clear is why a pre-trained ImageNet model would do well on very different domains, such as medical tasks (e.g. reading MRIs). This work runs a few experiments to try to probe this phenomenon further.

They began with simple performance evaluations as a starting point. Using standard ImageNet architectures on chest x-ray and fundus photos data, they showed that randomly initialized weights achieve similar performance as a pre-trained network. However, models with pre-trained weights converge faster than randomly initialized models. But this might not be because the weights themselves are good — for instance, it could be that the scaling of the weights yields better gradient flow. To test this, they created a third model with randomly sampled weights, but sampled in a way to preserve the scaling and variance of the pre-trained weights. They found this scale-preserved network also converges notably faster than the random initialization.

Additionally, they compared transfer learning for small models against transfer learning for large models. They studied this by looking at the correlation between weights of a model before training vs after training. They found that (even when randomly initialized), large models had higher correlations in weights. This means that the larger models were changing less from the finetuning process.

Figure: Showing the correlation of model weights before vs after training.

I like this work! In general, I’m a fan with probing a dataset to see whether kicking the tires gets unexpected results. Science is often driven by tinkerers who are trying to figure out why something is happening. There is still more work to be done, but I appreciated this presentation!

Spotlight: Survival Regression (Xinyu Li and Chiraj Nagpal)

I couldn’t find a twitter profile for Xinyu. Chiraj’s handle is @nagpalchirag.

Xinyu described her work in survival analysis, Deep Survival Experts: A Fully Parametric Survival Regression Model. In case you’re not familiar with survival analysis, she gives a brief overview, but I’ll summarize her overview here: it’s like an ordinary regression problem for predicting time to an event, except that some data points are censored (e.g. they never came back to the doctor, so you don’t know whether they died or not; you only know the most recent time you saw them alive). It’s actually slightly more complicated than that, because they predict more than just one event (e.g. time to breast cancer, time to cardiovascular disease, etc).

Figure: Typical survival analysis setup. Patients 2 and 3 are censored before any events occurred. Patient 1 has a known cardiovascular disease event at time T1. Patient 4 has a known event of breast cancer at time T4.

Xinyu then gave a tour of previous work in survival analysis modeling, including Kaplan-Meier, Cox Proportional Hazard, DeepSurv (2017), and DeepHit (2018). These models had increasing complexity for how the covariates interact with the baseline hazard (e.g. log-linear vs deep learning) and multi-event modeling. Earlier models assumed that the proportional hazard over time is constant (which is very unlikely in the medical domain), but DeepHit’s solution to discretize into buckets doesn’t scale well to long time horizons. They propose a new approach: Deep Survival Machines.

Chriaj describes their model. Illustrated in the figure below, it works as a mixture over many parametric survival distributions (i.e. fit a bunch of models, potentially on different time scales, and then at test time, attend to whichever model is most relevant). Parameter inference is performed separately for censored and uncensored data, where both terms are approximated in closed form (e.g. PDF or CDF), which allows for them to get a gradient and optimize the entire model with gradient descent.

Figure: There are multiple deep network survival models (in red) that predict from the covariates, X. Additionally, there is a softmax (blue) over X which selects the mixture weights that interpolates between those aforementioned models.

You might be surprised in academia by how many simple concepts are obfuscated to sound fancier than they actually are. Xinyu and Chiraj explained this motivation so clearly that it seemed like an obvious solution, which is a sign that they did a great job. Additionally, this is a really nice example of what the Machine Learning for Healthcare community tries to do, in general: take existing models which struggle on domain-imposed constraints (e.g. poor scaling to long time horizons) and use that to motivate the development of a new model that performs better!

Predicting Acute Kidney Injury (Cian Hughes and Nenad Tomasev)

Dr. Hughes’s twitter handle is @CianH. Dr. Tomasev’s handle is is @weballergy.

This presentation by Cian and Nenad was a tale of two projects. Both were about Acute Kidney Injury (AKI). In the first half, Cian talked about an app they developed (NPJ Digital Medicine) to help improve the treatment of AKI in the clinic without using any Machine Learning (just a simple rule-based definition of AKI). For the second half of the talk, Nenad described a proof of concept study  (Nature) they did on retrospective data to show that ML could help identify cases of AKI up to two days before it occurs, which they expect would help with planning and management of treatment if it gets deployed.

When I was listening to their talk, I was struck by how similar it sounded to the many sepsis projects that the ML for Healthcare community has been working on for years: acute problem that affects millions of people, existing efforts to do better with national mandates (i.e. National Patient Safety Alert for AKI vs Surviving Sepsis campaign and SEP-1 bundles for sepsis), and ML researchers working on early detection. Nenad’s talk reminded me a lot of the set of papers by Joe Futoma et al. (ICML 2017, MLHC 2017) predicting sepsis early using RNNs. Both of these approaches did a lot right, including simulating a prospective study to see how early the method is able to catch the onset:

Figure: From Nenad’s talk showing that the model suspects AKI hours before the clinical definition of AKI onset.
Figure: From Futoma et al. 2017 showing that the model suspects sepsis hours before the clinical definition of sepsis onset.

Nenad went into detail about all of the well-executed experiments they did, including the simulated prospective study, evaluating on heldout sites, and a discussion on how AUC isn’t very clinically meaningful because so much of that region isn’t “clinically appropriate.” He also noted that because the VA population is male-dominated, the model currently performs worse on female patients. They haven’t actually deployed this model, though I think they hope to.

Personally, I was most interested in hearing about Cian’s talk about the Streams app that they built to create “digital pathways” for standardized care protocols and specialist response teams. He mentioned their paper which evaluated the tool and found that it helped patients receive treatments faster and have shorter lengths of stays. I think work like that is some of the most important things Computer Scientists should be focusing on; we don’t need fancy Deep Learning for many of these problems. Atul Gawande’s amazing book The Checklist Manifesto (note: audiobooks are your friend!) emphasizes how a large fraction of medicine isn’t that people don’t know what to do, it’s that to save lives you need to do things without mistakes and without letting things slip through the cracks. Even simple checklists — if well-designed to enable better communication and culture — can often help with the lionshare of improvement. We see the same thing with California’s tremendous efforts to cut maternal mortality rates to ⅓ of what they used to be in the last two decades.

Obviously, I’m also excited to see how much better the care team could do using ML to forecast AKI as opposed to reacting to the rule-based definition, which tends to be met after some damage has already been done. Looking forward to hearing more. Also, I think it was a really wise move to partner with the VA for their prediction research.

Spotlight: Privacy-Preserving Human Fall Detection (Umar Asif)

I couldn’t find a twitter handle for Dr. Asif.

The goal of this work is fall detection. In particular, They generated synthetic data of stick figures and used a physics engine to model the figures falling. This was their training data, and they evaluated performance on two real, public datasets.

In their evaluation, they investigated two questions:

  1. How much is performance hurt by training on synthetic data vs training on real data?
  2. Does pose estimation (as opposed to direct pixel space prediction) improve generalization across datasets?

It seems pretty suspicious that they do *better* with synthetic (as opposed to real) data on pose-input for dataset 2; perhaps there is large variance in the evaluation. It’d be interesting to see confidence intervals to better understand that. For generalization, they find that pose estimation improves performance, especially for dataset 2. Once again, I’m a little surprised by some of these results, in particular: why is dataset 2 so low accuracy in pixels-space but not pose-space for both training datasets? Is the use of accuracy (as opposed to AUC, F1, etc) hiding some kind of label distribution issue? If they trained on a heldout segment on dataset 2, should we expect much better performance, or is dataset 2 simply difficult to measure? I’d love to see the evaluation fleshed out a little bit more.

Figure: Accuracy of models for: different training data (real dataset 1 vs synthetic), different input representation (original RGB pixel space vs stick figure poses), and on two different heldout datasets.

He ended their his with a discussion of limitations, which is great practice; every paper should try to do this. Umar mentioned they’d like to model depth. If they’re not familiar: at the NeurIPS workshop ML4H 2017, Fei-Fei Li talked about her group’s work using depth sensors to deidentify videos for privacy preserving smart hospitals (MLHC 2017). Even earlier, Zhang et al. published on Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras in 2012. These works were not cited in this paper, though they might be useful places to look when developing future work.

Figure: A figure from Zhang et al. 2012’s work on depth-based Privacy Preserving Fall Detection.

Spotlight: Embarrassingly Simple Exploitation of Segmentation Supervision (Paul Jäger)

@pfjaeger

Paul gave a really clear presentation about his paper Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection. He gave great motivation for the work and a comprehensive evaluation. The proposed model is able to perform the appropriate clinical task while still using the provided fine-grained annotations! And the code is publicly available as a toolkit. I liked this talk a lot!

Semantic Segmentation is an extremely popular approach for tasks in Medical Image Analysis, but these pixel-level predictions don’t actually answer the kinds of questions that are typically of clinical interest. The task of interest might be patient-level (e.g. cancer or no cancer) but the annotations are of nodules at the pixel-level. Reducing those annotations to the patient-level would be a large loss of information to give to the model, which is especially challenging because medical datasets are often much smaller than general domain ones. This results in a frequent mismatch between the prediction (and the evaluation metric that necessarily follows) and the clinical task that you actually care about. In Medical Imaging, pixel-level metrics like Dice Score are often incorrectly used to answer clinical tasks at the object-level. Evaluating models on clinically relevant scales is an essential part of impact-driven research in our field. 

Figure: There are different scales that we could model, and each scale has its own kinds of questions that it can answer that are of clinical interest. We evaluate different scales using different metrics (e.g. classification correctness vs counting objects vs pixel overlap).

Choosing a beneficial scale for training is a valid approach, so long as predictions are transferred to the clinical scale for evaluation. However, as illustrated by this figure below, it is not always obvious how to aggregate these predictions into an object-level score. Should it be max-value? Average value? Majority vote? 

Figure: Simple demonstration of the challenge in aggregating many pixel-level predictions into a single object-level prediction.

Rather than trying to do an ad hoc aggregation heuristic, this work learns the aggregation in and end-to-end manner. They propose Retina U-Net, which essentially adds a pixel-wise loss to the standard Retina-Net for object detection. This hybrid model addresses the challenge of aligning a model’s output structure to the scale of a clinical task, but it also maintains data efficient training on small data sets by using the pixel-wise annotations. They evaluated this method on two datasets: LIDC and an in-house Breast MRI dataset. The experiments were very comprehensive, testing both 2D and 3D as well as against segmentation+aggregation baselines and against instance segmentation. 

Figure: Their proposed Retina-U Net model gets the best of both worlds: losses for both the pixel-level feature extractors and also the more coarse object detection task.

Spotlight: Generating Synthetic Images of Rare Classes (Jerry Wei)

I couldn’t find a twitter profile for Jerry.

Jerry presented his work on Generative Image Translation for Data Augmentation in Colorectal Histopathology Images. Essentially, they want to tackle the problem of rare classes in small datasets by using GANs to generate synthetic images of those rare classes. They used CycleGAN to generate images of rare examples (adenomatous polyps) from normal examples (normal colonic mucosa), like how in the general domain CycleGANs learn how to generate images of zebras from horses. Using this image-to-image translation model allows the provides a starting structure for the generated images, because adenomatous polyps always originally began as normal colonic mucosa.

Using the augmented data for training, they improved downstream performance on identifying whether an image is adenomatous polyps or normal colonic mucosa. However, they decided to test whether training on *just* the synthetic images would achieve good performance, and they found a large decrease in performance without any real images in the training set.

My favorite part of this work was their “Pathologist Turing Test”, where they tried getting pathologists to spot which generated images were real and which were fake. Some of the doctors did indeed show significant ability to spot the fakes, but most of them didn’t show huge expertise at the task. I appreciated that they noted limitations with this study, namely that pathologists are not trained to detect real vs fake, and they were shown only fixed-size tiles (not slides) for the question. Even still, I was curious what Pathologist 1 from Task B saw that allowed them to perform so well on this task, so I checked the paper:

“Based on feedback from pathologists, fake sessile serrated adenoma images were easier to identify because our CycleGAN model created a subtle mosaic-like pattern in the whitespace of images. Sessile serrated adenomas tended to have more whitespace because they are defined by a single large crypt (of mostly whitespace), which might explain why it was easier to detect fake sessile serrated adenomas than tubular adenomas.”

Figure: The Pathologist Turing Test to see whether doctors could spot the reals from the fakes.

Models of Cognition (Emily Fox)

I couldn’t find a twitter profile for Dr. Fox.

Emily presented to projects about modeling cognition: one at Apple (predicting cognitive impairment from smartphone activity) and one at the University of Washington (building a generative model for co-activated regions in the brain). This was a very interesting talk, and she did a really great job presenting. I especially liked her slides & visuals.

For the first project, she talked about the passive sensing capabilities of the iPhone (though she alluded to the opportunity for intervention-based experiments with smart and wearable technology). They ran a study with 113 affirmatively-consenting iPhone users, where the researchers built an ML model to predict whether a user has Alzheimer’s Disease based on the sequence of apps that they access over a series of sessions. A prediction was at the person-level, where a person was represented as a set of “sessions”, and each session was a sequence of apps that the user accessed between unlocking and locking their phone. As shown in the figure below, a session was represented as the average embedding of the apps in that session, though they also ran experiments of using a simple bag-of-words count-based session representation as well. The sessions were then clustered using k-means to assign a cluster label to each session. Finally, the person-level representation was formed by counting the number of cluster labels that given person had (e.g. had 3 sessions from cluster #1, 0 sessions from cluster #2, etc).

Figure: A single session (i.e. sequences of apps used) is represented as the centroid embedding of the apps in the session. Each app’s embedding is learned using the word2vec algorithm (where you have a target/center “word”, surrounded by context “words” that you want to predict).

The results of this experiment (shown below) are interesting. They found that the embedding representation performed much better than a one-hot encoding (though it’s not clear to me whether they still did the clustering for that baseline). I’m a bit surprised that the AUC could be as good as 0.8 for just N=113 between train/test, especially since the centroid embedding approach disregards the sequence order of app usage. One of the most interesting parts of this was that she highlighted the Apple “Research Kit” tool which seems to essentially create the infrastructure for other researchers to create experiments like this by recruiting people who affirmatively consent to volunteer their data to the researcher for the project. I think that idea opens up a lot of possibilities for potential research questions, and I like that it brings user consent to the forefront of every study. I’m interested to see what kind of impact this will have made 5 years from now.

Figure: Ablation study to see how well the embedding-cluster model performs compared to baselines.

The second paper she presented was early work in trying to model regions of the brain. Based on current neuroscience literature, it seems that many complex behaviors are the result of multiple brain regions acting in correlated ways (like how, analogously, when someone walks, their joints are controlled by multiple angles acting in a correlated manner). For this project, they build a variational autoencoder, which encodes an observation down to a low-dimensional representation, and then decodes that representation with group-specific generators (e.g. one for neck, one for right elbow, etc).

Figure: The variational autoencoder, with group-specific generators.

The following two figures show how this applies to the joint example (or the brain region activation, which is the phenomenon of interest). They enumerate the matrix of generated groups by dimension to for a given latent variable, which joints it controls. They found that for the brain data, some of the dimensions controlled known region networks in the brain.

Figure: How the group-specific generators show co-activations of regions when doing activities of interest.

Safety is the Foundation of Medical ML (Luke Oakden-Rayner)

Luke doesn’t have a twitter… I’m just kidding! If you’re not already following him, his handle is @DrLukeOR

Luke talked about safety, or has he said: “Machine Learning in Healthcare without constant thought towards safety is not actually Medical Machine Learning, it’s just ML that happens to be on medical [data].” In medicine, human experts are predisposed to watching out for the rare one-in-a-million cases, whereas ML is predisposed to the average cases, because that minimizes the loss function for the majority of training data points. You can read more extensively about this in his blog posts on Medical AI Safety here, here, and here.

He enumerated some of the factors that make Medical ML uniquely important, including:

  • High risk.
  • Long tail of rare, dangerous outliers. 
  • Human/user error is a major problem.
  • Potential for external/systemic risks. 
  • Experiments in controlled environments are not adequate to demonstrate safety.

Surprise twist!! This isn’t actually unique to Healthcare. In fact, these bullet points describe self-driving cars. Luke used this psych out to suggest that maybe we should be thinking about Medical ML more like self-driving cars and less like recommendation engines.

Luke began with an example AI tool that has been in medical practice for over 20 years, which allows us to compare its promise vs its impact. The figure below shows the 1996 FDA approval for computer-aided diagnosis (CAD) for breast mammography screening, which had demonstrated improvement for diagnosis when doctors used the system (left). Fast forward to 2015, and we find that the system doesn’t help people who use it, and it even reaches worse performance than the human on their own (right).

Figure: The figure on the left is from the 1996 FDA application, showing that humans using CAD performed better. The figure on the right shows that humans using CAD (red) actually perform worse than humans not using CAD (black).

One source of error is Human Error and Misuse. For the CAD scenario — even though humans were told only to use it to catch cancers that would otherwise be missed — humans began to over-rely on the system and began using it to rule out cancer. These “unintended consequences” happen frequently with technology, and when they are completely foreseeable, we shouldn’t say “Don’t look at me, I’m just the technical person. We told them not to do it, so that should be good enough.” As my own example, Q-tip boxes explicitly state to not put the swabs inside one’s year… yet people constantly do it. It’s entirely foreseeable there. It reminds me of a great piece I read this year called Do Artifacts Have Politics? which argues that technology necessarily changes dynamics and social structures, and those effects cannot be separated from the tools that enabled them.

Another source of error is what he calls Hidden Stratification (which he had a paper about for the workshop). This occurs when a given label is overly reductive and collapses too many distinct concepts into one umbrella group. For instance, the label “lung cancer” actually describes a broad category which contains dozens of different pathologies. Another example (shown in the figure below), is “pneumothorax” (which has a mortality rate around 30% when untreated). When a pneumothorax is treated (as is the case for the majority of instances it appears in for chest x-rays), things are okay, but untreated cases are critically important to identify. Unfortunately, the untreated cases are rarer, and so the ML training process does not fit to them; when evaluating ML models on the important cases, performance drops by about 10%.

Figure: Average performance for pneumothorax detection (0.87) hides the fact that performance is notably worse on the rare-but-important cases (0.77).

For medical devices as ML, regulators do not require Phase 3 trials (which is when the tool is tested in the real world) before marketing it commercially. Would anyone hop in a self-driving car that has never been tested on real roads? Unfortunately, the regulators have no interest in pushing towards requiring Phase 3. In pharmaceuticals, even after a drug clears Phase 2, it still has a 1-in-2 chance of never succeeding. For now, a minimum required effort we should do is to talk to domain experts about what the rare, dangerous outliers are, and then we need to collect examples of those cases and test on them to ensure we can at least do that.

Spotlight: Localization with Limited Annotation for Chest X-rays (Eyal Rozenberg)

I couldn’t find a twitter profile for Eyal.

There was a slight error with the video for the beginning of Eyal’s presentation, which makes it a little hard to follow (because it skips the task and motivation). Essentially, they want to  improve disease localization by leveraging unannotated data. In that sense, it reminded me a bit of Paul Jäger’s spotlight earlier about combining annotations from different levels of granularity. Eyal’s paper is Localization with Limited Annotation for Chest X-rays

The dataset (ChestX-ray14) has over 100k images labeled with 14 different diagnoses, though they also used a small annotated dataset (N=880) with localized bounding boxes. The baseline model performs localization (i.e. object detection) and image classification simultaneously with a joint loss, but they identify several potential issues with this approach. Many of these solutions are pretty intuitive: they require multiple predicted patches before returning a match (to reduce False Positives), they replace multiplication with addition to combat numerical underflow, they model contextual relationships between patches with a CRF (because the baseline treats patches as independent), and they do some kind of low-pass filter thing (called “anti-aliasing”) between CNN layers so as to preserve shift invariance. In that sense, it seems to complement the Retina-U Net paper nicely.

Figure: The final model, which involves their novel loss function (which imposes the multi-patch prediction constraint and uses addition instead of multiplication), anti-aliasing between convolution layers to preserve shift invariance, and a CRF layer on top to model contextual relationships between patches.

Spotlight: Pain Evaluation in Video using Multitask Learning (Xiaojing Xu)

I couldn’t find a twitter profile for Xiaojing.

Xiaojing gave a good presentation about predicting pain scores from video data. The UNBC McMaster Shoulder Pain Dataset has 200 videos (for 25 subjects) with many human expert-labeled pain scores both at the frame-level (e.g. VAS, AU) and video-level (e.g. VAS, OPR, SEN, AFF). The goal of this work is to predict the VAS score for a video, and this is done with a 3-stage network and multi-task learning to incorporate signal from all of the labels. You can read her paper Pain Evaluation in Video using Extended Multitask Learning from Multidimensional Measurements for more details.

Figure: Their architecture, showing the frame-level prediction (stage 1), the video-level aggregation (stage 2), and the ensemble over the other predicted pain scores (stage 3).

Her model works in 3 stages. The first stage is frame-level predictions, where the input is the image and the output is the PSPI score. For the 2nd stage, the sequence of PSPI scores are aggregated with statistics and fed into a single layer to predict VAS. The 3rd and final stage is an ensemble over the 4 predicted video-level scores to refine the VAS prediction. They evaluated baselines using Mean Absolute Error (MAE), and tested: stages 1+2 without multitask learning (MAE of 2.34), stages 1+2 with MTL (2.20), and the full model (1.95). They also found that humans were able to achieve an MAE of 1.76.

I thought this work was pretty straightforward. I appreciated that it had a clear goal (i.e. predict VAS from video-level data), though I have no way of evaluating whether that was the best goal (I’ll assume it was). I do wonder, however, how generalizable this will be to other tasks, which might not have numerous additional labels/scores that could be used for multi-task prediction. That said, it could be the case that the pain scores themselves aren’t the value-add, but rather, predicting any reasonable additional labels helps the model with training because of gradient flow. I’d be interested to see an experiment similar to one described by Maithra, where the weights learned by the final model are destroyed but the scales of the final model are preserved: would retrained that new model (without multitask) yield similarly good scores in the 1.95 range?

Spotlight: 15-way Sentence Prediction for Patients That Have Difficulty Speaking (Arnav Kapur)

@arnavkapur

Arnav presented a summary of his paper Non-Invasive Silent Speech Recognition in Multiple Sclerosis with Dysphonia. The goal of their work was to help patients with MS that need technology-assistance to speak; they built a tool that attaches electrodes to the patients and uses the small electric signals (from muscle movement) to predict which sentence the patient is trying to say (out of a possible space of 15 hard-coded sentences). The novelty seemed to be much more focused on their measurement tool and the 3-person pilot study than they ran, since their technical methods amounted mostly to lots of signal pre-processing and a simple convolutional neural network. I thought this work was an interesting change of emphasis, and I appreciated the effort they put into working with real patients, getting consent, building a tool based on the users’ preferences, and grounding the presentation in the language of clinical impact. That said, I think this work suffers from methodological challenges, and (based on the evaluation metrics they report) might be overstating the clinical impact they can deliver so far.

Figure: The 15 sentences that their model can generate.

This paper’s emphasis was on the pilot study, not the technical components of this work. That said, even for papers with a clinical emphasis, this field does expect a certain level of technical rigor. In his presentation, Arnav virtually spent no time on the model (“We use a convolutional neural net”) or the way it was trained (“a stratified cross-validation run”). Instead, there was much more attention devoted to a video of the patients using the system — which, to be clear, is really nice grounding, but should be in addition to, not at the expense of, technical rigor. I had to read their paper to better understand the details, which I’ll highlight now to complement the talk:

  • “We non-invasively collected such recordings from 3 Multiple Sclerosis (MS) patients with dysphonia using pre-gelled Ag/AgCl surface electrodes while they voicelessly and minimally articulated 15 different sentences 10 times each.”
  • “We employed personalized machine learning for the patients, i.e. the data of each patient was kept separate and not mixed with others. Hence, the architecture was trained with personal data to generate separate models for each of the 3 patients. The primary reasons are the attributes and features present in the data, which are unique to each patient due to their personal characteristics and stage of the disease and subsequent speech pathology, and the variance in the electrode positions.”
  • “We used repeated (5 times) stratified 5-fold cross-validation to evaluate our model on the 15 class classification task. The models achieved overall test accuracy of 0.79 (±0.01), 0.87 (±0.01) and 0.77 (±0.02) for the three MS patients P1, P2 and P3 respectively”

As a pilot study, I think this work is nice. And there definitely is a value in accepting papers of pilot studies, as a way to reward the work and incentivize others to do it as well. That said, I do wish that there had been more methodological rigor; the N=3 population is especially concerning, and they fit one model per patient, which means their 6-layer CNNs (with over 10,000 parameters to fit) were trained on 120 examples (i.e. 80% of 150). They also don’t compare against other baselines, which would be a great way to demonstrate how easy/difficult this task is. Because the data is (understandably) not public, readers have no way to flesh out the technical analysis themselves. I would’ve loved to see more baselines.

I also was a little confused about their metrics for evaluating clinical impact. Their average accuracy is 81%, which means they incorrectly predict about 1-in-5 sentences (and without one an option of “That’s not what I meant to say”). Nonetheless, they report their words per minute (which seems a little strange to do when selecting from a menu of pre-selected sentences) and find that they achieve 94 wpm. With 4 words per sentence, that is 4 seconds per prediction. As a pilot study, I think these are very promising results! Though it does seem a little apples-to-oranges to compare against existing silent speech recognition technologies (which have an open vocabulary to generate any words) with a head-to-head 94 wpm vs 10 wpm. I do hope that they continue refining this work with a larger cohort of patients, and build upon this!

As I said earlier, I appreciate how much of an emphasis on clinical impact this work had. I think the ML for Health community stands to benefit from incorporating those aspects of this work. However, I also that that this work stands to benefit from many of the things that the community does well, including: methodological rigor, larger sample sizes, and more comprehensive evaluations.

Predicting Cardiac Arrest Design for Deployment (Anna Goldenberg)

I couldn’t find a twitter handle for Dr. Goldenberg.

Anna spoke lessons learned from an ongoing effort to deploy a cardiac arrest model in a hospital. She described three steps of this process so far: a retrospective prediction model, a survey of what doctors want from the tool, and an explanation-based model that addresses what the doctors said they were interested in. I think this talk is so important, because there are very few deployments in hospitals among the ML for Health community; sharing lessons learned and best practices is a critical step forward as we try to improve lives with the tools we hope to build.

The first step was a retrospective study of a cardiac arrest prediction model, presented at MLHC 2018. They used a CNN-LSTM model to predict cardiac events 5-15 minutes before the event happens, which would help the team prepare for the event. The model performed well, but then they did a sanity check which initially left them a little dejected. Although hospital only has 100 cardiac arrest events per year, their model would be making 3,000,000 predictions (because

It would predict every 5 minutes for 30 beds all the time). This means that even with a great model with only 1% error would still have 30,000 False Positives, which would dwarf the number of actual events & potentially cause alarm fatigue. When they ran the evaluation, and found a 15% False Positive Rate, they were surprised to hear that the clinical staff was actually pretty optimistic, because they said when viewing those predictions in context, a lot of them made sense (e.g. the patient indeed was going to have a cardiac arrest, but the staff had already known to give treatment before it was too late).

Next, Anna discussed the survey her team conducted about Medical ML explanations for her MLHC 2019 paper  What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. They found that junior doctors wanted a safety net (i.e. “tool that will be constantly monitoring the situation and alerting clinicians to the onset of potentially critical events”) whereas more senior doctors only didn’t want to be bothered with what they already knew, and only wanted to know when the system disagrees with them & why. They also learned that explanations are very task- and constraint-dependent: doctors want lots of info when lots of time at the end-of-shift handoff, but during critical situation they only want the small pieces of actionable information. And they wanted per-event explanations (i.e. “Why did you make *this* decision?”) as opposed to per-model (i.e. “What are the most relevant features for this model, in general?”). 

With these considerations in mind, they developed an explainability model for a time series to answer “What in the past could have caused this event?”. The work is currently under review, but the Feed Forward Counterfactual (FFC) Model infers feature importance by forecasting the next observation given the previous observations, and comparing that expectation against what is actually observed. As the figure below shows, this difference between expectation and observed signal helps flag strange behavior that the doctors could then investigate for themselves to see whether it is concerning.

Figure: Feed Forward Counterfactual (FFC) Model for forecasting signals, and seeing which signals deviated a lot from the expected observations.

She concluded the talk with some very good advice: it is critical for deployment to discuss and design these tools alongside the users, but that includes nurses just as much (if not more than) doctors, because nurses will be the ones using many of these often times.

Lessons Learned: Computer Vision for Medical Imaging (Lily Peng and Dale Webster)

I couldn’t find a twitter profile for Dr. Webster. Dr. Peng’s twitter handle is @lhpeng.

This talk from Lily Peng and Dale Webster was about their experiences from developing and deploying computer vision systems in the clinic. Dale described considerations on the modeling side of things, and Lily talked about the challenges and successes with integrating the tool into a real healthcare system. I liked this talk, because I think the field is ready to move beyond just theoretical models for papers and start deploying models to improve patients’ lives. We need case studies like this one to help us learn what the best practices are (hint: it’s to work closely with the people using the tool, understand their needs, and integrate it into their workflow in a useful way). I wonder if in future years, we will begin to see more User Interface work at ML4H venues. I’d personally love to see that, especially as a proving ground for how to bridge between ML folks and domain experts when deploying systems, which the younger Algorithmic Fairness is also beginning to work towards!

Dale pointed out that there are a lot of problems in healthcare imaging where its fine to use pre-trained architectures and pre-trained models, and the real challenges come from correctly defining the problem and then collecting high quality labels. He recounted some examples of when good predictive performance was misleading, such as when a model purported to get 0.95 AUC but was actually having so much success because the training dataset was curated from merging two quite different sets (one of which had only negative examples). He offered the suggestion to test for these obvious mistakes with model explanations, decision trees on metadata, etc.

After they evaluated on retrospective studies, they found that the humans were better at limiting False Positives, whereas the models were better at making sure nothing slipped through the cracks and got missed. They hoped to combine the strengths in a human+ML hybrid model, but found that originally, the hybrid wasn’t doing as well. After looking further into it, they realized that the poor user interface (i.e. displaying a heat map adjacent to the image, with poor localization) wasn’t helping, and once they worked with a designer for better visualizations, the human+ML predictor was the best. As she described, when it comes to integrating AI into the system, there’s “not a lot of AI there”, yet those small improvements can have large payoffs in actual impact on patients’ lives.

Lily talked about the deployed diabetic retinopathy screening tool in Thailand. They began with a one-site pilot study, where they went to the clinics and mapped out how patients, doctors, and nurses flowed through the clinic. They’d interviewed patients about what was important to them, and were sometimes surprises by the answers. For instance, many didn’t say accuracy, but rather things like “I don’t want to travel 5 hours to the hospital”, which meant that False Negatives were very costly (i.e. don’t call a patient back willy nilly if you’re not sure because a picture is slightly blurry). She described the importance of incorporating patient values, especially before integrating the system, because involving them in the design process is an important way to build trust with the patients who will actually be impacted by the tools.

Figure: Mapped organization flow of the Thailand clinic in December 2018.

Panel

The panel includes Cian, Anna, Danielle Belgrave (@DaniCMBelg), and Luke.

Q: What are some areas that might be most fruitful in the next 5 years, for instance for new PhD students?

Cian: Before you decide on techniques, choose the right problem.

Anna: NLP models right now are too computationally intensive. It’s important to start thinking about scalability and environment constraints.

Danielle: Look for good clinical collaborators. It’s hard to have the right context for these tools without one.

Luke: The most important thing to be a medical ML practitioner is to start understanding medical systems.

Q: For people who don’t have easy access to medical collaborators, what can they do?

Danielle: Ultimately, you want clinical collaborators. There’s also a lot of important work to be done working with designers on system usability. When working on clinical problems, it’s essential to understand the context deeply. This can be acquired through reading to some extent.

Cian: Agreed on user design and understanding clinical context. Clinical conferences can be a good way for ML researchers to reach out and find potential collaborators.

Luke: Young doctors are excited about these kinds of collaborations. Perhaps in online communities as well.

Anna: Participate in talks and panels. Ask questions. Collaborators are there. “If somebody really can’t find any collaborators, then I think they’re just not looking hard enough.”

Q: Lots of success in ML4H on Imaging and maybe some Signal Processing. How to extend to other modalities like notes or structured EHRs?

Anna: Many tools being deployed are using EHR data (e.g. Stanford, Kaiser, NYU, Duke, Toronto). Lots of examples of medical imaging data because it gets collected in a more standardized way. The availability of data has been driving opportunistic research. Also some interesting new work in mental health NLP.

Luke: I’m not sure “better models” can help us do better on EHR. Imaging success comes from being able to exploit spatial relationships that EHR data doesn’t have. 

Anna: I don’t think it’s a failure if random forest works to solve a problem. Also, deploying algorithms isn’t largely a technical challenge… it requires lots of support from hospital staff, IT, admins, etc at the hospital. It’s a big commitment on their part, and we’re just now beginning to get buy in.

Cian: Hospitals have digitized in the last 10 years (e.g. electronic recording of vital signs).  Interoperability in imaging (e.g. Dicom) has been ahead of EHRs (e.g. FHIR), which has allowed more standardization and transferability.

Danielle: We should think beyond “Machine Learning for Healthcare” to “Machine Learning for Health.” We need to get well-curated datasets, and think about longitudinal data to understand health and the progression of disease.

Q: Ground truth labels in healthcare are expensive and often suffer from poor inter-annotator agreement. I’d love to hear more about the panel’s experience in building labeled datasets in healthcare.

Cian: Spend time talking about clinical collaborators about what it is you’re trying to predict. In the future, hopefully can try to extract high quality labels from clinical practice historical data.

Anna: Pretty much impossible if you don’t have access to a clinician. Things like ICD9 codes are biased for billing, not diagnosis. Online learning and active learning techniques will hopefully help. There’s an example of Regina Barzilay’s group, where they were able to achieve better performance by predicting the outcome itself rather than an intermediate, human-derived proxy, like density.

Q: How does disease heterogeneity impact this? What happens when the “true label” is not clear?

Luke: When the data is heterogeneous, the data might be worse for some sub-populations than others. Try to favor tasks with objective ground truth; even panels can have a lot of disagreements.

Danielle: There is a large opportunity for research in new methods to better capture uncertainty. Heterogeneity can happen both across annotators or for a given patient with multiple measurements.

Q: Machine Learning models often amplify unfairness against underrepresented populations. For the industry folks on the panel: what policies do you have in place to probe the fairness of deployed systems? In particular, how do you measure fairness across many intersectional groups?

Cian: For our retinal disease project, we sought out a hospital with one of the most ethnically diverse populations in our country. We built models to ensure we were performing appropriately on subpopulations. We tried collecting data from across the globe that could help us assess performance even in the face of things like variation in the tools that capture the images (and therefore image granularity). This work is still ongoing.

Danielle: These are the two main challenges of Fair ML in Health: a representative dataset and doing post-surveillance studies to see what unintended consequences might arise from a deployed system.

Anna: These are good practices to do. I’m not sure there are policies in place to ensure these things are checked. Sometimes, policies can get in the way, such as how the Canadian system does not record race, which makes it impossible to audit them for racial bias.

Luke: Once you can measure disparities, then you need to make a conscious choice to improve conditions for the affected groups. As an ML researcher, you need to supply the info of what the model is doing so that the people in governance can make the decision.

Q: Given a model with really strong performance in a research setting, how should one decide whether to pursue deployment in the clinic?

Anna: We conveyed a deployment seminar for Toronto to learn how other institutions have done this. Priorities (e.g. billing, care, etc) differ. Having a clinical champion is very important in this process. Often times, health economists are involved in the decision-making for how tools affect the system.

Cian: When you have a model with strong performance, you might be under pressure from collaborators to deploy ASAP. There needs to be assurance of safety and effectiveness. You’ll also need post-deployment assessments.

Danielle: One easy win for deploying models is in the areas of diagnostics. But we shouldn’t necessarily try to test everything just because we can. These are ethical questions beyond just the basics of ML.

Luke: We have hundreds of models for research, dozens of models approved for sale, and almost no one buying them. The biggest barrier seems to be no clear value proposition for the right people with these technologies.

Brett: Also worth noting that “high performance” is typically describing accuracy, which is a proxy for whatever your real goal is for improving outcome.

Q: Many of the largest attempts to deploy clinical machine learning are coming from industry. However, many members of the public have expressed concern about large tech companies controlling their data. Do you feel these concerns are warranted?

Luke: Historically, medical research has been done without consent. New laws (e.g. in the EU) are saying that’s not okay anymore. This is a new climate, and researchers will need to deal with it. But let’s not be too down on the commercial space; we need both sides of this process.

Cian: We want to do deep public engagement. We should also have strong privacy assurances. There should be clear information about data use and retainment.

Q: Two complementary challenges in the deployment of medical ML systems are: First, the need to rigorously evaluate the behavior of a system pre-deployment, and Second, the desire to allow continual learning over time post-deployment. Is there a tension between these two goals?

Anna: It’s not a tension, it’s a natural progression: we need to ensure safety and efficacy at each step. There should be more research in this area.

Cian: There is an opportunity for us to engage with regulators as a community now about the benefits and challenges of continuous learning systems.

Luke: “Continuous learning systems give me the absolute heebee jeebees from a safety and regulation perspective”. Even with static models, performance changes over time (e.g. dataset shift). Regulators have never dealt with anything like this before. In near-term, have a re-regulation process.

Anna: Regulation bodies (e.g. FDA) are already addressing continuous learning systems. The whole point of continuous learning systems is to correct for drift that might trip up static systems. We should get these systems in place sooner. We might not get it right at first, but doctors aren’t getting it right at all.

Danielle: A medical ML system is just an extension of the current system, with existing concerns about safety.

Luke: We should be cautious. The way medical advancement happens is slow and methodical. That slowness is a feature, not a bug. If everything changes at much faster speeds, then we’ll be facing a new set of problems.

Conclusions

I thought the workshop went really well this year! I loved the selection of speakers, and the interesting topics they’re working on! It is clear to me that the field is moving towards focuses on safety, deployments, and user interfaces. There is still a place for technical innovations to model the unique nature of clinical data, but as this field matures, Spiderman reminds us that our great power comes with the great responsibility to think about the impact these tools will start to have on the world. And very importantly, we need to be working with clinical collaborators to understand the context of these tools. At the 2016 NeurIPS ML4H workshop, the first keynote speaker, Dr. Leo Celi (an intensivist at Beth Israel), said that if you can’t find a clinical collaborator, then you should take your valuable brain power and work on a different field rather than “waste your time” working on healthcare data without a clinician.

Speaking of Leo (who is an MD), that reminds me of another recurring note from throughout the day that I had. There were a few speakers who said that although ML has its problems, doctors aren’t very good either. To be honest, that point really rubs me the wrong way when it comes from a non-clinician. Yes, clinicians get many things wrong. But computer scientists get a lot of things wrong too. No field holds a monopoly on failures or arrogance. However, I believe it is counterproductive for computer scientists to throw doctors under the bus; that will hurt the trust between two communities that NEED to build bridges. I think we computer science folks need to be humble and partner with clinical collaborators and let the clinicians police their own arrogance. Our technology has also been causing a lot of harm to society, between social media disrupting democracy and algorithmic bias exacerbating inequities.

Also, I’d like to once again reiterate how grateful I am that these talks were recorded! That’s really great for science 🙂

As one last note, I noticed that the gender breakdown of spotlights and invited speakers was:  10/17 (59%) men, 7/17 (41% women). This is certainly much better than many conferences I’ve been to in the past… but that said, I think we should be shooting for 50% representation unless one has a really good reason otherwise. I do appreciate that the organizers likely put effort into this already, but we all have our roles to play, and in this case, I’m happy to be the one on the outside reminding them that they can do better.

Speaking of gender representation, I wanted to audit myself to ensure word count fairness in what I wrote. For talks that had multiple speakers, I attributed the whole writeup to each of them (i.e. double-counted the writing). The word counts here do not count figure captions or excerpted quotes from authors’ papers. When I ran the initial check, I saw concerning disparities that I didn’t have a good justification for: I was writing less about women speakers than men speakers. I felt that needed to be addressed/corrected. Here are the results of the revised writeups:

  • Initial Version
    • Invited Speakers: 485 words/woman, 568 words/man
    • Spotlight Speakers: 307 words/woman, 375 words/man
    • Panel: 635 words for men, 465 words for women
  • Revised Version
    • Invited Speakers: 567 words/woman, 568 words/man
    • Spotlight Speakers: 356 words/woman, 358 words/man
    • Panel: 591 words for men, 591 words for women

Corrections and Updates

12/29/2019: I realized I did basic math wrong. 10/17 is 59%, not 63%.

Looking back, 2019 went well for me. I’m grateful for the opportunities I was given and the fun that I had. The main highlights for me were TA’ing my first two classes (Machine Learning for Healthcare and Foundations of Internet Policy), interning at Aledade for the summer, working with Dracut Public Schools on a few extracurricular opportunities (including Dracut DI), and finding a little bit of time to get some research done. On a personal note, I also found a great partner in DC for the summer, and have been seeing her since June. I read lots of really great books, including Dignity by Donna Hicks, Doing Good Better by William MacAskill, The Book of Why by Judea Pearl and Dana Mackenzie, The Brethren by Bob Woodward and Scott Armstrong, and Switch by Dan Heath and Chip Heath.

I was able to serve the MIT Community with various roles as President of the CSAIL Student Social Committee, Discussion Chair of the Science Policy Initiative, and Treasurer of the Association of Student Activities. And I learned some really valuable lessons about power and responsibility: power can arise in all kinds of ways (seniority within a group, gender dynamics, instructor-student interactions, or otherwise) and when communication breaks down, the responsibility falls on the person in power to make sure everyone is on the same page.

But 2019 wasn’t just a series of successes. Part of that success came from privilege and luck, of course, but still other parts of it came from taking chances. For the rest of this reflection post, I want to tell you two stories from this year, a success and a failure.

Working at Aledade

At the beginning of the year, I wanted to find a summer internship where I could help make the world a better place. There were some opportunities in research — which seemed interesting — but I wanted to do something more directly impactful, especially since I already do research as a grad student. 

I heard about the company Aledade, which was working with independent primary care doctors to show that a new kind of business model can work in healthcare: value-based care. I learned more about their specific business model (Accountable Care Organizations) from their podcast, The ACO Show. Essentially, the organization works to align the incentives so that doctors win when patients win. Corporate business decisions can get in the way of patient care, like when a hospital orders unnecessary tests & profits from providing those services. In value-based care, the insurer pays for outcomes, not volume, which encourages preventative care (which is much cheaper than reactionary treatment) to make sure patients stay healthy and don’t slip through the cracks.

Aledade’s mantra is to make decisions that are “good for doctors, good for patients, and good for society.” It’s a really interesting concept: creating a market where doing the profitable thing means doing the right thing! I was interested in trying to help & to learn from them.

I submitted a resume online. A little while went by, and I had other offers I needed to get back to as quickly as I could, but I still hadn’t heard from Aledade. This is where I tried something new for me. The CEO and co-founder of the company, Farzad Mostashari, is a very friendly guy and is also active on Twitter. So I decided to tweet at him and asked if I could DM him my resume. I figured the worst case scenario was that he thought it’d be weird and wouldn’t respond (which would result in the same not-having-the-job for me as if I didn’t reach out at all). Fortunately for me, he got back to me, and we setup an interview.

I ended up spending my summer at Aledade, and it was a really great experience. The project was meaningful, and the people were great! I even got to help out with the podcast (I asked if there was anything I could do to help & they wanted to give me something to do, so I got to send a couple cold-call emails to potential guests). Often times, it doesn’t hurt to ask!

Applying for Tech Congress

Throughout my service in MIT organizations, I’ve learned that governing is hard: even when you want to do the most good for the most amount of people, it’s hard to know what the right answers are. No one is an expert in everything, because there are so many topics and fields (e.g. tech, healthcare, education, labor, criminal justice, immigration, agriculture, foreign affairs, climate change, etc), which is why the government relies on experts to help them understand what options to consider for making the best policies. This has been hurt by ideological efforts to actively undermine government, like when Newt Gingrich was able to dissolve the Office of Technology Assessment in the 1990s. Is anyone now surprised when Congress lacks technical expertise on issues like regulating Facebook and protecting consumers online?

My friend Andy helped write a blog post about the importance of technical experts serving a Civic “Tour of Duty” in government, both to gain experience/perspective for oneself and also to help the government make well-informed decisions. Justice Stephen Breyer believes that citizen participation is crucial to making a workable society; without everyone doing their part, it won’t work for the people.

That is why I was very excited to hear about the Tech Congress program! It was created a few years ago to place technical experts on Capitol Hill as staffers in Congress for a year. Additionally, the program emphasizes the importance of diversity, and offers a competitive stipend in order to attract a broader pool of applicants (because it doesn’t require independent wealth in order to support oneself). I think this program is a great idea!

I applied to Tech Congress. As part of the application process I wrote a brief essay about a technical topic, along with recommendations for how Congress should handle it. I wrote about algorithmic bias, which is currently unregulated, and deeply problematic. My initial attempt at the essay began with:

In her popular book Weapons of Math Destruction, Cathy O’Neil identifies key elements that can make algorithms dangerous. Opaque algorithms do not have to explain their decisions, which makes it difficult to identify or correct bias. Unaccountable algorithms do not have to answer for mistakes they make or correct for future decisions. Scale is what transforms a nuisance into a catastrophe.

If someone is denied a loan for being left-handed, that seems wrong but probably not worth correcting with policy; perhaps other people will be denied loans for being right-handed, and it could “even out.” But if things do not even out — if the bias is always directed one way, perhaps by race — then society should intervene. Technology at scale makes it easier for biases to correlate, often times invisibly until it’s too late.

This usually boils down to an imbalanced dataset (e.g. teaching a model “What does a criminal look like?” on a dataset collected from over-policing some neighborhoods). But it can get very subtle (e.g. a healthcare risk model could be biased by socioeconomic status if it learns to identify expensive brand name drugs as top predictors but doesn’t recognize their low-cost generic equivalents). These things can be very hard to see coming beforehand.

Because my goal was to make the essay understandable to a non-technical audience, I decided to ask for help! I posted my first draft essay on Facebook, and asked friends to give me feedback so that I could make it better. In total, that post got 22 comments, and a lot of really useful feedback, including eliminating jargon-y phrases and questioning assumptions that I took for granted (such as when people ask “How can math be biased?”). The result of the feedback helped me write something that I was very proud of! I thought it was clear and informative.

“How can math be biased?” That is like asking how cars can crash; math is a tool and it can be misused if you’re not careful. In the last few years, there have been many high-profile examples of biased AI, including Google’s facial recognition thinking black people were “gorillas” or Amazon’s automatic resume filter down-weighting resumes from women’s colleges. Responsible scientists have created conferences to study this issue: the most well-known one to date is called FAT/ML (Fairness, Accountability, and Transparency in Machine Learning).

AI learns to make decisions by finding patterns in data; if the dataset has biases, then the model likely will too. Some biases are “obvious” to see coming (e.g. teaching a model “What does a criminal look like?” on a dataset collected by over-policing some neighborhoods). However, other biases can be more subtle (e.g. a healthcare risk model could be biased by socioeconomic status if it learns to identify expensive brand name drugs as top predictors but doesn’t recognize their low-cost generic equivalents).

The scale of technology makes it easier for biases affect thousands or even millions of people, often times invisibly until it’s too late.

Unfortunately, I was not accepted to Tech Congress’s program this year. They received hundreds of applications (which was bad news for me, but good news for America, I guess).

It was, of course, disappointing to not be selected, but after some thought, I was okay.

  • For one, I am still a PhD student at my dream school, doing really exciting research, and surrounded by great people. I have a lot to be thankful for already.
  • Additionally, my application wasn’t what they were looking for right now, but I still had fun writing that essay. Getting crowdsourced feedback from my non-EECS friends was a really great opportunity to help me examine my own assumptions & get experience communicating important topics.
  • But there was a third reason why I realized I was going to be okay: rejections mean that I’m reaching & trying to grow. If you don’t ask, then the answer is automatically no. Kim Liao argues that you should aim to get 100 rejections a year to keep yourself from letting failure stop you before you even start. This perspective helped me accept that not everything pans out successfully.

Overall, 2019 was a really good year for me. Here’s hoping 2020 will be anywhere near as good. I’m obviously very fortunate to now be at a great school like MIT, which will (deservedly or not) open some doors for me that wouldn’t be available if I was still at UMass Lowell. But regardless of where I’m at, I can always shoot for the goal of “better”. And part of that is putting myself out there and asking for what I want.

Online Political Ads

As the Discussion Chair for MIT’s Science Policy Initiative, I’m responsible for picking a topic + leading discussion for interesting subjects in science and technology policy. So far this year, we talked about Diversity in STEM and about Science Funding. This month, in honor of elections the following day, we met and talked about technology and democracy.

The announcement for the event that students saw, outlining many ways the conversation may go (depending on what they wanted to talk about).

You can read the notes of what we discussed here.

The main thing everyone was interested in discussing was social media. We began with whether it was a good thing or a bad thing for Twitter to announce they won’t be selling political ads (unlike Facebook, who announced they will run all political ads regardless of intentional lies or misrepresentation). Interestingly, the room was pretty evenly split between ⅓ say good idea, ⅓ say bad idea, ⅓ unsure.

Mark Zuckerburg tells Congresswoman Ocasio-Cortez that she’d probably be allowed to run ads on Facebook lying about how Republicans voted, if she wanted to.

Some of the concern came from how “political” would be defined. Although social media platforms do already have working guidelines for how they’ve been handling political ads (for purposes of heightened transparency and disclosures), this concern is still very real. Some members sympathized with Twitter because “in some sense, ‘Truth’ has become political.” The days of unchallenged narratives from whomever is in power have ended with tools like social media that (sometimes for good and sometimes for ill) let all parties tell their story in their own terms. Now that tech enables like-minded people to build power and organization without “permission”, competing narratives often emerge.

One aspect that is especially alarming is the use of targeting different users with different information. Companies like Cambridge Analytica were used to build predictive profiles of users in order to target what messaging would be most effective for each individual (using data collected from an unethical and unauthorized data breach). This scenario is most concerning, when you don’t even know what ads are being shown; efforts from journalists like ProPublica to understand who sees which ads has pressured Facebook to release a database of political ads, though ProPublica also showed the Facebook database failed to capture some ads from the NRA, unions, and electoral reform activists.

The 2019 Netflix documentary “The Great Hack” examines predictive targeting, social media disinformation, and the role that Cambridge Analytica played in the Brexit and Trump campaigns.

Of course, some of these concerns are not new. We’ve been policing communication since the 1927 Radio Act, followed by establishing the FCC, and then instituting the Fairness Doctrine for broadcasting (particularly, tv and radio) for decades. And even before then, we can look to the telegraph and later the telephone for technological inventions for communication, which turned the world on its head by connecting everyone beyond the local scale to more national. In many respects, the internet and social media are just a more powerful version of the same socio-technical discussions we’ve been having for over a century. Some members of Congress have introduced The Honest Ads Act to increase transparency requirements, which would be an improvement in some areas, but wouldn’t do enough to address the unique challenges posed by the new technology.

One concern with targeting — even with transparency (which is absolutely essential) — is that such precision is a new form of power which had never been available to advertisers before. They used to have to air their messages at community-level granularity for tv or newspapers, which usually acted as a regularizing force against extremism. But with individual targeting, you can tell two different groups exactly what they want to hear, and they likely won’t compare notes later to see whether you were completely honest with each of them.

So who is responsible for overcoming fake news? Is it just a problem for the tech companies, or should we expect online users to act responsibly? Some members suggested that perhaps with better civics education (including classes on how to recognize fake news), perhaps people would be less susceptible.

Just as we saw when talking about diversity, people make better decisions but have less fun when they are in diverse teams that challenge their views and rethink their assumptions. People don’t intuitively want to be told they might be wrong, even if it’s better for everyone (because no one is right about everything). Similarly, no one wants to fact check their friends or people on “their side.” But maybe they should. Researchers studying online disinformation campaigns have found that bad actors create extremist fake content for either side of a conflict & then send that content out make that issue the most divisive and salient (such as when Russia pushed polarizing content for both communities in 2016 through their “Blacktivist” and “Back the Badge” pages). No one thinks “Yep, this is probably manipulating me right now.” It’s always the “other” people who are the one’s being influenced and fooled.

Personally, although I do think everyone should indeed act more individually responsible, I don’t think that will solve the problem. Yes, I think people should fact-check any surprising news/claims they hear. And I think they shouldn’t take the bait from trolls and that they shouldn’t be unnecessarily mean. But that’s not the driving force of these problems. The causes of these problems are structural, and we need to tackle the source.

The goal of misinformation is to deceive or persuade, but the goal of disinformation is to demoralize. If you can convince someone that the world is gray and that truth doesn’t exist, then you can convince them to give up. Because social media allows for easily fighting an existing narrative by using one’s own “alternative facts,” this is more of a concern now than ever before. Data profiling allows bad actors to target each of us in individualized ways: they’ll demoralize some of us into giving up, and they’ll divide the rest of us up so that we fight against each other (instead of those profiting from the status quo).

An example of how Cambridge Analytica described their campaign in Trinidad and Tobago to induce apathy among the youth and sway the election.

Private tech companies built these powerful tools which allow for never-before-seen levels of data profiling and targeted messaging. As Spider-Man reminds us, with great power comes great responsibility.

The findings were robust across many replication experiments (Maguire et al. 2002, Garfield et al. 2012, Holland et al. 2017, and Moore et al. 2018).

At the SPI discussion, some members asked:

Do you really want private companies deciding what people can say?

Which is a fair point. On the other hand, I’m even more concerned with the following question:

Do you really think private companies should be profiting from amplifying lies and sewing division?

Even beyond advertisements, companies like Facebook moderate all of the content on their platforms, subject to community guidelines (e.g. cannot incite violence or violate copyright). The current practice for content moderation at Facebook is to hire an army of human moderators. However, they aren’t exactly doing a very good job with that: after 2 investigations by The Verge exposing how poorly the moderators are being treated, the vendor that Facebook contracts to (so as to avoid having to treat the moderators as employees of FB) decided to stop doing business with them. The investigations uncovered: trauma-inducing work environments, poor resources for counseling, much lower wages than a job at Facebook would pay, filthy offices, and a pattern of manager mistreatment (including sexual harassment). The solution cannot be to just scale those kinds of operations up to handle more content.

There have been some attempts to use AI and Natural Language Processing algorithms to help with content moderation, though these tools have severe limitations including: no understanding of social/cultural context, less ability to adapt to new patterns/tropes with just a few examples, and less common sense (i.e. heavier reliance on rigid optimizations) to avoid being gamed. Additionally, introducing AI can have biases of its own, which arise in hard-to-foresee ways. There may be promising roads there, but the tools aren’t ready to put our trust in yet. But either way, Facebook will be doing this content moderation on posts from users. Should we really increase the task by including political ads into this mess too?

Of course, these questions are easy to debate philosophically, but the companies are going to keep existing, and 2020 is two months away. They’ll need to decide what they’re going to do for the 2020 US elections. In the short term, maybe it’s best to shut down the sale of political ads until they can figure out responsible use. I don’t think they should be experimenting with such high stakes when there is an election that could be decided by a few tens of thousands of well-targeted votes.

If political ads were banned, politicians (and everyone else) would still be able speak on platforms. Deciding to not sell ads (to everyone) is not censorship; it’s just not amplifying a message. Political content can still go viral, but it should be done on the substance of the content, not how much the buyer is able to spend on reaching their audience. Why should the powerful be able to spend more to spread their message, and why should Facebook be profiting from that?

This post is from a non-lawyer for non-lawyers. I hope to convey why I think Justice Kagan is so good at what she does, and I do this by talking about a lot of examples of Supreme Court cases (in language that I hope everyone can follow).

Elena Kagan, the 112th Justice of the Supreme Court, is one of my heroes. She is one of the best questioners at Oral Arguments, and her world view is optimistic yet pragmatic. Fundamentally, she wants to connect with people so that everyone is heard. She writes with clarity so that everyone — even non-legal folks like myself — can understand. And she speaks with humility and empathy so that she and her “adversary” can see each other’s points and work together on finding the right answer, rather than just “winning”.

And honestly, she loves what she does. She just has fun with it, and occasionally she’s a huge dork! In 2015, she wrote the majority opinion for Kimble v Marvel Entertainment, a case involving the royalties to the inventor of a “web-slinging toy”, where she filled her opinion with Spidey puns:

  • “Patents endow their holders with certain superpowers, but only for a limited time,”
  • “The parties set no end date for royalties, apparently contemplating that they would continue for as long as kids want to imitate Spider-Man (by doing whatever a spider can),”
  • “Indeed, [prior case law’s] close relation to a whole web of precedents means that overruling it could threaten others,”
  • “What we can decide, we can undecide,” she concluded. “But stare decisis teaches that we should exercise that authority sparingly. Cf. S. Lee and S. Ditko, Amazing Fantasy No. 15: ‘SpiderMan,’ p. 13 (1962) (‘[I]n this world, with great power there must also come—great responsibility’).”

She literally cited Stan Lee’s comic book! It was such a fun and endearing opinion; you can tell she really had a blast. Similarly, in 2011 there was a campaign finance case (Arizona Free Enterprise v. Bennett), Justice Kagan’s dissenting opinion commented on the “chutzpah” of the people challenging public financing on a bold and strange argument.

Her writing is direct. Her writing is accessible. And her writing is relatable. I wish every Justice (no matter their ideology or methods) put as much thought and consideration into their communication as she does.

The Queen of Oral Arguments

Seriously, Justice Kagan is *so* good!

She is easily one of the smartest and most qualified Justices. She’s probably the best questioner at oral arguments; she can distill the essence of a case down succinctly.

In Helsinn Healthcare v. Teva Pharmaceuticals, which is about whether a private sale of an invention voided patent protections (because it can only be patented if it hasn’t been sold before). The law says that “a person shall be entitled to a patent unless the patent was … in public use, or on sale, or otherwise available to the public [prior to the filing.]” I really struggled to follow the arguments of this case, because I’m not a lawyer. Everyone was very focused on the “otherwise” clause, and how to parse it linguistically.

Then Justice Kagan asked a hypothetical that really helped clarify for me what everyone was talking about: “suppose I say don’t buy peanut butter cookies, pecan pie—this is the key one, ready—brownies, or any dessert that otherwise contains nuts. Do I—do I violate the injunction if I buy nutless brownies?” That was just classic Kagan. It was a very succinct and persuasive question. I was finally able to figure out why this “otherwise” clause’s interpretation would such a large impact on how this case could come out.

As another example, in 2016, the Court decided 6-2 in Bank Markazi v. Peterson that Congress had not did not violate separation of powers by passing an act which ended up resolving an issue which had been concurrently being litigated in the courts. Chief Justice Roberts was in the dissent, and in his dissenting opinion, he argues that:
1) The Court says it would reject a law that says “Smith wins” because such a statute “would create no new substantive law.”
2) This actual law in question is still simply picking winners & losers.

In the Chief’s opinion, the phrase “Smith wins” appears 6 times, as he really leans into that hypothetical to hammer home his point that he thinks the Court is wrong. But where did that succinct way of arguing that comes from? If you check the transcripts of oral arguments, you’ll find that it was none other than Justice Kagan who distilled it down to that pithy framing.

KAGAN
Does that mean, Mr. Olson, if I could, you're conceding that Congress could not say, we have a particular case, Smith v. Jones, Smith ought to win? Congress cannot say that, right?

...

ADVOCATE
Agree that that would implicate concerns about separation of power, just directing a judgment.

But to the extent --

KAGAN
So if that's right, now Congress takes a look at this case and says, we can't just say Smith wins.

And then we just -- we take a look at the case and we say, oh, if you just tweaked the law in this particular way, Smith would win.

So we tweak the law in this particular way for this case only.
But we don't say Smith wins.

We just say we're tweaking the law in this particular case for this case only.

Is that all right?

Richard Feynman said If you can’t explain something in simple terms, you don’t understand it. Justice Kagan (because of how much work she puts into preparing for each case) understands it all. She can explain an issue so simply that even a non-lawyer or a Chief Justice can understand. In all seriousness, Chief Justice Roberts is actually also a great questioner; probably the 2nd best questioner on the Court. But Kagan is #1 in my opinion. The only other Justice in the same ballpark is Justice Sam Alito, and that’s only on the rare occasion when he really wants to destroy someone’s argument, like his book banning question in Citizen’s United or his parade of edge cases in Minnesota Voters Alliance v Mansky (if you’d like to see that, search for “not only does it have to be a political message, but it has to be well-known”).

Listen to Oral Arguments and you’ll hear a lot of her suggest “Well let’s ask Justice X’s question another way.” Whenever you hear that, you’re in for a treat.

Her Approach to Law

Justice Kagan is a textualist. If that term is unfamiliar, it’s probably as it intuitive as it sounds: it means she carefully parses the words of the law and try to figure out what it says not what Congress meant to say.

Although textualism tends to be associated with the Conservative Justices, that’s not always how it shakes out. Justice Scalia famously often found himself ruling in favor of criminal defendants, quipping that “I have defended criminal defendants’ rights—because they’re there in the original Constitution—to a greater degree than most judges have.” In cases like Kyllo v US and US v Jones, he argued that the Fourth Amendment protected defendants against unreasonable searches and seizures in their homes and of their property, because that’s what the Constitution says. Another avowed textualist, Justice Neil Gorsuch, wrote for a unanimous Court in New Prime Inc. v. Oliveira that the original meaning and evolution of the phrase “contracts of employment”  was understood to cover all sorts of work agreements (including “contracting”) when the law was passed during FDR’s 1930’s.

So like I said, Kagan is a very serious textualist. She and Justice Gorsuch are probably the two most serious textualists on the Court, which might surprise some because they often find themselves on opposite sides of high-profile 5-4s, including Trump v Hawaii, Rucho v Common Cause, and Husted v. Philip Randolph Institute. But they both use the tools of textualism as a starting point as they try to interpret the law, even if it ultimately leads them to different conclusions.

Sometimes, we get to see the two go at it with dueling opinions; looking at Kisor v Wilkie as an example, Justice Kagan writes for the Court that stare decisis (i.e. precedent for previous rulings because people organize reliance interests around that stability) wins the day, and a previous case is not overruled. Justice Gorsuch writes a “concurrence” where he attacks this idea (because he believes much less in the importance of precedence than she does), saying “Still, today’s decision is more a stay of execution than a pardon. The Court cannot muster even five votes to say that Auer is lawful or wise. Instead, a majority retains Auer only because of stare decisis.” This example shows that even two textualists can strongly disagree because of how they weight different factors.

Of the “Liberal” (i.e. Democratic-appointed) Justices, Kagan and Breyer are closer to what one might characterize as “moderate” in the sense that they are more likely to occasionally cross over and vote in a decision with the 5 “Conservative” (i.e. Republican-appointed) Justices to produce an opinion that is 6-3 (e.g. Lucia v SEC, which Kagan actually wrote) or 7-2 (e.g. NFIB v Sebelius to make the state Medicaid expansion optional instead of essentially mandatory).

That said, in all of the high-profile cases that the average person might have heard of (e.g. Citizens United, Shelby County, Hobby Lobby, Trump v Hawaii, Rucho, etc), Justice Kagan and the 3 other “Liberals” are all often in the dissent together for 5-4s. Some suggest these 5-4s are (or at the very least really appear to be) political, with the argument being that if the 5 couldn’t even get Breyer or Kagan to join, then they weren’t particularly interested in remaining non-divisive.

She Makes a Great Role Model

Elena Kagan is a legal badass. She was the first female Solicitor General of the United States and the first female Dean of Harvard Law School. She’s the fourth woman on the Supreme Court, and the eight Jewish person. And in addition to all of that, she’s also one of the smartest Justices in generations. A lot of people have a lot of reasons to see her as a role model.

In 2017, I was Captain America for Halloween, so in 2018 I decided to continue the trend of people I look up to.

For Halloween 2018, I dressed up as American hero Justice Elena Kagan.

At one point in an interview, she was asked about best part of being a Justice, and she talked about the importance of the role. When asked about the worst part, she talked about when she loses a case. No one likes to lose. She is very competitive, and while she uses that fire to motivate herself to always be the best she can, she doesn’t let that get in the way of her relationships.

When she was the Junior Justice, she had to be on the cafeteria committee and she got a frozen yogurt machine, which everyone loved. She has a very collegial demeanor with all of her colleagues. She goes to the opera with Justice Ginsburg but she also used to go hunting with Justice Scalia when he was alive (which began after she jokingly promised a Republican Senator she’d do that during the confirmation process).

She famously got along very well with all of the professors at Harvard Law when she was the Dean, and that was one of the reasons she was able to be so effective in that role. She honestly seems like a great person who works hard but also really cares. I can think of no higher praise than that!

The Baseline Manifesto

Baselines are important in Machine Learning. One criticism I’ve had of the NLP field for years is that all of the fancy Deep Learning (DL) was often just marginally better (if at all) than bag-of-words-based models for many of the common tasks.

I remember as an undergrad, I spent a long time studying the heck out of Socher & Manning’s Tree-RNN suite of papers (RAE, MV-RNN, RNTN). I remember being so impressed, only to then see Chris Manning present at NAACL 2015 and hear him casually mention that all those other fancy-math, linguistically-inspired tree networks couldn’t actually beat paragraph2vec (which is a pretty simple model that straight up ignores the ordering of words in a sentence)… but don’t worry because the new Tree-LSTM finally can now! That really stuck with me. It made me much more skeptical of “the hype.” And for what it’s worth, Manning wasn’t trying to pull a fast one; he was on a 2012 paper which showed well-built bag-of-words beat models like the RAE); I just missed that paper amid all of the DL hype. Now, I always want to see strong baselines. Not wimpy un-tuned Logistic Regression that anyone can beat by 20 points, but an actual assessment of “How hard is this task? How simple is this data?”

Of course this complaint is not limited to just the NLP community. I’ve also found myself very frustrated with weak baselines in ML for Health papers as well. Really, every research field will have projects which don’t do a good job evaluating their contribution. But the ML community should be trying as hard as it can to limit the number of unforced errors.

I decided to write this blog post to honor some lovely baselines I’ve seen over the years.

Case Study: Image Captioning

Also during my formative years in undergrad, I was working on a image captioning task with a team of grad students: given an image, generate a sentence describing it (e.g. “a man is playing a guitar”). Our model was as fancy as one could do in 2015: a CNN feature-extractor fed into an LSTM decoder to output a sequence of words. It was one of the most interesting things I’d seen in undergrad research.

Image captioning example from MS COCO. Input is this image and output is something like “a dog with a frisbee in its mouth with a leash attached to it.”

But as you know, Deep Learning language generation (especially back then) could look something like “a man is a man is a guitar”, which left something to be desired but was still pretty cool that it was able to get a lot mostly correct. Then I came across a really interesting paper which looked at Nearest Neighbor methods for image captioning.

Develin et al. 2015: When we use a CNN to extract features from an image, you can plot a t-SNE of those images and observe whether similar images have similar captions.

Rather than using deep models to decode our image representations into words, what if we just returned our “generated” caption to be the caption of the closest image in CNN-feature-extraction space? It’s actually slightly more involved by that, but not by much (they look at a few neighbors and select the “consensus” caption). As it turns out, you can do pretty well with that on a dataset where the sentence complexity is at the level of “a man is playing guitar” or “a train is stopped at a train station”.

Develin et al. 2015: Decently close image captions without any language generation. The authors took a test image (which needs to be captioned), found some similar-looking images, and used either BLEU or CIDEr to find what the “consensus” caption was among the options available.

The authors evaluated this approach and they found that it performs almost as well as fancy decoding models when evaluating with standard Natural Language Generation metrics (e.g. CIDEr, BLEU, METEOR). They did note that in a human evaluation, the truly-generated captions were preferred. Still though, you can do surprisingly well with this simple approach. And that helps researchers diagnose poor model performance and task complexity:
– bad feature-extractors?
– bad language generation?
– low density of similar training caption examples?

I was impressed.

Case Study: Automatic “Reading Comprehension”

In 2015, Google Deep Mind published Teaching Machines to Read and Comprehend so that the field could have some complex datasets allowing them to work on impressive problems like reading comprehension. The paper has over 1,000 citations.

Each entry is a multi-sentence passage for the model to read and then answer a fill-in-the-blank question.

I might’ve picked on Chris Manning earlier, but in all honesty he’s a great scientist. That’s why he’s one of the most famous and successful NLP researchers in the world. And in 2016, his team conducted A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task, which acknowledged that although Google Brain did provide some baselines in their initial paper, “we suspect that their baselines are not that strong.”

They show that a simple feature-based (i.e. non-Deep Learning) classifier can outperform nearly all of the deep models provided in the original paper. They look more closely at many of the examples and conclude that the dataset is easier than previously understood because of the way it was artificially constructed by bootstrapping “supervised” examples from news summary bullet points. In short, the dataset wasn’t really measuring “reading comprehension.”

Chen et al. 2016: A feature-engineered classifier did much better than the initially reported “baselines” in the 2015 paper. The first 8 rows come from the 2015 paper; the final row is the 2016 paper’s shallow classifier.

This example might be well-known to many because the Chen/Manning paper was a finalist for “Best Paper” at ACL 2016. In that paper, they also build their own deep model which reaches the same performance as their “human performance” manual review of 100 randomly-selected examples. They conclude that such a dataset is ill-suited for trying to challenge models to do “reading comprehension.” If someone wants to work on that task, they need a better dataset.

Thanks, strong baselines! You prevented the field from overfitting a series of models to a dataset that wasn’t rich/complex enough!

Case Study: Automatic Text Summarization

The following year, more of Manning’s students continued research with the CNN/DailyMail dataset. Sure, it was not well-suited for reading comprehension, but the articles and bullet point summaries are actually a natural fit for the task of Text Summarization. Using the same data example as above, the new task is for a model to ingest the “Context”/passage and generate the “Query”/summary.

The model that See et al. 2017 built combined the two competing paradigms in language generation at the time: LSTM-decoders and Pointer Networks, making a Pointer-Generator Networks. Each model had their own strengths and weaknesses, and the team was able to do even better by combining them. You can learn about the model at the author’s accessible blog post.

See et al. 2017: To my knowledge, the state-of-the-art for abstractive text summarization on the CNN/DailyMail dataset.

In NLP, there are two kinds of summarization: abstractive or extractive. Abstractive summarization (think pen) generates new text to try to summarize/paraphrase the original source. Extractive summarization (think highlighter), on the other hand, copies snippets from the source and pastes it all together without generating any new text of its own. As you might imagine, abstractive summarization is very hard. This work is an abstractive model which tries to leverage concepts from extractive summarization.

So how does it do? Better than any other abstractive model to date! That’s pretty cool. On the other hand… it still does worse than generating a “summary” which is just returning the first 3 sentences of the article. For what it’s worth, the best extractive model is able to beat that “lead-3 baseline” on its dataset of comparison… but not by much (39.6 vs 39.2).

See et al. 2017: Results show that the best-performing model for summarizing an article is… returning the first 3 sentences of the article.

These super complex Deep Learning models can’t yet beat “literally just return the first 3 sentences of the article and pretend that is a summary.” What does that say about our models? What does it say about the dataset? What does it say about how hard summarization can be?

It’s worth noting poor performance could reflect weak models, an inappropriate task/dataset, or ineffective evaluation metrics. But if it’s a bad dataset or metric and the model is secretly good, we still shouldn’t make grand claims until we can back them up.

Case Study: Word Vector Analogies

In that vein, for our final example, let’s look at how baselines can debug an evaluation and show that a metric isn’t as good as you might have originally thought.

In 2013, Mikolov et al. released a series of papers defining word2vec, which took the NLP community by storm. The first such paper showed that they could learn interesting word vectors which seemed to have this “linear offset” structure that captured intuitive relationships. Word2vec became so popular (i.e. over 12,000 citations) that the word analogy task — especially the famous “King – Man + Woman = Queen” example — became a staple of word embedding evaluations.

How to go from the stated analogy to the set of instructions to run to see whether your word vectors are good enough to get the analogy correct.

The figure above outlines how word analogies “A is to B as C is to ___” became were used to evaluate word embeddings. A lookup engine loads your embeddings, gets an analogy from a dataset, does the vector arithmetic, and finds the word whose embedding which is closest to that result (excluding the three query words). The more queries that the lookup engine gets correct using your word vectors, the better your vectors must be.

Word analogy evaluations were not seen as the end-all-be-all metric, but performance on the task was taken relatively seriously for a while (and maybe to some people still is). But in 2016, I saw a paper at an ACL workshop which explored the analogy task a bit more critically. Essentially, he wanted to see whether the lookup engine was actually measuring how well word2vec was encoding analogies. What effect was the nearest neighbor having in distorting the space? In the example below, the “linear offset” relationship doesn’t hold at all, yet because of how the space is structured, “screaming” would still be returned, and the example would be counted as a success.

Linzen 2016: When a* − a is small and b and b* are
close, the expected answer may be returned even
when the offsets are inconsistent (here screaming
is closest to x).

The paper explored different query engines to try to debug what effect the nearest neighbor lookup was having to obscure the vector arithmetic: there are the standard Mikolov and Levy-Goldberg queries, as well as ignoring the vector offset, and also moving in the opposite direction of the offset. Using “Singular to Plural” in the figure below as an example, if you do c+(b-a) your accuracy is 80% but if you travel in the exact opposite direction with c-(b-a) you can still achieve 45%. Similarly, “Only-b” (e.g. instead of “king-man+woman” you instead only query “king”) achieves 1/3 to 1/2 the performance as the full analogy.

Obviously, the full analogy engine has the highest accuracy, but I found this analysis incredibly interesting. It made me re-think something I’d spent years taking for granted. I really like the Linzen’s paper!

Linzen 2016: Performance on difference analogy types (y-axis) as you vary the query engine (x-axis).

As a total tangent, the day the author presented their paper at RepEval 2016 was the day I got to meet Levy and Goldberg in person (woohoo!).

I think it’s fair to say that word2vec analogies were overhyped. Ian Goodfellow literally wrote the book on Deep Learning, and even he had been misled about parts of the analogy task.

This tweet is from May 2019.

Of course, in fairness to Ian Goodfellow, he is a Computer Vision researcher, not an NLP person. If there was a miscommunication, it was that NLP folks didn’t challenge the word analogy task enough. That’s why we need baselines not just for understanding datasets but also for sanity-checking evaluation metrics.

Good Baselines are Good Science

I think the above work is honestly so cool! Each one “debugs” the complexity of tasks or datasets that many researchers use. And they remind us to have humility in the models that we build and the results that we report.

In all honesty, baselines aren’t sexy. People don’t get grants for baselines just like they don’t get grants for replication. But it’s very important.

As a scientific community, we need to be the ones policing our own hype. Sometimes a new model comes along and it’s seriously revolutionary (e.g. Computer Vision in 2012). But in NLP, we spent a long time between 2012 and now trying to make Deep Learning work, but weren’t seeing the same record-shattering performance as in Vision. Maybe that’ll be different now with BERT and Transformers. With TA’ing the past two semesters, I really haven’t explored that yet, but GPT-2 definitely has seemed impressive. But whatever we do, we should compare against good baselines.

In January 2019, I took a one-week class over Winter Break about science policy and how America funds innovation. It was a very interesting topic, and overall I quite enjoyed it! One interesting component was that it was interactive: students were regularly encouraged in-class to have discussions and reflect on the material. I thought that was a great idea, though once we began, there was something glaring that I noticed: the male students were talking much more frequently than the female students. I’m sure everyone has anecdotes about this, so I decided to collect a little data: I made tallies in my notebook every time someone of each gender spoke. I certainly wasn’t looking to single anyone out, but I was curious to understand what was going on with a little more clarity.

Soon, I realized that just doing tallies didn’t capture the whole story. There were other noteworthy things that I wanted to try to understand & capture. As is usually the case, we had an unstated norm that you should raise your hand before speaking. But there were some people who tended to break that norm, and especially at first, those people all happened to be men. So I wanted to start also tracking “called upon” participation vs “not called upon” participation.

It felt like some periods of discussion had really gender-diverse back-and-forths while other stretches were male-dominated (and one time female-dominated). Occasionally this felt topic-dependent, such as the higher participation of women in the conversation about Grace Hopper and diverse teams. But more often than not, it was still noticeable during “neutral” topics. It was frustrating seeing some people raise their hand to get called on but then someone else simultaneously spoke anyway (this happened to one woman at least 3 times). I think that when someone got leapfrogged, it had a subtle, exclusionary message of “this is not quite your conversation” (especially when it’d change topics). So I tried generalizing my anonymous tally to an anonymous chain of which gender is speaking (to see if my streak intuition was even happening in this particular class).

The final tally of participation can be seen here. To summarize, men spoke more than twice as often as women in general. Men were four times as likely as women to speak without being called on by the discussion leader. That seemed bad to me.

Data

I’ll start by saying that the data here (and everywhere) is imperfect because counting things like called/uncalled discussion has arbitrary decisions (e.g. should I count the instructor? should I count the discussion leader? Should I count a panelist student that isn’t leader the discussion but spoke up without being called on? How do I count someone speaking up uncalled on during an awkward silence?). Usually I didn’t count the instructor or discussion leader, but sometimes I did if it was an inorganic interruption. If someone responded immediately to someone talking to them, I didn’t count that as a second discussion tally.

With that said, here is the data that I recorded. For Monday, I just had tallies. Around Tuesday afternoon, I started recording a time series of discussion to better try to capture the two bullet points above. When I log a time series, I also show the summary counts here below.

Monday (started counting at 11 am, left class at 3pm):


MaleFemale
Called upon249
Uncalled upon42

Tuesday Morning


MaleFemale
Called upon2721
Uncalled upon102

Tuesday Afternoon


MaleFemale
Called upon4020
Uncalled upon133

Wednesday


MaleFemale
Called upon1311
Uncalled upon51

Thursday


MaleFemale
Called upon2710
Uncalled upon94

Friday


MaleFemale
Called upon86
Uncalled upon11

Discussion

I have the following observations from looking at this data. Underlying all of them, however, is that it should be remembered that this is a rather small sample size. It would be wise to not overgeneralize or read too much into this one experiment. 

  1. Men talked more than twice as often as women, even when called upon. I think there are plenty of plausible explanations for this. In my opinion, regardless of whether one gender actually was “smarter” (whatever that means, if anything), it seemed like men had more confidence / less apprehension to say what they were thinking about.
  2. In the few times I did make a note of the student discussion leader’s gender, I didn’t notice large differences in participation depending on that variable. That doesn’t agree with what my intuition would be, but that’s what my (small) collected sample indicated.
  3. Even when the course staff took great steps (e.g. inviting 50/50 men and women on the Friday panel), gendered norms outside of our class’s control limited how effective these ended up being. For the panel, both women left to care for their sick children, which led to an all-male panel. This was very unfortunate.

Personally, I used to never really pay much attention to gender dynamics in conversations. I didn’t really notice until a close friend pointed out to me: not only are there general trends but that I, specifically, sometimes interrupted/dismissed women during conversation. But he told me this in a non-accusatory way to try to make me feel less defensive, which I think made a world of difference. In my experience, calling someone out for something makes them feel defensive, and it is counterproductive at getting everyone on the same team to address these issues. In reality, we’re all trying our best here, and sometimes there are blind spots we miss. But ever since I noticed that particular spot I’d been overlooking, it’s been a lot harder for me to unsee moments in my own life when women get fewer opportunities to speak. 

Moving Forward

Obviously, I’m just one person. I don’t have many answers for how to make the world a better place. The following are some thoughts that I have about what seem like they would help.

For instructors / discussion leaders

If you’re going to have class discussions (which are a lot of fun & often very helpful), letting things sort themselves out might not lead to a desirable outcome. Obviously, every class is different. But if you do take a hands-off approach, at least notice to see whether that has undesirable effects.

Many of these social dynamics are policed by informal norms of behavior, which differ from person-to-person and culture-to-culture. So it might be helpful to simply establish the rules of the road (whatever you want them to be) upfront, such as by making it clear the discussion leader can step in and say “please wait to be called on before talking”.

Alternatively, you could do a stronger form of participation where students opt-out instead of opt-in. You could go down the row as a “popcorn”, or you could use a system of name tags so that you can cold call people to show the conversation is for everyone (though some people are introverted, so strongly consider giving an opt-out if you’re cold calling people and putting them on the spot).

Also, if many people want to talk, it could be helpful to have a queue (either explicitly on the board or even just informally “Let’s hear from Josh, then Sarah, then Dave”). This can give people a sense of their ideas mattering & other people can’t jump the line.

For everyone

It’s like that old saying goes “You’re not in traffic… you *are* traffic.” In a perfect world, everyone would feel included to say what they want to say without feeling like their contributions are unwanted or unappreciated. Unfortunately, whether we intend it to happen or not, most conversations I’m a part of don’t reach that ideal. I think if everyone in the conversation is mindful about questions like “Am I talking too much?” and “Did that person get to make their point without being interrupted?” then we’ll get a lot closer. 

If you’ve never really noticed or thought of this issue before, you could feel encouraged to do your own informal tally for a week in classes or meetings. It doesn’t need to be capital “R” Research. Maybe you’re in more balanced meetings than I am! If so, please let me know what the secret to success is!

One last note: for this class, I did “contaminate” my data a little bit. After 2 days, I felt like I’d collected enough evidence to be able to point to a real pattern. I figured calling people out would have come across as aggressive & virtue signalling, and would have alienated to people (which could cause them to feel defensive and mistrustful). Instead, I tried talking to my peers one-on-one during class breaks about my informal observations. Some people were more receptive to the concerns than others. I’m not actually sure there is a surefire way to convince someone that this is a problem and that they might also want to be concerned about.  🤷🏻‍♂️

Parting Thoughts

If I may editorialize a bit, it’s always been strange to me when I have conversations with folks about diversity in STEM and they think that biological differences between men and women are likely to explain a lot of these disparities. I’m certainly no biologist, so I can’t definitively comment on the science, but if that were the explanation, how would it explain the history of Computer Science? Comp Sci has many women pioneers, including: Ada Lovelace, Grace Hopper, Katherine Johnson, the ENIAC programmers, and 10,000+ women codebreakers in WW2,. 

In 1985, about 1-in-3 Computer Science students  were women. But the late ‘80s ushered in the era of personal computers, which were marketed for little boys for computer games. Families bought PCs for their sons but not as often for their daughters. Not long afterwards, when those children started in an Intro to CS class, the boys had had a head start on programming experience. Over time, it seems this performance gap solidified into a self-fulfilling prophecy and culture about one’s  “inherent ability” to understand CS. Within a few decades, the women-participation rate of CS was cut in half, even as it rose in other fields such as law, medicine, and the physical sciences. 

Maybe there are some biological differences in some areas – who knows? But for conversations about women in STEM, it seems far more likely to me that we’re dealing predominantly with nurture, not nature.

Revisiting Past Decisions: I’m Going to Try Watching Football Again

Over a year ago, I decided to stop watching football. To briefly summarize that thought process, I used to be a huge football fan, but after I started reflecting on the importance of going against one’s culture/interests to do what was “right” (e.g. the idea of a conservative supporting some basic gun safety laws) I decided I should walk the walk.

Football causes intense brain trauma to players, and that affects over a million athletes across high school, college, and professional football. The summer before I stopped watching, a study came out showing that of the 111 brains of former NFL players examined, 110 of them had CTE (99% for professional but also 88% of college players examined had CTE). Then factor in that poor kids are more likely to play football as their lottery ticket out plus the link between race and socioeconomic status, and the whole thing seems wrong.

So why would I return to watching football? Has any of that changed?

No, all of that is still true. But to me, that hasn’t been the whole universe of considerations. Really what it now comes down to for me is that football is a social connection. I’ve never been to Seattle or Jacksonville, but if I meet someone from there, talking about football is a good icebreaker. 

Is The Social Benefit Really Worth It?

I think so, yes. One major concern of mine is how so many people live in their bubbles and don’t have to confront good faith disagreements from other people. It makes it easy to strawman any disagreements, and it results in a political environment of yelling at each other while the wealthy and corporations get to consume all of the economic growth from the last 40 years. Geography (i.e. urban vs rural divides) and class already make this hard, and technology has only accelerated it. I want to try to interact with more “real” people (i.e. people that don’t all think like MIT PhD students), and I think that gets non-trivially hindered by my long list of things that I gave up. Navigating those spaces are already uncomfortable and hard, and common ground can be a very helpful starting point. So maybe it’s time for me to reassess.

Let’s say you and I are at dinner and you order a pitcher of beer for us. That’s when I awkwardly tell you that I don’t drink. We get our menus, and I decline splitting a pepperoni pizza because I don’t eat meat. I guess you probably feel like you shouldn’t tell that story about the amazing T-Bone that you ate last month. But it’s no problem! You pivot the conversation to the latest Antonio Brown drama. Sorry, once again, because I haven’t been watching football.

In reality, I actually try to superficially talk football when meeting new people and they bring it up (but I have now missed two offseasons worth of players moving around). But the basic idea still stands: People don’t want to feel judged. Everyone just wants to enjoy their thing without feeling like someone thinks they’re a bad person. Should every problem in the world be worth causing those little tiny social wrinkles in a large number of my social interactions?

What was the last thing you changed your mind about? I don’t know about you, but for me it’s not often  from some “other” convincing me I was wrong & dumb & needed to adopt their solution. No, instead it has usually tended to be someone who I could relate to telling me “Yeah I struggle with _____ and haven’t figured out what to do yet.” That’s what actually gets me thinking about it & reflecting on what I should do. I might not end up exactly where they do, but this give and take of “the right thing” needs trust and social connection in order for us to find out how to be our best selves.

Well Then Why Is It Important for Football But Not Vegetarianism?

Essentially, it boils down to my current belief that I can’t pull off all these hard lines at once. I don’t drink for personal and health reasons, but I gave up football & meat because they seemed like “the right thing to do.” Now I think I’m at a point where one of them needs to give. And if I have to pick which one stays, I pick vegetarianism. I think the scale of suffering for animals (who don’t have any say in the matter) is worse than what football does to poor kids (who don’t have very good choices but have more autonomy than the animals do). PETA estimates in the US we kill 9 billion chickens per year; given that 70-85 million people died in World War II, that would be a death toll of over 100 WW2s each year. For chickens alone. And the way factory farms treat these animals would be criminal if done to other animals.

What Does Not Watching Football Do Anyway?

Like many things in life (such as recycling more or going vegetarian), it’s not so much about the literal individual impact. It’s more about solidarity and signalling that things could be better if everyone chose to do <insert_good_thing_here>. To me, it’s showing that you think some cause is important enough to change your own habits over, rather than just insisting that other people have to do all of the changing.

To be honest, in March 2018 I was looking for something I could do to tell myself that I was doing “more”, and sacrificing football was that thing for me. And at the time, it was probably the right thing for me as I started my journey of trying things and seeing what felt right and what seemed to help others. It was a good process of grabbing on to what I could in areas like activism, protests, governance, research, and more. There’d be no way for me to get all of the things 100% right on the first try. But it’s been a useful journey.

In December 2018, I put my convictions to the test. I was offered ~$3,000 to consult on a sports analytics high school project. Should I turn down the offer because I swore off football or should I take the job and donate the money to a good cause like fighting malaria in developing countries or promoting diversity in STEM? I decided that my participation in the project didn’t hurt anyone & I agreed to join. Now I’m using the money to help start a local chapter of Girls Who Code at my old high school. And depending on how much that costs, I could still have a lot left over to donate to a good cause that would otherwise not have gotten that money. This exercise forced me to think about putting a dollar amount to the choices I made & try to assess whether it was worth it. In that case, it wasn’t. And in the case of social costs/benefits, I again don’t think it’s worth it.

But What If I’m Wrong? What If Football Really Is As Bad Or Worse?

That could be! As time has gone on, I’ve only gotten more confused & uncertain about what is or isn’t “right”. Some things seem pretty obviously correct, such as: fairness, dignity, and education. But a lot of things are hard. I’m viewing this decision as a course correction from maybe going too far in one direction last year, but either way I should probably reassess in a year or two again and see if I feel any differently.

I feel like I’m at a point where I’m on the fence about football & could end up on either side at the end of the day. I think I’m probably more uncomfortable with my Amazon Prime membership because of how much power it gives Amazon, how it eliminates competitors that can’t compete, how much control they have over us, and how relentless the company has been about efficiency.

I really don’t like the pressure I feel like I’ve put on myself to be someone who “doesn’t watch football for moral reasons” (not that anyone actually cares what I do, but we always feel like society is closely watching & scrutinizing us). And besides, there’s a spectrum of football participation; it’s not a binary. I can watch some games without buying merchandise or attending games. In fact, it’s actually pretty hard to try to 100% avoid following the sport: what do you do when it’s on at a bar or when your newsfeed is full of football posts from friends? For now, I’m going to start with the baby steps.