A couple weeks ago, I attended the Machine Learning for Health workshop at NeurIPS 2019! It was a lot of fun.
The workshop was recorded, and are already available on the website, which is tremendously helpful for: researchers who couldn’t attend because of visa or conference capacity limitations, young researchers who couldn’t attend but are looking to understand the field, or anyone else who wasn’t able to attend.
I took some notes on each talk, because there is 5-6 hours of video (with no discernable way to speed it up to 2x). This post is ~9,000 words, which takes only 35-45 minutes to read. I tried using accessible language to explain the concepts from the videos, though I invite you to use this as a guide to identify which talks you are especially interested in watching at their full length.
Please let me know if you identify any typos or misunderstandings (my twitter handle is @willieboag).
The workshop’s General Chair, Brett Beaulieu-Jones, started the event by recognizing that the event was held on unceded territory of the Coast Salish peoples (a recognized protocol for guests to acknowledge the host nation, its people, and its land). Additionally, he pointed us towards the NeurIPS code of conduct, including the hotline email NeurIPShotline@gmail.com to report inappropriate behavior.
Brett gave an overview of the submissions and acceptances: abstract acceptance was 34.3% (68/198) and full paper acceptance was 17.1% (19/111). There was a marked increase in papers consulting or being authored by clinicians. And this year half of the papers used multiple datasets (up from just 16% last year). Wow!
The first invited speaker was Daphne Koller, talking about her new work on Machine Learning for Drug Discovery. discovery has been slowing down exponentially, despite humanity’s great progress on finding breakthrough treatments in the last 50 years. This trend is also known as as Eroom’s Law (i.e. Moore’s Law backwards). It is very difficult to know which drugs to pursue (because of great time and expense), so it would be very helpful to have a “compass” which helps guide which directions to take. Fortunately, as she pointed out, these sorts of heuristic, pattern-matching guesses is exactly what Machine Learning can provide.
She described the convergence between both the “ML Revolution” and the ability to generate Biological data (using tools like CRISPR) has created an inflection point in “Digital Biology.” At her new startup, they have a data factory for cell lines, where they are able to perturb cells using CRISPR and then phenotype with RNA Sequencing or imaging. Using deep learning, they create a low-dimensional manifold on which the cells exist. In this “remarkable organization” of the data, they then cluster the data to try to identify which structures are similar. This would be especially useful for differentiating labels that are very heterogeneous (e.g. we now know that “breast cancer” isn’t just one thing) and probably shouldn’t be treated with one-size-fits-all drugs.
I don’t know as much about the economics of pharmaceutical research, though I’m reminded of a presentation I saw about the financial constraints of drug discovery. As Daphne explains, investing in R&D for drugs is boom-or-bust: 95% of the time, you waste your hundreds of millions of dollars of investment, but 5% of the time, it pays off 60-fold (note: numbers may vary depending on drug type). It’s technically a very good “return on investment”, but unless you have enough money to take many shots on goal, you’ll likely end up as a bust. The talk above proposes financing models (e.g. a really clever idea to issue bonds on a cancer drug that will pay off with 98% success rate). In the financial constraints talk, Andrew Lo talks about how the mega-fund of many shots on goal is essentially a collective action problem because individual companies can’t take 150 shots on goal at once, however, Dr. Koller’s work is interesting because it’s not trying to solve the collective problem but rather to increase that 5% individual success rate into something higher. As an analogy, the mega-fund would be like an industrial-grade tomato processing machine that efficiently processes tomatoes but only for large farms that can afford it, whereas Daphne’s efforts would be analogous to new tomato seeds that increase the yield for any farmer, regardless of their farm size. Though I do wonder how successful the drug discovery “compass” would need to be in order to start paying off (e.g. if it raise the chance from just 5% to 6%, is that enough to discover drugs at an observable faster rate?).
Daphne then gave a whirlwind tour of the initial results that her startup has achieved so far, and then concluded with a philosophical reflection on the progression of science: there were periods where one or two fields were able to make incredible amounts of progress over a short period of time:
- Late 1800s: Chemistry (moved away from Alchemy into understanding elements)
- 1900s: Physics (understood the connections between matter & energy, space & time)
- 1950s: Computing (silicon chips allowed advanced computations)
- Microbiology (microscopy, sequencing allowed measurement at new scale)
- “Era of Data” (computing PLUS stats and optimization)
- Next: Digital Biology (reading, interpreting, and writing biology using new technologies)
Maithra talked about her work Transfusion: Understanding Transfer Learning for Medical Imaging, which worked to understand why transfer learning seems to work well on some tasks. Transfer learning involves pretraining on a task (e.g. ImageNet) and fine-tuning on your task of interest (even something clinical, such as MRIs). As we’ve seen, this approach has become standard practice in Vision (with pre-trained ImageNet models) and NLP (with BERT). But one thing that is not clear is why a pre-trained ImageNet model would do well on very different domains, such as medical tasks (e.g. reading MRIs). This work runs a few experiments to try to probe this phenomenon further.
They began with simple performance evaluations as a starting point. Using standard ImageNet architectures on chest x-ray and fundus photos data, they showed that randomly initialized weights achieve similar performance as a pre-trained network. However, models with pre-trained weights converge faster than randomly initialized models. But this might not be because the weights themselves are good — for instance, it could be that the scaling of the weights yields better gradient flow. To test this, they created a third model with randomly sampled weights, but sampled in a way to preserve the scaling and variance of the pre-trained weights. They found this scale-preserved network also converges notably faster than the random initialization.
Additionally, they compared transfer learning for small models against transfer learning for large models. They studied this by looking at the correlation between weights of a model before training vs after training. They found that (even when randomly initialized), large models had higher correlations in weights. This means that the larger models were changing less from the finetuning process.
I like this work! In general, I’m a fan with probing a dataset to see whether kicking the tires gets unexpected results. Science is often driven by tinkerers who are trying to figure out why something is happening. There is still more work to be done, but I appreciated this presentation!
I couldn’t find a twitter profile for Xinyu. Chiraj’s handle is @nagpalchirag.
Xinyu described her work in survival analysis, Deep Survival Experts: A Fully Parametric Survival Regression Model. In case you’re not familiar with survival analysis, she gives a brief overview, but I’ll summarize her overview here: it’s like an ordinary regression problem for predicting time to an event, except that some data points are censored (e.g. they never came back to the doctor, so you don’t know whether they died or not; you only know the most recent time you saw them alive). It’s actually slightly more complicated than that, because they predict more than just one event (e.g. time to breast cancer, time to cardiovascular disease, etc).
Xinyu then gave a tour of previous work in survival analysis modeling, including Kaplan-Meier, Cox Proportional Hazard, DeepSurv (2017), and DeepHit (2018). These models had increasing complexity for how the covariates interact with the baseline hazard (e.g. log-linear vs deep learning) and multi-event modeling. Earlier models assumed that the proportional hazard over time is constant (which is very unlikely in the medical domain), but DeepHit’s solution to discretize into buckets doesn’t scale well to long time horizons. They propose a new approach: Deep Survival Machines.
Chriaj describes their model. Illustrated in the figure below, it works as a mixture over many parametric survival distributions (i.e. fit a bunch of models, potentially on different time scales, and then at test time, attend to whichever model is most relevant). Parameter inference is performed separately for censored and uncensored data, where both terms are approximated in closed form (e.g. PDF or CDF), which allows for them to get a gradient and optimize the entire model with gradient descent.
You might be surprised in academia by how many simple concepts are obfuscated to sound fancier than they actually are. Xinyu and Chiraj explained this motivation so clearly that it seemed like an obvious solution, which is a sign that they did a great job. Additionally, this is a really nice example of what the Machine Learning for Healthcare community tries to do, in general: take existing models which struggle on domain-imposed constraints (e.g. poor scaling to long time horizons) and use that to motivate the development of a new model that performs better!
This presentation by Cian and Nenad was a tale of two projects. Both were about Acute Kidney Injury (AKI). In the first half, Cian talked about an app they developed (NPJ Digital Medicine) to help improve the treatment of AKI in the clinic without using any Machine Learning (just a simple rule-based definition of AKI). For the second half of the talk, Nenad described a proof of concept study (Nature) they did on retrospective data to show that ML could help identify cases of AKI up to two days before it occurs, which they expect would help with planning and management of treatment if it gets deployed.
When I was listening to their talk, I was struck by how similar it sounded to the many sepsis projects that the ML for Healthcare community has been working on for years: acute problem that affects millions of people, existing efforts to do better with national mandates (i.e. National Patient Safety Alert for AKI vs Surviving Sepsis campaign and SEP-1 bundles for sepsis), and ML researchers working on early detection. Nenad’s talk reminded me a lot of the set of papers by Joe Futoma et al. (ICML 2017, MLHC 2017) predicting sepsis early using RNNs. Both of these approaches did a lot right, including simulating a prospective study to see how early the method is able to catch the onset:
Nenad went into detail about all of the well-executed experiments they did, including the simulated prospective study, evaluating on heldout sites, and a discussion on how AUC isn’t very clinically meaningful because so much of that region isn’t “clinically appropriate.” He also noted that because the VA population is male-dominated, the model currently performs worse on female patients. They haven’t actually deployed this model, though I think they hope to.
Personally, I was most interested in hearing about Cian’s talk about the Streams app that they built to create “digital pathways” for standardized care protocols and specialist response teams. He mentioned their paper which evaluated the tool and found that it helped patients receive treatments faster and have shorter lengths of stays. I think work like that is some of the most important things Computer Scientists should be focusing on; we don’t need fancy Deep Learning for many of these problems. Atul Gawande’s amazing book The Checklist Manifesto (note: audiobooks are your friend!) emphasizes how a large fraction of medicine isn’t that people don’t know what to do, it’s that to save lives you need to do things without mistakes and without letting things slip through the cracks. Even simple checklists — if well-designed to enable better communication and culture — can often help with the lionshare of improvement. We see the same thing with California’s tremendous efforts to cut maternal mortality rates to ⅓ of what they used to be in the last two decades.
Obviously, I’m also excited to see how much better the care team could do using ML to forecast AKI as opposed to reacting to the rule-based definition, which tends to be met after some damage has already been done. Looking forward to hearing more. Also, I think it was a really wise move to partner with the VA for their prediction research.
The goal of this work is fall detection. In particular, They generated synthetic data of stick figures and used a physics engine to model the figures falling. This was their training data, and they evaluated performance on two real, public datasets.
In their evaluation, they investigated two questions:
- How much is performance hurt by training on synthetic data vs training on real data?
- Does pose estimation (as opposed to direct pixel space prediction) improve generalization across datasets?
It seems pretty suspicious that they do *better* with synthetic (as opposed to real) data on pose-input for dataset 2; perhaps there is large variance in the evaluation. It’d be interesting to see confidence intervals to better understand that. For generalization, they find that pose estimation improves performance, especially for dataset 2. Once again, I’m a little surprised by some of these results, in particular: why is dataset 2 so low accuracy in pixels-space but not pose-space for both training datasets? Is the use of accuracy (as opposed to AUC, F1, etc) hiding some kind of label distribution issue? If they trained on a heldout segment on dataset 2, should we expect much better performance, or is dataset 2 simply difficult to measure? I’d love to see the evaluation fleshed out a little bit more.
He ended their his with a discussion of limitations, which is great practice; every paper should try to do this. Umar mentioned they’d like to model depth. If they’re not familiar: at the NeurIPS workshop ML4H 2017, Fei-Fei Li talked about her group’s work using depth sensors to deidentify videos for privacy preserving smart hospitals (MLHC 2017). Even earlier, Zhang et al. published on Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras in 2012. These works were not cited in this paper, though they might be useful places to look when developing future work.
Paul gave a really clear presentation about his paper Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection. He gave great motivation for the work and a comprehensive evaluation. The proposed model is able to perform the appropriate clinical task while still using the provided fine-grained annotations! And the code is publicly available as a toolkit. I liked this talk a lot!
Semantic Segmentation is an extremely popular approach for tasks in Medical Image Analysis, but these pixel-level predictions don’t actually answer the kinds of questions that are typically of clinical interest. The task of interest might be patient-level (e.g. cancer or no cancer) but the annotations are of nodules at the pixel-level. Reducing those annotations to the patient-level would be a large loss of information to give to the model, which is especially challenging because medical datasets are often much smaller than general domain ones. This results in a frequent mismatch between the prediction (and the evaluation metric that necessarily follows) and the clinical task that you actually care about. In Medical Imaging, pixel-level metrics like Dice Score are often incorrectly used to answer clinical tasks at the object-level. Evaluating models on clinically relevant scales is an essential part of impact-driven research in our field.
Choosing a beneficial scale for training is a valid approach, so long as predictions are transferred to the clinical scale for evaluation. However, as illustrated by this figure below, it is not always obvious how to aggregate these predictions into an object-level score. Should it be max-value? Average value? Majority vote?
Rather than trying to do an ad hoc aggregation heuristic, this work learns the aggregation in and end-to-end manner. They propose Retina U-Net, which essentially adds a pixel-wise loss to the standard Retina-Net for object detection. This hybrid model addresses the challenge of aligning a model’s output structure to the scale of a clinical task, but it also maintains data efficient training on small data sets by using the pixel-wise annotations. They evaluated this method on two datasets: LIDC and an in-house Breast MRI dataset. The experiments were very comprehensive, testing both 2D and 3D as well as against segmentation+aggregation baselines and against instance segmentation.
Jerry presented his work on Generative Image Translation for Data Augmentation in Colorectal Histopathology Images. Essentially, they want to tackle the problem of rare classes in small datasets by using GANs to generate synthetic images of those rare classes. They used CycleGAN to generate images of rare examples (adenomatous polyps) from normal examples (normal colonic mucosa), like how in the general domain CycleGANs learn how to generate images of zebras from horses. Using this image-to-image translation model allows the provides a starting structure for the generated images, because adenomatous polyps always originally began as normal colonic mucosa.
Using the augmented data for training, they improved downstream performance on identifying whether an image is adenomatous polyps or normal colonic mucosa. However, they decided to test whether training on *just* the synthetic images would achieve good performance, and they found a large decrease in performance without any real images in the training set.
My favorite part of this work was their “Pathologist Turing Test”, where they tried getting pathologists to spot which generated images were real and which were fake. Some of the doctors did indeed show significant ability to spot the fakes, but most of them didn’t show huge expertise at the task. I appreciated that they noted limitations with this study, namely that pathologists are not trained to detect real vs fake, and they were shown only fixed-size tiles (not slides) for the question. Even still, I was curious what Pathologist 1 from Task B saw that allowed them to perform so well on this task, so I checked the paper:
“Based on feedback from pathologists, fake sessile serrated adenoma images were easier to identify because our CycleGAN model created a subtle mosaic-like pattern in the whitespace of images. Sessile serrated adenomas tended to have more whitespace because they are defined by a single large crypt (of mostly whitespace), which might explain why it was easier to detect fake sessile serrated adenomas than tubular adenomas.”
Emily presented to projects about modeling cognition: one at Apple (predicting cognitive impairment from smartphone activity) and one at the University of Washington (building a generative model for co-activated regions in the brain). This was a very interesting talk, and she did a really great job presenting. I especially liked her slides & visuals.
For the first project, she talked about the passive sensing capabilities of the iPhone (though she alluded to the opportunity for intervention-based experiments with smart and wearable technology). They ran a study with 113 affirmatively-consenting iPhone users, where the researchers built an ML model to predict whether a user has Alzheimer’s Disease based on the sequence of apps that they access over a series of sessions. A prediction was at the person-level, where a person was represented as a set of “sessions”, and each session was a sequence of apps that the user accessed between unlocking and locking their phone. As shown in the figure below, a session was represented as the average embedding of the apps in that session, though they also ran experiments of using a simple bag-of-words count-based session representation as well. The sessions were then clustered using k-means to assign a cluster label to each session. Finally, the person-level representation was formed by counting the number of cluster labels that given person had (e.g. had 3 sessions from cluster #1, 0 sessions from cluster #2, etc).
The results of this experiment (shown below) are interesting. They found that the embedding representation performed much better than a one-hot encoding (though it’s not clear to me whether they still did the clustering for that baseline). I’m a bit surprised that the AUC could be as good as 0.8 for just N=113 between train/test, especially since the centroid embedding approach disregards the sequence order of app usage. One of the most interesting parts of this was that she highlighted the Apple “Research Kit” tool which seems to essentially create the infrastructure for other researchers to create experiments like this by recruiting people who affirmatively consent to volunteer their data to the researcher for the project. I think that idea opens up a lot of possibilities for potential research questions, and I like that it brings user consent to the forefront of every study. I’m interested to see what kind of impact this will have made 5 years from now.
The second paper she presented was early work in trying to model regions of the brain. Based on current neuroscience literature, it seems that many complex behaviors are the result of multiple brain regions acting in correlated ways (like how, analogously, when someone walks, their joints are controlled by multiple angles acting in a correlated manner). For this project, they build a variational autoencoder, which encodes an observation down to a low-dimensional representation, and then decodes that representation with group-specific generators (e.g. one for neck, one for right elbow, etc).
The following two figures show how this applies to the joint example (or the brain region activation, which is the phenomenon of interest). They enumerate the matrix of generated groups by dimension to for a given latent variable, which joints it controls. They found that for the brain data, some of the dimensions controlled known region networks in the brain.
Luke talked about safety, or has he said: “Machine Learning in Healthcare without constant thought towards safety is not actually Medical Machine Learning, it’s just ML that happens to be on medical [data].” In medicine, human experts are predisposed to watching out for the rare one-in-a-million cases, whereas ML is predisposed to the average cases, because that minimizes the loss function for the majority of training data points. You can read more extensively about this in his blog posts on Medical AI Safety here, here, and here.
He enumerated some of the factors that make Medical ML uniquely important, including:
- High risk.
- Long tail of rare, dangerous outliers.
- Human/user error is a major problem.
- Potential for external/systemic risks.
- Experiments in controlled environments are not adequate to demonstrate safety.
Surprise twist!! This isn’t actually unique to Healthcare. In fact, these bullet points describe self-driving cars. Luke used this psych out to suggest that maybe we should be thinking about Medical ML more like self-driving cars and less like recommendation engines.
Luke began with an example AI tool that has been in medical practice for over 20 years, which allows us to compare its promise vs its impact. The figure below shows the 1996 FDA approval for computer-aided diagnosis (CAD) for breast mammography screening, which had demonstrated improvement for diagnosis when doctors used the system (left). Fast forward to 2015, and we find that the system doesn’t help people who use it, and it even reaches worse performance than the human on their own (right).
One source of error is Human Error and Misuse. For the CAD scenario — even though humans were told only to use it to catch cancers that would otherwise be missed — humans began to over-rely on the system and began using it to rule out cancer. These “unintended consequences” happen frequently with technology, and when they are completely foreseeable, we shouldn’t say “Don’t look at me, I’m just the technical person. We told them not to do it, so that should be good enough.” As my own example, Q-tip boxes explicitly state to not put the swabs inside one’s year… yet people constantly do it. It’s entirely foreseeable there. It reminds me of a great piece I read this year called Do Artifacts Have Politics? which argues that technology necessarily changes dynamics and social structures, and those effects cannot be separated from the tools that enabled them.
Another source of error is what he calls Hidden Stratification (which he had a paper about for the workshop). This occurs when a given label is overly reductive and collapses too many distinct concepts into one umbrella group. For instance, the label “lung cancer” actually describes a broad category which contains dozens of different pathologies. Another example (shown in the figure below), is “pneumothorax” (which has a mortality rate around 30% when untreated). When a pneumothorax is treated (as is the case for the majority of instances it appears in for chest x-rays), things are okay, but untreated cases are critically important to identify. Unfortunately, the untreated cases are rarer, and so the ML training process does not fit to them; when evaluating ML models on the important cases, performance drops by about 10%.
For medical devices as ML, regulators do not require Phase 3 trials (which is when the tool is tested in the real world) before marketing it commercially. Would anyone hop in a self-driving car that has never been tested on real roads? Unfortunately, the regulators have no interest in pushing towards requiring Phase 3. In pharmaceuticals, even after a drug clears Phase 2, it still has a 1-in-2 chance of never succeeding. For now, a minimum required effort we should do is to talk to domain experts about what the rare, dangerous outliers are, and then we need to collect examples of those cases and test on them to ensure we can at least do that.
There was a slight error with the video for the beginning of Eyal’s presentation, which makes it a little hard to follow (because it skips the task and motivation). Essentially, they want to improve disease localization by leveraging unannotated data. In that sense, it reminded me a bit of Paul Jäger’s spotlight earlier about combining annotations from different levels of granularity. Eyal’s paper is Localization with Limited Annotation for Chest X-rays.
The dataset (ChestX-ray14) has over 100k images labeled with 14 different diagnoses, though they also used a small annotated dataset (N=880) with localized bounding boxes. The baseline model performs localization (i.e. object detection) and image classification simultaneously with a joint loss, but they identify several potential issues with this approach. Many of these solutions are pretty intuitive: they require multiple predicted patches before returning a match (to reduce False Positives), they replace multiplication with addition to combat numerical underflow, they model contextual relationships between patches with a CRF (because the baseline treats patches as independent), and they do some kind of low-pass filter thing (called “anti-aliasing”) between CNN layers so as to preserve shift invariance. In that sense, it seems to complement the Retina-U Net paper nicely.
Xiaojing gave a good presentation about predicting pain scores from video data. The UNBC McMaster Shoulder Pain Dataset has 200 videos (for 25 subjects) with many human expert-labeled pain scores both at the frame-level (e.g. VAS, AU) and video-level (e.g. VAS, OPR, SEN, AFF). The goal of this work is to predict the VAS score for a video, and this is done with a 3-stage network and multi-task learning to incorporate signal from all of the labels. You can read her paper Pain Evaluation in Video using Extended Multitask Learning from Multidimensional Measurements for more details.
Her model works in 3 stages. The first stage is frame-level predictions, where the input is the image and the output is the PSPI score. For the 2nd stage, the sequence of PSPI scores are aggregated with statistics and fed into a single layer to predict VAS. The 3rd and final stage is an ensemble over the 4 predicted video-level scores to refine the VAS prediction. They evaluated baselines using Mean Absolute Error (MAE), and tested: stages 1+2 without multitask learning (MAE of 2.34), stages 1+2 with MTL (2.20), and the full model (1.95). They also found that humans were able to achieve an MAE of 1.76.
I thought this work was pretty straightforward. I appreciated that it had a clear goal (i.e. predict VAS from video-level data), though I have no way of evaluating whether that was the best goal (I’ll assume it was). I do wonder, however, how generalizable this will be to other tasks, which might not have numerous additional labels/scores that could be used for multi-task prediction. That said, it could be the case that the pain scores themselves aren’t the value-add, but rather, predicting any reasonable additional labels helps the model with training because of gradient flow. I’d be interested to see an experiment similar to one described by Maithra, where the weights learned by the final model are destroyed but the scales of the final model are preserved: would retrained that new model (without multitask) yield similarly good scores in the 1.95 range?
Arnav presented a summary of his paper Non-Invasive Silent Speech Recognition in Multiple Sclerosis with Dysphonia. The goal of their work was to help patients with MS that need technology-assistance to speak; they built a tool that attaches electrodes to the patients and uses the small electric signals (from muscle movement) to predict which sentence the patient is trying to say (out of a possible space of 15 hard-coded sentences). The novelty seemed to be much more focused on their measurement tool and the 3-person pilot study than they ran, since their technical methods amounted mostly to lots of signal pre-processing and a simple convolutional neural network. I thought this work was an interesting change of emphasis, and I appreciated the effort they put into working with real patients, getting consent, building a tool based on the users’ preferences, and grounding the presentation in the language of clinical impact. That said, I think this work suffers from methodological challenges, and (based on the evaluation metrics they report) might be overstating the clinical impact they can deliver so far.
This paper’s emphasis was on the pilot study, not the technical components of this work. That said, even for papers with a clinical emphasis, this field does expect a certain level of technical rigor. In his presentation, Arnav virtually spent no time on the model (“We use a convolutional neural net”) or the way it was trained (“a stratified cross-validation run”). Instead, there was much more attention devoted to a video of the patients using the system — which, to be clear, is really nice grounding, but should be in addition to, not at the expense of, technical rigor. I had to read their paper to better understand the details, which I’ll highlight now to complement the talk:
- “We non-invasively collected such recordings from 3 Multiple Sclerosis (MS) patients with dysphonia using pre-gelled Ag/AgCl surface electrodes while they voicelessly and minimally articulated 15 different sentences 10 times each.”
- “We employed personalized machine learning for the patients, i.e. the data of each patient was kept separate and not mixed with others. Hence, the architecture was trained with personal data to generate separate models for each of the 3 patients. The primary reasons are the attributes and features present in the data, which are unique to each patient due to their personal characteristics and stage of the disease and subsequent speech pathology, and the variance in the electrode positions.”
- “We used repeated (5 times) stratified 5-fold cross-validation to evaluate our model on the 15 class classification task. The models achieved overall test accuracy of 0.79 (±0.01), 0.87 (±0.01) and 0.77 (±0.02) for the three MS patients P1, P2 and P3 respectively”
As a pilot study, I think this work is nice. And there definitely is a value in accepting papers of pilot studies, as a way to reward the work and incentivize others to do it as well. That said, I do wish that there had been more methodological rigor; the N=3 population is especially concerning, and they fit one model per patient, which means their 6-layer CNNs (with over 10,000 parameters to fit) were trained on 120 examples (i.e. 80% of 150). They also don’t compare against other baselines, which would be a great way to demonstrate how easy/difficult this task is. Because the data is (understandably) not public, readers have no way to flesh out the technical analysis themselves. I would’ve loved to see more baselines.
I also was a little confused about their metrics for evaluating clinical impact. Their average accuracy is 81%, which means they incorrectly predict about 1-in-5 sentences (and without one an option of “That’s not what I meant to say”). Nonetheless, they report their words per minute (which seems a little strange to do when selecting from a menu of pre-selected sentences) and find that they achieve 94 wpm. With 4 words per sentence, that is 4 seconds per prediction. As a pilot study, I think these are very promising results! Though it does seem a little apples-to-oranges to compare against existing silent speech recognition technologies (which have an open vocabulary to generate any words) with a head-to-head 94 wpm vs 10 wpm. I do hope that they continue refining this work with a larger cohort of patients, and build upon this!
As I said earlier, I appreciate how much of an emphasis on clinical impact this work had. I think the ML for Health community stands to benefit from incorporating those aspects of this work. However, I also that that this work stands to benefit from many of the things that the community does well, including: methodological rigor, larger sample sizes, and more comprehensive evaluations.
Anna spoke lessons learned from an ongoing effort to deploy a cardiac arrest model in a hospital. She described three steps of this process so far: a retrospective prediction model, a survey of what doctors want from the tool, and an explanation-based model that addresses what the doctors said they were interested in. I think this talk is so important, because there are very few deployments in hospitals among the ML for Health community; sharing lessons learned and best practices is a critical step forward as we try to improve lives with the tools we hope to build.
The first step was a retrospective study of a cardiac arrest prediction model, presented at MLHC 2018. They used a CNN-LSTM model to predict cardiac events 5-15 minutes before the event happens, which would help the team prepare for the event. The model performed well, but then they did a sanity check which initially left them a little dejected. Although hospital only has 100 cardiac arrest events per year, their model would be making 3,000,000 predictions (because
It would predict every 5 minutes for 30 beds all the time). This means that even with a great model with only 1% error would still have 30,000 False Positives, which would dwarf the number of actual events & potentially cause alarm fatigue. When they ran the evaluation, and found a 15% False Positive Rate, they were surprised to hear that the clinical staff was actually pretty optimistic, because they said when viewing those predictions in context, a lot of them made sense (e.g. the patient indeed was going to have a cardiac arrest, but the staff had already known to give treatment before it was too late).
Next, Anna discussed the survey her team conducted about Medical ML explanations for her MLHC 2019 paper What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. They found that junior doctors wanted a safety net (i.e. “tool that will be constantly monitoring the situation and alerting clinicians to the onset of potentially critical events”) whereas more senior doctors only didn’t want to be bothered with what they already knew, and only wanted to know when the system disagrees with them & why. They also learned that explanations are very task- and constraint-dependent: doctors want lots of info when lots of time at the end-of-shift handoff, but during critical situation they only want the small pieces of actionable information. And they wanted per-event explanations (i.e. “Why did you make *this* decision?”) as opposed to per-model (i.e. “What are the most relevant features for this model, in general?”).
With these considerations in mind, they developed an explainability model for a time series to answer “What in the past could have caused this event?”. The work is currently under review, but the Feed Forward Counterfactual (FFC) Model infers feature importance by forecasting the next observation given the previous observations, and comparing that expectation against what is actually observed. As the figure below shows, this difference between expectation and observed signal helps flag strange behavior that the doctors could then investigate for themselves to see whether it is concerning.
She concluded the talk with some very good advice: it is critical for deployment to discuss and design these tools alongside the users, but that includes nurses just as much (if not more than) doctors, because nurses will be the ones using many of these often times.
This talk from Lily Peng and Dale Webster was about their experiences from developing and deploying computer vision systems in the clinic. Dale described considerations on the modeling side of things, and Lily talked about the challenges and successes with integrating the tool into a real healthcare system. I liked this talk, because I think the field is ready to move beyond just theoretical models for papers and start deploying models to improve patients’ lives. We need case studies like this one to help us learn what the best practices are (hint: it’s to work closely with the people using the tool, understand their needs, and integrate it into their workflow in a useful way). I wonder if in future years, we will begin to see more User Interface work at ML4H venues. I’d personally love to see that, especially as a proving ground for how to bridge between ML folks and domain experts when deploying systems, which the younger Algorithmic Fairness is also beginning to work towards!
Dale pointed out that there are a lot of problems in healthcare imaging where its fine to use pre-trained architectures and pre-trained models, and the real challenges come from correctly defining the problem and then collecting high quality labels. He recounted some examples of when good predictive performance was misleading, such as when a model purported to get 0.95 AUC but was actually having so much success because the training dataset was curated from merging two quite different sets (one of which had only negative examples). He offered the suggestion to test for these obvious mistakes with model explanations, decision trees on metadata, etc.
After they evaluated on retrospective studies, they found that the humans were better at limiting False Positives, whereas the models were better at making sure nothing slipped through the cracks and got missed. They hoped to combine the strengths in a human+ML hybrid model, but found that originally, the hybrid wasn’t doing as well. After looking further into it, they realized that the poor user interface (i.e. displaying a heat map adjacent to the image, with poor localization) wasn’t helping, and once they worked with a designer for better visualizations, the human+ML predictor was the best. As she described, when it comes to integrating AI into the system, there’s “not a lot of AI there”, yet those small improvements can have large payoffs in actual impact on patients’ lives.
Lily talked about the deployed diabetic retinopathy screening tool in Thailand. They began with a one-site pilot study, where they went to the clinics and mapped out how patients, doctors, and nurses flowed through the clinic. They’d interviewed patients about what was important to them, and were sometimes surprises by the answers. For instance, many didn’t say accuracy, but rather things like “I don’t want to travel 5 hours to the hospital”, which meant that False Negatives were very costly (i.e. don’t call a patient back willy nilly if you’re not sure because a picture is slightly blurry). She described the importance of incorporating patient values, especially before integrating the system, because involving them in the design process is an important way to build trust with the patients who will actually be impacted by the tools.
Q: What are some areas that might be most fruitful in the next 5 years, for instance for new PhD students?
Cian: Before you decide on techniques, choose the right problem.
Anna: NLP models right now are too computationally intensive. It’s important to start thinking about scalability and environment constraints.
Danielle: Look for good clinical collaborators. It’s hard to have the right context for these tools without one.
Luke: The most important thing to be a medical ML practitioner is to start understanding medical systems.
Q: For people who don’t have easy access to medical collaborators, what can they do?
Danielle: Ultimately, you want clinical collaborators. There’s also a lot of important work to be done working with designers on system usability. When working on clinical problems, it’s essential to understand the context deeply. This can be acquired through reading to some extent.
Cian: Agreed on user design and understanding clinical context. Clinical conferences can be a good way for ML researchers to reach out and find potential collaborators.
Luke: Young doctors are excited about these kinds of collaborations. Perhaps in online communities as well.
Anna: Participate in talks and panels. Ask questions. Collaborators are there. “If somebody really can’t find any collaborators, then I think they’re just not looking hard enough.”
Q: Lots of success in ML4H on Imaging and maybe some Signal Processing. How to extend to other modalities like notes or structured EHRs?
Anna: Many tools being deployed are using EHR data (e.g. Stanford, Kaiser, NYU, Duke, Toronto). Lots of examples of medical imaging data because it gets collected in a more standardized way. The availability of data has been driving opportunistic research. Also some interesting new work in mental health NLP.
Luke: I’m not sure “better models” can help us do better on EHR. Imaging success comes from being able to exploit spatial relationships that EHR data doesn’t have.
Anna: I don’t think it’s a failure if random forest works to solve a problem. Also, deploying algorithms isn’t largely a technical challenge… it requires lots of support from hospital staff, IT, admins, etc at the hospital. It’s a big commitment on their part, and we’re just now beginning to get buy in.
Cian: Hospitals have digitized in the last 10 years (e.g. electronic recording of vital signs). Interoperability in imaging (e.g. Dicom) has been ahead of EHRs (e.g. FHIR), which has allowed more standardization and transferability.
Danielle: We should think beyond “Machine Learning for Healthcare” to “Machine Learning for Health.” We need to get well-curated datasets, and think about longitudinal data to understand health and the progression of disease.
Q: Ground truth labels in healthcare are expensive and often suffer from poor inter-annotator agreement. I’d love to hear more about the panel’s experience in building labeled datasets in healthcare.
Cian: Spend time talking about clinical collaborators about what it is you’re trying to predict. In the future, hopefully can try to extract high quality labels from clinical practice historical data.
Anna: Pretty much impossible if you don’t have access to a clinician. Things like ICD9 codes are biased for billing, not diagnosis. Online learning and active learning techniques will hopefully help. There’s an example of Regina Barzilay’s group, where they were able to achieve better performance by predicting the outcome itself rather than an intermediate, human-derived proxy, like density.
Q: How does disease heterogeneity impact this? What happens when the “true label” is not clear?
Luke: When the data is heterogeneous, the data might be worse for some sub-populations than others. Try to favor tasks with objective ground truth; even panels can have a lot of disagreements.
Danielle: There is a large opportunity for research in new methods to better capture uncertainty. Heterogeneity can happen both across annotators or for a given patient with multiple measurements.
Q: Machine Learning models often amplify unfairness against underrepresented populations. For the industry folks on the panel: what policies do you have in place to probe the fairness of deployed systems? In particular, how do you measure fairness across many intersectional groups?
Cian: For our retinal disease project, we sought out a hospital with one of the most ethnically diverse populations in our country. We built models to ensure we were performing appropriately on subpopulations. We tried collecting data from across the globe that could help us assess performance even in the face of things like variation in the tools that capture the images (and therefore image granularity). This work is still ongoing.
Danielle: These are the two main challenges of Fair ML in Health: a representative dataset and doing post-surveillance studies to see what unintended consequences might arise from a deployed system.
Anna: These are good practices to do. I’m not sure there are policies in place to ensure these things are checked. Sometimes, policies can get in the way, such as how the Canadian system does not record race, which makes it impossible to audit them for racial bias.
Luke: Once you can measure disparities, then you need to make a conscious choice to improve conditions for the affected groups. As an ML researcher, you need to supply the info of what the model is doing so that the people in governance can make the decision.
Q: Given a model with really strong performance in a research setting, how should one decide whether to pursue deployment in the clinic?
Anna: We conveyed a deployment seminar for Toronto to learn how other institutions have done this. Priorities (e.g. billing, care, etc) differ. Having a clinical champion is very important in this process. Often times, health economists are involved in the decision-making for how tools affect the system.
Cian: When you have a model with strong performance, you might be under pressure from collaborators to deploy ASAP. There needs to be assurance of safety and effectiveness. You’ll also need post-deployment assessments.
Danielle: One easy win for deploying models is in the areas of diagnostics. But we shouldn’t necessarily try to test everything just because we can. These are ethical questions beyond just the basics of ML.
Luke: We have hundreds of models for research, dozens of models approved for sale, and almost no one buying them. The biggest barrier seems to be no clear value proposition for the right people with these technologies.
Brett: Also worth noting that “high performance” is typically describing accuracy, which is a proxy for whatever your real goal is for improving outcome.
Q: Many of the largest attempts to deploy clinical machine learning are coming from industry. However, many members of the public have expressed concern about large tech companies controlling their data. Do you feel these concerns are warranted?
Luke: Historically, medical research has been done without consent. New laws (e.g. in the EU) are saying that’s not okay anymore. This is a new climate, and researchers will need to deal with it. But let’s not be too down on the commercial space; we need both sides of this process.
Cian: We want to do deep public engagement. We should also have strong privacy assurances. There should be clear information about data use and retainment.
Q: Two complementary challenges in the deployment of medical ML systems are: First, the need to rigorously evaluate the behavior of a system pre-deployment, and Second, the desire to allow continual learning over time post-deployment. Is there a tension between these two goals?
Anna: It’s not a tension, it’s a natural progression: we need to ensure safety and efficacy at each step. There should be more research in this area.
Cian: There is an opportunity for us to engage with regulators as a community now about the benefits and challenges of continuous learning systems.
Luke: “Continuous learning systems give me the absolute heebee jeebees from a safety and regulation perspective”. Even with static models, performance changes over time (e.g. dataset shift). Regulators have never dealt with anything like this before. In near-term, have a re-regulation process.
Anna: Regulation bodies (e.g. FDA) are already addressing continuous learning systems. The whole point of continuous learning systems is to correct for drift that might trip up static systems. We should get these systems in place sooner. We might not get it right at first, but doctors aren’t getting it right at all.
Danielle: A medical ML system is just an extension of the current system, with existing concerns about safety.
Luke: We should be cautious. The way medical advancement happens is slow and methodical. That slowness is a feature, not a bug. If everything changes at much faster speeds, then we’ll be facing a new set of problems.
I thought the workshop went really well this year! I loved the selection of speakers, and the interesting topics they’re working on! It is clear to me that the field is moving towards focuses on safety, deployments, and user interfaces. There is still a place for technical innovations to model the unique nature of clinical data, but as this field matures, Spiderman reminds us that our great power comes with the great responsibility to think about the impact these tools will start to have on the world. And very importantly, we need to be working with clinical collaborators to understand the context of these tools. At the 2016 NeurIPS ML4H workshop, the first keynote speaker, Dr. Leo Celi (an intensivist at Beth Israel), said that if you can’t find a clinical collaborator, then you should take your valuable brain power and work on a different field rather than “waste your time” working on healthcare data without a clinician.
Speaking of Leo (who is an MD), that reminds me of another recurring note from throughout the day that I had. There were a few speakers who said that although ML has its problems, doctors aren’t very good either. To be honest, that point really rubs me the wrong way when it comes from a non-clinician. Yes, clinicians get many things wrong. But computer scientists get a lot of things wrong too. No field holds a monopoly on failures or arrogance. However, I believe it is counterproductive for computer scientists to throw doctors under the bus; that will hurt the trust between two communities that NEED to build bridges. I think we computer science folks need to be humble and partner with clinical collaborators and let the clinicians police their own arrogance. Our technology has also been causing a lot of harm to society, between social media disrupting democracy and algorithmic bias exacerbating inequities.
Also, I’d like to once again reiterate how grateful I am that these talks were recorded! That’s really great for science 🙂
As one last note, I noticed that the gender breakdown of spotlights and invited speakers was: 10/17 (59%) men, 7/17 (41% women). This is certainly much better than many conferences I’ve been to in the past… but that said, I think we should be shooting for 50% representation unless one has a really good reason otherwise. I do appreciate that the organizers likely put effort into this already, but we all have our roles to play, and in this case, I’m happy to be the one on the outside reminding them that they can do better.
Speaking of gender representation, I wanted to audit myself to ensure word count fairness in what I wrote. For talks that had multiple speakers, I attributed the whole writeup to each of them (i.e. double-counted the writing). The word counts here do not count figure captions or excerpted quotes from authors’ papers. When I ran the initial check, I saw concerning disparities that I didn’t have a good justification for: I was writing less about women speakers than men speakers. I felt that needed to be addressed/corrected. Here are the results of the revised writeups:
- Initial Version
- Invited Speakers: 485 words/woman, 568 words/man
- Spotlight Speakers: 307 words/woman, 375 words/man
- Panel: 635 words for men, 465 words for women
- Revised Version
- Invited Speakers: 567 words/woman, 568 words/man
- Spotlight Speakers: 356 words/woman, 358 words/man
- Panel: 591 words for men, 591 words for women
Corrections and Updates
12/29/2019: I realized I did basic math wrong. 10/17 is 59%, not 63%.