Why I just love dropping out of MOOCs

Why I just love dropping out of MOOCs

I love massive open online courses. I love everything about them. I love the format. I love the platforms. I love the teachers. I love the flexibility and the lack of friction.

And I love dropping out of them.

This picture turned up in my Twitter feed recently, from Class Central by way of the Economist and tweeted by @EricTopol (well worth following).

mooc-trends

It’s references all the way down

We can also see (from Class Central again) the number of classes is following the same trend.

growth-of-moocs

This is exactly what I would hope happens with MOOCs. Khan’s law, or something. A roughly 1.2 year doubling time, and a massive expansion of low cost education opportunities worldwide.

On a side note, I’m not sure why the 2017 extrapolation is for levelling off. The trend looks pretty solid up through 2016. But even if we have reached saturation, the stats are pretty amazing.

Despite this growth, there are critics. They say MOOCs are hype that don’t deliver on education promises. And they think they know why.

 


DON’T TELL ME HOW TO DO MY LEARNING

“There is only a 7% completion rate,” they say. “The majority drop out and don’t get an education.”

This is both true and misleading. Drop out is not the same as a lack of engagement.

MOOCs are not like brick-and-mortar courses.

MOOCs have no barrier to entry. Most of these courses can be tried for free, at no risk. There are tons of real-world courses I have taken that I would have abandoned if I had the choice, if not for the sunk costs and the risk of tanking my GPA, or the requirements of my degree. They didn’t interest me, and they didn’t seem relevant. Maybe being forced to sit through them was a good thing, because in my teens and twenties I wouldn’t have made good choices.

But MOOC students are not teens. They are adult learners*. Over 75% of MOOC participants have a bachelor’s degree or higher.

The key feature of adult learners is that they are self-directed. This is usually loosely defined as “being able to learn a syllabus reasonably autonomously, with some direction in the form of book lists and in-person teaching.” Really, this sentence encapsulates the ethos of higher education.

But being an adult learner means so much more than this.

It also means knowing what parts of the syllabus are important to you, what parts are not relevant, what parts can be skipped, and what you want to achieve. This is not how traditional education works, where we set goalposts for learners and define the scope of their learning.

An adult learner without barriers to access can engage with education in a whole range of ways that brick-and-mortar students cannot.

 

studentpatternsinmoocs2

These aren’t fixed groups. I engage in all of these ways, depending on the course and my needs.

As an adult learner, the ability to try out a course and see if any part of it matches my needs is fantastic.

I’ve signed up to courses I have never started, because they seemed interesting but I didn’t have the time. They sit in my Coursera and EdX profiles like Steam games after the summer sale. For both the MOOCs and the games I will probably never get around to them, but ancient Chinese history and Arx Fatalis both seemed worth trying when it cost me next to nothing.

I’ve dropped courses at the start because they didn’t cover the material in a way I enjoyed, and I had alternatives. I could choose between several instructors and formats, and so I tried all of them before I focused on one. As an example, there must be over a dozen world-class introductory computer science courses available. CS50 is super popular, but I really don’t like the big lecture style**.

I’ve dropped courses in the middle because I was interested in a particular subtopic in the syllabus. I know a little about a lot, and so big chunks of many courses are redundant. But there are gems of knowledge I can mine from beyond the well-trodden paths of the introductory lectures.

I’ve dropped courses at the end because the whole course was interesting and worthwhile, but the projects and quizzes I needed to pass it were useless.

I’ve completed courses because it was all worth it.

I’ve augmented courses with books and blogs and Khan academy and other online lectures.

I must have dropped out of between fifty and one hundred MOOCs, and in almost all of those I at least watched the introduction. In many of them I put in between ten and fifty hours studying the material. And that is not the same as “dropping out”. Not as people seem to use the term when criticising MOOCs. I learnt in the way I wanted, when I wanted, how I wanted.

And it seems like other people do too.

activity-per-week

A case study: Twelve thousand people started the first session of a Bioelectricity MOOC on Coursera. Eight thousand opened at least one video. A thousand watched all the videos.

Only three hundred and fifty completed the course.

So we could call this a 97% drop out rate, and wail and gnash our teeth.

Or we could say that a thousand people watched an entire course on bioelectricity. Three times as many as “completed” the course. They learnt it all (to some degree), but didn’t do all the assignments. I wonder how many years they would have to run the course at Duke to teach that many students?

Or we could say that, roughly eyeballing that chart above, thirty two thousand people watched sixteen videos. That means two thousand person-equivalents watched whole-course-equivalents. They just chose which part to watch, and left when they had their fill.

Or, we could just say that twelve thousand people engaged with learning in a self-directed manner.

I kind of see drop out rates as completely missing the point of MOOCs. They don’t measure anything we should care about. I would love to see more detailed data on how people engage, what they actually do, how long they spend on the material (both on site and off-site). I’m sure that if the staggering growth in MOOCs continues, there will be a whole field of academics who make their careers out of studying this data.

But they won’t care about drop out rates.

Because dropping out is awesome.

 

* One of the areas I do research in is the application of artificial intelligence in education, but specifically for teaching experts. I think I am at least partially justified in claiming some knowledge here. Forcing adults to do a specific sort of learning is like herding cats, if the cats could just spit in your face and walk away with no repercussions.
** I actually still watched at least half of CS50, because it offered some things the other courses didn’t. Picking and choosing is what adult learners do.

Predicting Medical AI in 2017

Predicting Medical AI in 2017

Welcome to 2017!

What a blast 2016 was. It seemed like every day there was a new, massive breakthrough in deep learning research. It was also the year that the wider world really started to take notice. The media, professional groups, and the general public all climbed aboard the AI hype train in 2016. Governments commissioned major reports and conservative economic forums discussed the future of work (like, whether work will still exist).

So what made 2016 so special? In the last year we saw deep learning systems beating a Go champion, outperforming human stenographers at speech transcription, generating high quality images and amazingly human voices from raw text, saving massive amounts of electricity, nearing human level in translation services and beating humans at lipreading, and many more amazing and unprecedented advances. Any one of these achievements would have been inconceivable a few short years ago.

But in medicine, the progress was much more modest. Indeed, for most of the year I thought we would not see any big disruptive breakthrough of the sort described above. But, at five minutes to the midnight of 2016, Google pulled a rabbit out of their hat with their work on diabetic retinopathy assessment. For the first time, we saw a computer system truly compete with doctors at a medical task.

Doctors started talking seriously about AI in 2016. Specialties in the firing line (like radiology and pathology) have lead the conversation, although I’m not convinced by the prevailing wisdom on the short, mid and long term outlook for our professions.

I’ll talk more about the prevailing wisdom some other time, but I think part of the reason people get it wrong is that long term prediction is really hard, especially when the pace of change is so rapid and the role technology is taking is (arguably) unprecedented.

But the near term is much more clear, so I thought it might be worthwhile grabbing my prediction goggles to look to the future. Let’s consider what AI tricks and treats might be in store for the medical world in 2017.

steampunk-victorian-goggles-welding-glasses-diesel-punk-gcg-0

Donning my +1 Glasses of Future Sight. Note: I don’t actually have these and could never wear them. Even sunglasses look ridiculous on me.

Phased and confused

So what next medical AI in 2017? In my last blogpost* I talked about the sensible way clinical trials in medicine are divided into different phases, and suggested we can understand AI research in a similar way. This is a nice way to appreciate translational research because these phases reflect how likely it is for an application to reach clinical practice, and how long it will take to get there.

So when you read about a successful clinical trial, you can understand the impact of the results in context. As a rough rule of thumb, the chance of an eventual clinical product and the time until that product is available will be:

Preclinical complete: 5% chance, 10 years

Phase I complete: 10% chance, 8 years

Phase II complete: 50% chance, 5 years

Phase III complete: 80% chance, 1 year

We don’t know if AI research will mimic these numbers, because no AI trials have made it past phase II yet. It is likely that phase I and II trials for AI can be much quicker, because clinical trials require long follow-up times where AI trials use retrospective data. But the phase III and regulatory stages should be pretty similar to clinical trials.

The great thing about this framework is how it predicts how soon the AI can fulfill its promise in medicine. This makes it a very nice way to ground our predictions, because we only need to estimate how many good quality trials of each phase will be performed, and how much impact there will be just flows from there.

To be clear about what I am predicting about here:

  1. I am only considering deep learning research in my predictions. As I have said in the past, older machine learning methods are still widely used and published in medicine, but there is no good reason to expect they will suddenly achieve breakthrough performance after 30 years without this success. Nor should we expect them to work amazingly well in medicine when they are not revolutionising other areas of technology. These older technologies make steady, incremental progress and still have a lot to offer, but aren’t going to be unexpectedly disruptive.
  2. I am only concerned with systems that will directly alter patient care. Self-help apps that recommend you see a doctor at the first sign of trouble, image processing systems, data gathering systems and so on are cool and have a big role to play, but they aren’t “medical”. I’m talking about systems that do doctor work, and need regulatory approval.
  3. I don’t know what I don’t know. If the big tech companies have large unpublished studies that are ahead of publicly available research, I’ll be surprised, but I can’t incorporate that into my predictions. Unless someone wants to tell me about it 🙂

download

You can’t expect me to know about unknown unknowns!

That definitional stuff out of the way, onto the predictions.


PHASE I

Prediction:

We will see the quantity of phase I research double in 2017.

Argument:

It is really hard to know what we should call phase I research. Technically, every CS student project with a public medical dataset and every medical Kaggle competition is a phase I study. But almost none of these will ever end up becoming products, because they are sort of “throwaway” research. There is no infrastructure to take the projects further. This isn’t something we see in clinical trials, because even preclinical and phase I trials are expensive. The lack of a cost barrier to entry in AI studies confuses the whole space to some extent.

It means I will have to be more narrow in my definitions. I will say a “true” phase I trial is one performed by a research group, published in peer-reviewed literature. These are more analogous to the phase I clinical trials, because researchers are motivated by impact and therefore they usually select projects that can grow into bigger things.

If this is our definition, then we anecdotally see something like five to ten good quality phase I AI trials each month. I “investigated” this by looking at the results of a couple of Google Scholar searches (e.g. “deep learning medicine”), covering the last 6 months. Five to ten per month seemed about right. If anything, I am over-estimating the output at the moment, which would make my prediction a bit optimistic.

I think this will happen because the number of researchers interested in deep learning is growing massively. The barrier to entry is low, practically anyone who can code can spin up cutting edge neural networks without much fuss. We see huge increases in conference attendance, new heavily attended conferences focused on deep learning targeted at medical researchers, and packed-to-the-brim deep learning workshops at the top medical ML conferences (several of the organisers for that one are actually my collaborators).

Not to mention the fact that there are an enormous number of medical informatics and medical machine learning folks who are still publishing on old methods, and they are going to catch on eventually and transition to deep learning.

I think I am being fairly conservative when I say there will be between ten and twenty good quality phase I AI trials per month by the end of 2017. I wouldn’t be surprised with a tripling instead of a doubling.


 

PHASE II

Prediction:

We will see several (3-5) large phase II AI trials published in the medical literature.

Argument:

This prediction doesn’t seem very impressive, but considering we have a history of exactly one phase II trial for a deep learning system performing a medical task (see the previous blogpost), we are actually looking at a three to five-fold increase in one year.

(On a side note, there have actually been a few other trials that I think are on the cusp of “phase II-ness”, but fall just short for a variety of reasons)

I’m not sure if I am being over-optimistic here. Google spent a lot of effort getting an army of ophthalmologists to create a dataset for them, a level of effort I am not sure other groups are ready for yet.

From the public side, anecdata suggests that funding bodies are loathe to give large grants for deep learning applications research that hasn’t been supported by large studies already (we could call the “phase I to phase II funding gap”). Many of the academics labs will probably keep focusing on old techniques for a while, if for no reason other than funding availability. I foresee zero to two phase II studies will come out of public institutions in 2017.

There will be some role for cashed-up startups, but I haven’t really had any catch my eye yet. This is probably because their public statements are investor-facing, not aimed at convincing doctors. It is difficult to grok if they know what they are doing. Enlitic is probably the most likely to surprise me, but a complete phase II trial would be a surprise.

Overall, I would be surprised if a single good phase II study came out of a startup in 2017. Maybe next year.

It will probably be up to the large tech companies to push this along, and not having any direct knowledge about the work these groups are doing behind closed doors, I am not sure about their appetites.

Though if I had to guess and considering the amount of money on the table if they are successful, I would suspect they are quite hungry. I’m mainly talking about Google, Microsoft, IBM et al. although the big med-tech companies could play a role.

I predict up two to five phase II trials will come out of established industry groups.

 


 

PHASE III

Prediction:

We won’t see any complete phase III trials in 2017

Argument:

I would love to be wrong here, but I just can’t see it happening. To be successful in phase III, you need to show that using the system is as good or better than using human doctors, on real patients in real clinics. That is a whole new ballgame.

Let’s take Google’s diabetic retinopathy study, assuming it is the only one that is almost ready for phase III and could get by an ethics board. Even if they could formulate a really good use case here (I’m not an ophthalmologist, but maybe something along the lines of a screening system that can avoid the need for specialist review in mild cases and therefore save money with no decrease in patient safety) we are looking at at least a year or two of follow-up.

Diseases like diabetes move slowly. If you reduce the follow-up period, fewer people will suffer the events you are watching out for (like blindness). This doesn’t mean you can’t do it, but you will need a larger cohort to prove it works and is safe. This adds to the cost, which is already going to be high (seven figures high).

If a phase III trial comes out of no-where, I don’t know who would have approved it ethically.

If anyone can think of good candidates for research groups that might be closing in on phase III and I haven’t heard of, let me know. I will update my predictions accordingly.

But as things stand, I don’t expect any phase III studies in 2017.


 

MISCELLANEOUS PREDICTIONS

Some random AI and non-AI stuff, just for fun.

  1. Medical apps will continue to proliferate. Things like smartphone skin cancer detectors (that just recommend you see a doctor for anything unusual), health trackers and quantified self stuff, medication trackers/reminder systems falls prediction, psychological health support bots and so on. Anything that doesn’t need regulation. I doubt any will achieve significant market penetration, but these are the kind of low-hanging fruit that will see a lot of effort and start-up interest.
  2. News stories will continue to breathlessly report the end of doctors, despite all evidence to the contrary.
  3. Radiologists and other “threatened” specialists will continue to talk about this at every major meeting and in opinion pieces in all the major journals, but the conversation isn’t going to really go anywhere new this year. It won’t really change unless we see disruption at scale.
  4. On the non-AI front, augmented and virtual reality isn’t going to do much of anything useful in medicine in 2017.
  5. Similarly, 3D printing isn’t going to do much of anything useful in medicine in 2017.
  6. Genomics will continue to progress incrementally, without any major breakthroughs. Deep learning for genomics is tricky. We will cross the $1000 genome barrier this year though, which is actually pretty amazing.
  7. The biotech revolution will start picking up steam. More really effective targeted cancer therapies, more stem-cell stuff, more rejuventation tech including the first evidence on senolytics (anti-aging treatments). We should get an idea about how effective new anti-Alzheimers treatments are, and whether metformin extends human lifespans.

I think that is all I have for now. The most important implication is that without any phase III trials in 2017, we are at least two years away from any clinical application that can displace doctors. So rejoice, medics, and be merry.

I’ve tried to make my primary predictions numerical, and therefore falsifiable. A few of the throw-away predictions at the end are semi-falsifiable too (these are bolded). I will definitely be coming back to these by the end of the year, and we can see how I did.

Happy new year, and cheers!

 

 

*To be honest, I only started writing that last post because I wanted to write this one. Now I’m glad I did, because I really do think the framework is useful.

The three phases of medical AI trials

The three phases of medical AI trials

In a recent blogpost I explored how to critically read medical artificial intelligence research, focusing on the relevance of these experiments to clinical practice. It has since struck me that we don’t have a simple, clear way to discuss the idea that some studies are still a still a long way off use in the clinics, and others have progressed much closer to translation into practice.

The medical researchers in the audience might recognise this concept, because this is a case where medicine has already solved the problem.

See, clinical trials are grouped into categories based on how useful the results are going to be to clinical practice. These groups are called the phases of medical research, and reflect the common path from preliminary work to clinical translation, and these are pretty much the required path for clinical innovations to take they want to be accepted by doctors and regulators. Broadly speaking, most research that involves humans (I will use drug trials to illustrate the concept) falls into one of three categories.

Phase I is the first safety checks. A drug is tested in a small group of people to make sure nothing terrible happens. At this stage we barely even consider efficacy (how well the drug works). We just want to know it doesn’t kill people. If we get hints that it works really well, great, but that isn’t the primary motivation of the study.

Phase II assesses safety more thoroughly. This requires a larger group, to identify rarer side effects. Because of the larger sample, we can start finding some evidence of efficacy but it will never be enough to justify clinical use.

Phase III is the difficult, expensive, important stage. The study is designed with the express purpose to find out how useful the drug is. This usually means a large number of people using it for a long period. The methods and analysis needs to be able to hold up under heavy scrutiny from the FDA or similar regulatory authorities.

Technically there are also pre-clinical trials (animal models), and phase 4 trials (follow-up once the drug is available), but phases I-III are where ideas become treatments.

53cb7c95c84154b5956a5895dd76c445

This diagram doesn’t add anything useful except a splash of colour

I think we have a very similar progression in medical AI research, as almost all studies I have seen can fit into a few well defined categories. I highly doubt that the system I present below is rigorous or covers many of the edge cases, but it should form a useful framework when designing and reading about research in the field.


ARTIFICIAL INTELLIGENCE TRIALS

It makes sense for a framework of AI trials to mirror the structure of other clinical trials. The three phase concept is familiar, intuitive and would possibly go some distance in bridging the gap in understanding between medical and artificial intelligence researchers. It might even make it easier to convince doctors and regulators that your new state of the art medical AI system is ready for patients.

The key difference between clinical trials and AI trials is that in phase I and phase II, safety is not a concern for AI systems. These systems will not be applied to patient care at all until phase III, so there is no risk to humans. This is called “negligible risk research” among the ethics boards I usually interact with.

Note that the framework below is intended for use with software systems, not physical systems like surgical robots. A similar framework would exist for these systems, but the details would differ significantly.


Phase I:

Overview: This phase will try to identify tasks which are unfeasible, where the intended model is not promising enough to warrant further research. In tasks that seem promising at this stage, it will guide model design choices and inform cohort selection in the next phase of research.

Study design: The AI system is trained an tested on a small retrospective cohort. This means the data was collected in the past for other reasons, and the researchers simply use it to try to identify factors relevant to the task they want to solve. The classic example in ML research is using a public dataset.

Usually the cohort size will number between twenty and a few hundred, and is not expected to be large enough to accurately characterise model performance or make claims about efficacy.

The cohort is similar to the population the model is targeted at, but it is rarely exactly the same. Choices are often made to simplify the experiments, and these choices limit the ability of researchers to generalise the results more broadly. For example, a dataset of hospital patients is often used because it is readily available, even though the goal is to apply the system to the general (non-hospitalised) population. These design choices will often be performed by researchers not specifically trained in cohort selection (i.e. by computer scientists instead of biostatisticians/ epidemiologists/medical researchers).

The task itself will often be simplified as well to aid the analysis. Proxy tasks are often targeted (we call these surrogate endpoints), instead of attempting to measure the ultimate goal of the research.  An example of a surrogate endpoint from my previous blog would be the study that measured the precision and regularity of stitch placement ex-vivo with a surgical robot, rather than the effect on the patient complication rate. While good performance at the former task is not direct evidence of a system doing human tasks, the latter is an experiment that could never get past an ethics board using an untested system, since it would need to be applied to patients.

Costs: The majority of the cost of phase I trials is in the researcher time, designing and training the models.

Time to translation: In clinical trials, we might expect around ten years between a successful phase I trial and a consumer-ready product.

Examples: every medical deep learning trial ever (except one). These are published at a rate of several per week, by groups ranging from high end researchers to undergraduate students. Even Kaggle competitions with medical data and a clinical target would count.


Phase II:

Overview: This phase will identify the ideas that are worth pursuing in phase III studies. Since phase III trials are expensive and time consuming, phase II experiments aim to discover the most promising model architectures, goals and patient cohorts.

Study design: The AI system is tested on a big cohort, large enough that the performance is representative of the expected maximum performance for the model design. The cohort should reflect the target population closely, although some significant differences are still likely. The major confounding variables should be accounted for, or explicitly recognised and acknowledged where they are not controlled. Cohort selection for phase II studies will often require the assistance of study design experts (biostatisticians, epidemiologists).

Cohorts in phase II AI trials are likely to number in the tens of thousands or more. This is much larger than is common in phase II clinical trials, accounting for the need in machine learning research for both training and testing cohorts. If you don’t know what this means, just accept that it will double your required cohort size at minimum compared to a similar clinical trial. 

The data will almost always still be retrospective, but the task itself will be very similar to the clinical task that the researchers seek to automate.
Costs: The majority of cost in phase II trials will be in gathering, labeling and processing the large training dataset. The costs for model design at this stage will vary, depending on the novelty of the machine learning methods.

Time to translation: In clinical trials, we might expect around five to eight years between a successful phase II trial and a consumer-ready product.

Examples: the Google study on diabetic retinopathy. This study is the only one I have seen that could be called phase II in this framework. Over 10,000 cases to test the system, trained on 130,000 thousand images. This system performs on par with medical specialists and should accurately reflect the clinical performance (within a margin of error), and thus could legitimately form the basis for a phase III clinical trial.


Phase III:

Overview: Phase III trials are for proving clinical utility. The goal is to show how effective the system is at the clinical task in a controlled environment.

Study design: The AI system is tested on a large prospective cohort that accurately reflects the target population. Prospective means the patients are gathered prior to application of the system, and then followed up for long enough to assess the effects. The study aims to demonstrate change in a medical metric, such as improvement in patient outcomes or a reduction in the costs of clinical care (without increased harm).

Cohort selection is critical in this phase, as the system will only be accepted in clinical practice for populations that match the study cohort. A significant amount of effort is spent on study design, often requiring multiple experts working for several months.

Cohort size is more variable in phase III, and will be guided by the size of the effects identified during phase II studies (a statistical power calculation). It is possible that a phase III trial for a particularly efficacious system could be smaller than than the phase II study that created the AI model. That said, I personally expect that the first phase III AI system trials will have to “overpower” their cohorts to overcome the conservative bias* of medical research.

Task selection will reflect the use case of the system. Clinical and regulatory acceptance will require proof in the same task as the system is deployed to achieve (a regulatory endpoint). Again, this will require extensive planning and discussion with domain experts.

Costs: The majority of the cost during phase III trials is in the study design, cohort enrollment and management, data analysis and publication expenses. As the computer system design is largely finalised during phase II experiments, the machine learning cost during phase III should be small, although engineering costs may be much higher.

Since these studies are prospective, follow-up periods must be long enough to capture the clinical outcomes in question. For events like heart attacks, this often means several years of follow up. The costs of running studies like this can be enormous.

Time to translation: In clinical trials, we might expect around two to five years between a successful phase III trial and a consumer-ready product. The regulatory approval process can take a really long time!

Examples: No phase III trials have ever been performed using deep learning systems.

It could be argued that Computer Aided Diagnosis (CAD) systems for radiology have undergone phase III trials in the past, particularly in mammography. These systems were an older (and less performant) style of machine learning. This history could actually make the translation of deep learning systems harder, because phase IV (post-deployment) experience with CAD systems has been disappointing.


PHASES SET TO THRILL

thereturnofarchons_6042

Vale, Leonard

It seems to me that this kind of framework could help solve some of the problems I have written about previously, particularly regqrding science communication with the public and the media. Simple categories like those I have described can identify up-front how close (or far) to clinical translation an AI system is, and that will make understanding the research much easier for everyone.

They might also help to calibrate our expectations. Almost no clinical research ever makes it through the whole system, and it would be reasonable to expect a similar culling process. Since we try to keep track of the more advanced clinical trials, we know that only 18% of phase II trials reach phase III, and probably less than 50% of phase III trials succeed.

nine-out-of-ten-stat-big

It is probably even worse for AI systems, since the barrier to performing a phase I study (particularly with public data) is so low. It might be fair to estimate that less than one in a thousand AI trials are ever going to progress past the first phase. We see publication of five to ten medical AI papers per week, but we have only ever seen a single phase II trial.

As a little bit of further cold water, it is estimated that the average drug takes more than ten years and almost a billion dollars to get from lab to market. AI systems might be easier and cheaper than that, but we don’t really have any evidence to justify this view. No AI trial has made it to phase III or beyond to find out.

Finally, a framework like this could also provide a clear road-map for researchers. Start with these sort of experiments, then move up to something like this, and by the end you will have a system that will (hopefully) address the concerns of doctors and regulators. In my experience computer scientists and engineers often find these kinds of study design choices non-obvious, and having a rough guide for how to get from idea to medical product could be helpful.

One of my new years resolutions is to try to limit the length of my blog posts so they are more digestible, so I will end this piece before it climbs too far over two thousand words 🙂

Thanks for reading and sharing.

 

 

* the conservative bias is a feature, not a bug. The first example of a new medical innovation faces a higher barrier to acceptance than subsequent implementations. This is because of the precautionary principle. The more we test a method of medical science, the better we understand it and the better we can predict the risks. For largely untested methods, we err on the side of caution.

Standardised reports might be good for humans, but they are probably bad for artificial intelligence

After an amazingly high number of readers for my last blog post (thanks to everyone who read and shared it), I have starting writing a series of posts on the big question in radiology – will radiologists be replaced by machines in the near future? Geoff Hinton thinks we have five to ten years left, and as one of the handful of top deep learning researchers in the world, when he talks it is always worth listening. Since I want to explore the topic in much more depth than the majority of articles that have popped up recently, it will take some time. But I was distracted when an alert for this paper popped up in my inbox, and I decided to write about it.

The paper was published in Radiology (our top journal) and is titled “Common Data Elements in Radiology”. The authors Dr Rubin and Dr Kahn are famous in the world of imaging informatics, and are probably some of the most recognisable names in the game. Their knowledge of radiology, informatics and computer science outstrips my own by several orders of magnitude, and their clinical and academic experience beats mine by decades. And I think they got something wrong.

really-sweating

Me right now

First of all, let’s talk about the paper. It is about standardised reporting on steroids, using a system of templates with fixed responses in drop-down lists (they call these common data elements, or CDEs). A partial example is included below (Figure 1 from the paper).

radiol-2016161553-fig1a

A partial example of a CDE for radiology. Maybe it would work as an app?

We all know that standardised reporting is an unpopular topic among radiologists. There has been some limited success in achieving uptake, particularly in oncologic imaging (TNM reports, BIRADS/LIRADS/PIRADS and so on). But mostly radiologists prefer free-form or ‘narrative’ reports. I haven’t seen numbers on this, but my own straw polling comes out at almost 100% against heavily structured reporting.

I actually like structured reporting myself, although I recognise the task of converting radiologists to structured reporting wholesale is somewhat Quixotic. But none of this is where I disagree with the authors.

You see, the authors make a very specific claim in this paper. They say that by hyper-standardising reports by using limited choice “common data elements”, we will improve our ability to use computers to extract meaning from our records. This is a claim about artificial intelligence, and I think it is wrong.


It is all about the information, theoriously.

“It would be of great value if computers could read and understand radiologic reports.”

Rubin, D.L. and Kahn Jr, C.E., 2016. Common Data Elements in Radiology. Radiology, p.161553.

It sure would. But drop down boxes won’t help. In fact, I might even argue that converting radiologists to use CDEs universally in their reports could be an effective way to prevent the automation of radiology. So if you are worried about our future robot overlords, maybe you should get on board the standardised reporting train.

To be clear, I am talking about computers understanding radiology reports. There is no question that template reporting helps with traditional data mining. But traditional data mining (doing things like counting up word occurrences) isn’t computer understanding by any stretch.

More on that a bit later. First let’s look at the problem.

I will start with an example, and then get to the meat of the argument. Consider two sentences:

  1. “There is a subacute infarct.”
  2. “There is an abnormality consistent with a subacute infarct although a low-grade tumour cannot be excluded.”
Pfftt. Infarcts and tumours are, like, totally different. You can’t be very good at radiology.

An infarct is a stroke, a blood clot that has gone to the brain, for those who haven’t heard the term.

I’m sure most readers prefer the first sentence. We get told all the time that hedging is bad, that qualifying our statements frustrates our referrers. But let’s assume an equally skilled and motivated radiologist generated these sentences. Do they mean the same thing? If we wanted to “read and understand” these reports, would we say they are equal? Keep this example in mind, we will come back to it.

To the argument. It is a bit complex, and it is going to touch on a field of mathematics and signal processing called information theory. I will try my best to keep this discussion accessible for the non-math geeks out there.

Information theory was proposed by Claude Shannon in the late 1940s as a mathematical way to understand the transmission of information. Telephones were all the rage, and new maths were needed to optimise their use.

This is a very general concept – information goes from one place or time to another, through some medium. It doesn’t only apply to technology. You want to send me an idea, maybe where you want to go for lunch today (always important in radiology). You translate the idea into words in your brain, your mouth moves, the air moves, my eardrums vibrate, my nerves and brain activate and translate these signals into an idea in my head. But how similar are the two ideas, yours and mine? Did it get through ok, or was something lost in translation? Information theory is about efficiently transmitting information, the most idea for the least effort.

You can think of it like Twitter. How much information can you fit in 120 characters? Maybe you can drp sm f th vwls? There are some amazing abbreviations you really should know about. Why use boring words when a :), or a ;), or even a 😮 can do the same job?

This is called compression, just like in mp3 music and zip files. You fit more information in less space. But it doesn’t always work. Audiophiles lament compressed sound quality and prefer FLACs, photographers and digital artists would rather work with TIFF or RAW images. This is because there is a limit to how much you can compress something until it isn’t the same as the original.

These are the two flavours of compression. Lossless or reversible compression means you can squeeze the information down and it is unchanged when you recover it. Lossy or irreversible compression is when you squeeze too hard and something is permanently lost. Maybe the song is a bit crackly, or your image looks pixelated.

This is the key question: can you recreate the original data exactly with only the compressed version and some sort of decompression tool? If yes, your compression is lossless.

pobggcl-1430749957

Lossless compression was invented in early 2014

Which brings us to the actual point. Drs Rubin and Kahn are advocating irreversible information compression in radiology reports, and our future AI assistants need all that info.


Passing notes in hospitals

The first thing we have to understand is that radiology reports (like most medical records) are relatively information-poor. The vast majority of information a medical expert uses to make decisions is not included in the record, and in fact never even touches the conscious mind of the practitioner. The classic paper on medical decision making and much of the work that follows it suggests that most of the time we rely on pattern matching with stored experience, rather than a conscious application of knowledge.

Since almost all of this is internal and subconscious, we have no record of it. Instead we have a few notes that at worst may be post-hoc rationalisations (not the most trustworthy data for machine learning). Many radiologists understand this subconscious element of practice implicitly. We often say that our decision is made within seconds of seeing a case, just as clinicians often say they know who is sick in a room the second after they walk through the door. That is what subconscious pattern matching feels like.

I could go on about this for ages, and I will probably return to it in a future post (there are some really interesting overlaps with concepts like dark knowledge from Geoff Hinton), but for this discussion we can just acknowledge that we are already dealing with a low-information environment in medicine. We don’t want to make it worse by chucking out more of what we have.

Let us go back to our example.

  1. “There is a subacute infarct.”
  2. “There is an abnormality consistent with a subacute infarct although a low-grade tumour cannot be excluded.”
It can be very hard to tell the difference, honest.

Now imagine you are decompressing these two pieces of information. If we assume that the radiologist was trying to be precise, are these sentences expressing the same thing?

What do you think the probability of the patient having an infarct is in each example?

It should be clear that patient two has a lower chance of having an infarct than patient one. The hedging actually implies a different meaning. We hedge to express uncertainty based on a complex array of unstated data. What hasn’t been said? Maybe the second patient had preserved grey-white differentiation, or there was more mass effect than expected, or there was a compelling clinical history for tumour. Maybe it just felt a bit wrong for no reason the radiologist could put into words.

Let’s go further. Rank these in order of likelihood that the patient has a subacute infarct, high to low.

  1. The differential list includes low grade tumour and subacute infarct.
  2. Appearances consistent with a subacute infarct.
  3. There is a subacute infarct.
  4. Appearances may suggest a subacute infarct.
  5. The differential list includes subacute infarct and low grade tumour.
  6. Appearances compatible with a subacute infarct.
Don’t think I can’t see you skipping this learning opportunity.

These are very similar sentences, in terms of word use and dictionary meaning. But a human can read and understand that these sentences are different. We take the concise, compressed wording and we reconstruct the implied meaning.

Maybe you think the list should look something like this?

  1. There is a subacute infarct.
  2. Appearances consistent with a subacute infarct.
  3. Appearances compatible with a subacute infarct.
  4. Appearances may suggest a subacute infarct.
  5. The differential list includes subacute infarct and low grade tumour.
  6. The differential list includes low grade tumour and subacute infarct.

You could probably even put a rough estimate on the probability, if you had written the sentence. Maybe the first sentence is your way of expressing a greater than 99% chance of an infarct, the fourth is something like 75%, and the sixth might be around 40%.

That is a lot of information contained in a few short words of free-form text. Consider then the nuance contained in several lines of description instead of a single sentence.

In contrast, a CDE reduces everything to a limited set of choices. If we imagine posing each question as a binary choice (not that Rubin and Kahn advocate this), we could define the 50% certainty threshold as our decision boundary. Now we can’t tell the difference between sentence number 1 and number 5.

  1. Infarct: present.
  2. Infarct: present.
  3. Infarct: present.
  4. Infarct: present.
  5. Infarct: present.
  6. Infarct: absent.
Where is all my information?

Now, I’ve actually simplified this a little, and in doing so I have elided the human approach to the problem. Humans not only decompress the information they receive, but they also decode it. A decoder uses a set of rules (a code) to better understand communication. In the case of humans, this is shared knowledge. If a stranger says to you “let’s go to my favourite place for dinner”, the meaning of this will be impossible for you to reconstruct. But if you share that knowledge with the other person, you can recover the exact meaning of the phrase. The decompression relies on information that is not within the transmitted message itself.

This doesn’t affect the argument, but it is worth considering in the context of text analysis in medicine. Doctors have a huge shared body of knowledge they rely on to communicate, and medical notes are communications between doctors. They transfer information through time, via the medium of the page, in quickly scribbled shorthand. A lot of the compression of this information relies on encoding.

Deep learning systems are great, but we don’t yet have the capacity to teach them complex, multi-domain codes like “medicine”. This might be a hard limit to how much a deep learning model can recover from medical text alone.

But these networks do learn a great deal, and since they don’t have this decoder it raises the question – how do these systems learn?


Can we teach a machine to love understand?

Rubin and Kahn are clearly thinking about what we might call ‘traditional’ data analysis. By using a limited set of keywords (a lexicon) and a restricted range of options, we can do statistical analysis very easily. Each case is a row in a spreadsheet, and each keyword has a score.

untitled

Spreadsheets, the cleanest of sheets.

We call this ‘clean’ data, because it is so well organised. From here we can do great things*. Like keep track of trends in your population. Identify patient subgroups. Improve safety by auditing scan usage and outcomes. This sort of data is just great for human-guided analysis. But that is the point. We still need humans to understand it.

But what about machines? How could artificial intelligences understand text?

Deep learning is the closest we come to human-like understanding in computers. For example, deep learning systems can recognise objects in images by identifying the features of similar objects they have learned from previous examples. So the system identifies a dog by the presence of fur, the canine body shape, the relative size compared to other objects, common dog actions and poses, and so on.

dog-breeds-frisbee-catching_2707f3dac1c03e69

What makes a dog a dog? Pepperidge farm Deep learning remembers.

This seems a lot like understanding what a dog is, visually at least. It is also why many people are worried about the future of radiology – there is no reason the same techniques can’t recognise lung cancer by shape, location, background lung appearance, associated features like lymphadenopathy and other complex things that humans can learn to recognise but computers haven’t been able to see until recently.

In fact, there is no reason these techniques can’t work better than humans, like they have in object recognition since early 2015.

These techniques are fairly general, and so we can teach machines about text in the same ways. They learn “language” by example, which letters and words go together to express concepts. This set of tasks (called natural language processing) have been a bit more resistant to deep learning than computer vision, but the models are getting pretty darn good.

In fact, from some of my own work: I was supervising some undergrad computer science students doing deep learning this year, and we trained language models to generate radiology reports. The remarkable thing is that these models worked by choosing individual letters, one at a time. The computers learned spelling, punctuation, syntax, grammar and even what concepts fit together in sentences, paragraphs and report sections.

An example of our work is included below:

CT HEAD

CLINICAL HISTORY: CONFUSION.

TECHNIQUE: PLAIN AXIAL SCANS FROM BASE OF SKULL TO VERTEX.

REPORT: THERE IS NO ACUTE INTRA OR EXTRA AXIAL HAEMORRHAGE. NO EXTRA AXIAL COLLECTION. THERE IS NO SPACE OCCUPYING LESION.

COMMENT: NO EVIDENCE OF ANY MASS LESION OR MASS EFFECT.

While this computer generated report is cool, it is completely old-hat for deep learning.

Like in images, where the systems learn to recognise a dog from its features, in text understanding the networks discover useful “features” of text. They learn that headings are important, and that certain combinations of letters occur after one heading but not another. They learn that two sentences in the same report shouldn’t contradict each other. They learn that the longer a sentence is, the more likely the next character is a period.

Without the medical decoder, it isn’t complete understanding. But it is a very good approximation of what can be learned in the reports alone, and it turns out that this level of understanding is still very useful.

We can look at other parts of image analysis to understand this. The holy grail for computer aided diagnosis is multimodal or sequence-to-sequence learning. You feed in all your scans, all the matching reports, and your system learns a model that turns images into reports.

You know, does diagnostic radiology.

Like, all of it.

Is this plausible, using only the information contained in text? Believe it or not, when we are talking about photos, this works quite well. Take an image, turn it into text that describes it. No human involvement, no decoder that understands the complex multimodal stuff that humans do about objects and scenes. Just a computer program that can process thousands of images per second for under a cent of electricity.

deep-learning-cases-text-and-image-processing-20-638

Wouldn’t it be a vegetable stand? Silly computer.

To do this we need good training data. The above system was trained on images and human-like descriptive text. If the example caption for that image had said something like “vegetable: present, human: present” the results would not be anywhere near as impressive.

So what is it about descriptive text that is useful in these models? How does the additional information get used?


The geometry of disease

I’ve said two things that seem to contradict each other. One is that humans have a vast store of pre-existing knowledge that is complex and multimodal and therefore cannot yet be squeezed into a neural network. The other is that outside of radiology these systems work very well combining text and visual understanding.

So which is it? Can machines understand reports without the human ability to decode them?

The answer is a definite “sort-of”. Neural networks are very good at learning the useful structure in one domain (like images), and mapping it to another domain (like text). The do this by finding “universal” representations of things, a way to describe an object or concept in a vector of numbers that is the same whether you are talking about what it looks like, or how it is used in a sentence.

[0.23, 0.111, 0.4522, 0.99 …]

An example of what a sentence vector might look like. They are usually >100 numbers long.

A good example is Google’s translation system, which it trained on various language pairs to transform sentences between the languages, but even works quite well on language pairs it hasn’t seen before. The vector representation of a sentence can be considered an “interlingua” or universal language.

The key here is that what is learnt and stored by these models are the relationships between concepts, and this is really nicely visualised by exploring the geometry of the vectors.

Last week I said that we need to have a good understanding of the medical task we want the machine to perform. So what are radiology reports trying to express?

If you boil it down to the essence, we want to our reports to describe the image variation relevant to diagnostic and treatment decisions. This is what we want our models to learn too, and things like grammar and punctuation are pretty much irrelevant (which is why the generated report above is cool but unexciting).

So let’s try to visualise this useful variation (please note that these visualisations are not from data – I have made them up to illustrate the concepts). If we took a sentence about pulmonary embolism (blood clots in the lungs), what elements of that sentence define untreated mortality risk, and therefore the need for treatment? Doctors know that the size of the clot is the most important feature in an untreated patient. Since this is almost always described in the reports, it can be learned by a computer.

We could visualise this by embedding the text in a ‘mortality space’ where there is increasing mortality risk as we go from left to right. Ignore the Y-axis, it doesn’t reflect anything here. It is just easier to show on a page in a two dimensional plot.

vspace1

Understanding common variants in terminology is trivial for these systems, so a ‘large main pulmonary artery PE’ would be located around the same place as the ‘saddle embolism’.

The key thing to appreciate here is that if a system understands reports, this mortality space will be universal. A mortality interlingua. Any report observation could be placed into this space and the location in the space would relate to the risk of death.

vspace2

The other interesting thing about these sort of spaces is that you can perform arithmetic within them. You can add concepts together, subtract them and so on.

vspace3

This is how deep learning systems might develop an ‘understanding’ of reports. They learn the useful relationships, like mortality risk.

So how would we learn this ‘mortality space’? We would need the outcomes – how many untreated patients died, and the inputs – the reports. This isn’t something we could ever do for ethical reasons, but it serves to illustrate the idea of a conceptual interlingua.

Let us return to our holy grail though. We want to learn an image to text interlingua for radiology, so we can put in a scan and output a report. We can visualise this space too, which would be learned by identifying the variation in the images that matches the variation in the text.

vspace4

In this example, the left to right axis appears to reflect the concept of mild to severe variation in a disease – a feature that is described in reports and can be learned from the text alone. Note also that while the text is discrete (mild, moderate, severe), the images are not. This type of system should be able to recognise a case halfway between mild and moderate even if that ‘position’ in the spectrum has never been explicitly described in the text.

vspace6b

You can imagine other useful and predictive spectra of disease as well that aren’t directly related to size or scale. How aggressive a tumour looks, the shape of an abdominal aneursym, and so on.

Because of this, there is even more promise here beyond simply automating radiology reports. We can apply this new mathematically defined understanding of useful image relationships to further research. We could take other data (like mortality outcomes) and explore how the outcome varies with location in the image space, and maybe learn entirely new things about predictive radiology.

But back to the topic at hand. What if we did not have rich, descriptive reports? What if we only have labels that say a disease is present or absent? Then our space doesn’t look as useful.

vspace5

You can see that we no longer have these beautiful disease spectrums. We can only identify that the disease is there, exactly as the label described.

It should be clear that to fulfil the true promise of medical AI, these systems need all the information we can give them. They are actually very good at sifting through too much information to find the useful stuff, but they can’t create information from nothing. They can’t reverse lossy compression. And this is why strict lexicon based reporting could hurt us in the long-run.


Potayto-Potahto

I don’t think I really disagree with Rubin and Kahn (phew), because I think we are talking about different things. They are interested in the type of data analysis we use in 99% of radiology research today, and I’m talking about the stuff that is going to be the future of our field. It isn’t even really about their paper. It is about the more general push to limit the nuance in radiology reports, restricting our vocabulary, reducing hedging and qualification. CDEs are just the extreme version.

The issue here is that many strong voices are promoting standardised reporting. RSNA is a leading proponent, which really means something considering they are responsible for the biggest conference and the most widely read journal in radiology (among many other things). There are tons of articles on the topic, all pro-standardised reporting. There are opinion pieces and keynote speeches.

I can forsee the argument that we can just make CDEs more complex, and that no-one (especially Kahn or Rubin) is recommending binary questions. We could add descriptors like ‘mild’, ‘moderate’ and ‘severe’ for example.

But there is a limit here. The more long drop-down boxes, the more clicks, the more barriers from thought to page, the less satisfying the experience. Front-end developers and usability experts know this well. In my own hospital the radiologists rebelled against using a single (long) drop down box to protocol studies, a simple process of click, scroll, click. They chose to go back to a tick-and-sign paper form rather than use the electronic system.

CDEs are much the same. They might be used, but only if they are easy and fast. That limits how complex they can be, and there is definitely no way we can capture all of the variation of free-text reports in CDEs.

The key point is this: any limit on our reporting language now may be good in the short term, but in the long run it could hurt the development of AI systems.

I don’t know how important this is in the context of radiology. It might be only a minor impediment with big short term benefits. It could be a major stumbling block. If Geoff Hinton is right, and we only have 5 to 10 years left in radiology as it exists today, should we continue putting effort into standardisation? These questions have never been explored, and we should be thinking about them.

It is up to those of us who work with AI to talk about this. If we go around saying things like “standardisation will help machines understand reports”, we might be giving the wrong impression to an audience of radiologists who see self-driving cars in the news but have never heard of a linear model.

Anyway, I’m going to knuckle down to write something about my upcoming series on automation in radiology, so it might be a little while until my next post. See you then!


*Great things – terrible, yes, but great.

Do machines actually beat doctors?

doctor-who-into-the-dalek_article_story_large

Spoiler: You know what they say about headlines that end with a question mark, right?

If you ask academic machine learning experts about the things that annoy them, high up the list is going to be overblown headlines about how machines are beating humans at some task where that is completely untrue. This is partially because reality is already so damn amazing there is no need for hyperbole. AlphaGo beat Lee Sedol convincingly. Most of Atari is solved. Professional transcriptionists lose to voice recongition systems.

Object recognition has been counted on the machine side of the tally for years (albeit with a few more reservations).

But not medicine. Not yet.

Considering the headlines we see, this may surprise many people. For someone who watches the medical AI space, it seems like a day can’t go by without some new article reporting on a new piece of research in which the journalists say machines are outperforming human doctors. I’m sure anyone who stumbles on this blog has seen many of them.

A few examples:

Computer Program Beats Doctors at Distinguishing Brain Tumours from Radiation Changes (Neuroscience News, 2016)

Computers trounce pathologists in predicting lung cancer type, severity (Stanford Medicine News Centre, 2016)

Artificial Intelligence Reads Mammograms With 99% Accuracy (Futurism, 2016)

Digital Diagnosis: Intelligent Machines Do a Better Job Than Humans (Singularity Hub, 2016)

I didn’t even have to search for these. Almost all of them are still at the top of my Twitter feed.

Now, these are pretty compelling headlines. The second one is even from the Stanford Uni press, not a clickbait farm. But I hope I can explain why they are both reasonably true statements, but also completely wrong. I think this could be useful for a lot of people, and not just layfolk. Courtesy of Reddit, apparently even some researchers in the field think the machines are already winning.

So I am writing this survival guide: How to read the medical AI reports with a critical eye, and see the truth through the hype.

Because the truth is already amazing and beautiful. We don’t need the varnish.


The three traps of medical AI articles

There are three major ways these articles get it wrong. They either don’t understand medicine, they don’t understand AI, or they don’t actually compare doctors and machines.


1) Humans don’t do that

The first one is the most important, because this afflicts healthcare technologists as much as journalists. It is also the most common, and therefore the major culprit behind these headlines.

Journalists, technologists, futurists and so on mostly don’t understand medicine.

Medicine is complex. The biology, the therapeutics, the whole system is so vast it is beyond the scope of any one human mind. Doctors and other healthcare professionals get a feel for it, in a vague and nebulous way, but even that is emphemeral. Some tidbits to remind us how complex treating people actually is:

  • We train for 12 years minimum to become specialists in a subfield of medicine. Doctors are required by law to keep learning throughout their careers, and only hit peak performance after decades.
  • Researchers dedicate their lives to tiny fractions of human biology.
  • For every doctor or manager there are thousands of other highly trained personnel keeping everything going. In many countries healthcare employs more people than any other industry, and most of them have been through tertiary education.

Medicine is massive. Medical research output is larger than any other discipline by orders of magnitude. The scale is mindboggling.

  • You think NIPS is getting cramped with a few thousand visitors? The biggest conference in radiology, RSNA (on this weekend), has over fifty thousand attendees.
  • The impact factor of our top journal is nearly sixty. It has over six hundred thousand readers. The Proceedings of NIPS is under 5. A few teams publish in Nature, sure, but even Nature is only 38.
  • Funding totals are hard to pin down, but public funding in the US is at a ratio of something like 3 to 1. For medicine and all of the rest of science.
  • In PubMed alone (which only indexes the top 4000 or so journals) there are something like a million medical articles indexed per year.

Medicine is idiosyncratic. Most of it has grown around a questionable evidence base. Wrong results, misinterpreted results, unreproducible results, no results. Many of our decisions are made for non-scientific reasons, guided by culture, politics, finance, the law. Unless you have been inside it, it is unlikely you will understand it very well. Even from the inside it doesn’t make much sense.

This isn’t bragging. Medicine is a mess.

This is explaining why your intuitions about what doctors do are mostly wrong. Just because something sounds medical, and seems to be in the scope of medical practice, it doesn’t mean doctors are actually doing it.

And if doctors don’t do it, learn it, get good at it, value it … is it useful to say that machines are better at it?

Let’s have a look at some examples.

The article I was recommended in the reddit thread is a pretty famous one. It is a great article from a great research team, and a very valuable contribution that has been taken massively out of context.

It was publicised as Computers trounce pathologists in predicting lung cancer type, severity by the Stanford Medicine News Centre. Fighting talk for sure! Apparently the machine learning system they created has vastly outstripped human pathologists in some sort of predictive task. I am going to ignore the multiple medical errors in the piece, and focus on the meat of the problem. They say computers are better at predicting something about cancer.

This is where the alarm bells should ring. If you see the word “predictive” in the headline, you can almost stop there.

Rule 1: Doctors don’t do prediction. 

This is completely unintuitive, but almost always true. Let’s use this article as an example.

“Pathology as it is practiced now is very subjective,” said Michael Snyder, PhD, professor and chair of genetics. “Two highly skilled pathologists assessing the same slide will agree only about 60 percent of the time.”

So, skilled doctors are pretty much tossing coins in this task. That doesn’t sound right. Maybe we should check the research article itself, maybe the journalists just got it wrong? Here is line two of the abstract.

However, human evaluation of pathology slides cannot accurately predict patients’ prognoses.

Emphasis mine.

Is it clear yet?

Humans don’t do this task.

They trained a computer to identify which patients with cancer will survive for shorter times. That sounds medical, right? It sounds useful, right?

It isn’t. We have no evidence it can help.

Pathologists have trained to provide an answer that will alter treatment choices. Surgery or no surgery. Chemotherapy or radiotherapy. Both, all three, none. These are not the same thing as defining how long someone will live, and there has been no reason to get good at the latter.

Will doing prognosis research help? For sure. I’m fully on board with the Stanford research team here. This is the future of medicine. Predictive analysis is a great, unbiased way of identifying useful patient groups. It will undoubtedly lead to better treatment decisions. We call this precision medicine. We call it that because it is different than what we currently do. Imprecision medicine, which is built on a whole bunch of compromises and simplifications that work really well despite it all.

crukmig_1000img-12647

My favourite chart in medicine. We are winning, even without being able to do prediction.

The point is that we need a defined idea of what “beating doctors” actually looks like. If we accept that a machine outperforming a doctor at anything vaguely medical is enough, we have trivialised the entire concept. Self-driving cars are better than humans at driving without using hands. It verges on tautological.

Prediction isn’t the only place this error in understanding rears its head. Look at the widely reported Autonomous Robot Surgeon Bests Humans in World First, where a robot “outperformed” human surgeons at suturing (stitching) up a pig’s intestines. Again, amazing work by an amazing team in a very good journal. They created an autonomous bowel-suturing robot. This is a great step forward. In context.

f3-large

Figure 3 from the paper. Some really nice results.

Look at the figure. What did they test? Exactness of suture spacing. How much pressure it took to force the repaired bowel to leak. These are mechanical metrics, and it is unclear how they relate to outcomes. Leaking sounds clinical, but there is no proof that there is a direct relationship between the force needed for leaks, and the number of actual leaks in practice. There could be a threshold effect, and there is no appreciable benefit to “better” suturing. There could be a sigmoidal pattern, or other more complex relationship. Stronger anastamoses could be worse, stranger things happen in medicine (stents impregnated with anti-clotting agents created more clots). We just don’t know.

The bottom three are different. Number of mistakes, time in surgery, presence of surgical complications. These are things that matter, that surgeons keep track of as metrics of their own performance. And STAR is no better than comparisons here. Much longer theatre time, no significant difference in mistakes or complications.

You are probably a little confused now, because it seems like I just described a different set of plots. STAR looks like it does pretty well in those last three.

STAR is open surgery. A surgeon would immediately understand this, and ignore the LAP and RAS results. It isn’t a fair comparison. They cut a big hole in the pig, and pulled the intestines out through it to repair them. That is a big deal, and comparing it to a human using a laparoscope is like asking them to tie a hand behind their back. The risk to the patient is much higher with an open procedure.

We use laparascopic surgery despite slightly higher complications rates because it is better for the patient if you don’t cut a big, dangerous hole in them.

Compared to human surgeons using an OPEN technique … STAR underperforms. Three times as long under general anaesthesia is no small thing.

As Andrej Karpathy says – human accuracy is not a point, but a curve. We trade off accuracy against effort. Surgeons don’t bother with millimetre exact stitch spacing, presumably because it doesn’t help. I’m not up to date on the last hundred years of surgical research, but I am totally happy to take as given that if more careful suturing helped, surgeons would be doing it (or maybe not, often culture trumps evidence).

It is the same thing with predicting cancer survival. Pathologists don’t try to divide people into a dozen survival categories, if all clinical doctors want is to make a binary decision about surgery. It would be overkill. We do what we need to, and no more, and it already costs too much.

So maybe there is a more general rule for deciding when machines beat doctors?

Rule 1: Use a fair comparison

Rule 1a: Doctors don’t do prediction

Rule 1b: Ask a doctor what they actually do, and what a fair test might be. Doctors trade off accuracy for effort, and optimise for outcomes (be it health, financial, political, cultural etc.)

Does this mean we need to do large randomised control trials to find out if any system actually helps with outcomes?

I wouldn’t go that far. There are certainly tasks I can think of where the causal chain is understood enough to make an accurate inference. For example, in the paper above, luminal reduction post bowel repair has been tested thoroughly enough to know that a 20% or more reduction is needed to have a high chance of symptoms. We can use that as a comparison point. But saying 13% is better than 17% … we might need further testing to make that claim (or ask a bowel surgeon!).

So that is the first problem I see in “superhuman” medical systems research. But not all tasks are inappropriately chosen. Some tasks are exactly what doctors do, and we know exactly what doing better would look like. For example, Computer Program Beats Doctors at Distinguishing Brain Tumors from Radiation Changes shows that computers can do better than radiologists at distinguishing radiation necrosis (something that happens after radiotherapy) from brain tumour recurrence. This is very important, very hard for radiologists, and a great target for computational approaches.

Which brings us to the second common error.


2) These are not the AIs you are looking for

AI is AI, right? Machine learning is eating the world? Deep learning is so hot right now? Sure, except when it isn’t.

Not all machine learning is created equal, and not all of it is groundbreaking, even if most people don’t see the difference or think that it matters.

It matters.

Because the paper in AJNR (again, great paper, important paper) about brain tumours doesn’t use deep learning. This is incredibly common in the radiology literature right now, because some major papers starting in 2010/2011 showed that an old style of image analysis could do some interesting things, like identify tumour subtypes in cancer cases from medical images.

These techniques are not loosely based on the human brain. They don’t “see” the world. They aren’t “cognitive” or “intelligent” or whatever other buzzwords are flying around.

These techniques have been around for decades, and we have had the computational power to run them on laptops for almost as long. There has been no hard barrier to doing this work for a long time. So why would it suddenly succeed now, when hundreds or thousands of previous attempts have failed?

Now, that isn’t an argument on its own, but it should be concerning. Non-deep systems don’t exactly have a track record of beating humans at human-like tasks.

The same techniques didn’t beat humans in object recognition. They didn’t help solve Go, or Atari. They didn’t beat human transcriptionists or drive cars safely and autonomously for hundreds of millions of miles. They never left the parking lot.

Rule 2: Deep learning doesn’t use human-designed features

The old style of image analysis was to get humans to try to describe images with maths, in hand-crafted matrices of numbers. This is super hard, so the best we could do is identify the building blocks of images. Things like edges and small patterns. We could then quantify how much of each pattern was in an image or image region.

This is what they do in the paper.

braintexture

For starters, you can see why this is so hard for radiologists. A and E look identical.

What they are doing here is taking the region that is brighter (has more fluid in it) and quantifying how much of various textures is present. They try over a hundred textures in a cohort of around fifty patients, select the best performing ones and combine them into a predictive signature. They then use that signature to outperform humans to some level of statistical certainty.

Any statistically trained person reading this is hearing alarm bells right now, hopefully.

Probably the biggest problem with using human-defined features is that you will need to test them all, and select the best ones.

Multiple hypothesis testing is a weird beast. I really want to do a blog post on this at some point, because I really do find it strange. But the moral of the story is, if you test lots of hypotheses (“texture x detects cancer” is one such hypothesis), then you get false positives. If the p-value of your test is 0.05, then you have something like a 1 in 20 chance of your results being dodgy. If you test 100 results at a threshold of 0.05, then you might expect 5 dodgy results.

Feature selection – choosing the best performing features – probably makes it worse, not better. You expected twenty dodgy results, and you picked the top ten features.

I love the mRMR algorithm used in this paper for feature selection, and use it myself. But dimensionality reduction right at the end isn’t a fix for overfitting. You’ve already overfit your data. The feature selection helps us explore the predictions and present them, nothing more and nothing less.

The thing is, all researchers understand this. We know that we are probably overfitting when we have very small n and larger p (small sample, more features than samples). We try to mitigate this as best we can, with techniques like hold-out validation sets, cross-validation and so on. This team did all of that. Absolutely perfectly, it is quality work.

But all researchers in this field still know results like this can’t be trusted. Not really. We might not need large randomised clinical trials, but unless a system is tested on a lot more cases, hopefully from a completely different patient cohort, forget about it.

But don’t take my word for it. Let’s read the paper.

Our study did have its limitations. As a feasibility study, the reported results are preliminary because our study was limited by a relatively small sample size, both for the training and holdout cohorts.

Emphasis mine. The researchers are exactly spot-on here (I honestly can’t think of an example where medical researchers of this calibre have overstated their results).

It isn’t just sample sizes. You can perfectly split your train and test sets, but if you try a dozen different algorithms to see which one works best, you have overfit your data (picture from the Stanford paper again). Which is fine, again, but needs to be recognised.ncomms12474-f1

Testing multiple algorithms can tell you a rough range of the true test accuracy, but you shouldn’t expect the same results in a new data set.

One more thing to say here, a little more controversial. Public datasets. Be very cautious with public datasets, especially if you have worked with them before or have ever read a paper or blogpost or tweet about someone else working on them. Because you just contaminated your test set, Kagglers. You know what techniques work better than others in this dataset, which has its own idiosyncrasies and biases. The chance of spurious results that fit the bias rather than the true research target is very high.

Many machine learning researchers feel this way about ImageNet, and don’t get very excited by the weekly “new state of the art” results unless there is a big jump in accuracy. Because hundreds of groups are working on that data, trying hundreds of models with wide hyperparameter searches. There is no chance they are not overfitting.

My machine learning colleagues shrug their shoulders. It is just accepted, take each result with a grain of salt and move on. It would just be nice if someone told the journalists and the public.

So, a better formulation for rule 2.

Rule 2: Read the paper

Rule 2a: If it isn’t deep learning, it probably isn’t better than a doctor.

Rule 2b: Overfitting is easy and unavoidable in small and public datasets. Look for larger scale tests, multiple unrelated cohorts, real-world patients.


3) That doesn’t mean what you think it means

Whew. Almost there, thanks for sticking around this long.

Type 3 error is easy. The article never even mentions what the headline states, or the article completely misunderstands the research.

Digital Diagnosis: Intelligent Machines Do a Better Job Than Humans from Singularity Hub is a good example. There is not a single mention in the article of a head-to-head comparison. It is all projection and conjecture. It isn’t necessarily a bad article, even. But the headline doesn’t fit.

Artificial Intelligence Reads Mammograms With 99% Accuracy from Futurism is a bit more egregious. This article is about research in using natural language processing. It has nothing to do with reading mammograms, but instead mining the text of the reports that radiologists make. The headline is wrong, and so is a lot of the article.

Rule 3: read the article

Easy peasy.


The doctor is victorious…

So where does that leave us?

I remain convinced that we have yet to see a machine outperform a doctor in any task that is relevant to actual medical practice. The slowly building wealth of preliminary research suggests that won’t last forever, but for now I haven’t seen a case where the robots win.

I hope my rules will be useful, to help distinguish between great research that isn’t quite there yet, and the true breakthroughs that are worth getting very excited about.

And if I have missed a piece of research somewhere, let me know.

Except, while I was writing this – literally this last paragraph – it became untrue.

Google just published this paper in the Journal of the American Medical Association (impact factor 37 🙂 ). And since it actually lives up to the hype, it is a great way to end this piece. Because any worthy set of rules should still work when the situation changes.

They trained a deep learning system to diagnose diabetic retinopathy (damage to the blood vessels in the eye) from a picture of the retina. This is a task that ophthalmologists currently perform using the exact same technique, looking at the retina through a fundoscope.

Google’s system performed on par with the experts, in a large clinical dataset (130,000 patients). While this isn’t necessarily “outperforming” human doctors, it probably costs under a cent per patient to run the model. An opthalmologist costs a lot more than that, and honestly has better things to do with their time. I am happy to call that a win for the machines.

Let’s look at my rules. Do they work?

Rule 1 –  is it a task human doctors do, done with the same inputs. Yep.

Rule 2 – is it deep learning, with a decent sized dataset. Yep.

Rule 3 – is it actually a thing? Yep.

So you see, I can be proven wrong with my own system. Science!

Can’t call me a cynical, turf-protecting doctor now.

As a final note, it is worth looking at why the Google system worked. They paid to make a good dataset. A lot, presumably. They had a panel of between 2 and 7 ophthalmologists grade every single one of the 130,000+ images (from a set of 54 ophthalmologists). That is a huge undertaking. I don’t even know 54 ophthalmologists.

This technology is probably close to ready for a large randomised control trial, and that is a HUGE deal.

This is what we will see in the next few years. There will be many tasks like this, where computers can do exactly what humans do if someone is willing to build the dataset. Most medical tasks probably aren’t right for it, but enough will be that this will start to happen frequently.

Exciting times indeed.

Health Informatics, big data and computer gamers @ SAHMRI

Thanks again to HISA for inviting us, and for the excellent Q & A / meet and greet after the talks. In particular thanks to Chris Radbone for organising the event.

We got a tour of SAHMRI before we started, and I can say that the building was as impressive from the inside as is from outside.

cx5genivqaauc8u

The spaceship vibe of the outside continues on the inside.

A very interesting layout too – very open plan throughout (some researchers love this, some hate it), but I really liked the idea of having the wet-labs right next to the office space. Do some experiments, walk ten metres, and analyse the data! It also made everything feel much more “science-y” to have the labs on display (with glass walls to maintain a controlled environment). One should never discount the benefits of a research workplace putting the Science! front and centre.

p375189263-3

The mad scientist vibe is strong at SAHMRI, and it is awesome

With SAHMRI, the new Royal Adelaide Hospital and the new Med School and School of Nursing all together at the North-West end of Adelaide, I honestly think South Australia is going to have one of the most exciting medical precincts anywhere in the world. A great time to do research in SA.

Adelaide Uni Health

The new medical and nursing school on North Terrace looks like the future

The event opened with a welcome from Louise Schaper, the CEO of HISA, who spoke about the organisation.

cyfrrdjuuaee4sr

Are CEOs allowed to be this fun?

Louise is a great speaker, and gave an exciting overview of the role of HISA in Australia and worldwide. Since most people don’t really grok what health informatics is, I thought I would put a little explanation up for people to stumble across. Louise explained that she stumbled across health informatics on the web herself after she struggled to identify what to call her own interest – “people who make computer systems for healthcare”.

That isn’t really a complete description though, because undergrad IT and CS students can make “healthcare” apps and many doctors work with data in excel or access. Health informaticists are people who straddle healthcare and information technology, and have at least a working knowledge in both fields. Healthcare is so complicated that technologists cannot have much of an impact without health knowledge, and most doctors barely know one end of a keyboard from the other. Health informaticists come from both sides of the health/tech divide. They can be medical professionals like doctors, nurses, physios, pharmacists and OTs who have learned coding, infrastructure and data. They can also be computer scientists, database experts, and data scientists who have trained to learn about health systems, healthcare funding, research, and disease.

So if you too have been looking for the right terminology to use for someone who builds health IT systems but also understands the problems we face in healthcare, look no further. You want Health Informaticists.

Key points from Louise:

  • HISA is a leading health informatics organisation worldwide, and has the highest membership per capita of any HI organisation.
  • HISA reaches around 13,000 people on their mailing lists and through other channels.
  • HISA runs exams for certification as a health informaticist. Hundreds have already sat these exams.
  • They are commonly approached by businesses who say they want to hire people with mixed IT and health skillsets but don’t know where to find them, as well as by graduates who can’t find jobs. There is a workforce mismatch that HISA are working to solve.
  • HISA are moving more strongly into advocacy in the government sector, and strong media engagement.
  • They run the Health Informatics Conference (HIC), which looks fantastic.

Beyond that, the talks we well. Prof Lyle Palmer (my PhD supervisor) spoke about big data infrastructure and epidemiology. It was a great (and fun) talk.

cyf0pt7uaaacxdc

Check out the picture on screen. Health misinformation is a challenge.

Highlights for me:

  • We already have tons of data, this isn’t a limiting factor in medical research. The major problem is that genomic (and other ‘omic) data is growing more rapidly that hard-drive storage (by 2-3 times), so a major bottleneck is the ability to store and process the data.
  • The other major bottleneck is statistical methods to understand the data at such high dimensions (the perennial problem in genomics). Both of these are part of the health informatics remit.
  • We know very little about why some people get chronic diseases and some don’t. Genetic effect sizes are very small but even strong environmental exposures only cause disease in a minority.
  • A great story was the success of genetic testing with Abacavir (anti-HIV) treatment, where a major life-threatening side effect went from an incidence rate of over 7% to under half a percent with the introduction of genetic screening.
  • Early settlers in Australia had very large families. One family line studied in WA contained over 100,000 members.

I spoke too, a nice length talk about deep learning and precision radiology (about 30 minutes this time) so I got to cover a lot more ground. I “borrowed” Blaise Agüera y Arcas’ mathematically intuitive explanation of deep learning, which I think is one of the best I have seen for a tech-savvy audience. Not too heavy on the formulas, but not too light and hand-wavy.

The crowd-pleasing moment was when I suggested that computer gamers were directly responsible for deep learning working. Without gamers buying graphics cards for the last 20 years, we would be decades away from being capable of deep learning. The ability to do deep learning actually came quite late in the GPU development cycle – we had Crysis for 5 years before AlexNet was a thing. Imagine the state of AI research if we never went further than Wolfenstein3d or Doom.

cyf8l0wvqaabfim

I probably helped more people playing Quake than by doing medicine

So next time someone complains you are spending too much time playing games, or you are telling your children to get out more, remember that gamers have changed the world and will be responsible for trillions of dollars of economic benefits in the coming decades 🙂

I also sneakily showed some new (and very exciting) unpublished results from our research group, but those will have to publicly stay under wraps until we have finished our experiments!

 

Seminar @ SAHMRI

Seminar @ SAHMRI

I will be giving a talk about my PhD project to the Health Informatics Society of Australia (SA branch) with my supervisors Professor Lyle Palmer and Associate Professor Gustavo Carneiro.

Professor Palmer works at the School of Public Health at the University of Adelaide. He is a world-renowned genetic epidemiologist and a previous executive scientific director of the Ontario Health Study (one of the largest and most detailed patient cohorts in the world). Check out his impressive citation list and an interview from 2013 about his work in Ontario.

Associate Professor Carneiro is a computer scientist at the Australian Centre for Visual Technologies, the largest deep learning and computer vision lab in Australasia. His work focuses on medical image analysis with many widely cited papers, and he has organised the deep learning in medical imaging workshops at MICCAI (the Medical Image Computing and Computer Assisted Intervention conference, one of the premier conferences in the field) in 2015 and 2016. (edit: Gustavo can’t make it due to other commitments. so I will be filling in for him and talking about deep learning in his stead)

The seminar is going to cover topics including precision medicine, medical image analysis and deep learning in medicine.

The talk will be at SAHMRI in the audiotorium on the 25th of November. This will actually be the first time I have been in SAHMRI, which I am really looking forward to – the architecture is pretty great! Who doesn’t want to give a talk in a spaceship?

The event is open to anyone, so if you are interested in big data in medicine, deep learning or precision radiology, you can register here and come along.