A huge new CT brain dataset was released today, with the goal of training models to detect intracranial haemorrhage. So far, it looks pretty good, although I haven’t dug into it in detail yet (and the devil is often in the detail).
The dataset has been released for a competition, which obviously lead to the usual friendly rivalry on Twitter:
Of course, this lead to cynicism from the usual suspects as well.
And the conversation continued from there, with thoughts ranging from “but since there is a hold out test set, how can you overfit?” to “the proposed solutions are never intended to be applied directly” (the latter from a previous competition winner).
As the discussion progressed, I realised that while we “all know” that competition results are more than a bit dubious in a clinical sense, I’ve never really seen a compelling explanation for why this is so.
Hopefully that is what this post is, an explanation for why competitions are not really about building useful AI systems.
DISCLAIMER: I originally wrote this post expecting it to be read by my usual readers, who know my general positions on a range of issues. Instead, it was spread widely on Twitter and HackerNews, and it is pretty clear that I didn’t provide enough context for a number of statements made. I am going to write a followup to clarify several things, but as a quick response to several common criticisms:

I don’t think AlexNet is a better model than ResNet. That position would be ridiculous, particularly given all of my published work uses resnets and densenets, not AlexNets.

I think this miscommunication came from me not defining my terms: a “useful” model would be one that works for the task it was trained on. It isn’t a model architecture. If architectures are developed in the course of competitions that are broadly useful, then that is a good architecture, but the particular implementation submitted to the competition is not necessarily a useful model.

The stats in this post are wrong, but they are meant to be wrong in the right direction. They are intended for illustration of the concept of crowdbased overfitting, not accuracy. Better approaches would almost all require information that isn’t available in public leaderboards. I may update the stats at some point to make them more accurate, but they will never be perfect.

I was trying something new with this post – it was a response to a Twitter conversation, so I wanted to see if I could write it in one day to keep it contemporaneous. Given my usual process is spending several weeks and many rewrites per post, this was a risk. I think the post still serves its purpose, but I don’t personally think the risk paid off. If I had taken even another day or two, I suspect I would have picked up most of these issues before publication. Mea culpa.
Let’s have a battle
Nothing wrong with a little competition.*
So what is a competition in medical AI? Here are a few options:
 getting teams to try to solve a clinical problem
 getting teams to explore how problems might be solved and to try novel solutions
 getting teams to build a model that performs the best on the competition test set
 a waste of time
Now, I’m not so jaded that I jump to the last option (what is valuable to spend time on is a matter of opinion, and clinical utility is only one consideration. More on this at the end of the article).
But what about the first three options? Do these models work for the clinical task, and do they lead to broadly applicable solutions and novelty, or are they only good in the competition and not in the real world?
(Spoiler: I’m going to argue the latter).
Good models and bad models
Should we expect this competition to produce good models? Let’s see what one of the organisers says.
Cool. Totally agree. The lack of large, welllabeled datasets is the biggest major barrier to building useful clinical AI, so this dataset should help.
But saying that the dataset can be useful is not the same thing as saying the competition will produce good models.
So to define our terms, let’s say that a good model is a model that can detect brain haemorrhages on unseen data (cases that the model has no knowledge of).
So conversely, a bad model is one that doesn’t detect brain haemorrhages in unseen data.
These definitions will be noncontroversial. Machine Learning 101. I’m sure the contest organisers agree with these definitions, and would prefer their participants to be producing good models rather than bad models. In fact, they have clearly set up the competition in a way designed to promote good models.
It just isn’t enough.
Epi vs ML, FIGHT!
If only academic arguments were this cute
ML101 (now personified) tells us that the way to control overfitting is to use a holdout test set, which is data that has not been seen during model training. This simulates seeing new patients in a clinical setting.
ML101 also says that holdout data is only good for one test. If you test multiple models, then even if you don’t cheat and leak test information into your development process, your best result is probably an outlier which was only better than your worst result by chance.
So competition organisers these days produce holdout test sets, and only let each team run their model on the data once. Problem solved, says ML101. The winner only tested once, so there is no reason to think they are an outlier, they just have the best model.
Not so fast, buddy.
Let me introduce you to Epidemiology 101, who claims to have a magic coin.
Epi101 tells you to flip the coin 10 times. If you get 8 or more heads, that confirms the coin is magic (while the assertion is clearly nonsense, you play along since you know that 8/10 heads equates to a pvalue of <0.05 for a fair coin, so it must be legit).
Unbeknownst to you, Epi101 does the same thing with 99 other people, all of whom think they are the only one testing the coin. What do you expect to happen?
If the coin is totally normal and not magic, around 5 people will find that the coin is special. Seems obvious, but think about this in the context of the individuals. Those 5 people all only ran a single test. According to them, they have statistically significant evidence they are holding a “magic” coin.
Now imagine you aren’t flipping coins. Imagine you are all running a model on a competition test set. Instead of wondering if your coin is magic, you instead are hoping that your model is the best one, about to earn you $25,000.
Of course, you can’t submit more than one model. That would be cheating. One of the models could perform well, the equivalent of getting 8 heads with a fair coin, just by chance.
Good thing there is a rule against it submitting multiple models, or any one of the other 99 participants and their 99 models could win, just by being lucky…
Multiple hypothesis testing
The effect we saw with Epi101’s coin applies to our competition, of course. Due to random chance, some percentage of models will outperform other ones, even if they are all just as good as each other. Maths doesn’t care if it was one team that tested 100 models, or 100 teams.
Even if certain models are better than others in a meaningful sense^, unless you truly believe that the winner is uniquely able to MLwizard, you have to accept that at least some other participants would have achieved similar results, and thus the winner only won because they got lucky. The real “best performance” will be somewhere back in the pack, probably above average but below the winner^^.
Epi101 says this effect is called multiple hypothesis testing. In the case of a competition, you have a ton of hypotheses – that each participant was better than all others. For 100 participants, 100 hypotheses.
One of those hypotheses, taken in isolation, might show us there is a winner with statistical significance (p<0.05). But taken together, even if the winner has a calculated “winning” pvalue of less than 0.05, that doesn’t mean we only have a 5% chance of making an unjustified decision. In fact, if this was coin flips (which is easier to calculate but not absurdly different), we would have a greater than 99% chance that one or more people would “win” and come up with 8 heads!
That is what an AI competition winner is; an individual who happens to get 8 heads while flipping fair coins.
Interestingly, while ML101 is very clear that running 100 models yourself and picking the best one will result in overfitting, they rarely discuss this “overfitting of the crowds”. Strange, when you consider that almost all ML research is done of heavily overtested public datasets …
So how do we deal with multiple hypothesis testing? It all comes down to the cause of the problem, which is the data. Epi101 tells us that any test set is a biased version of the target population. In this case, the target population is “all patients with CT head imaging, with and without intracranial haemorrhage”. Let’s look at how this kind of bias might play out, with a toy example of a small hypothetical population:
In this population, we have a pretty reasonable “clinical” mix of cases. 3 intracerebral bleeds (likely related to high blood pressure or stroke), and two traumatic bleeds (a subdural on the right, and an extradural second from the left).
Now let’s sample this population to build our test set:
Randomly, we end up with mostly extraaxial (outside of the brain itself) bleeds. A model that performs well on this test will not necessarily work as well on real patients. In fact, you might expect a model that is really good at extraaxial bleeds at the expense of intracerebral bleeds to win.
But Epi101 doesn’t only point out problems. Epi101 has a solution.
So powerful
There is only one way to have an unbiased test set – if it includes the entire population! Then whatever model does well in the test will also be the best in practice, because you tested it on all possible future patients (which seems difficult).
This leads to a very simple idea – your test results become more reliable as the test set gets larger. We can actually predict how reliable test sets are using power calculations.
These are power curves. If you have a rough idea of how much better your “winning” model will be than the next best model, you can estimate how many test cases you need to reliably show that it is better.
So to find out if you model is 10% better than a competitor, you would need about 300 test cases. You can also see how exponentially the number of cases needed grows as the difference between models gets narrower.
Let’s put this into practice. If we look at another medical AI competition, the SIIMACR pneumothorax segmentation challenge, we see that the difference in Dice scores (ranging between 0 and 1) is negligible at the top of the leaderboard. Keep in mind that this competition had a dataset of 3200 cases (and that is being generous, they don’t all contribute to the Dice score equally).
So the difference between the top two was 0.0014 … let’s chuck that into a sample size calculator.
Ok, so to show a significant difference between these two results, you would need 920,000 cases.
But why stop there? We haven’t even discussed multiple hypothesis testing yet. This absurd number of cases needed is simply if there was ever only one hypothesis, meaning only two participants.
If we look at the leaderboard, there were 351 teams who made submissions. The rules say they could submit two models, so we might as well assume there were at least 500 tests. This has to produce some outliers, just like 500 people flipping a fair coin.
Epi101 to the rescue. Multiple hypothesis testing is really common in medicine, particularly in “big data” fields like genomics. We have spent the last few decades learning how to deal with this. The simplest reliable way to manage this problem is called the Bonferroni correction^^.
The Bonferroni correction is super simple: you divide the pvalue by the number of tests to find a “statistical significance threshold” that has been adjusted for all those extra coin flips. So in this case, we do 0.05/500. Our new pvalue target is 0.0001, any result worse than this will be considered to support the null hypothesis (that the competitors performed equally well on the test set). So let’s plug that in our power calculator.
Cool! It only increased a bit… to 2.6 million cases needed for a valid result :p
Now, you might say I am being very unfair here, and that there must be some small group of good models at the top of the leaderboard that are not clearly different from each other^^^. Fine, lets be generous. Surely noone will complain if I compare the 1st place model to the 150th model?
So still more data than we had. In fact, I have to go down to the 192nd placeholder to find a result where the sample size was enough to produce a “statistically significant” difference.
But maybe this is specific to the pneumothorax challenge? What about other competitions?
In MURA, we have a test set of 207 xrays, with 70 teams submitting “no more than two models per month”, so lets be generous and say 100 models were submitted. Running the numbers, the “first place” model is only significant versus the 56th placeholder and below.
In the RSNA Pneumonia Detection Challenge, there were 3000 test images with 350 teams submitting one model each. The first place was only significant compared to the 30th place and below.
And to really put the cat amongst the pigeons, what about outside of medicine?
As we go left to right in ImageNet results, the improvement year on year slows (the effect size decreases) and the number of people who have tested on the dataset increases. I can’t really estimate the numbers, but knowing what we know about multiple testing does anyone really believe the SOTA rush in the mid 2010s was anything but crowdsourced overfitting?
So what are competitions for?
They obviously aren’t to reliably find the best model. They don’t even really reveal useful techniques to build great models, because we don’t know which of the hundred plus models actually used a good, reliable method, and which method just happened to fit the underpowered test set.
You talk to competition organisers … and they mostly say that competitions are for publicity. And that is enough, I guess.
AI competitions are fun, community building, talent scouting, brand promoting, and attention grabbing.
But AI competitions are not to develop useful models.
Great post. It’s worth noting another problem with model competitions – for most scoring systems, they reward overconfidence.
Specifically, if I have a model that I expect performs about as well as everyone else’s, I maximize my probability of winning not by correctly reporting results, but by extremizing my predictions slightly, since higher expected variation in score increases my odds of winning. See my tweetstorm here: https://twitter.com/davidmanheim/status/1080458380806893568
LikeLike
I have a beginner question:
In the example of calculating the required sample size when comparing only the 1st and 150th submission, 99.99% was used as the required confidence level.
But when comparing only two models, the Bonferroni correction tells us that a pvalue of 0.05/2 = 0.025 would be sufficient.
So therefore, if only testing two hypotheses, the required confidence level in the ‘calculator’ should be 97.5%, shouldn’t it?
I don’t think the previous Bonferroni calculation with 500 hypotheses applies here, because it seems like the comparison between 1st and 150th is done under the assumption that the top submissions are so similar, that they really are the same model. Therefore the number of unique hypotheses in this case should be 2.
Please let me know what I am missing here, I am new to this. Also thanks for the article, I enjoyed it very much.
LikeLiked by 1 person
The only group of hypotheses being tested is that one model is better than the others. Technically I am guilty of salami slicing by paring down the data until I can show a significant difference. This is the equivalent of generating new hypotheses until you get an answer you can publish!
So you are right that I am being naughty, but it is because we don’t get to discard our primary hypothesis when we feel like it.
LikeLike
Generally speaking, not too many innovations can be made for image classification and segmentation, but many possibilities in more complicated tasks of speech, nlp and cv (e.g., multi object tracking, video spatial temporal detection, etc.).
LikeLike
Luke, what do you think about the fact that most wellperforming models are extremely similar? This was pointed out and analyzed in more detail by https://arxiv.org/pdf/1905.12580.pdf . To me, this observation also conforms well to reality – in each computer vision Kaggle competition most of the submissions are convolutional neural networks based on a small set of architectures. I feel that these submissions seem to mainly differ by the values of various hyperparameters and the type of ensembling employed. These submissions cover but a tiny patch of the computer vision model space.
Sometimes one or a few teams end up way in front of the “peloton”. It seems to me that this has always required additional insights about the data (I would appreciate a counterexample). As you correctly point out, medical image related Kaggle challenges produce leader boards with no appreciable score gaps, which presumably means that their winners have not come up with groundbreaking insights.
LikeLike
Great post. There is another (possibly epi 101) issue related to your multiple testing points: regression to the mean. When you observe something perform extremely well (or poorly), expect it to do worse (or better) next time. You’re virtually guaranteed that the top model will do worse on a new data set than it did in the competition.
Further, classification accuracy and similar counting metrics are in very bad for comparing models because an arbitrarily small change in model can lead to an arbitrarily large change in accuracy.
So it’s all an exercise in overinterpreting noise.
LikeLike
It depends a lot on competition dataset, metric and unique solutions found by competitors.
While I would agree about the level of randomness in the leaderboard standings, it should not be as significant as described here, otherwise bestfitting for example would not be able to win (or be within top 5 or so) for so many competitions so reliably.
On the data science bowl, the private test distribution was significantly different to the public, while the same team has won the first place. On the last RSNA competitions, two top solutions just swapped the order on private/public.
With the high randomness it would be almost impossible to enter a dozen of competitions and to win or at least end up in top 5 in all of them.
LikeLike
@Dmytro I respectfully disagree. If there is more randomness to how a certain kind of model performs it is highly likely to place in the top more often than other models and virtually guaranteed to place top 5.
Try this simulation in R:
experiment < plyr::raply(1000, {
nns < data.frame(score=rnorm(50, sd=1.2), model="NN")
glms < data.frame(score=rnorm(50), model="GLM")
competition < rbind(nns, glms)
ranks < order(competition$score, decreasing = T)
standings < competition$model[ranks]
c(win=standings[1] == "NN", top="NN" %in% standings[1:5])
}); colMeans(experiment)
Will yield something like a 75% chance to win and a 99% chance to place top 5 even though both on average perform exactly the same.
LikeLike
I agree the group of models with more randomeness would more likely to occupy top places. But this experiemnt does not represent the competition from one person point of view. The kaggle case is more like: the chance to stay in top 5 with 2 submissions vs other 2000, and to achieve this over multiple competitons.
If bestfitting for example has the best model, more randomness would only decrease his chance to finish in to 5. For an average model of cource, more randomness would increase the chance to win at least sometimes. But not by much as depending on metrics, more randomness would more likely to penalise more each individual sample in the test dataset metrics.
While I’d agree about randomness or results, and very often top models are not suitable for production due to complexity of solution, I don’t think the best model would be around top 3040% of standings. Much more likely it would still be in top 10%, with the exception of the few competitions with very noisy labels.
I have done a dozen of competitions or so on kaggle and other platforms, and almost always when I lost, the winning solution was better than mine, even while both were within 12% of leaderboard standings. I certainly would not claim I lost mostly due to randomness. Very often one of the top published solutions was nice and elegant.
LikeLike
You are right, I somehow did not understand that bestfitting is a person and took it as a class of models.
LikeLike
There is a clearer test of this, though less statistical. Write a model and commit it 20 times on kaggle. At each point submit the output. You will definitely see variation.
LikeLike
Thank you Mr Luke for the very informative and educational posts.
However, I do believe that the top performers/ grandmasters in kaggle all understand all of what you got to say here about overfitting for the holdout test, real life application differences, randomness etc, etc… and they all are definitely very highly skilled machine learning experts as well.
I dont think any of the top performers there are just lucky without talents, skills and experience by any means. If you do spend time reading their code, you’ll understand what I mean.
I think you saying stuff like ‘winning a competition is just a coin toss’ is not very well – thought out. Besides, if you don’t think that most Kaggle Grandmasters do not know the stuff you know about machine learning, you are well overestimating yourself. Please have some humility in your words. You are basically spitting on some of the most passionate and skilled machine learning people in the world.
LikeLike
don’t you think it’s possible to use better statistical technique to find the top performing models that we still can not diffrentiate between them given the test set sample size?
using bonferroni correction for more than 2 models is absurdly low power statistical test.
LikeLike