David Grayson - Doctors are weak learners

Most of us have experienced a medical misdiagnosis of some kind. It is a frustrating experience that can result in enormous time wasted with unnecessary office visits and medical tests, plus significant stress and prolonged illness.

At the heart of the matter is the fact that many illnesses are still diagnosed using primarily human expertise with little understanding of the uncertainty behind those predictions. Even when lab tests are involved, interpretation of results and diagnostic conclusions still usually rely on the judgment of individual doctors. For instance, medical imaging such as MRIs or CT scans usually involves a trained radiologist reading the result and writing up a conclusion, which is then interpreted and communicated by the patient’s provider. The diagnosis is therefore the result of expert intuition from a select pair of trained specialists. Equivalently, it is the result of these experts’ combined knowledge and training, and also their lack of knowledge and lack of training.

There’s some interesting data that bears this out. Back in 2018, it was first shown that AI outperforms most expert dermatologists at detecting skin cancer, and reams of papers have come out since then improving on those results. If we go a bit further back to 2015, we find evidence that individual dermatologists misdiagnose at much higher rates than they would if they combined their votes on each diagnosis (though this ‘collective intelligence’ is still much worse than AI). The authors argue that this effect is due to each doctor having their own biases - their own way of looking at the same signs, symptoms, and test results, and drawing their own unique and often incorrect conclusions.

This shouldn’t really be too surprising. Each doctor’s understanding is innately limited by human abilities of perception and cognitive integration between diverse forms of observation and patient data. In addition, each doctor can only practically be exposed to a (very) small subset of all the available patient data that might potentially be relevant to their training. In other words, each doctor can only fit their own limited mental model of disease to a limited sample of the data, leading to different doctors having different diagnostic models that are all suboptimal in their own unique way.

In the parlance of machine learning and artificial intelligence, these kinds of models are termed ‘weak learners’. Concretely, a weak learner is an oversimplified statistical model that has limited mathematical complexity and often only allowed to ‘see’ a limited subset of data.

What’s fascinating is that, in the ML literature, weak learners actually serve a very important role in producing very high performing models. The idea is to use an ensemble, which means training many weak learners on the same task and then making final predictions by taking a vote among your learners. In fact, for many types of problems, particularly those where the data is tabular, these types of models are consistently the best performing (see here and here). But the key takeaway is that you need to ensemble across many, many of these weak learners (i.e. hundreds or thousands depending on the size of the data). It also helps to train the learners sequentially so that each subsequent learner is an improvement of the previous.

One of the findings of the 2015 skin cancer study was that, indeed, if you ensemble the predictions of all doctors, you get predictions that are on average far better than those of any individual doctor’s. Even so, AI now vastly outperforms the best human experts at skin cancer detection as well as a host of other imaging based diagnostics. Basically, the best predictions emerge like a forest zooming out from lots of individual trees, where each tree is one guess that helps to build an overall consensus. When you only get one doctor’s opinion, you only see one of those trees.

But, two important questions remain: 1. Are doctors aware that they’re not very accurate? 2. Are they communicating this to their patients?

Logically, I would think “no.” First of all, most established diagnostic methods are taken for granted - their sensitivity and specificity are often not considered in a patient setting. Second, determining real-world diagnostic accuracy would require that doctors get unbiased feedback from an unbiased sample of their real patients, which clearly isn’t happening. Third, even if doctors were sufficiently aware, communicating that uncertainty to patients would require that they are actively encouraged towards humility.

Anecdotally, I think patients would also say “no” - i.e. that they often can’t tell if their doctor is really sure or just guessing. I could be wrong though.

Finally, there is some data. In one study, general internists were given 2 easy and 2 difficult case vignettes to diagnose. Doctors were right 55.3% of the time in the easy cases, but only 5.8% in the difficult cases. Regardless, their confidence was nearly the same - 7.2 out of 10 in easy cases versus 6.4 in difficult. That’s so close that the authors suspect the difference is “clinically insignificant.”

So we can establish that an individual doctor/provider can’t be all that accurate, because they’re basically a weak learner. But in many cases they’re also overconfident, and that’s a problem, because there’s a connection between error rates and doctors’ individual mental models. More generally, that suggests a connection between doctors’ overconfidence and misdiagnosis across the healthcare spectrum.

I’m not sure how providers should try to estimate their uncertainty, but it does seem like it would be really worthwhile if they could. Until then, it also seems like we would all benefit from more discussion and transparency about diagnostic error and more access to data-driven diagnostic tools.