Hi,
I have seen that the results of the methods in the challenge were compared with the accuracy of radiologists as defined here: http://pubs.rsna.org/doi/pdf/10.1148/radiol.2016161174 . Sensitivity = 0.87, specificity = 0.89. ( And it was also mentioned in the Community period call. ) I think this comparison is not fair. The two situations are quite different, and the number should not be compared. I will try to explain why.
1, First let me explain the difference with the negative cases:
* Radiologist: For a doctor during practice usually nobody checks the mammograms again (at least in the US), so only those negative cases had further examination and biopsy, who develop symptoms, come back. If a cancer is growing slowly, even if it is there and it is visible, it might not be diagnosed within a year.
* In the challenge: Each one of the exams defined negative by the programs were checked by a radiologist. This way many cases which were classified as negative by the computer, were called back for further evaluation, and subsequently received a cancer diagnosis, which makes them false negatives.
So let me stress again, for a radiologist NONE of the cases are double checked by somebody automatically. For the programs here EACH ONE of the cases classified as negative were checked by a radiologist, and SOME of them were called back for further evaluation and biopsies.
I think it is not very hard to see that if all of your negatives are double checked, it is much easier to find false negatives, than if none of them were double checked. This leads to a large difference in the estimated false negative rate/sensitivity.
2, The positives
I think it?s a less serious issue, but some of the false positives for the programs could be actually diagnosed with cancer if biopsy would have been performed, but because doctors didn?t call them back for evaluation they were not diagnosed then, and their visible cancer was growing slow enough that it was only diagnosed 2-3 years later. This can not happen with a radiologist where each one of the positive cases are evaluated.
So the problem boils down to one thing. Breast cancer is growing slowly many times, and just because somebody has not been diagnosed in a year after a negative exam, it does not necessarily mean, that she does not have a cancer which is visible and could be diagnosed with cancer. This makes the comparison meaningless.
Just to make sure, I want to stress that I think that the methods and numbers describing the performance of radiologists are fine, and they measure what they have to. This is just a different situation. Here those numbers do not apply.
I think the situation in this challenge is much closer to the setup used in the DMIST study (http://www.nejm.org/doi/full/10.1056/NEJMoa052911) and the results published there should be used as baseline. In that study each mammogram was read by two radiologists independently, and they called back a patient if any of the doctors wanted to call back. So in this study EACH negatives exam was reviewed by another doctor, just like in the challenge. In this study the performance of radiologists were: sensitivity = 0.7, and specificity = 0.92 (after 1 year follow up with digital mammograms, see table 4).
I understand that the concept is not very easy to digest, but I hope I have managed to explain my thoughts. What do you think?
@tschaffter @Justin.Guinney @gustavo @ynikulin @yuanfang.guan @bill_lotter and others
Thanks,
Dezso
(Edited for formatting, sorry it was a mess)
Created by Dezso Ribli riblidezso Hi Dezso,
I looked at your code provided but did not find the actual model. If possible could you please share it. I have already ran the top winner model using independent set and would like to compare the results with other winners.
Thanks,
Serghei Let's continue to discuss this thread here: https://www.synapse.org/#!Synapse:syn9935146/discussion/threadId=2131 Welcome!
> The same paper ( http://pubs.rsna.org/doi/pdf/10.1148/radiol.2016161174 ) have the following stats. 1996-2005: Sensitivity 0.787, Specificity 0.895; 2004-2008:0.849,0.903; 2007-2013: 0.869,0.889. So my guess is maybe there is improvement. Will check with a radiologist we know.
The first data is probably mostly about film mammograms, so the improvement is also because of the digital technology, so that is not very relevant. Since the 2004-2008 period there was a 2% change in sensitivity. This is much smaller than the difference compared to DMIST trial: sens=0.7 spec=0.92. Probably there was some improvement in the performance of radiologists since the DMIST trial (2001-2003), but that should be only around 2-4% and the rest 15% is due to methodical differences (double checked mammograms).
Dezso I am Sijia from Eagle Eye, jumping in the discussion here.
The same paper ( http://pubs.rsna.org/doi/pdf/10.1148/radiol.2016161174 ) have the following stats. 1996-2005: Sensitivity 0.787, Specificity 0.895; 2004-2008:0.849,0.903; 2007-2013: 0.869,0.889. So my guess is maybe there is improvement. Will check with a radiologist we know.
If there is any hope, I have heard from a group claiming similar level as 0.87,0.89 for mammograms just weeks ago and they are in the mid of commercializing it. It is probably way too early to tell as I don't know their study design at all and cannot share their stats as I have a Non-disclosure agreement with them.
What I am not comfortable with is the hard stat of Sensitivity of 0.869 and Specificity 0.889 as it is very difficult to compare given different datasets and potentially different measure of true breast cancer cases(though it appears to me at first glance it is the same - through cancer registry, will look closer or anyone kind enough to point out the difference).
Another thing I want to point out is that in the same paper, it is proposed that sensitivity >0.75, specificity 0.88-0.95 is considered acceptable. So we are probably already close or in the range of acceptable performance (human expert level performance). I personally like the way it is presented in the Stanford paper - (Dermatologist-level classification of skin cancer with deep neural networks): https://www.nature.com/nature/journal/v542/n7639/fig_tab/nature21056_F3.html > furthermore, many 'cancers' identified by screening mammography are simply over-diagnosis (false positives). that by decades it was reported not to reduce breast cancer death rate is a piece of evidence that if it had been evaluated approximately, its performance is probably close to random. otherwise, one cannot explain why early diagnosis doesn't lead to better treatment
I think we should not start here the decades long and very controversial debate about the utility of breast cancer screening. I think we should concentrate on the task of detecting cancer on X-rays, and leave the neverending fight for the breast cancer epidemiology experts.
> furthermore, many 'cancers' identified by screening mammography are simply over-diagnosis (false positives).
I don't think that we can deal with the possible false positives in a clear way. Maybe the best is to just assume that there are not too many false positives. I think the issue of false negatives is a much bigger one, and if we clear that one, the comparison will be almost fair.
Dezso "that is definitely wrong. the problem is how to make them understand and what shall we do if they don't."
I did not get it how you came to such a conclusion. I see the problem, however, I would prefer a thorough analysis (ideally by an experienced radiologist) to a fast conclusion. Usually there are a lot of nuances about how exactly the statistics were calculated.
Also, could you please explain what you meant by "a much better prediction algorithm, that calls for 100, and 90 are true, would have a much lower performance measure (0.9/ 0.9)." - what means "100, and 90 are true" ?
Finally, which data did you mean saying "many 'cancers' identified by screening mammography are simply over-diagnosis (false positives)" - if you mean the DREAM (or DDSM) database, then gold standard was biopsy proven. False positives should not contribute to the 'cancers'. If you mean the papers about human performance I cannot say, but pretty sure they should have also been biopsy-proven, otherwise it does not make a lot of sense.
that is definitely wrong. the problem is how to make them understand and what shall we do if they don't. it is like there are 100 out of 1000 people speeding, a police caught 10, 9 are true, later, 1 other was later found, so they reported a recall of 0.9, and a specificity of 0.99. but the actual recall is less than 0.1.
and what is the next step, if they agree that there isn't much improvement space. i think it is necessary to propose something more meaningful to do, e.g. deliver to other cohorts.
a much better prediction algorithm, that calls for 100, and 90 are true, would have a much lower performance measure (0.9/ 0.9).
furthermore, many 'cancers' identified by screening mammography are simply over-diagnosis (false positives). that by decades it was reported not to reduce breast cancer death rate is a piece of evidence that if it had been evaluated approximately, its performance is probably close to random. otherwise, one cannot explain why early diagnosis doesn't lead to better treatment > I would like to hear the opinion of an experienced radiologist who could also qualitatively compare the two studies and suggest the most appropriate.
I agree. Oh, sorry, my misunderstanding. Not sure though that nothing has changed since then in mammography analysis - normally teaching approaches in universities constantly change and adapt. To sum it up, I would like to hear the opinion of an experienced radiologist who could also qualitatively compare the two studies and suggest the most appropriate.
Hi Yaroslav,
Just a small correction, the second study is from US/Canada, and it was funded by the NCI. From the abstract:
> A total of 49,528 asymptomatic women presenting for screening mammography at 33 sites in the United States and Canada underwent both digital and film mammography
Their small wikipedia page (https://en.wikipedia.org/wiki/Digital_Mammographic_Imaging_Screening_Trial) , and the description of the trial on clinicaltrials.gov: (https://clinicaltrials.gov/show/NCT00008346).
So I think this study is coming from relevant sources.
As far as I know 2005 was not that long time ago, ( I was already 15 then! :) ) . Mammographic reading performance haven't changed very much since that. See this article from 2007 (http://www.nejm.org/doi/full/10.1056/NEJMoa066099) sensitivity = 0.83, specificity = 0.89. This a bit lower than the one reported by the most recent article, but the change is not huge.
Dezso Could you please paste a link with the document that anyone can download?
At least myself, I can not download http://pubs.rsna.org/doi/pdf/10.1148/radiol.2016161174
Thanks Hello Dezso,
Thanks for sharing your thoughts. I think I understood your concerns: basically, what we call "false positives" and "false negatives" are strongly biased because of just one single reading by a radiologist. It would be really interesting to know if this is indeed the reason that explains such a huge difference in human's performance reported by the 2 studies. If we take the numbers from your paper, we are clearly much closer (probably even already there, we did not see speceficity at sensitivity 0.7) to the human level performance. I agree that this is important to have a correct estimation of human performance for the obvious reason of the prize money and even more importantly for a correct understanding of where the algorithms are and their acceptance by the community: logically, much more hospitals would be interested in the use of (super)-human level algorithms than by under-performing ones. One last thing, it is probably also important to understand that the second study is much older (2005 apparently) and the data could have a slightly different distribution.
I would like to hear the opinion of Organizers and invited Radiologists, because it is indeed important to compare comparable things. I guess if a radiologist reads both papers it will be clearer to him from where the difference comes and which one should be taken as the averaged human performance.
Yaroslav
Drop files to upload
Concerns about comparing the performance with radiologists page is loading…