I'd like to use this article to advertise our new paper just published on arXiv:
Shen,L. (2017) End-to-end Training for Whole Image Breast Cancer Diagnosis using An All Convolutional Design. arXiv:1708.09427 [cs, stat].
https://arxiv.org/abs/1708.09427
Companion website: https://github.com/lishen/end2end-all-conv
Abstract
We develop an end-to-end training algorithm for whole-image breast cancer diagnosis based on mammograms. It has the advantage of training a deep learning model without relying on cancer lesion annotations. Our approach is implemented using an all convolutional design that is simple yet provides superior performance in comparison with the previous methods. With modest model averaging, our best models achieve an AUC score of 0.91 on the DDSM data and 0.96 on the INbreast data. We also demonstrate that a trained model can be easily transferred from one database to another with different color profiles using only a small amount of training data.
Comments and suggestions are highly encouraged!
Cheers,
Li
Created by Li Shen thefaculty @davecg ,
I'm not aware of this bias in scanners. I used the CBIS-DDSM data for my paper. According to their website: https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM#942c6f2390be4435b79c614ae2408986
All images are first converted to optical density values and then to 16-bit gray scale. Will this standardized processing remove the scanner issue?
My best models are uploaded to the website. You are welcome to download them to generate saliency maps. I'm very curious to see how the result looks like. Thanks! I'm not familiar with INBreast, but when I was testing out models on DDSM I noticed that there is a fair amount of data leakage in that dataset. When I created saliency maps for one of my models, I noticed part of how it was discriminating between cancer and benign was by identifying general characteristics of the image - i.e. which scanner was used for that case.
The vast majority of the cases digitized with the DBA scanner are normal, the HOWTEK scanner is more mixed, and the LUMISYS scanner has the highest proportion of cancers.
DBA
benign 430, cancer 97
HOWTEK
benign 725, cancer 424
LUMISYS
benign 551, cancer 393
You can get a 25% specificity at ~90% sensitivity just by calling cases scanned using DBA negative and the rest positive.
There were also some repeated cases in the dataset under different names.
It might be worth looking into the issue of data leakage in the datasets a bit more for your paper, e.g. to make sure that your model isn't calling the 97 positive cases from the DBA scanner negative. It might also be worth rebalancing the training dataset to include equal numbers of positive and negative cases from each scanner.
As I said, not familiar with INBreast, but there may be similar issues with that dataset. @ramiben
I agree with you. The models have not been evaluated in the most stringent manner so far. I'll definitely get back to fix this once I have enough resource/time. Right now, I really just want to get the idea out in the open.
Anyway, I consider publishing on arXiv to be a continuous process. There will be v2, v3, etc. of this paper. Thank you for the feedback! I agree with the time consuming training burden indeed. My comment of course goes to the generalization of your reported performance. I have previously experienced cases where there were large variations in performance (AUC) between folds/random splits.
The 3rd point was a comment on performance evaluation too. As you are reporting your results on a held out set, it should be carefully chosen to lower the chance that you have accidentally reached an easy or particularly hard test set. In the Dream Challenge, we as organizers, had a certain procedure to do that.
Good luck with the paper,
Rami Copying Li Shen's Reply ...
I was not aware of your weakly supervised works. It seems a lot of papers appeared in this area all of a sudden recently. I'll read your papers once I get a chance. I'll try to describe my method more accurately in the 2nd version.
My evaluations were conducted on a single split. I think it's best to do multiple splits and then take an average AUC score. But training whole image models took many hours. I have only a single GPU. If I do multiple splits, it will take me a very long time to perform the experiments because I have many convolutional designs to try. The main purpose of my paper is to study which design may provide the best result. Since all designs were evaluated on the same train-val-test sets, it was a fair comparison for them. To provide a super reliable evaluation of a model, it is best to perform multi-center, large-scale evaluations. And that is what the top performers are doing during the community phase. Unfortunately, I'm no longer participating in this stage.
My splits were based on patients/cases. I stated that very clearly in the paper. Each split was also done in a "stratified" fashion to keep the positive/negative ratio same between train and test sets. I should have mentioned that in the paper. Another thing to correct in the 2nd version.
I don't seem to understand your 3rd point. Is that a question?
Thank you for the feedback! Hi Li,
Thank you for sharing this paper. Indeed very interesting. We have published two accepted papers in the Eurographics Workshop on Visual Computing for Biology and Medicine, which I presented just last week. The papers address the weakly supervised MG classification. I attach the link to the papers that are now available in the workshop proceedings. We also discussed the impact of image downsizing and therefore took the patch wise approach but without any ROI labeled available at any stage (defined as weakly labeled). In one paper we suggest a method that can provide localization even on weakly labeled sets and in the other we target the imbalance and AUC loss function, which can rather be easily implemented.
I have some comments and questions regarding your paper that may be relevant to all.
I think it better to emphasis and note in the abstract that labeled ROIs are used in the first stage (although from an external data set), to differentiate it from the whole branch of weakly supervised methods.
Are your evaluations based on a single split? Did you further conducted Monte-Carlo experiments or cross validation? Are the train-test splits made on images or patients? Since it is highly important that train and test set consist of mutually exclusive patients. Is there any balancing done on the splits, for instance in terms of findings?
In the Dream Challenge the validation set was carefully chosen based on several experiments showing that it can be a representative set.
Paper links:
https://drive.google.com/open?id=0B3-4p5Bx0j11bUVrRmJKUEs5ZTA
https://drive.google.com/open?id=0B3-4p5Bx0j11ZlE4cUVSbExBalE
Our results in INbreast are based on training and testing on the same data set. We didn't get the chance to transfer-learn from our large data set due to publication deadline. Indeed in our to do list.
Regards,
Rami Thanks. I will try all of them using Keras API following your github description of image file preprocessing.
Serghei Hi @sam417
Are you familiar with Keras' API?
All you need to do is to load the model and then use the predict or evaluate functions (https://keras.io/models/model/). I don't know which model would be best suited for your data. Maybe just use them all? This shall be rather easy to do.
Li
Hi Li,
Sounds interesting. I would like to test your model on independent set but without fine tuning and compare it with the top model results I already created for my set. I want just to feed my set to your final model and to get confidence outputs. Could you tell me which script and model should I use?
Thanks,
Serghei
Drop files to upload
End-to-end training all convolutional network with AUC: 0.91 on DDSM and 0.96 on INbreast page is loading…