I'd like to announce that we just published the 2nd version of the manuscript titled: "End-to-end Training for Whole Image Breast Cancer Diagnosis using An All Convolutional Design". In this version, we mainly fixed the followings:
- We made a direct comparison of our methods with the top-performing team's method (YaroslavNet) on DDSM. After careful tuning, the YaroslavNet achieved single-model AUC score of 0.83 and augmented score of 0.86, which are a few points below our best models.
- In the previous version, we mistakenly used softmax activation on the heatmap. In this version, we replaced that with relu instead.
- We found that heatmap activation was not critical to the performance. But the heatmap itself is a bottleneck and therefore shall be removed to improve performance.
- The Introduction section is rewritten to make the purpose more clear.
- We have also reorganized the materials and improved writing across the board.
We would like to thank everyone who has made very constructive comments on the 1st version! And welcome more comments on the 2nd version.
Best,
Li
Created by Li Shen thefaculty @ynikulin
I have got a chance to read your code. This is my understanding of your network structure (correct me if I'm wrong):
6 VGGs -> 1024 3x3 conv -> 512 1x1 conv -> 5 1x1 conv (heatmap) -> 5x5 max pool -> 16 3x2 conv (=16 FC) -> 8 1x1 conv (=8 FC) -> final output
There are several 1x1 conv layers, which basically serve as element-wise FC operations on feature maps. The 16 3x2 conv layer is actually 16 FC since the input feature map is 3x2. The 8 1x1 conv layer is also 8 FC since the input feature map is 1x1. So when you say your network is also "full convolutional", that is not accurate.
It turned out my implementation of YaroslavNet is quite close to yours (you are welcome to read my code). The only difference is the "1024 3x3 conv -> 512 1x1 conv" part is replaced by "3x3 avg pool /1". Notice that I used stride=1 for the 3x3 avg pooling. I used avg pooling because in the patch classifier, I used avg pooling to replace the two FC layers. And it worked fairly well while reducing parameters a lot. I'm not against FC layers. In my experience, they can often make loss curves look smooth. However, they are also parameter heavy and can slow down learning.
This also highlights the difference between my design and yours. In my design, I treated the last feature map from the patch classification network as a new image and added more convolutional blocks on top of it to learn the new image. They are all actual convolutional layers, not 1x1 conv to simulate FC operations. And I also found that I could get rid of the heatmap to improve the performance. This leads to a very simple design of a stack of all convolutional blocks and works well.
I feel sorry if I made you upset because I didn't check your code. I believe science is often advanced by friendly competitions. I wish you good luck in the collaborative phase.
- "How am I supposed to know your description is different from your implementation?" - well, check the code for instance. It is not a Linux kernel, just several short python scripts. I also had limitations on the write-up size. But the problem is deeper: you are claiming that " FC layers are a poor fit for breast cancer diagnosis because they require a convolutional layer?s output to be flattened, which eliminates all the spatial information. " - this does not happen in our implementation. And yes, if you don't change the input image size AND the convolution window size is equal to the entering feature map spatial size FC are equivalent to conv layers. In our approach that's true.
- About the ReLU in our implementation: I did not write it for one simple and evident reason: if there was not a ReLU the next layer would be useless because you are applying a linear operator directly to the result of a linear operator - with the same success you can replace both by just one linear operator. That's why they put non-linearity in between. Also, about the "cuts vs bounds" - https://en.wikipedia.org/wiki/Bounded_function. So, if you don't suppose that the negative part is always bigger than positive and only it can be unbounded (that would be a strange assumption) applying ReLU does not bound the distribution.
- **I have tried hard to replicate your result.** - well, indeed, that's the problem. You did not replicate it and have not even checked several important moments, as described above. You are testing things on a much smaller subset and claiming that "all convolutional design" is better than our approach - but our approach can be called exactly the same with the same success. There is no principal problem with convolution vs FC layers in our approach, in principle, and could not be. About the bottleneck - yes, it can be an issue, but in my opinion the way you performed your experiments does not permit to judge. @ynikulin,
- **"in my opinion it should be clearly stated in the abstract that you are using CBIS-DDSM. This makes your results incomparable directly to mine and others on DDSM because it is just a small subset of the original DDSM. In total it contains ~2.5k images vs 10K in the original DDSM."**: I didn't write that intentionally. CBIS-DDSM is just more convenient to use than the original DDSM. In the results section, I state clearly that CBIS-DDSM is used. I'll make it more clear in the next round.
- **"in our implementation for DDSM and for SC1 there is no fully connected layers at all, neither in DetectorNet, nor in the final image level network. Please check the code here: ..."**: This is frustrating. How am I supposed to know your description is different from your implementation? And I do not agree that the convolutional layer is equivalent to fully connected layer. They are apparently different. Or, do you mean something else? Please clarify, if you will.
- **"in the original implementation the output of the DetectorNet also passes first by a ReLU. Please check the code. Also, using ReLU does not bound the output in any way - it just cuts the negatives values, leaving the positive values unchanged."**: It was not written in your write-up, so it took me some effort to figure out by myself. And in my opinion, "cuts" is the same as "bounds". ReLU "bounds" the negative values at zero. Our difference just lies in the use of the word.
- **"the DDSM model I uploaded right after the Competitive Phase ended already achieved ~88.5% in terms of AUC, without augmentation and per-image, the AUC plot was attached. And again, it was on the 20% of the original DDSM without 2 benign-without-callback volumes."**: That's a very good score. But as you said, it's comparing apples to oranges. I have tried hard to replicate your result. But since your description is not 100% accurate (probably out of good intention, though), my implementation is not an exact replicate of your method.
Anyway, these points do not make my comparisons invalid. A major contributin of my manuscript is presenting many different network structures and comparing their pros and cons. I want to make this work a basis for other researchers to build upon and further improve the methods. Thank you for your comments to clarify your method. I appreciate that.
Li
Dear Li,
I have a number of remarks about this paper:
- "**On DDSM, our best single-model achieves a per-image AUC score of 0.88**": in my opinion it should be clearly stated in the abstract that you are using CBIS-DDSM. This makes your results incomparable directly to mine and others on DDSM because it is just a small subset of the original DDSM. In total it contains ~2.5k images vs 10K in the original DDSM.
- "**Another important difference between our method and the top performing team?s method is the use of FC layers. **" : in our implementation for DDSM and for SC1 there is no fully connected layers at all, neither in DetectorNet, nor in the final image level network. Please check the code here: https://www.synapse.org/#!Synapse:syn9819697. In fact, I referred to the fully connected layers just for the ease of understanding meaning that they are applied to the whole feature map, but internally they were implemented as convolutional, what does not make any difference at all, apart from the topology of resulting tensors.
- "**We hypothesize that is because the unbounded values can make the top layers saturate too early. Therefore, we use relu on the heatmaps instead in this study and find the convergence to improve.**" : in the original implementation the output of the DetectorNet also passes first by a ReLU. Please check the code. Also, using ReLU does not bound the output in any way - it just cuts the negatives values, leaving the positive values unchanged.
- "**This finally gives a per-image test AUC score of 0.83 (Table 3), which is close to our allconvolutional networks but still a few points below our best models**": the DDSM model I uploaded right after the Competitive Phase ended already achieved ~88.5% in terms of AUC, without augmentation and per-image, the AUC plot was attached. And again, it was on the 20% of the original DDSM without 2 benign-without-callback volumes.
Very best,
Yaroslav
Drop files to upload
End-to-end training all convolutional design v2 page is loading…