Dear all,
please find here a couple of useful papers on classification of whole mammograms
http://cs.adelaide.edu.au/~carneiro/publications/multimodal.pdf
https://arxiv.org/pdf/1612.05968.pdf
For the training you can use the public databases
(DDSM and MIAS)
http://www.mammoimage.org/databases/
http://marathon.csee.usf.edu/Mammography/Database.html
I found this interesting Python script to generate augmented data:
https://github.com/benanne/kaggle-ndsb/blob/master/data.py#L315-L323
Some interesting results from another challenge (take inspiration -> squeezeNet)
https://blog.getnexar.com/nexars-deep-learning-challenge-the-winners-reveal-their-secrets-e80c24147f2d#.xfqbmtn8s
here more details:
https://medium.freecodecamp.com/recognizing-traffic-lights-with-deep-learning-23dae23287cc#.peruiq96z
Enjoy!
P.S. consider that the training data provided by the organizers is highly imbalanced
STDOUT: 2017-02-26T19:36:47.534041029Z num. total images = 317617
STDOUT: 2017-02-26T19:36:47.534083004Z num. positive images = 1114, 0.350736893806 %
Here a simple Python script to undersample the negative images to the same number of positive images.
```
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 31 23:32:33 2017
@author: £%*@%^&
"""
import pandas as pd
import sys
import random
import os
if __name__ == '__main__':
examsMetadataFilename = sys.argv[1]
imagesCrosswalkFilename = sys.argv[2]
imageLabelsTempFilename = sys.argv[3]
outputDir = sys.argv[4]
random.seed(float(sys.argv[5]))
metadata = pd.read_csv(examsMetadataFilename, sep="\t", na_values='.')
images = pd.read_csv(imagesCrosswalkFilename, sep="\t", na_values='.')
labels = pd.read_csv(imageLabelsTempFilename, sep=" ", na_values='.',
header=None)
labels.columns = ['filename', 'cancer']
imagesPos = set(labels.filename[labels.cancer == 1])
imagesNeg = set(labels.filename[labels.cancer == 0])
print "num. positive images = {}" .format(len(imagesPos))
print "num. negative images = {}" .format(len(imagesNeg))
imagesNeg = set(random.sample(imagesNeg,
int(len(imagesPos))))
print "after sampling: num. negative images = {}" .format(len(imagesNeg))
imagesNew = imagesPos.union(imagesNeg)
# Update the image crosswalk and the exams metadata
images = images.loc[images.filename.isin(imagesNew)]
metadata = metadata.loc[metadata.subjectId.isin(set(images.subjectId))]
# Write the new exams metadata train and images crosswalk train
metadata.to_csv(os.path.join(outputDir, "exams_metadata_train_UNDER.tsv"),
sep="\t", na_rep='.', index=False, header=True)
images.to_csv(os.path.join(outputDir, "images_crosswalk_train_UNDER.tsv"),
sep="\t", na_rep='.', index=False, header=True)
```
Created by TlQ6ApqEGf TlQ6ApqEGf TlQ6ApqEGf @davecg
Yes, Dave, I knew that. It's a very good package.
The only problem is that those techniques work well in the feature space, so you first need to extract them from the images.
If you use a fine-tuned network to extract the features then you face the same problem because it's fitted on the imbalanced dataset.
The two solutions suggested by @subcosmos would help in this case, having a balanced batch or a weighted loss can help in getting a good network, and after that you extract the features.
There is also the scikit-learn contrib module "imblearn" which is interesting and designed for imbalanced datasets with functions both for oversampling and undersampling.
You could try the advanced techniques on generated feature vectors (e.g. SMOTE or Tomek links), or the random sampling options. Yes that's a good strategy. Unfortunately I was playing with caffe where there's no such option.
I was thinking to use a weighted loss but again nothing is available in caffè without recompiling, so I simply used undersamplying with multiple models aggregation.
Have you tried other models than VGG, such as GoogleNet, AlexNet? I wanted to try the deep residual model but I didn't have much time (joined the challenge too late).
I would give a shot also to SqueezeNet, it gives almost identical performances of AlexNet on ImageNet but the final model is extremely smaller. In this way one can easily train multiple networks and combine them (since the limit is 1GB). I use stratified sampling while training. 1/2 of the image batch is cancer, and 1/2 is no cancer. Depsite this, my training and testing AUC is still computed on the underlying distribution of rare cancers.
With stratified sampling, the AUC converges much faster, even though the distribution that the AUC is calculated on is much different than what is being trained! That shocked me.
Sadly I didn't think of this sampling approach until last week. Up until now I was doing class weighting of the loss function. @subcosmos
Thanks for your update. I think that the AUC is large on the 500 images because of the small number of positive images.
I am looking forward to seeing your solution.
Foolishly I trained my models on the only provided training data (/trainingData) and I've discovered just two days ago that external databases were allowed.
So what I can do is only share my experience with other users. I definitely see the 500 as too small, just surprised at how little data augmentation worked on it. I augmented the 500 to about 100k with Keras using rotations, skew, translations, etc, but the net still recognizes the training images very well and gets to 0.95 AUC in just a few thousand epochs.
I think better augmentation would make use of nonlinear warping patterns of the images instead of just affine transforms.
And yeah, external data would be best, but I think its too late for me. Im going to focus on other competitions since I think this one has run its course and its not reasonable to accomplish too much given the constraints.
I did pull off one accomplishment here: using a simple VGG net I was able to produce a network that isolates the nipple and milk ducts region of the breast with very high accuracy. Using this I've been able to extract native resolution tumor patches from the source dicoms instead of relying on resizing the images down. I think I spent way too much time on this strategy though. It works but there's not sufficient computational time to hone the algorithm. @subcosmos
Remember that you can use also external mammography datasets (public and private).
I've shared a couple of them usually used as benchmark in papers.
Also, consider that the training data provided by the organizers contain less than 1% of positive images.
You should train your model on external databases (strongly advised) and on the provided training data (/trainingData on the DREAM servers).
The 500 images are just a joke and should not be considered for training/evaluating the models. They are too few to get a sufficiently general model or to evaluate the performances of your trained model. I would suggest you to split the available data (public/private/DREAM) into two train/validation random subsets (90%-10%) and then use those to build your best model.
Here a script to preprocess the images in the /trainingData folder of the DREAM servers. It generates 512x512 pixels png files from the provided dcm files. You can easily adapt it to external datasets.
```
#!/bin/bash
# Author: Thomas Schaffter (thomas.schaff...@gmail.com)
# Last update: 2016-11-02
#
# Modified by me
IMAGES_DIRECTORY="/trainingData"
EXAMS_METADATA_FILENAME="/metadata/exams_metadata.tsv"
IMAGES_CROSSWALK_FILENAME="/metadata/images_crosswalk.tsv"
PREPROCESS_DIRECTORY="/preprocessedData"
PREPROCESS_IMAGES_DIRECTORY="$PREPROCESS_DIRECTORY/images"
PREPROCESS_METADATA_DIRECTORY="$PREPROCESS_DIRECTORY/metadata"
LMDB_DIRECTORY="$PREPROCESS_DIRECTORY/lmdb"
mkdir -p $LMDB_DIRECTORY
mkdir -p $PREPROCESS_IMAGES_DIRECTORY
mkdir -p $PREPROCESS_METADATA_DIRECTORY
# Count the positive images
python count_pos_images.py $EXAMS_METADATA_FILENAME \
$IMAGES_CROSSWALK_FILENAME
echo "Resizing and converting $(find $IMAGES_DIRECTORY -name "*.dcm" | wc -l) DICOM images to PNG format"
find $IMAGES_DIRECTORY/ -name "*.dcm" | parallel "convert {} -resize 512x512! $PREPROCESS_IMAGES_DIRECTORY/{/.}.png" # faster than mogrify
echo "PNG images have been successfully saved to $PREPROCESS_IMAGES_DIRECTORY/."
echo "Select ROI and resize 512x512"
python select_ROI_and_resize.py 512 $PREPROCESS_IMAGES_DIRECTORY
echo "Flip the right images"
python flip_right_imgs.py $IMAGES_CROSSWALK_FILENAME \
$PREPROCESS_IMAGES_DIRECTORY
```
count_pos_images.py
```
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Sat Jan 28 19:42:32 2017
@author: me
"""
import pandas as pd
import sys
if __name__ == '__main__':
examsMetadataFilename = sys.argv[1]
imagesCrosswalkFilename = sys.argv[2]
# Pandas converts columns that have missing values to double
metadata = pd.read_csv(examsMetadataFilename, sep="\t", na_values='.')
# Read the image metadata
images = pd.read_csv(imagesCrosswalkFilename, sep="\t", na_values='.')
# Count the number of positive images
numImagesPosL = 0
metadataL = metadata.loc[metadata.cancerL == 1]
for i in range(0, metadataL.shape[0]):
# Read the images corresponding to the cancer status
imageMetadata = images.loc[(images.subjectId == metadataL.subjectId.iloc[i]) & \
(images.examIndex == metadataL.examIndex.iloc[i]) & \
(images.laterality == "L")]
numImagesPosL = numImagesPosL + imageMetadata.shape[0]
numImagesPosR = 0
metadataR = metadata.loc[metadata.cancerR == 1]
for i in range(0, metadataR.shape[0]):
# Read the images corresponding to the cancer status
imageMetadata = images.loc[(images.subjectId == metadataR.subjectId.iloc[i]) & \
(images.examIndex == metadataR.examIndex.iloc[i]) & \
(images.laterality == "R")]
numImagesPosR = numImagesPosR + imageMetadata.shape[0]
numImagesPos = numImagesPosL + numImagesPosR
# Count them
print "num. total images = {}" .format(images.shape[0])
print "num. positive images = {}, {} %" .format(numImagesPos,
float(numImagesPos)/images.shape[0]*100)
```
select_ROI_and_resize.py
```
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 26 22:02:29 2017
@author: me
"""
import numpy as np
import cv2
import sys
import glob
if __name__ == '__main__':
imageSize = int(sys.argv[1])
imageDir = sys.argv[2]
listPNG = glob.glob(imageDir + "/*.png")
# Detect the ROI containing the mammogram and remove the other regions
for i in range(0, len(listPNG)):
print "{}" .format(listPNG[i])
im = cv2.imread(listPNG[i])
# Transform to GRAY
gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# Extract bounding box
newGray = np.copy(gray)
contours, _ = cv2.findContours(gray, cv2.RETR_LIST,
cv2.CHAIN_APPROX_SIMPLE)
# Set to black all the small regions
area = np.zeros(len(contours))
k = 0
for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)
area[k] = w*h
k += 1
textMaskColor = (0, 0, 0)
for contour in contours:
[x, y, w, h] = cv2.boundingRect(contour)
# Remove all the small elements
if w*h < np.max(area):
cv2.rectangle(newGray, (x, y), (x+w, y+h), textMaskColor, -1)
areaMax = np.argmax(area)
x, y, w, h = cv2.boundingRect(contours[areaMax])
# Select the region corresponding to the tissue
newGray = cv2.cvtColor(newGray, cv2.COLOR_GRAY2RGB)
imROI = newGray[y:y+h, x:x+w, :]
# Resize the image
imROI = cv2.resize(imROI, (imageSize, imageSize), interpolation=cv2.INTER_NEAREST)
# Save the modified image
cv2.imwrite(listPNG[i], imROI)
```
flip_right_imgs.py
```
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 26 20:41:36 2017
@author: me
"""
import pandas as pd
import cv2
import sys
if __name__ == '__main__':
imagesTrainCrosswalkFilename = sys.argv[1]
imageDir = sys.argv[2]
# Read the image metadata
images = pd.read_csv(imagesTrainCrosswalkFilename, sep="\t", na_values='.')
# Select the right mammographies
imagesR = images.loc[images.laterality == "R", ]
# Update the extension
imagesR.loc[:, 'filename'] = imagesR.filename.str.replace('.dcm', '.png')
# Flip the right mammographies to match the left ones
for index, row in imagesR.iterrows():
img = cv2.imread(imageDir + "/" + row.filename)
imgFlipped = cv2.flip(img, 1)
cv2.imwrite(imageDir + "/" + row.filename, imgFlipped)
```
Thank you kindly :)
I used data augmentation on the 500 pilot images to hone my algorithms, but with little success. Even dramatically transforming images resulted in overtraining. Maybe the higher sample counts here will make data-aug possible
Drop files to upload
CLASSIFICATION ALGORITHMS AND PUBLIC DATABASES (LINKS INSIDE) page is loading…