Dear all, please find here a couple of useful papers on classification of whole mammograms http://cs.adelaide.edu.au/~carneiro/publications/multimodal.pdf https://arxiv.org/pdf/1612.05968.pdf For the training you can use the public databases (DDSM and MIAS) http://www.mammoimage.org/databases/ http://marathon.csee.usf.edu/Mammography/Database.html I found this interesting Python script to generate augmented data: https://github.com/benanne/kaggle-ndsb/blob/master/data.py#L315-L323 Some interesting results from another challenge (take inspiration -> squeezeNet) https://blog.getnexar.com/nexars-deep-learning-challenge-the-winners-reveal-their-secrets-e80c24147f2d#.xfqbmtn8s here more details: https://medium.freecodecamp.com/recognizing-traffic-lights-with-deep-learning-23dae23287cc#.peruiq96z Enjoy! P.S. consider that the training data provided by the organizers is highly imbalanced STDOUT: 2017-02-26T19:36:47.534041029Z num. total images = 317617 STDOUT: 2017-02-26T19:36:47.534083004Z num. positive images = 1114, 0.350736893806 % Here a simple Python script to undersample the negative images to the same number of positive images. ``` #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Tue Jan 31 23:32:33 2017 @author: £%*@%^& """ import pandas as pd import sys import random import os if __name__ == '__main__': examsMetadataFilename = sys.argv[1] imagesCrosswalkFilename = sys.argv[2] imageLabelsTempFilename = sys.argv[3] outputDir = sys.argv[4] random.seed(float(sys.argv[5])) metadata = pd.read_csv(examsMetadataFilename, sep="\t", na_values='.') images = pd.read_csv(imagesCrosswalkFilename, sep="\t", na_values='.') labels = pd.read_csv(imageLabelsTempFilename, sep=" ", na_values='.', header=None) labels.columns = ['filename', 'cancer'] imagesPos = set(labels.filename[labels.cancer == 1]) imagesNeg = set(labels.filename[labels.cancer == 0]) print "num. positive images = {}" .format(len(imagesPos)) print "num. negative images = {}" .format(len(imagesNeg)) imagesNeg = set(random.sample(imagesNeg, int(len(imagesPos)))) print "after sampling: num. negative images = {}" .format(len(imagesNeg)) imagesNew = imagesPos.union(imagesNeg) # Update the image crosswalk and the exams metadata images = images.loc[images.filename.isin(imagesNew)] metadata = metadata.loc[metadata.subjectId.isin(set(images.subjectId))] # Write the new exams metadata train and images crosswalk train metadata.to_csv(os.path.join(outputDir, "exams_metadata_train_UNDER.tsv"), sep="\t", na_rep='.', index=False, header=True) images.to_csv(os.path.join(outputDir, "images_crosswalk_train_UNDER.tsv"), sep="\t", na_rep='.', index=False, header=True) ```

Created by TlQ6ApqEGf TlQ6ApqEGf TlQ6ApqEGf
@davecg Yes, Dave, I knew that. It's a very good package. The only problem is that those techniques work well in the feature space, so you first need to extract them from the images. If you use a fine-tuned network to extract the features then you face the same problem because it's fitted on the imbalanced dataset. The two solutions suggested by @subcosmos would help in this case, having a balanced batch or a weighted loss can help in getting a good network, and after that you extract the features.
There is also the scikit-learn contrib module "imblearn" which is interesting and designed for imbalanced datasets with functions both for oversampling and undersampling. You could try the advanced techniques on generated feature vectors (e.g. SMOTE or Tomek links), or the random sampling options.
Yes that's a good strategy. Unfortunately I was playing with caffe where there's no such option. I was thinking to use a weighted loss but again nothing is available in caffè without recompiling, so I simply used undersamplying with multiple models aggregation. Have you tried other models than VGG, such as GoogleNet, AlexNet? I wanted to try the deep residual model but I didn't have much time (joined the challenge too late). I would give a shot also to SqueezeNet, it gives almost identical performances of AlexNet on ImageNet but the final model is extremely smaller. In this way one can easily train multiple networks and combine them (since the limit is 1GB).
I use stratified sampling while training. 1/2 of the image batch is cancer, and 1/2 is no cancer. Depsite this, my training and testing AUC is still computed on the underlying distribution of rare cancers. With stratified sampling, the AUC converges much faster, even though the distribution that the AUC is calculated on is much different than what is being trained! That shocked me. Sadly I didn't think of this sampling approach until last week. Up until now I was doing class weighting of the loss function.
@subcosmos Thanks for your update. I think that the AUC is large on the 500 images because of the small number of positive images. I am looking forward to seeing your solution. Foolishly I trained my models on the only provided training data (/trainingData) and I've discovered just two days ago that external databases were allowed. So what I can do is only share my experience with other users.
I definitely see the 500 as too small, just surprised at how little data augmentation worked on it. I augmented the 500 to about 100k with Keras using rotations, skew, translations, etc, but the net still recognizes the training images very well and gets to 0.95 AUC in just a few thousand epochs. I think better augmentation would make use of nonlinear warping patterns of the images instead of just affine transforms. And yeah, external data would be best, but I think its too late for me. Im going to focus on other competitions since I think this one has run its course and its not reasonable to accomplish too much given the constraints. I did pull off one accomplishment here: using a simple VGG net I was able to produce a network that isolates the nipple and milk ducts region of the breast with very high accuracy. Using this I've been able to extract native resolution tumor patches from the source dicoms instead of relying on resizing the images down. I think I spent way too much time on this strategy though. It works but there's not sufficient computational time to hone the algorithm.
@subcosmos Remember that you can use also external mammography datasets (public and private). I've shared a couple of them usually used as benchmark in papers. Also, consider that the training data provided by the organizers contain less than 1% of positive images. You should train your model on external databases (strongly advised) and on the provided training data (/trainingData on the DREAM servers). The 500 images are just a joke and should not be considered for training/evaluating the models. They are too few to get a sufficiently general model or to evaluate the performances of your trained model. I would suggest you to split the available data (public/private/DREAM) into two train/validation random subsets (90%-10%) and then use those to build your best model. Here a script to preprocess the images in the /trainingData folder of the DREAM servers. It generates 512x512 pixels png files from the provided dcm files. You can easily adapt it to external datasets. ``` #!/bin/bash # Author: Thomas Schaffter (thomas.schaff...@gmail.com) # Last update: 2016-11-02 # # Modified by me IMAGES_DIRECTORY="/trainingData" EXAMS_METADATA_FILENAME="/metadata/exams_metadata.tsv" IMAGES_CROSSWALK_FILENAME="/metadata/images_crosswalk.tsv" PREPROCESS_DIRECTORY="/preprocessedData" PREPROCESS_IMAGES_DIRECTORY="$PREPROCESS_DIRECTORY/images" PREPROCESS_METADATA_DIRECTORY="$PREPROCESS_DIRECTORY/metadata" LMDB_DIRECTORY="$PREPROCESS_DIRECTORY/lmdb" mkdir -p $LMDB_DIRECTORY mkdir -p $PREPROCESS_IMAGES_DIRECTORY mkdir -p $PREPROCESS_METADATA_DIRECTORY # Count the positive images python count_pos_images.py $EXAMS_METADATA_FILENAME \ $IMAGES_CROSSWALK_FILENAME echo "Resizing and converting $(find $IMAGES_DIRECTORY -name "*.dcm" | wc -l) DICOM images to PNG format" find $IMAGES_DIRECTORY/ -name "*.dcm" | parallel "convert {} -resize 512x512! $PREPROCESS_IMAGES_DIRECTORY/{/.}.png" # faster than mogrify echo "PNG images have been successfully saved to $PREPROCESS_IMAGES_DIRECTORY/." echo "Select ROI and resize 512x512" python select_ROI_and_resize.py 512 $PREPROCESS_IMAGES_DIRECTORY echo "Flip the right images" python flip_right_imgs.py $IMAGES_CROSSWALK_FILENAME \ $PREPROCESS_IMAGES_DIRECTORY ``` count_pos_images.py ``` #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Sat Jan 28 19:42:32 2017 @author: me """ import pandas as pd import sys if __name__ == '__main__': examsMetadataFilename = sys.argv[1] imagesCrosswalkFilename = sys.argv[2] # Pandas converts columns that have missing values to double metadata = pd.read_csv(examsMetadataFilename, sep="\t", na_values='.') # Read the image metadata images = pd.read_csv(imagesCrosswalkFilename, sep="\t", na_values='.') # Count the number of positive images numImagesPosL = 0 metadataL = metadata.loc[metadata.cancerL == 1] for i in range(0, metadataL.shape[0]): # Read the images corresponding to the cancer status imageMetadata = images.loc[(images.subjectId == metadataL.subjectId.iloc[i]) & \ (images.examIndex == metadataL.examIndex.iloc[i]) & \ (images.laterality == "L")] numImagesPosL = numImagesPosL + imageMetadata.shape[0] numImagesPosR = 0 metadataR = metadata.loc[metadata.cancerR == 1] for i in range(0, metadataR.shape[0]): # Read the images corresponding to the cancer status imageMetadata = images.loc[(images.subjectId == metadataR.subjectId.iloc[i]) & \ (images.examIndex == metadataR.examIndex.iloc[i]) & \ (images.laterality == "R")] numImagesPosR = numImagesPosR + imageMetadata.shape[0] numImagesPos = numImagesPosL + numImagesPosR # Count them print "num. total images = {}" .format(images.shape[0]) print "num. positive images = {}, {} %" .format(numImagesPos, float(numImagesPos)/images.shape[0]*100) ``` select_ROI_and_resize.py ``` #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Thu Jan 26 22:02:29 2017 @author: me """ import numpy as np import cv2 import sys import glob if __name__ == '__main__': imageSize = int(sys.argv[1]) imageDir = sys.argv[2] listPNG = glob.glob(imageDir + "/*.png") # Detect the ROI containing the mammogram and remove the other regions for i in range(0, len(listPNG)): print "{}" .format(listPNG[i]) im = cv2.imread(listPNG[i]) # Transform to GRAY gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY) # Extract bounding box newGray = np.copy(gray) contours, _ = cv2.findContours(gray, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE) # Set to black all the small regions area = np.zeros(len(contours)) k = 0 for cnt in contours: x, y, w, h = cv2.boundingRect(cnt) area[k] = w*h k += 1 textMaskColor = (0, 0, 0) for contour in contours: [x, y, w, h] = cv2.boundingRect(contour) # Remove all the small elements if w*h < np.max(area): cv2.rectangle(newGray, (x, y), (x+w, y+h), textMaskColor, -1) areaMax = np.argmax(area) x, y, w, h = cv2.boundingRect(contours[areaMax]) # Select the region corresponding to the tissue newGray = cv2.cvtColor(newGray, cv2.COLOR_GRAY2RGB) imROI = newGray[y:y+h, x:x+w, :] # Resize the image imROI = cv2.resize(imROI, (imageSize, imageSize), interpolation=cv2.INTER_NEAREST) # Save the modified image cv2.imwrite(listPNG[i], imROI) ``` flip_right_imgs.py ``` #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Thu Jan 26 20:41:36 2017 @author: me """ import pandas as pd import cv2 import sys if __name__ == '__main__': imagesTrainCrosswalkFilename = sys.argv[1] imageDir = sys.argv[2] # Read the image metadata images = pd.read_csv(imagesTrainCrosswalkFilename, sep="\t", na_values='.') # Select the right mammographies imagesR = images.loc[images.laterality == "R", ] # Update the extension imagesR.loc[:, 'filename'] = imagesR.filename.str.replace('.dcm', '.png') # Flip the right mammographies to match the left ones for index, row in imagesR.iterrows(): img = cv2.imread(imageDir + "/" + row.filename) imgFlipped = cv2.flip(img, 1) cv2.imwrite(imageDir + "/" + row.filename, imgFlipped) ```
Thank you kindly :) I used data augmentation on the 500 pilot images to hone my algorithms, but with little success. Even dramatically transforming images resulted in overtraining. Maybe the higher sample counts here will make data-aug possible

CLASSIFICATION ALGORITHMS AND PUBLIC DATABASES (LINKS INSIDE) page is loading…