Just a thought:
I think that the multiple submissions (maximum 9 = 3 * 3) represent a multiple test problem that requires a correction for the final p-values.
Since the leaderboard dataset is always the same, people are performing multiple tests (with different models or parameters) in order to achieve the best AUC.
I would suggest a multiple test correction depending on the number of previous submissions in order to reduce the effect of the bias introduced by the multiple verification of the leaderboard AUC.
Another different way would have been keeping the participant unaware of the performances of their submissions, but unfortunately we can see them and correct (overfit maybe) the models.
From an interesting post on stackexchange:
http://stats.stackexchange.com/questions/137481/how-bad-is-hyperparameter-tuning-outside-cross-validation
"The effects of this bias can be very great. A good demonstration of this is given by the open machine learning competitions that feature in some machine learning conferences. These generally have a training set, a validation set and a test set. The competitors don't get to see the labels for either the validation set or the test set (obviously). The validation set is used to determine the ranking of competitors on a leaderboard that everyone can see while the competition is in progress. It is very common for those at the head of the leaderboard at the end of the competition to be very low in the final ranking based on the test data. This is because they have tuned the hyper-parameters for their learning systems to maximise their performance on the leaderboard and in doing so have over-fitted the validation data by tuning their model. More experienced users pay little or no attention to the leaderboard and adopt more rigorous unbiased performance estimates to guide their methodology.
The example in my paper (mentioned by Jacques) shows that the effects of this kind of bias can be of the same sort of size as the difference between learning algorithms, so the short answer is don't used biased performance evaluation protocols if you are genuinely interested in finding out what works and what doesn't. The basic rule is "treat model selection (e.g. hyper-parameter tuning) as an integral part of the model fitting procedure, and include that in each fold of the cross-validation used for performance evaluation).
The fact that regularisation is less prone to over-fitting than feature selection is precisely the reason that LASSO etc. are good ways of performing feature selection. However, the size of the bias depends on the number of features, size of dataset and the nature of the learning task (i.e. there is an element that depends on the a particular dataset and will vary from application to application). The data-dependent nature of this means that you are better off estimating the size of the bias by using an unbiased protocol and comparing the difference (reporting that the method is robust to over-fitting in model selection in this particular case may be of interest in itself).
G. C. Cawley and N. L. C. Talbot (2010), "Over-fitting in model selection and subsequent selection bias in performance evaluation", Journal of Machine Learning Research, 11, p.2079, section 5.2.)"
Comments and thoughts are welcome