Problem with the data!

We have found that in 5 of the dataset pairs in the Practice Censoring folder there are some mild/major problems with the outcomes in that the observed outcomes don't match the appropriate counterfactual outcomes (i.e. corresponding to the relevant treatment); in fact in the problem observations I've checked by hand the observed outcome doesn't match either potential outcome. The R code below demonstrates the lack of correspondence as a percentage of non-matching outcomes per dataset (pair). Note that until this problem is resolved these 5 pairs of treat-outcome/counterfactual data cannot be used as test data for the competition. Please advise. ``` inputFiles <- list.files("censoring", ".\\.csv$", full.names = TRUE) ## remove counterfactual files inputFiles <- inputFiles[!grepl("._cf\\.csv$", inputFiles)] match.pct <- numeric(length(inputFiles)) names(match.pct) <- sapply(strsplit(inputFiles, "/"), tail, 1L) for (i in seq_along(inputFiles)) { file <- inputFiles[i] yz <- read.csv(file) cf <- read.csv(sub("\\.csv", "_cf.csv", file)) match.pct[i] <- with(yz, mean(y == z * cf$y1 + (1 - z) * cf$y0, na.rm = TRUE)) } as.matrix(match.pct) [,1] 110f6dc8583c456ea0dd242d5d598497.csv 0.5556928 3ebc51612e034ff99e8632a228dae430.csv 1.0000000 5a147c7e542a4ea5b22da127b654666b.csv 1.0000000 5ad181455e954bcba44743e1f2d7824e.csv 0.7356785 74420a1794304013bb7a5a8f61994d71.csv 0.7468097 8ff38d337ec842dab1b8c01076e24816.csv 0.2459916 9333a461d3944d089ef60cdf3b88fd40.csv 1.0000000 ac6e494cbc254dc599be26a2a17f229c.csv 1.0000000 ae51149d38ce42609e00bf5701e4fe88.csv 1.0000000 d1546da12d8e4daf8fe6771e2187954d.csv 0.6910000 d4ae3280e4e24ca395533e429726fafc.csv 1.0000000 e36aca1030264e638452ea4053cbb42c.csv 1.0000000 ```

Created by Jennifer Hill jlhillny
Just an update that I have heard that the deadline will be extended and that this will be announced soon.
I vote you guys to win it based on noticing this error?that was the challenge, actually, very clever. Just kidding. I am not 100% sure even of the param of interest as it says population ATE or do they mean ATE for the sample like last year? I asked this two days ago to no avail but maybe some of cleared that up?
Any update on this request?
Yes, there was indeed some confusion and I accidentally uploaded some of the intermediate files out of rush. Many thanks for bringing that up. Regarding an extension of the deadline, it is not up to me to decide (I'm only here for the technical stuff), but I forwarded the request to the challenge director and I believe he will update soon.
Unfortunately there is still a problem in the new counterfactual data files. The values for y1 are missing whenever y is missing in the observed data file.
Thanks so much for fixing this so quickly. Unfortunately this hiccup has led to several days of wasted time (for us and I'm sure for others) going down rabbit holes to figure out why models were yielding such bad results. Would it be possible to extend the deadline by a few days to let everyone make up the lost time?
Thank you Susan and Jennifer for notifying us of this problem. We apologize, you are indeed correct. It turns out we had an error in reporting the counter-factual outcome when treatment==1 in some of the simulations. This was fixed recently and we re-uploaded the files with the correct values. We would like to point out that the values in the observation files were not affected in any way, and so if you already have results they should not be affected. Of course, this will change any evaluation estimation you may have run, so be sure to check those again. We will issue a general email explaining this shortly, and we hope this will not have a significant impact on the participants.
I noticed the same thing! I got similarly poor results from TMLE, outcome regression only, IPTW, and finally Matching, before finally turning my attention to the the supplied datasets. Here is one problematic example (74420a1794304013bb7a5a8f61994d71, 26% missing outcomes). It looks as if the observed data were generated as Y = Delta * [A * (Y1 - 4.998558) + (1-A)* Y0], where Delta = 1 indicates the outcome is observed. This violates the consistency assumption. Even if we replace Delta by some function f(A,W) in the above equation, e.g., I(f(A,W) > 0.26), it appears that Delta is always when when this function is 1, so we can only ever observed (Y1 - x) when A = 1, never Y1. If I'm right, then it's impossible to recover the correct ATT or ATE.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Problem with the data! page is loading…