Sorry if it is a silly question, but I did not have the opportunity to run my scripts against the real data so far. In the synthetic dataset, there is a "verbatim_end_date" column in the drug_exposure file, that discloses data from years up to 2018, far after the outcome prediction date which is up to 2010. I believe that a dead person would not stop taking a medication long after his death, so a record in 2018 means that he did not die in 2010... Have the organizers ensured that there is no leakage through this column in the real data.

Created by Ariel Yehuda Israel arielis
Hi @arielis, You shouldn't read too much into the content of the synthetic dataset. Its main purpose is to mimic the form and structure of the real UW data. Variable correlations, date distributions, condition distributions, etc are not representative of the real data. The number of visits per patient is close to the real data. Our main goal with this synthetic data was to catch obvious errors early and to help participants have a rough estimate of how long their models would run on the real data. I hope this clears things up! Tim
My question is in fact not only with the "verbatim_end_date" column. When running on the synthetic data in fast lane, where the infer set contains 28,484 patients, having a last visit date between 2008-01-06 and 2010-01-06, there is as much as 18,951 patients (66%) which last "verbatim_end_date" is at least 180 days after the last visit, and also 17,645 patients (61%) which last "observation_period_end_date" is at least 180 days after the last visit Here is the count (from the infer dataset): outcome *180 days after visit_start_datetime *: drug_exposure_start_datetime_days_after_visit_start 2627 verbatim_end_date_days_after_visit_start 18951 drug_exposure_end_datetime_days_after_visit_start 0 death_datetime_days_after_visit_start 189 procedure_datetime_days_after_visit_start 4155 observation_period_end_date_days_after_visit_start 17645 observation_period_start_date_days_after_visit_start 55 condition_start_datetime_days_after_visit_start 3785 condition_end_datetime_days_after_visit_start 245 measurement_datetime_days_after_visit_start 5225 observation_datetime_days_after_visit_start 3526 So I suppose these patients can be marked as certainly alive.

Data leakage? page is loading…