Hi @trberg, I've started looking at the data, and reviewed the Task 1 Gold Standard workflow, and have some questions. Please bear with the long detail, as I'm trying to describe some nuances that are significant obstacles for me: 1. The instructions say: "outcome - patients are considered true positives if they have a hospitalized_visit_start_date within 5 weeks (35 days) of the outpatient_visit_start_date. Patients whose only hospitalized_visit_start_date occurs on the same day as the outpatient_visit_start_date, will not be evaluated. **Patients who have their hospitalized_visit_start_date on the same day as the outpatient_visit_start_date, but who have a separate hospitalized_visit_start_date within 5 weeks of the outpatient_visit_start_date are considered true positives.**" Can you explain your rationale for this last part that I bolded? If someone was admitted to the hospital on the same day as the outpatient visit, they probably were sick enough to concern the outpatient provider and were sent to emergency/inpatient services. If they were discharged from emergency/inpatient, then re-admitted later, I would think the most likely scenarios were: a) they recovered enough from COVID-19 in the inpatient setting to be discharged, but later worsened and were readmitted; b) they recovered in-hospital and were discharged but were later admitted for something else \- either non-COVID or sequelae of COVID (the gold-standard workflow does not appear to check if the hospitalization is COVID-related); or c) the patient was transferred to a different hospital (maybe this is fixed with macrovisits?). None of these 3 scenarios as outcomes seem helpful in deciding whether patients presenting at an outpatient visit go on to being hospitalized for COVID-19. For this reason, it would seem to me to make more sense to drop any patient hospitalized on the same day as the outpatient visit, and not be looking for a second visit. Further, I'm concerned we will be building models to predict hospitalization not just for COVID-19, but often on whether they have other conditions that predispose them to hospitalization regardless of COVID-19 status. I'm thinking there might be many instances of children needing procedures that require hospitalization, they get an outpatient test prior to admission per policy, some show positivity, they wait until they test negative, and then get their procedure done within the 35-day window. I'm guessing that the patients hospitalized on an outpatient day are probably more likely to be COVID-19-related than those we use for our actual outcome. Did you check the distribution of COVID-19 diagnoses and procedures within hospitalization to see if these hospitalizations are for COVID-19? 2. It seems from your response to this [thread](https://www.synapse.org/#!Synapse:syn25875374/discussion/threadId=8421) that our training models, if they are to be helpful in predicting the test data, should look at all patient data before and including the date of the outpatient visit subsequent to first positive COVID-19 test, and not afterwards. Presumably, this is so we can have the advantage of whatever information is available from that outpatient visit plus past history. If that is the case, will we also have access to the fact of a hospitalized visit that starts on the same day as the outpatient visit? After all, it overlaps the observation period. If so, given the evaluation criteria discussed above in 1, we should assign a 100% confidence of the outcome as you will not be evaluating patients who do not have subsequent hospitalizations: every patient with an immediate hospitalization who makes it into your evaluation set has to have a positive outcome! Correct me if I'm wrong, but this seems like a not-good immortal time bias to have baked into the outcome. Even if you withhold inpatient data for that day from our models, it seems likely that there could be enough evidence in the outpatient data to infer whether there was a same-day inpatient visit -- like an O~2~ saturation of 85. 3. The instructions say, "For your final submission, we are expecting that you submit a full workbook that takes in training data in standard OMOP format as well as testing data in the same OMOP format. Your code should ingest the OMOP tables, train a model, ingest the testing data, and then finally output a prediction file that has one prediction (a continuous score between 0 and 1) for each person_id in the testing data person table. " Given our training data allows us to peek into the future, can you describe how the test data will be given so that cannot happen? Can you specifically describe how our workflow should process the test data? Do we need our own version of the gold-standard workflow to not look into the future? Will there be two different person tables for train and test? What about all the other tables? I'm trying to architect the workflow, and this is really important to know this at the beginning stages. 4. Will you be scheduling a kick-off call for other questions and answers? Many thanks, @christophe.lambert

Created by Christophe Lambert christophe.lambert
hi @trberg, please see my comment in the other [thread](https://www.synapse.org/#!Synapse:syn25875374/discussion/threadId=8502&replyId=26290) where I advocate we be given all new patient data for training up to and including the outpatient visit events, not just the patients meeting the gold standard criteria. Models that use the additional patient data outside the gold standard NEED that information to perform properly. I would be happy to discuss with you offline -- this is ESSENTIAL for my approach given the very small sample size of gold standard cases. Thank you, Christophe
Hi @JMele, I'll be sending out an email later today clarifying the submission structure, but the test data will only contain patients that meet the gold standard criteria. So you will not have to subset, or filter out any patients from the test, you can assume all patients need a prediction. Thank you, Tim
Hi @trberg, Thank you for the information you've provided above. I have a few clarifying questions regarding the organization of our final code workbooks. The Instructions indicate that we are only technically required to input the Person OMOP tables to our workbooks, but it looks as though we need at least a few OMOP tables to generate the COVID19 positive pediatric patients set from the Task 1 Gold Standard Script. Are we required to create this "half gold-standard" set, that is, subset the data to include only pediatric patients with an outpatient diagnosis of COVID19? Or can we assume the person table we are given already meets these requirements? In short, are we expected to generate a prediction for every patient in the Testing Person Table, or is further extraction necessary before generating our testing set, or will we be given a similar gold standard file (minus the outcome) for the Testing patient IDs for our final submissions? Any clarification here is greatly appreciated. Thank you! Jessica Mele
Sorry, wrong link. I've edited my previous post with the correct link, Thank you
Hi @trberg, the registration link you provided above signs you up for the recurring **Monday** N3C Community Forum zoom call. Is that correct? Thanks, @christophe.lambert
Thanks, @trberg for your thorough response! I look forward to hearing what the organizing committee thinks. An alternative to further reducing our sample size of outcomes would be to have us build models using information only earlier than but not including the first outpatient visit after a positive COVID-19 test, and use any hospitalization within 35 days as the outcome. If you did this, models would not have information from the outpatient visit, but instead, would be considering the history of chronic conditions and past acute ones, assuming we have some decent window of lookback time. Such a model could be medically relevant as a tool to prioritize vaccination efforts. The 12-17 age group has a low vaccination rate as it is, and understanding the co-occurring conditions of the children we should most encourage to vaccinate could impact public policy favorably. I would really love to have an additional 4517 more outcomes to work with. I'm concerned with the lack of power already, then having it be further eroded by subtracting out the additional double-hospitalized patients. Kind regards, @christophe.lambert
Hi @christophe.lambert, (1) So the rationale was, even if a patient is immediately hospitalized, there may some information during those outpatient visits that would indicate whether a patient is at risk for hospitalization further into the future. However, after reading through your different points, I think you may be right and that this wouldn't be the optimal way to handle these situations. I'll bring your points up to the challenge organizing committee this Wednesday and see what they say. > Further, I'm concerned we will be building models to predict hospitalization not just for COVID-19, but often on whether they have other conditions that predispose them to hospitalization regardless of COVID-19 status. We know that we are running the risk of building models that predict other types of hospitalizations. Determining the cause of hospitalization can be difficult with EHR data, and we decided against going that direction with this challenge. We did some sensitivity analyses prior to launching and found that there were still easily-identifiable COVID related hospitalizations occurring within 5 weeks after the initial outpatient visit (we looked for COVID and MISC related codes to identify if any COVID related visits were still occurring 5 weeks out). (2) In that thread, I mistakenly said that the observation window should _start_ at the outpatient_visit_start_date. I meant to say that the observation window should _end_ on the _outpatient_visit_start_date_. You're correct, the idea is that you would use all the clinical history available at the time of the visit to make a risk prediction. > If that is the case, will we also have access to the fact of a hospitalized visit that starts on the same day as the outpatient visit? After all, it overlaps the observation period. If so, given the evaluation criteria discussed above in 1, we should assign a 100% confidence of the outcome as you will not be evaluating patients who do not have subsequent hospitalizations: every patient with an immediate hospitalization who makes it into your evaluation set has to have a positive outcome! Correct me if I'm wrong, but this seems like a not-good immortal time bias to have baked into the outcome. Even if you withhold inpatient data for that day from our models, it seems likely that there could be enough evidence in the outpatient data to infer whether there was a same-day inpatient visit -- like an O2 saturation of 85. This is a good point, we were planning on removing the same day hospitalized visits, but there could potentially be other indicators of inpatient status. If we do end up excluding these patients, as brought up in point 1, I think this problem will be solved. (3) The training data will not be cut off, so you will be responsible for building your training data from the input OMOP tables, but the testing data will be cut off at the _outpatient_visit_start_date_. You will need your own version of the gold-standard file to not look into the future in your training data. There will be two different person tables for training and testing. There will also be two different versions of the other tables. (4) Yes! A kickoff call has been scheduled for this Wednesday at 11:30 am PT. [Register Here](https://uw-phi.zoom.us/meeting/register/tJUqcu-srTktHdEuyUHkiy4uhEvEBwXQmfPz) Let me know if your questions aren't completely clarified and thank you very much for all of the feedback, Thank you, @trberg

Clarity on Task 1 outcome and how to structure our models for testing page is loading…