I was wondering about these two questions: - I understand synpuf data is synthetic. However, does it follow some data generation process such that we can expect generalization from its training to its evaluation? Or are they all randomly generated without any signal? - Can we get synthetic data that is approximately as large (in number of records and number of codes) as the private dataset against which our models will be trained on? This is just so that we can calculate an estimate of whether or not our training process is sufficiently fast to complete within an hour.

Created by Anand Avati avati
Hi @ssmk, We will be releasing the new synthetic dataset in the next couple of days. We are currently reviewing the synthetic generation process and the data itself prior to release. Thank you for your patience, Tim
Hi @trberg , May I check with you on when will we have this new synthetic dataset? Thanks
Hi @avati , >I understand synpuf data is synthetic. However, does it follow some data generation process such that we can expect generalization from its training to its evaluation? Or are they all randomly generated without any signal? The synpuf data was generated by the Centers for Medicare and Medicaid. [See here for more info](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html) The signal is simply that the synthetic data sort of looks like CMS data (skewed to older demographic). However, you should not expect any generalization from training. >Can we get synthetic data that is approximately as large (in number of records and number of codes) as the private dataset against which our models will be trained on? This is just so that we can calculate an estimate of whether or not our training process is sufficiently fast to complete within an hour. Yes, we are currently working on building a new synthetic dataset that will be available shortly. This new dataset will be more similar to the UW data in size and form. Keep in mind that the hour time limit is only for the current synpuf data on the NCATS cloud server. You will have more time to train your model on the UW data.

synpuf dataset questions page is loading…