Hi, we found some incorrectly parsed CIDs in "Training Data.xls".
There are 2 major points:
1) Incorrectly parsed single data point in Snitz (row 21 in Training Data.xls)
2) 445 out of 520 data points in the Bushdid section have 1-7 different CIDs, after the comparison with the original publication.
According to the comparison with our corrected data, different CID drastically changes the chemical identity of the compounds.
For example,
in row 369 in ?Training Data.xls? (Dataset: Bushdid, Mixture Label: 1), the combination of CIDs is:
[ 22311 ... 0 ... 997 ... 0 ... 8842 ... ]
On the other hand, our parsed combination in row 3 in ?processed_bushdid.csv? (Bushdid_ID: 1) is:
[ 440917 ... 7463 ... 999 ... 443162 ... 7793 ... ]
First of all, '0' CID fails the parsing of the molecule.
In addition, '997' (2-oxo-3-phenylpropanoic acid; C1=CC=C(C=C1)CC(=O)C(=O)O. Parsed from PubChem) is chemically not identical with '999' (2-phenylacetic acid; C1=CC=C(C=C1)CC(=O)O, Parsed from PubChem).
For the details, please refer to our Google Drive.
https://drive.google.com/drive/folders/1w7YMekQKBfSzdPPWB6A1YjGQGe_UogxX?usp=drive_link
If you wish to replicate this analysis visit the colab "DatasetConstructor.ipynb" in the drive.
In this drive, refer to "DREAM report.docx" to see the explanation of our observation
Thanks.
Created by Sean Park Tteokbokki I'm reading through the report now and it is very thorough. Thank you for posting this!
Looking in your drive, do you have the true data for the train/leaderboard/test sets? It seems like you have the different Bushdid tables available (with dilutions!) but I'd really appreciate it if you'd upload the corrected data. Thanks a lot! Hi thanks for finding this and sharing with the community,
we put together the training set from publicly available sources in order to help participants build their models, we tried it on a baseline model and results were as expected.
People are free to use any training set they want.
thanks
To add, we believe this issue is different from the other incorrect CID post in the forum.
The report doc @Tteokbokki posted has really concrete examples of this issue:
[Report doc](https://docs.google.com/document/d/1IALFakSyQMq5M5tZ9IEYYo4XrTkphXYd/edit?usp=sharing&ouid=100954538238939066449&rtpof=true&sd=true)
Drop files to upload
Incorrectly parsed CIDs in "Training Data.xls" page is loading…