I have downloaded the file a few times, but there is not data in the CSV file. Download is from DTC. Could please provide direction on getting the data.
Created by Peter Gandy pgandy Hi Robert,
Thanks for your detailed response and suggestions you kindly made. Much appreciated!
Since I was using the office365 version of Excel, I decided to check if a full version of Excel can open the CSV file, but still, I had no luck. Further, last week I clicked on the hotlink inside the API documentation (https://drugtargetcommons.fimm.fi/api/data/bioactivity/%5D/) but it failed to open it, but today I realized that it will work after removing the "%5D" tag. I also installed the TextEdit software but this one (this TextEdit) failed to open the csv file (perhaps some other text editors can open the file).
However, as I mentioned in my earlier message, I managed to open the file using Panda but couldn't sort and filter it the way I wanted (due to my poor scripting skills).
I do agree with you that R might be easier for data manipulation, but I think at this point I just will pass on this challenge.
Thanks Hi there,
The raw database dump is about 2 GB. Depending on your version of Excel, it might not support the number of rows that exist in this dataset, but you should be able to open it in any standard text editor.
I am a little confused about this comment: 'but it has about six million rows, not really clean and easy to sort for not a programmer like me.' Apologies if I am misunderstanding, but are you saying that you are new to Python? If so, as an alternative to pandas & Python, may I suggest trying R/RStudio and the excellent tidyverse suite of packages, which you may find more approachable than Python. Here is a great tutorial for getting started with data manipulation in R: https://datacarpentry.org/R-ecology-lesson/index.html
Also, as noted above, DTC has an API, so if you'd like to download just a subset of interactions, you can do that as well. The instructions for accessing the documentation are in a post in this thread.
I also downloaded the CSV files a few times through the older (took 5 hours dl time) and newer mirror links through the DTC link and script, but Excel shows an empty page only. I think it's a general understanding that csv should be open with software like Notepad, WordPad or Excel, and if it's going to fail on Excel a note about it could be added right next to the download link.
When I read this page, I opened the file in Python using panda, but it has about six million rows, not really clean and easy to sort for not a programmer like me.
I also couldn't find the subset of the data that was mentioned in Peter's question. Thank you, The page is timing out Yes, the DTC has an API. You can find instructions for that API here: https://drugtargetcommons.fimm.fi/ under Download -> Api documentation.
I'd refer you to @Guru if you have any specific questions about the DTC API.
Thank you I got the data, I need. I have another question. It is possible for to get a subset of the data instead of the 2GB?
Peter You may find that Excel has trouble with such a large file. I am not able to open this file on my machine using Excel. I have to use R or Python (or, if you just want to have a look, a text editor like Sublime). I am downloading now. I will let you know. In the past when I tried to download the data. The file would open, but with no data inside. If you are having trouble with the direct download from DTC, we have also mirrored the training data on Synapse. You can access it here: https://www.synapse.org/#!Synapse:syn17017461
Hi Peter,
Clicking the above link downloads a 2 GB CSV for me. What happens when you download the above linked file?
Yes the DTC training dataset. Hi Peter,
Could you specify which file you are downloading?
Are you downloading the DTC training dataset? https://drugtargetcommons.fimm.fi/static/Excell_files/DTC_data.csv
This is a 2 GB download.